- Demonstrate the use of Open Source image processing software for automating the decoding of historic documents which have secret message text embedded within them.
- Identify a valid source of Ground Truth Data.
- Explain how some elements of this application are inherently simple, which bodes well for feasibility and success.
The central feature of the 1916 Riverbank Laboratories monograph (included here) is a series of “Classifier” cards to manually decode the secret messages inscribed by Francis Bacon, using the Biformed Alphabet scheme he had invented. Foremost among Bacon’s documents was the 1623 First Folio of William Shakespeare, which Fr.B was covertly the real, primary author of.
The last sentence at the bottom of the covered page (at left) reads, CUT OUT SHADED PART WITH A SHARP KNIFE.
With the rectangular slot open, the classifier could be positioned over the document to be decoded, so that, one at a time, its letters could be manually identified as being either an ‘a’ or ‘b’ form, equivalent to a binary ‘0’ or ‘1’.
Contemporary AI software can emulate this process readily, using Convolution Neural Networks (left). Needed computing resources are available freely from Google Colab to implement the needed image comparison algorithm (“template matching”). The end result of this is a best-choice selection among a group of pre-defined Labels, expressed with a Confidence level: so at left, in this very simple example, it may have been that the CNN output produced was a Confidence value of (say) 87%, and that the examined image appears to be a “Bird”.
Therefore a CNN makes Predictions, which are probabilistic: not either ‘true’ or ‘false’, but instead a numerical value between 0.0 and 1.0, which corresponds to a Confidence value between zero and 100%.
The Labels for this application are only two in number, ‘a’ and ‘b’, which correspond to binary 0 and 1. Bacon used Biformed Alphabets to encode his secret messages, that is, two font styles within the same block of text. If our program can reliably predict whether any single letter is of the ‘a’ or ‘b’ form, the hidden message can be decoded with vastly greater speed than in the past.
With greater speed would come wider scope of coverage: the the most prolific decipherer of Fr.B’s concealed messages, E.W.G, said that at least a third of the overall volume remains undeciphered. Therefore new revelations from Fr.B could arise from what is newly decoded.
In Machine Learning, Ground Truth means checking the results of machine learning (such as an application of a CNN) for accuracy against the real world. The term is borrowed from meteorology, where “ground truth” refers to information obtained from measurements on some terrestrial site.
Some components of the ground truth data for this experiment are available from the same Bodleian Library download page already mentioned in Experiment One:
The previously-used image of the Prologue page from Troilus and Cressida came from the “image PDF” file shown above. The “text PDF” file provides the corresponding Prologue page in text, rather than as an image:
Please wait while flipbook is loading. For more related info, FAQs and issues please refer to DearFlip WordPress Flipbook Plugin Help documentation.
For the project to be successful, there are five attributes for each letter of unencoded text which need to be determinable on request.
The Five Attributes
- Which of the 24 letters of the Elizabethan English alphabet any given glyph (uncharacterized image blob) is.
- Whether upper or lower case.
- Whether ‘a’ or ‘b’ biliteral value.
- Bounding box coordinates of each glyph.
- Direct lookup between text position and image (glyph) position.
An example of #5 would be, “letter number 23 of the Prologue is ‘T’, so show the tiny image of that one glyph by itself, using the values of its bounding box”.
The ‘text PDF’ representation is a reliable information source for providing the first two, and will comprise elements of our Ground Truth data.
The third (‘which of ‘a’ or ‘b’ biliteral value’) is already available from Experiment One. The assumption will be made that this data also qualifies as Ground Truth.
The fourth, presumably, can eventually come from custom OpenCV programs. Below is described the first step towards that (more will be needed).
In the initial iteration of this project, it was believed that the goals could easily be achieved using a relatively simple customization of the Tesseract computer vision library. Tesseract was for decades the most venerated name in Optical Character Recognition software libraries.
Eurotext.png, grayscale, 450 × 351, 57Kb
Shown here is a test image commonly used in Computer Vision trials. A supposed Acid Test, the text is intentionally riddled with punctuation marks that could be troublesome, in five different languages.
Yet like all modern text fonts, none of the glyphs overlap, and there is no noise in the background image. This is in contrast to digital facsimiles of the First Folio, which has overlapping glyphs, and lots of background noise.
OCR Text output
The character recognition by Tesseract of the above test image is nearly perfect.
The discovery and display of the letter Bounding Boxes is also near-perfect.
Lab Notebook, “Experiment-2”
On every execution of the Python program code, the tiny image inside each bounding box gets saved into its own .jpeg file, and will be referred to here as the ‘pictures-of-letters’. This will be important in future Experiments.
Now, though, the Python algorithms can access each picture-of-letter directly as an ordinary program object.
This enables the main CV (Computer Vision) technique which will be used here, Template Matching: a Template image is superimposed over a larger Target image to compare the two for similarity. This is the basis for Recognition, and by extension, Decoding.
For this experiment, the Target is the Bodleian Library facsimile of the T & C Prologue, used without manual preprocessing of any kind.
The Templates would be a set of 96 glyphs (twice 24, the number of letters in the Elizabethan alphabet, with both upper and lower case), each of which is known, with certainty, as being either the ‘a’ or ‘b’ Biliteral variant for that letter.
An existing font, below, is an attempt to duplicate the look of the First Folio’s type. But it can’t suffice for use here, since there is no way to have the bounding boxes of letters overlapping.
The quick brown fox jumped over the lazy dog.
Waltz, bad nymph, for quick jigs vex.
Sphinx of black quartz, judge my vow
How quickly daft jumping zebras vex!
Pack my box with five dozen liquor jugs.
Mr Jock, TV quiz PhD, bags few lynx
It can be seen in the above sample from The Prologue that the bounding boxes of actual First Folio text has overlapping bounding boxes in many places. The letters ‘s’ and ‘f’ are particularly troublesome, as their bounding boxes overlap in two places, both a letter before, and one after. This throws a proverbial wrench into algorithms depending on bounding boxes.
Also, the noise in the background image is readily visible.
This is the resulting image from the same Tesseract program as above, as it tries to discover the bounding boxes of a sample from the 1623 First Folio.
It is rife with erroneous identifications. The OCR text returned is gibberish.
Then from where can our set of 96 non-overlapping glyphs come? These would be the essential Templates used to match against a rectangular region within the Target (in this case, the Folio Prologue page). The project can’t go forward without obtaining these.
The next Experiment will explore using the 1916 decoder (above), which were produced by W.F.F at Riverbank Laboratories. Note how there is ample space between all the letters, so there is no overlapping.
There is also a great deal of noise in the background image, but this might be filterable by our custom Python code.
If each glyph can be saved to its own .jpeg file, a simple naming scheme could provide the crucial requirement of Attribute #5: each letter-picture could have a unique two-character file name, which concisely encodes the letter, its case, and its biliteral value. For example, the upper ‘c’ letter in the decoder card above would be saved in a file, c0.jpeg. The lower ‘c’ letter would be saved in a file, c1.jpeg. For the custom Python code, this would allow ready access to any pair of letter-pictures to match against. Thus is satisfied the requirement, “Direct lookup between text position and image (glyph) position.”
Elements of Simplicity
View the Jupiter Notebook: the experimental commands used here, and the results recorded.
View and download the Python source code from Github.
Notes for this page:
- The downloadable Bodleian file listed above, “XML as PDF” provides a representation of the First Folio as a Text Encoding Initiative document. A more comprehensive abstraction of a book’s content and metadata can hardly be imagined. So from that, we can know that the text transcription in the PDF version has been proofread by human beings, making it suitable as Ground Truth data. We are even given the names of the proofreaders: thank you Pip Willcox, Lucienne Cummings, Judith Siefring, and Emma Stanford! Your work is accepted here as Ground Truth!