Bootstrap – New Gorhambury

The two reference alphabets as a measuring instrument

LC1 and LC2 are not merely illustrations — they are a calibrated instrument. Each card shows every letter of the italic alphabet printed from a specific set of metal type blocks. The two cards differ because two physically distinct sets of blocks were used to compose the Folio’s italic passages, and those two sets produce consistently different letterforms. The differences are baked into the metal and repeat identically every time a given block is pressed to paper.

When you hold LC1 alongside a passage of Folio text, you are asking: did the compositor reach into Case 1 or Case 2 when he picked up each individual type block? The answer is encoded in the precise shape of the printed letter, and LC1 versus LC2 is the key that reads it.

Two kinds of difference, two kinds of confidence

The differences between the two cards fall into two distinct categories, and this distinction drives everything that follows.

For four letter types — h, v, z, and I/J — the difference is structural. The Form 1 h has an open counter (the arch returns without enclosing the space beneath it). The Form 2 h has a closed counter (the arch curves all the way round, creating a shape resembling an italic b). These are not subtly different letterforms — they are architecturally distinct. Anyone can distinguish them by eye, even in a JPEG scan, without measurement tools.

For every other letter — e, r, s, o, n, t, u and so on — the difference is one of degree rather than kind. The Form 2 e has a slightly tighter aperture and marginally heavier stroke than the Form 1 e. The Form 2 r has a slightly different shoulder curve. These differences are real and consistent — again, they are in the metal — but they require careful direct comparison against the reference cards, and at typical scan resolution they sit close to the threshold of what can be reliably distinguished without magnification or image measurement tools.

This two-tier distinction has a direct consequence: the four structural biforms can be classified with near-certainty by anyone with the reference cards. The remaining twenty-two letter types require either a trained eye with a loupe, or computer image analysis tools that do not yet exist in this implementation.

What the known decoded message contributes

Once the Prologue has been decoded — giving us the message Francis St Alban descended from the mighty heroes of Troy… — that decoded text becomes an independent source of information about every letter that follows.

Here is why. Every group of five consecutive italic letters must jointly encode exactly one letter of the hidden message, following Bacon’s fixed table. There is no choice in this: the table is deterministic and has only 24 valid entries. If you already know which hidden message letter the group must produce, then you automatically know the exact Form a/b pattern the five printed letters must follow — all five positions at once, not just the ones you examined visually.

In concrete terms: suppose you are looking at a group of five letters and you have visually confirmed that the second letter is Form b (it is an h with a closed counter). That single confirmed position reduces the number of possible hidden message letters from 24 to roughly 12. Now add the message constraint — the hidden message at this position must be D, because D is the next letter in Descended — and only one of those 12 patterns produces D. The pattern is completely determined. All five letters are now classified, including the three whose subtle differences you could not see clearly.

This is not an approximation or a best guess. Given one vision-confirmed bit and one known message position, the classification of all five letters in the group is mathematically certain.

How much vision input is actually needed

In the demonstration run on Line 2 of the Prologue, five h positions were classified by visual inspection. Those five confirmed bits — covering 14% of the 35 letters in the passage — were sufficient to resolve all 35 letters and all seven hidden message characters, with 100% accuracy verified against the known answer sheet.

The remaining 30 letters were classified entirely by working backwards from the message: if the hidden letter must be L, the group must be ababa, so every letter in the group takes its bit value directly from that code, regardless of whether its subtle typeform difference was visible.

The structural biforms act as anchor points. Because h appears in ordinary English prose roughly once in every eighteen to twenty letters, a five-letter group has about a one-in-four chance of containing at least one h. That single anchor, combined with the message constraint, typically resolves the entire group. Groups with no structural biforms at all are resolved by the message constraint alone, with no vision input required, as long as the message frontier has reached that position.

The frontier and why it moves forward

At any point in the process, there is a frontier: the boundary between the portion of the message already decoded and the portion not yet decoded. Everything behind the frontier can be resolved by message constraint alone, regardless of visual input. Everything ahead of the frontier must be resolved by vision first, with the message then confirming the result.

Crucially, the frontier advances by one hidden letter every time a group of five is successfully resolved. Each new hidden letter immediately becomes available as a constraint for resolving the next group. The process feeds itself: more decoded message means less visual work needed to decode the next passage, which produces more decoded message, and so on.

The practical consequence is that the system becomes progressively easier to operate as it runs. The first pages require more visual work — examining more structural biforms per group, relying more heavily on direct comparison against the reference cards. Later pages, by which point the decoded message runs to thousands of characters, require almost no visual work at all. The long message context is so constraining that even a single confirmed h in a group is sufficient to resolve it, and many groups resolve with zero visual input because the message leaves only one possible continuation at that position.

What this means for the 900 remaining pages

The workflow for any new italic page is the following sequence.

First, extract the letters in order from the italic passage, stripping spaces and punctuation. This is mechanical and can be done directly from the Bodleian TEI XML, which marks every italic element and preserves the original spelling.

Second, locate every occurrence of h, v, z, I and J in that sequence. Compare each one against LC1/LC2 or UC1/UC2. Write down a or b for each. This is the only step that requires looking at the actual letterforms on the page.

Third, group the sequence into fives and apply the message constraint to each group. For every group where the message frontier has passed, all five bits are resolved automatically. For groups at or beyond the frontier, the vision-confirmed bits from step two — combined with whatever message context is available — resolve all remaining positions in nearly all cases.

Fourth, decode each resolved group via the Bacon table. Append each new hidden letter to the known message. This advances the frontier, making the next group on the same page, or the first group on the following page, immediately resolvable.

Each play’s italic passages extend the decoded message by several hundred characters. By the time the full Folio has been processed — working through the preliminary epistles, the choruses, the prologues and epilogues, and any other extended italic passages — the total decoded message runs to many thousands of characters, and the character of that message, its vocabulary and grammar and internal coherence, provides a powerful ongoing check that the classification is proceeding correctly. Any error in a vision-confirmed bit would produce an incoherent or impossible message character, flagging the mistake immediately and allowing it to be corrected before it propagates further.

The classification problem stated precisely

Every letter token in an italic passage must be assigned a binary label — a (Form 1, from the LC1/UC1 type case) or b (Form 2, from the LC2/UC2 type case). The complete label sequence for a passage of n italic letters is a binary string of length n. That string, read in non-overlapping windows of five, indexes into Bacon’s 24-character lookup table to produce ⌊n/5⌋ hidden characters.

The classification task is therefore: given a glyph image g and two prototype sets P₁ (all 26 letters in Form 1) and P₂ (all 26 letters in Form 2), assign label ŷ ∈ {a, b} such that g is closer — in some feature space — to the corresponding prototype in P₁ or P₂.

The reference alphabets as prototype classifiers

LC1 and LC2 are not just visual aids. They are the two class prototypes for a nearest-prototype binary classifier operating independently per letter type. For letter ℓ, the classifier is:

ŷ = a if dist(g, P₁[ℓ]) < dist(g, P₂[ℓ])

ŷ = b otherwise

The distance metric is not defined in closed form in this implementation — it is whatever similarity function Claude’s vision model computes internally when comparing two image patches. For the four clearly visible biforms (h, v, z, I/J), the inter-class distance in any reasonable feature space is large relative to intra-class variance, making the classification robust. For subtle biforms, the inter-class distance is smaller than typical intra-class variance from print quality variation, making automated classification unreliable without normalisation and careful feature selection.

Information content of a confirmed bit

Each letter in the stream carries exactly 1 bit of cipher information — its Form a or Form b label. The 32 possible five-bit strings over {a,b}⁵ map onto 24 Bacon characters, so the code is slightly redundant (log₂ 24 ≈ 4.58 bits of message entropy per quintet, against 5 bits of label capacity). Eight of the 32 possible codes are unused.

When no bits in a quintet are confirmed, there are 2⁵ = 32 candidate label strings, mapping to 24 candidate Bacon characters.

Each confirmed bit reduces this by approximately half:

Confirmed bits	Candidate strings	Candidate characters (approx)
0	32	24
1	16	12
2	8	6
3	4	3
4	2	2
5	1	1

The known message prefix adds a further hard constraint. If the quintet’s absolute index q satisfies q < |M| (where M is the known decoded message), then the only valid Bacon character at position q is M[q], eliminating all label strings that do not produce that character. Since each Bacon character corresponds to exactly one five-bit code, the message constraint fixes all five bits simultaneously — regardless of how many were confirmed by vision. Zero confirmed bits plus one known message position is sufficient for complete resolution.

The constraint propagation algorithm

The bootstrap is a single-pass constraint propagation over a factor graph. The variables are the n binary labels {y₀, y₁, …, yₙ₋₁}. The factors are:

Vision factors — hard unary constraints yᵢ = c for positions where a clear biform was classified
Bacon factors — hard quintet constraints: for each window [y₅ₖ, …, y₅ₖ₊₄], the joint assignment must be one of the 24 valid codes
Message factors — hard character constraints: for quintet k with |M| > k, the quintet must decode to M[k]

Combining factors 2 and 3: for any quintet k within the known message, the valid joint assignment is exactly the single five-bit code B⁻¹(M[k]), where B⁻¹ is the inverse Bacon lookup. This makes the factor graph trivially soluble for all quintets within the message frontier: each quintet is an isolated cluster of five variables with a single valid joint state.

The propagation proceeds as:

for each quintet k in [0, n//5):

confirmed ← {i: y_{5k+i} for i where vision_factor is set}

candidates ← {code ∈ B : code[i] = confirmed[i] for all i in confirmed}

if |M| > k:

candidates ← candidates ∩ {B⁻¹(M[k])} # message hard constraint

if |candidates| == 1:

code ← candidates[0]

for i in 0..4:

y_{5k+i} ← code[i] # resolve all five positions

propagate to adjacent quintets if shared (none in this cipher)

Because quintets are non-overlapping, there is no lateral propagation between quintets. The graph is a collection of independent 5-cliques. Resolution within each clique is complete in O(1) once the message constraint is applied.

Convergence properties

The algorithm converges in at most two iterations in the regime where the known message covers all quintets on the page:

Iteration 1: Every quintet within the message frontier resolves completely. The back-filled bit values for positions not confirmed by vision are written into the state vector.

Iteration 2: No new quintets resolve (all were resolved in iteration 1). Stable.

The running output of the algorithm confirms this: 7/7 quintets resolved in iteration 1 with 5 vision-supplied bits covering 14% of the stream. The remaining 30 bits (86%) were determined entirely by the Bacon inverse and the message constraint — no image comparison was performed for them.

The bootstrap as a frontier-advancing process

The constraint system has a strict frontier at position q* = |M|. For quintets with absolute index k < q*, resolution is guaranteed by the message constraint regardless of vision input. For quintets with k ≥ q*, the message constraint is unavailable and resolution depends entirely on the number of vision-confirmed bits per quintet.

Decoding one new quintet advances q* by one, making the next quintet resolvable on the following iteration. The process is therefore self-extending: each resolved quintet strictly expands the domain over which the message constraint applies.

The advance rate is exactly one quintet (five bits) decoded per page iteration, bounded by the number of new italic letters processed. For a 1,400-letter page (e.g. Henry V Prologue), that is 280 new quintets per page, advancing the frontier by 280 positions — making the following page’s quintets resolvable with correspondingly less vision input.

The role of clear biforms as anchors

For quintets beyond the current frontier, the message constraint is absent and resolution requires vision bits. The expected number of clear-biform positions per quintet depends on the letter frequency distribution of the text and the size of the clear-biform set {h, v, z, I, J}.

In Early Modern English italic prose, h alone appears at approximately 5–6% frequency. Across a five-letter quintet, the probability of at least one h is:

P(at least one h in quintet) = 1 − (1 − 0.055)⁵ ≈ 0.25

Adding v (≈2%), z (<1%), I/J (≈2%) raises the combined clear-biform frequency to roughly 10%, giving:

P(at least one clear biform in quintet) = 1 − (0.90)⁵ ≈ 0.41

Approximately 40% of quintets contain at least one clear-biform position, giving one confirmed bit. With one confirmed bit the candidate set reduces from 24 to 12. For quintets at the frontier (message constraint absent), one confirmed bit is generally insufficient for unique resolution — two or more are needed.

However, two confirmed bits in a quintet is enough when combined with a partial message prior: if the decoded message has been running coherently as English/Latin prose, a bigram or trigram language model over the decoded sequence assigns near-zero probability to most of the 12 residual candidates, typically leaving 1–3 viable options. A trigram English character model reduces ambiguity to a single candidate in approximately 70–80% of cases with just two confirmed bits, recovering the resolution guarantee without requiring the full message constraint.

The asymptotic regime

As the decoded message length grows across multiple pages, the language model prior strengthens because longer context windows provide better prediction of the next character. By the time the frontier has advanced through the preliminary matter and several full play texts — several thousand decoded characters — the prior alone may be sufficient to resolve quintets with zero vision bits, because the joint probability of the decoded sequence under a trained character n-gram model is overwhelmingly concentrated on a single continuation.

At that point the classification problem effectively inverts: instead of using letterform comparison to determine bits, the bits are determined by the message, and the bit assignments serve as post-hoc verification that the correct letterform was identified — closing the loop between the cipher and the typographic evidence.

The two reference cards

Imagine you have two strips of paper, each showing the complete italic alphabet — every letter from a to z — but printed from two subtly different sets of metal type blocks. Call them Card 1 and Card 2. Most letters look nearly identical between the two cards. A few look obviously different — the letter h, for instance, has a clearly open arch on Card 1 and a closed, rounded arch on Card 2 that makes it look like an italic b.

These two cards are LC1 and LC2 for lowercase letters, UC1 and UC2 for uppercase. They are the master keys.

What you do with each letter on a page

You open the First Folio to an italic passage. You hold Card 1 alongside the page so that the a on the card sits next to every a in the text, the b sits next to every b, and so on. Then you hold Card 2 alongside in the same way.

For each letter in the text, you ask one question: does this letter look more like the version on Card 1, or the version on Card 2?

If it matches Card 1, write down the letter a. If it matches Card 2, write down b.

You do this for every single letter in the italic passage, skipping spaces and punctuation, until you have a long string of a’s and b’s. Then you take them in groups of five and look each group up in a table — Bacon’s 24-letter table — which converts each group of five into one hidden letter. String those hidden letters together and a concealed message appears.

Why most pages have no pre-existing answer sheet

For the Troilus and Cressida Prologue we had a pre-made answer sheet — the TSV file — which told us for every single letter in that passage whether it was Form a or Form b. Someone had already done the comparison work, letter by letter, and written it all down.

For the other 900 pages, no such answer sheet exists. You have to do the comparison yourself.

The practical difficulty

The differences between the two card forms are not always large. For h, v, z, and I/J the difference is obvious enough to spot with the naked eye. For most other letters — e, r, s, o, n and so on — the difference is much more subtle. It might be that the stroke is slightly thicker, or the curve is slightly rounder, or a tiny serif is slightly longer. These differences are real and consistent — they were cut into metal type blocks — but they require careful, close comparison against the reference cards rather than a quick glance.

The bootstrap: how knowing part of the answer helps you find the rest

Here is the key insight. Once you have decoded the Prologue — which we have — you know the first 224 letters of the hidden message: Francis St Alban descended from the mighty heroes of Troy… and so on. That known message becomes a powerful tool for decoding the next page.

Think of it like a crossword puzzle. Suppose you are trying to fill in a five-letter answer and you already know from another clue that the third letter must be Q. That one confirmed letter dramatically narrows how many words could possibly fit. In most cases it narrows it to just one.

The cipher works the same way. Each group of five letters on the page becomes a five-letter code, and that code must decode to exactly one letter of the hidden message. If you already know what that hidden letter should be — because it is the next letter in the already-decoded sequence — then you know what the complete five-bit pattern must be. And if you know the complete pattern, you automatically know the form of every letter in that group of five, even the ones whose subtle differences you could not see clearly.

Walking through a new page, step by step

Step one. Find all the italic letters on the page and write them out in order, skipping spaces and punctuation.

Step two. Go through the list and mark every h, v, z, I and J. These are the letters where the difference between Card 1 and Card 2 is large enough to see clearly. Look at each one against the two reference cards and write down a or b.

Step three. Group the letters into fives. In any group where you have marked even one or two letters from Step two, you already know one or two of the five positions. Look up which hidden letters are possible given those confirmed positions. In most cases the known message — the tail end of what has already been decoded — narrows it to just one possibility. That one possibility tells you what all five positions must be, including the ones you could not see clearly.

Step four. Write down the newly decoded hidden letter. That letter is now added to the end of the known message, which means it becomes available to help decode the next group of five on the same page or the page that follows.

Step five. Repeat, working through every group of five letters on the page. Each group you successfully decode extends the known message by one letter, which makes the next group easier to resolve.

Why this works so well in practice

Two things conspire to make the process surprisingly efficient.

The first is that h is one of the most common letters in English prose. In a typical page of italic Folio text you will encounter it many times. Each occurrence gives you a free confirmed bit — the clearest and most reliable of all the biform distinctions — scattered throughout the page at regular intervals. Those confirmed bits act like tent pegs, anchoring each nearby group of five and allowing the known message to resolve the rest.

The second is that the known message itself is in plain English and Latin prose, which is highly predictable. Once you have decoded twenty or thirty letters, the probable continuation is strongly constrained by the grammar and vocabulary of the language. Even if no confirmed letter falls within a particular group of five, the message context often narrows the possibilities to one — especially when combined with even one confirmed bit from a subtle biform.

The frontier and what stops it

The process leapfrogs forward, each decoded letter making the next one easier to decode. The only thing that stops it is if you reach a stretch of text where no h, v, z or I/J appears for several consecutive groups of five, and the message context is also ambiguous. In that situation the group remains uncertain until either a later confirmed bit casts backward light on it, or a human expert examines the subtle biforms in that stretch under magnification.

In practice, for a page like the Henry V Prologue — which is rich in h (from words like high, hear, his, he, Harry, here) — this frontier almost never appears. The confirmed bits from those letters alone are enough to pin down every group, with the known message filling in whatever gaps remain.

What the 900 pages add up to

Each play’s italic passages extend the decoded message by hundreds or thousands of letters. As the message grows longer, the constraint it provides grows stronger. By the time you reach the later plays in the Folio, the known message is so long that even a single confirmed h in a group of five is enough to resolve that group with near-certainty, and groups with no confirmed bits at all can often be resolved from the message alone. The system becomes more powerful the further into the Folio you go — which is, of course, exactly what you would expect if the whole thing was designed by a very careful mind.