Home      Rongorongo concordance


An Algorithm for Producing Concordances and Random Text

Nota Bene: this algorithm is reproduced below verbatim from the archives of the Voynich Interest Group.


Monkey Breeding in a Few Easy Lessons

Lesson 1: Monkey I.

In this lesson we shall teach a monkey to ape this simple text:

ABABABABABC
First, sort into alphabetical order all its substrings up to a length of your choice, say, up to four letters long, and call it CONCORDANCE OF THE COLLECTED WORKS, or THE CONCORDANCE, for short.
ABAB
ABAB
ABAB
ABAB
ABC
BABA
BABA
BABA
BABC
BC
C
Second, decide to what order of approximation you wish to have Monkey ape your text. Third order? Second order? Fifth order? Sorry, you can't have fifth order because those substring are only four letters long. The most you can have is fourth-order. Fourth order? Fourth order it is. Let N=4.

Good. Now here is the idea: BABA is the beginning of Monkey's collected works, and it consists of his last pearl of wisdom, A, in the context of his N-1 previous works, namely, BAB. Each new pearl of wisdom will now depend on its previous context. What is the context of Monkey's next pearl of wisdom? Why, ABA, of course. We now look in THE CONCORDANCE for all the strings that start with the context ABA. Lo and behold, there are four of them! We pick one at random. Its fourth letter (N=4) shall be Monkey's next pearl of wisdom, B. So be it. We have so far: BABAB. The context for Monkey's next pearl of wisdom is now BAB (the last three letters), we look for all the strings that start with BAB, pick one, output its fourth letter, and so on, and so on. With a bit of luck we will at some stage pick the string which has C for its fourth letter (BABC). The context will be ABC, and that will be the end of Monkey's works, as no string in THE CONCORDANCE starts with ABC.

Lesson 2. Turbo Monkey, Part I.

Wherein we turbocharge our monkey.

First we number the entries in THE CONCORDANCE.

Next, we count how many occurrences there are of each three-letters context, and we write that figure next to the first occurrence. But next to strings shorter than four letters we write zero, to show that those contexts are dead ends. Like this:

 1  4 ABAB
 2    ABAB
 3    ABAB
 4    ABAB
 5  0 ABC
 6  4 BABA
 7    BABA
 8    BABA
 9    BABC
10  0 BC
11  0 C
Now we fill each gap with -1, -2, and so on, as will fit in. Like this:
 1  4 ABAB
 2 -1 ABAB
 3 -2 ABAB
 4 -3 ABAB
 5  0 ABC
 6  4 BABA
 7 -1 BABA
 8 -2 BABA
 9 -3 BABC
10  0 BC
11  0 C
Our monkey is now turbocharged.

Lesson 3. Turbo Monkey, Part II.

In which we learn to drive our turbocharged monkey.

First we start the turbocharged monkey like the plain monkey, picking an entry at random out of THE CONCORDANCE, writing it out, and noting the context for the next letter to be written. Search now for an occurrence of the current context in THE CONCORDANCE, using, for instance, a binary search. Say the binary search for context ABA returns entry number 9:

9 -3 BABC
The -3 tells you that the first occurrence of context BAB is in entry number 9-3, i.e. entry number 6. Look it up:
6 4 BABA
The 4 there tells you that there are 4 occurrences of context BAB in THE CONCORDANCE: 6, 7, 8, 9. Pick one at random, and output its fourth letter. Update the context (remember: the context is always the last N-1 letters written), and continue in the same manner.

With this indexing system, text is generated so fast, even on a lowly XT, that it fills the screen faster than you can read it!

Lesson 4. Turbo Monkey, Part III.

In which we learn to compute the fourth-order entropy of THE COLLECTED WORKS (copyright Frogguy, 1990, 1997).

Consider once again the turbocharged engine of your monkey, also known as THE CONCORDANCE OF THE COLLECTED WORKS (tm:). Count its contexts (sum the positive numbers next to each entry, ignore the negative ones, which are pointers back to the first occurrence of each context):

 1   4 ABAB
 2  -1 ABAB
 3  -2 ABAB
 4  -3 ABAB
 5   0 ABC
 6   4 BABA
 7  -1 BABA
 8  -2 BABA
 9  -3 BABC
10   0 BC
11   0 C

Sum: 8
Now, compute the entropy for each context. That is:
- sumOf (f[i]*log(f[i]))
in which f[i] denotes the relative frequency of letter i in that context and log(x) denotes the base-2 logarithm of x.

In context ABA (entries 1 to 4), only B is found. Its relative frequency is 100%, which is 1. So, the entropy for context ABA is -1*log(1) = 0.

In context BAB (entries 6 to 9), A is found 3 times, C one time. The relative frequency of is therefore 0.75, that of C is 0.25. The entropy for context BAB is then: - (0.75*log(0.75)+0.25*log(0.25)) = 0.8113

There are no other productive contexts. Now, add the products of the relative frequency of each context by its entropy. The result is the fourth-order entropy of the text:

Context ABA: relative frequency = 4/8
             entropy = 0                4/8*0      = 0
Context BAB: relative frequency = 4/8
             entropy = 0.8113           4/8*0.8113 = 0.4056

Entropy of text: 0+0.4056 = 0.4056
End of lesson.

HOMEWORK

Using THE CONCORDANCE, compute the third, second, and first-order entropy of THE COLLECTED WORKS (copyright Frogguy 1990, 1997).

SUPPLEMENTARY HOMEWORK (for the really, really, dedicated)

Figure out at least one way of computing the correlation between the second-order and fourth-order frequency tables as defined in my "Eureka!" posting. (I found two ways. One is computable from the concordance as built above, the other needs two separate concordances. The former will work for letter-monkeys -- those monkeys for whom a single letter is a pearl of wisdom, but it would need far too much RAM storage for word-monkeys)

NOTA BENE

The algorithm I gave for computing the entropy is equivalent to the one in Bennett 1976. It treats the sample text as if it were the whole of the virtually possible corpus. In the next volume of "Monkey-Breeding For The Millions" we shall see how to code it. I still do not know if I should talk about solving how to estimate the entropy of the population from the sample (the gist of what Jim Reeds sent me 5 years ago), since the entropy of texts does not seem a much useful measure for our purposes. Opinions, please!


Home     Rongorongo concordance