Home     Dictionary introduction     Rapanui-English     English-Rapanui     Grammar

How the Rapanui-English Dictionary was Produced


At first, the translation of Englert's dictionary was done entirely by hand. This process, however, proved so painfully time-consuming that it had to be automated. After scanning the dictionary out of Englert's "La Tierra de Hotu Matu'a," the first task was to train the OCR software to recognize the Greek letter eta used by Englert for the nasal velar, and to interpret it as "g". The result was saved in Rich Text Format. A piece of software was written to strip down the RTF files to plain ASCII, but retaining the information about the fonts: regular, bold, italics. Dictionary entries being in boldface and Rapanui examples in italics, this allowed to distinguish between lexical items (boldface), their Spanish translations (regular) and the examples (italics for the Rapanui text, regular for the Spanish).

At this stage, the dictionary file looked like this:

{\bpa}
	{rodear. circundar; campo cercado, sitio cerrado. propiedad
	particular; }
	{\ii agapó ku-rere-mai-  te kori kiroto ki tooku pa, }
	{anoche entró un ladrón en mi sitio. }
{\bp }
	{a veces en lugar de }
	{\ipe: pahé }
	{= }
	{\ipehé. }

Note how the RTF format has been reduced to the barest: \b for boldface, \i for italics.

It was evident that there were many mistakes, some due to the OCR, but many in the original, and so more software was written to trace those mistakes. This software took as input the stripped-down RTF file shown above and produced a frequency count of all the Rapanui words in it. The output was examined for words with a frequency of 1 __ these being likely to be mistakes. This showed that in many cases the OCR has missed a space and run two words together, or it had inserted a non-existent space. It also showed that, in the original, the Greek eta was often missing, replaced with a space. Those mistakes in the input file were corrected by hand, and the new file was again submitted to the same frequency count. After twelve iterations, it became obvious that little more could be done automatically.

However, one task remained which could be automated. It was to check that every word in the Rapanui examples had a corresponding entry in the dictionary. It turned out that some did not. For instance, you will look in vain for such a trivial word as toru "three". The missing entries were inserted. The process of matching the words in the examples with the dictionary entries had revealed more mistakes, which were corrected by hand, and again, the resulting file was submitted to the same checking algorithm, errors were corrected, and this process was repeated until no more errors were detected. A final piece of software rewrote the file into a more human-friendly format, and split it into one file per letter, looking like this:

||pa
	|=rodear. circundar; campo cercado, sitio cerrado. propiedad
	particular; 
	|:i agapó ku-rere-mai-  te kori kiroto ki tooku pa,
	|=anoche entró un ladrón en mi sitio.

	|-rodear. circundar; campo cercado, sitio cerrado. propiedad
	particular; 
	|:i agapó ku-rere-mai-  te kori kiroto ki tooku pa,
	|-anoche entró un ladrón en mi sitio.


||pa
	|=a veces en lugar de
	|:pe: pahé =  pehé.

	|-a veces en lugar de
	|:pe: pahé =  pehé.

Note how dictionary entries are announced by a double pipe ||, the Spanish text by |- and the Rapanui by |:

Note also how the Spanish text is repeated, once announced by |=, once by |-. That is where the English translation is to be added, overwriting it.

More errors were discovered and corrected during the process of inserting the English translation. Sometimes the OCR had missed the change to boldface, and what should have been a separate entry found itself amidst Spanish translations.

The above excerpt now looked like this:

||pa
	|=rodear. circundar; campo cercado, sitio cerrado. propiedad
	particular; 
	|:i agapó ku-rere-mai-  te kori kiroto ki tooku pa,
	|=anoche entró un ladrón en mi sitio.

	|-to surround; enclosed field; private property; 
	|:i agapó ku-rere-mai-  te kori kiroto ki tooku pa,
	|-last night a thief entered my property. 

||pa
	|=a veces en lugar de
	|:pe: pahé =  pehé.

	|-sometimes found instead of
	|:pe: pahé =  pehé.

A final piece of software was written to turn this into HTML, giving, for the example above:

<LI>
<B><a href="#TOP">pa</A>,</B>
<OL TYPE="1">
	<LI>to surround; enclosed field; private property;
	<I>i agap&oacute; ku-rere-mai-&aacute; te kori kiroto ki tooku pa,
	</I>last night a thief entered my property.
</LI>
<LI>sometimes found instead of
	<I>pe: pah&eacute; =  peh&eacute;.</I>
</LI>
</OL>

This program also took care of the headers and links to the other URLs, producing ready-to-upload HTML code.

Yet, that was not the end of it. Perusing the dictionary through a Web browsers, it was observed that some entries were still missing. For instance, the very common word roto "inside" was missing. Why had this not be reported by the cross-checking programs? Because roto has other meanings, such as "lagoon," and was present in the dictionary under those meanings.

All software was written in Euphoria, an interpreted language for DOS, Windows, and Linux, which allies the power of LISP and APL to the ease of BASIC. Although interpreted, it executes quite fast. A benchmark test run on its earliest available version showed it to execute at best as fast as Borland Pascal, at worst 8 times slower.

Should you be interested: www.rapideuphoria.com


La Tierra de Hotu Matu'a __ Historia y Etnologia de la Isla de Pascua, Gramática y Diccionario del Antiguo Idioma de la Isla. by Padre Sebastian Englert, O.F.M.Cap. Sixth edition, Editorial Universitaria, Santiago de Chile, 1993. Difficult to obtain, about US$50 in second-hand bookshops. Try www.biblio.com and www.abebooks.com

Home     Dictionary introduction     Rapanui-English     English-Rapanui     Grammar