APPENDIX 5 : AUTOMATED INTERFACE FOR THE L2000 STUDY

Introduction

Obtaining the results for the study of the place of Latin languages on the Internet involved a meticulous, long and repetetive measurement of the results given by the search engines. In fact, each final result (by search engine) demanded a count of the number of pages containing each of the 1200 or so graphical variants of the 57 terms used (1600 once German had been added to). This process was followed by manual corrections of 73 variants which had homographic problems (in red and capitals in the table of terms in Appendix 3.) The results returned for each variant were then grouped by occurence and then 57 scores were arrived at, classed by language. An average of the scores was then worked out and the variance coefficient was then calculated as well as the interval of reliablity for each language in order to get the final result. Clearly not forgetting the meticulous process of verification of the results some of which were without doubt distorted by various typing errors or lack of attention caused by continuous repetitive tasks...

All these results by term, variant or final results are not completely flexible in their use, an error in a score detected when writing up a posteriori means that a systematic recalculation of the term is needed along with a recalculation of the final statistics.

The automation of these manual procedures is therefore desirable. It was decided to invest in the designing of an algortihm capable of facilitating the search for 1600 or so variants by the search engines, collate the tally and organise them in a way so as to be able, after correcting homographies, to make final statistical calculations. This automation furthermore allows us to use several search engines without needing any significative further work.

Technology used

To allow optimal management of everything, it was decided to use a database structure, as a central element through which different applications could be linked.
The database is run under PosgreSQL, used widely on the Internet, and the PHP programming language was chosen for the interface between the database and Internet applications.

System function details

Database

The database is made up of three main tables which are the framework of the automated system�:

A table of the 1600 graphical variants of the terms�: in this table are the different variants of the sample. They are classed according to term and to the language which they belong and are registered with the parameters which are associated with them�: homographies, etc. Appendix 3 represents the contents of this table.

Scores table : all the results (page counts) are kept here, obtained, via the PHP interface, from the search engines. Theses scores are classed according to the variants to which they are related, and by the motor on which the score was found.
Table of results by concept�: Once the 1600 variants have been entered into the database and the scores of each one obtained, the results for each term are calculated. This is carried out by adding the scores of the variants belonging to the same term in the same language. The results obtained by this (classed by word, search engine and by language) will be used to arrive at the final results and serve as the basis for the statistical calculations.

It is an open design due to the setting of parameters for the list of languages and the selected search engines. This set-up allows great flexibility in adding new terms, search engines and languages.

The PHP Interfaces

The PHP software is made up of three types of interface:

The data capture interfaces

These interfaces are for filling the graphical variant and results fields in the data base.

The first is a user interface available to the administrator for specifying the graphical variants and their associated properties¹. The second is a machine interface which allows, for each of the chosen motors, the storing of the results whilst the program is running. The data keyed in through the first interface is saved between each program execution�; that in the second interface is volatile, and replaced each time the program is run. The data replaced by this process are saved first.

The data processing interfaces

There are two types�: those which allow the modification of the "results" table of the graphical variants and those which allow the calculating the values associated with each word. The first type is used to correct the following homographic problems�:

The partition of the count between Spanish and Portugese was carried out automatically pro rata from the partial results of the study

For this reason the forms CAL and CAI have not been counted

the form CAII has been excluded

fa�a

fa�as

The results given were calculated a posteriori by using the same method as for the words ending in "-IDADES" on which were based the coefficients for Portugese in relation to English. The form BOLI

bol�grafo

JOI

JOIA

j�ia

The results given were calculated a posteriori by using the same method as for the words ending in "-IDADES" on which were based the coefficients for Portugese in relation to English. MARTI is a homograph of the name of a celebrity (Jos� Mart�), without diacrital marks, and its result has not been counted for the Rumanian "mardi". The result for MARDI (Tuesday) in French has been taken away from the result for MARDI GRAS so as not to count its frequent use in English.

Also, the use of some search engines required further processing of certain results. Especially as far as the search engines not differentiating diacritical marks or dealing with them in a rather erratic way were concerned.

The interfaces of the second type updated the "results by concept" table in the database.

The interfaces for displaying the results

The data base used for storing the results contains�:

(a) 1600 graphical variants of terms classed by concept (57) and by language (7)

(b) the results of these 1600 graphical variants, measured by 6 search engines (9600 results)

(c) the results for the 57 terms, calculated from 9600 results from 6 search engines and 7 languages (2394 results). These results are seen as absolute figures or in proportion to the results for English. To access this information display interfaces needed to be created which had to take into account the following two conditions�:

allow the rapid and accurate retrieval of results from the available information

make use of the updated results each time a modification is made in the database The interface allows results to be obtained such as (a) which allowed the creation of the table in Appendix 3. The results from (b) are available in Appendix 8, and those of (c) in Appendix 9 . The interface for Appendix 9 also calculates the mean, the standard deviation, and the variance coefficient of the results when the results are viewed as a percentage. This interface also highlights the characteristics of the search engines (Appendix 4).

Conclusion and plans for the next version

The present system is a certain improvement on the manual method. It transforms a slow and mind-numbing operation needing 10 days work for 1200 graphical variants and a single search engine to 2 days work for 1600 variant occurrences and 6 search engines, with results which are easirer to use. This system allows, furthermore, easy integration of other languages to the study, other linguistic samples or other search engines.
This flexibility allows us to anticipate the adding of new functionalities in the future to the database and the interfaces. Regular backing-up and dating of the results allows a dynamic analysis of the growth in the presence of studied Latin languages on the Internet as well as transforming this study into a true permanent observatory of these changes. These results can also allow evaluation of the way in which each motor treats the plurilingualism of the Internet.

Top

Language, associated terms, homographic problems, diacritical variants.