APPENDIX 5 : AUTOMATED INTERFACE FOR THE L2000 STUDY
Introduction
Obtaining the results for the study of the place of Latin languages on the Internet involved a meticulous, long and repetetive measurement of the results given by the search engines. In fact, each final result (by search engine) demanded a count of the number of pages containing each of the 1200 or so graphical variants of the 57 terms used (1600 once German had been added to). This process was followed by manual corrections of 73 variants which had homographic problems (in red and capitals in the table of terms in Appendix 3.) The results returned for each variant were then grouped by occurence and then 57 scores were arrived at, classed by language. An average of the scores was then worked out and the variance coefficient was then calculated as well as the interval of reliablity for each language in order to get the final result. Clearly not forgetting the meticulous process of verification of the results some of which were without doubt distorted by various typing errors or lack of attention caused by continuous repetitive tasks...
All these results by term, variant or final results are not completely flexible in their use, an error in a score detected when writing up a posteriori means that a systematic recalculation of the term is needed along with a recalculation of the final statistics.
The automation of these manual procedures is therefore desirable. It was decided to invest in the designing of an algortihm capable of facilitating the search for 1600 or so variants by the search engines, collate the tally and organise them in a way so as to be able, after correcting homographies, to make final statistical calculations. This automation furthermore allows us to use several search engines without needing any significative further work.
Technology used
To allow optimal management of everything, it was decided to use a database structure, as a central element through which different applications could be linked.
The database is run under PosgreSQL, used widely on the Internet, and the PHP programming language was chosen for the interface between the database and Internet applications.
System function details
Database
The database is made up of three main tables which are the framework of the automated system�:
- A table of the 1600 graphical variants of the terms�: in this table are the different variants of the sample. They are classed according to term and to the language which they belong and are registered with the parameters which are associated with them�: homographies, etc. Appendix 3 represents the contents of this table.
- Scores table : all the results (page counts) are kept here, obtained, via the PHP interface, from the search engines. Theses scores are classed according to the variants to which they are related, and by the motor on which the score was found.
- Table of results by concept�: Once the 1600 variants have been entered into the database and the scores of each one obtained, the results for each term are calculated. This is carried out by adding the scores of the variants belonging to the same term in the same language. The results obtained by this (classed by word, search engine and by language) will be used to arrive at the final results and serve as the basis for the statistical calculations.
It is an open design due to the setting of parameters for the list of languages and the selected search engines. This set-up allows great flexibility in adding new terms, search engines and languages.
The PHP Interfaces
The PHP software is made up of three types of interface:
The data capture interfaces
These interfaces are for filling the graphical variant and results fields in the data base.
The first is a user interface available to the administrator for specifying the graphical variants and their associated properties1. The second is a machine interface which allows, for each of the chosen motors, the storing of the results whilst the program is running. The data keyed in through the first interface is saved between each program execution�; that in the second interface is volatile, and replaced each time the program is run. The data replaced by this process are saved first.
The data processing interfaces
There are two types�: those which allow the modification of the "results" table of the graphical variants and those which allow the calculating the values associated with each word. The first type is used to correct the following homographic problems�:
The most common anomalies concern the graphical variants of plurals ending in "-IDADES", common to Spanish and Portugese, equal to "-ities" in English ("uniformities", "uniformidades"). The partition of the count between Spanish and Portugese was carried out automatically pro rata from the partial results of the study. The coefficients (by search engine) for the weighting of Spanish in relation to the weighting of Portugese were calculated from these partial results and these coefficients were used in distributing the results for words ending in "-IDADES" between Spanish and Portugese.
Also, the use of some search engines required further processing of certain results. Especially as far as the search engines not differentiating diacritical marks or dealing with them in a rather erratic way were concerned.
The interfaces of the second type updated the "results by concept" table in the database.
The interfaces for displaying the results
The data base used for storing the results contains�:
(a) 1600 graphical variants of terms classed by concept (57) and by language (7)
(b) the results of these 1600 graphical variants, measured by 6 search engines (9600 results)
(c) the results for the 57 terms, calculated from 9600 results from 6 search engines and 7 languages (2394 results). These results are seen as absolute figures or in proportion to the results for English.
To access this information display interfaces needed to be created which had to take into account the following two conditions�:
allow the rapid and accurate retrieval of results from the available information
make use of the updated results each time a modification is made in the database
The interface allows results to be obtained such as (a) which allowed the creation of the table in Appendix 3. The results from (b) are available in Appendix 8, and those of (c) in Appendix 9. The interface for Appendix 9 also calculates the mean, the standard deviation, and the variance coefficient of the results when the results are viewed as a percentage. This interface also highlights the characteristics of the search engines (Appendix 4).
Conclusion and plans for the next version
The present system is a certain improvement on the manual method. It transforms a slow and mind-numbing operation needing 10 days work for 1200 graphical variants and a single search engine to 2 days work for 1600 variant occurrences and 6 search engines, with results which are easirer to use. This system allows, furthermore, easy integration of other languages to the study, other linguistic samples or other search engines.
This flexibility allows us to anticipate the adding of new functionalities in the future to the database and the interfaces. Regular backing-up and dating of the results allows a dynamic analysis of the growth in the presence of studied Latin languages on the Internet as well as transforming this study into a true permanent observatory of these changes. These results can also allow evaluation of the way in which each motor treats the plurilingualism of the Internet.
- Language, associated terms, homographic problems, diacritical variants.
|