L5 Title

APPENDIX 5 : AUTOMATED INTERFACE FOR THE L2000 STUDY

Introduction

Obtaining the results for the study of the place of Latin languages on the Internet involved a meticulous, long and repetetive measurement of the results given by the search engines. In fact, each final result (by search engine) demanded a count of the number of pages containing each of the 1200 or so graphical variants of the 57 terms used (1600 once German had been added to). This process was followed by manual corrections of 73 variants which had homographic problems (in red and capitals in the table of terms in
Appendix 3.) The results returned for each variant were then grouped by occurence and then 57 scores were arrived at, classed by language. An average of the scores was then worked out and the variance coefficient was then calculated as well as the interval of reliablity for each language in order to get the final result. Clearly not forgetting the meticulous process of verification of the results some of which were without doubt distorted by various typing errors or lack of attention caused by continuous repetitive tasks...

All these results by term, variant or final results are not completely flexible in their use, an error in a score detected when writing up a posteriori means that a systematic recalculation of the term is needed along with a recalculation of the final statistics.

The automation of these manual procedures is therefore desirable. It was decided to invest in the designing of an algortihm capable of facilitating the search for 1600 or so variants by the search engines, collate the tally and organise them in a way so as to be able, after correcting homographies, to make final statistical calculations. This automation furthermore allows us to use several search engines without needing any significative further work.

Technology used

To allow optimal management of everything, it was decided to use a database structure, as a central element through which different applications could be linked.
The database is run under PosgreSQL, used widely on the Internet, and the PHP programming language was chosen for the interface between the database and Internet applications.

System function details

Database

The database is made up of three main tables which are the framework of the automated system�: It is an open design due to the setting of parameters for the list of languages and the selected search engines. This set-up allows great flexibility in adding new terms, search engines and languages.

The PHP Interfaces

The PHP software is made up of three types of interface:

The data capture interfaces

These interfaces are for filling the graphical variant and results fields in the data base.

The first is a user interface available to the administrator for specifying the graphical variants and their associated properties1. The second is a machine interface which allows, for each of the chosen motors, the storing of the results whilst the program is running. The data keyed in through the first interface is saved between each program execution�; that in the second interface is volatile, and replaced each time the program is run. The data replaced by this process are saved first.

The data processing interfaces

There are two types�: those which allow the modification of the "results" table of the graphical variants and those which allow the calculating the values associated with each word. The first type is used to correct the following homographic problems�:

Also, the use of some search engines required further processing of certain results. Especially as far as the search engines not differentiating diacritical marks or dealing with them in a rather erratic way were concerned.

The interfaces of the second type updated the "results by concept" table in the database.

The interfaces for displaying the results

The data base used for storing the results contains�:

To access this information display interfaces needed to be created which had to take into account the following two conditions�:
  • allow the rapid and accurate retrieval of results from the available information
  • make use of the updated results each time a modification is made in the database The interface allows results to be obtained such as (a) which allowed the creation of the table in Appendix 3. The results from (b) are available in Appendix 8, and those of (c) in Appendix 9. The interface for Appendix 9 also calculates the mean, the standard deviation, and the variance coefficient of the results when the results are viewed as a percentage. This interface also highlights the characteristics of the search engines (Appendix 4).

    Conclusion and plans for the next version

    The present system is a certain improvement on the manual method. It transforms a slow and mind-numbing operation needing 10 days work for 1200 graphical variants and a single search engine to 2 days work for 1600 variant occurrences and 6 search engines, with results which are easirer to use. This system allows, furthermore, easy integration of other languages to the study, other linguistic samples or other search engines.
    This flexibility allows us to anticipate the adding of new functionalities in the future to the database and the interfaces. Regular backing-up and dating of the results allows a dynamic analysis of the growth in the presence of studied Latin languages on the Internet as well as transforming this study into a true permanent observatory of these changes. These results can also allow evaluation of the way in which each motor treats the plurilingualism of the Internet.

    Top

    1. Language, associated terms, homographic problems, diacritical variants.