For its third update, the study must deal with a certain amount of new parameters related to the use of AltaVista as the measuring tool. Hence, it initiates the transition towards a new step, much more reliable as far as methodology is concerned.
Talking about the very content, the third set of results shows that French keeps on growing more quickly than English, the relatively slow trend of 1997 being confirmed. As for Spanish, it continues its fast improvement, and it clearly gets closer to French. The domination of French over Spanish had decreased from 140% in 1996 to 92% in 97, and it is now reduced to 39%. Extrapolating, we can see a ratio of 1 to 9 between French and English for the year 2000 and a relationship of equality between Spanish and French at the same time...
First of all, we would like to mention a study of Alis Technologies with the support of the Internet Society: "Web Languages Hit Parade". The study claims itself to be the "first major study of the actual distribution of languages on the Internet", and to use a "rigorous method for exploring the Web".
The Alis methodology is very different from ours. It pays great importance to data-processing. As a matter of fact, the methodology used by Alis Technology is based on a program which automatically recognizes several languages (17) on the Web space. The measurement protocol consists in picking randomly 60,000 Internet sites on the criteria of their IP number(*), to validate a valid subset made suitable for the measurement of 8,000 Web sites, and to run the program. After this procedure, Alis applies on the result it has obtained some corrections the nature of which is not specified. It is a very interesting procedure, since it can be automated and reproduced as often as required. Furthermore, it can be applied simultaneously to several languages.
Compared to our work, the results show a much stronger presence of English (82% against 70%). The ratio French/Spanish is still very close to our results.
The main difference between the two approaches lies in the claimed ambition: the sole aim of the study led by FUNREDES is to provide a very broad estimation; on the countrary, Alis strongly affirms the high validity of its results.
Of course, this encourages us to have a closer look at Alis methodology.
Let us conclude by saying that the current limits of the study made by Alis encourage us to keep on defending our view, and even to make it more systematic as far as linguistics is concerned, in order to bring a more credible approach to the measurement of the languages on the Internet.
(*)The IP numbers (IP stand for Internet Protocol) identify in one unique way each system connected to the Internet. They have a structure standardized in four sections, of a value from 0 to 999, separated by dots. The names of the systems (or domains) are translated into IP numbers by a method known as "domain service". E.g., the IP number for <funredes.org> is 184.108.40.206.
On the one hand, the powerful search engine by Digital Corp. brings innovations: it now pays attention to the diacritic signs (accents, and other "special" characters, which do not exist in the English alphabet). And just like Alis, it offers an option to separate languages (Alis identifies 17 languages, and AltaVista claims 25). Without further investigations, the algorithms can already be said to be different.
On the other hand, the size of the universe taken into account by AltaVista has not really improved: it remains about 100 million Websites, for a universe experiencing a very strong exponential growth. As a relative proportion, AltaVista may have decreased from a cover of about 70% to a cover much more limited, perhaps about 20%. That is still a percentage strong enough to extrapolate our results. However, one needs to wonder if this approach might not benefit the older sites, hence the ones that are available in English rather than in other languages.
The study of the evolutions of the AltaVista search engine led to surprising situations, and, as we are now going to see it, it compelled us to take into account other search engines, and to use them to keep on working.
Some cross-checkings show that searching a word without using diacritics includes all the posibilities of the word with diacritics. Thus, the search "peche" displays "peche", "pêche" (fishing, or peach), "pèche" (conjugated form of "to sin"), "péché" (a sin), and all possible spelling mistakes and typos, such as "péche" or "pëche". This made us perform searches without diacritics for the comparisons with English. Also, this requires to chose the sample of words in a very careful way.
Search by language
Using the current version of AltaVista, we discovered an incoherent phenomenon, which will prevent us from using this way of counting. Behind this apparent inconsistency, there may be a logic, but, be as it may, this logic is not compatible with the counting goal. What is it?
In some cases, the result "all languages" (ANY) seems equal to the addition of all the results for each language, or it is higher, which is normal, since all the languages are not taken into account, and due to the multi-language pages, also. But in other cases (the main part of our sample list), this result is lower than the one given by the search in English (it is then very difficult to make interpretation)!
Here are some examples, for words or phrases: FUNREDES, FUNDACION REDES Y DESARROLLO, iberian, INTERNET, WEB (EN=English, FR=French, ES=Spanish, DE=German):
It seems that for the current English words (in the dictionary of AltaVista?), the result systematically gives a value for "any language" which is lower than the value for "English" (but what does this value mean?) and that for the English words, whether they are compound or not, the value "any language" is close to the sum of the values by language. We have requested explanations from AltaVista and we are waiting for an answer.
AltaVista displays two results for a request. The first one, on the top of the page, is the total number of pages of its sample which mention the given word ("documents"). The second, on the bottom of the page, indicates the number of times that the given word appears in the pagesÊof the sample ("occurrences"). There is another inconsistency here: sometimes the second one is identical in each language; some other times, the result is different according to the language (apparently for the expressions made of several words, such as "fundacion redes y desarrollo").
This anomaly represents an obstacle for our measurements, but it is still possible, playing some tricks, to establish a comparison between the algorithm of Alis and the one of AltaVista. If one seeks with AltaVista all of the documents which contain all the words except a probably non-existent word (e.g., by asking the following expression to be searched: "- qwxk49fnr8e4"), the result seems to be the total number of pages which the algorithm of AltaVista considers to belong to a given language. Of course, the option "any language" makes us obtain the measurement of the total universe of the pages of AltaVista: a little more than 100 million when the measurement was made. Cross-checking with words or very frequent combinations (e.g. "of+he" in English) confirms the validity of the result. On this matter, our experiments show that if the measurement of very frequent short words has been able to give apparently convincing results in the past, the method now leads to values which are not very reliable.
(**)a correction has to be made in order to take into account the difference between the total and the sum of the measured languages. What does this value of almost 17% represent? In theory, it may represent the sum of the values of the languages which are not measured. But this percentage is far too large for that. Perhaps, in addition to the values of the languages which are not measured, the multilingual websites that the algorithm did not know how to classify were taken into account. The fact that the number is so large leads us to think that the multilingual sites are not entered in several languages (if not, the total could be lower than the sum of counting by language). We will then make the hypothesis that "the remainder" represents the counting of the multilingual sites and the sites in the languages which are not recognized by the algorithm... as well as the sites which are not recognized by the algorithm though written in one of the "recognized" languages (errors of the algorithm), without forgetting the pages which present symbols that are not characteristic of a language (images, formulas...). We also take the (probably false!) hypothesis that the errors are equally distributed to languages: we will then ignore them. It remains us to fix a parameter to distribute the multilingual sites and the other languages. After several tests, we chose the following couple: 15% of multilingual sites (or neutrals in the field of the language) and 2.24% of sites in the other languages (e.g. 100 languages with 0.02%), because it appears to us as being the most probable one.
We can see that the comparison shows a more significant value for English with the method of Alis than with the method that we call "complement of the empty set" in AltaVista. Now, as we will see further on, our method of counting word by word makes us suspect that the counting of AltaVista as well emphasizes the importance of English. That makes one wonder about the result of Alis Technologies and justifies a study with more elaborate linguistic criteria. The comparison between the three methods gives the following result:
MEASUREMENTS OF FEBRUARY 1998
|M1: HOTBOT FEBRUARY 98||100221545||6090080||3230690||16.46||1.89||31.02|
|M2: EXCITE FEBRUARY 98||23689345||1430583||910317||16.56||1.57||26.02|
|M3: ALTAVISTA ALL LANGUAGES 2/98||26017027||1478396||1115708||17.60||1.33||23.32|
|M4: ALTAVISTA BY LANGUAGE 2/98||70718558||2946712||2058398||24.00||1.43||34.36|
There is a some coherence between the results of the three search engines, but we need to acknowledge the variations as well. It seems that AltaVista strengthens English less than the two others. Can the introduction of the recognition of the languages explain the differences? Probably. Which value can now be given to the study of tendency? We still have some doubts, and that also justifies us retaking this study with a more solid methodology as far as linguistics and technique are concerned.
|AVERAGE MARCH 1996||21.91||2.40|
|AVERAGE MARCH 1997||19.99||1.92|
|AVERAGE FEBRUARY 1998||17.60||1.33|
The progressions are almost linear, and the extrapolation shows the ratio English/French equal to 1 in 2006 and the ratio French/Spanish equal to 1 in 2000.
MEASUREMENTS OF THE DIACRITIC CHARACTERS
A result of interest for the supporters of the correct use of the languages in the network is a measurement of the relationship between the spelling with and without diacritic characters. The results are stable with all the engines.
|Percentage of sites without diacritic (average)||20%||50%|
It is now time to consolidate the method with the support of linguists. In collaboration with the Latin Union and its team of language specialists, Funredes has begun generalizing the study, and including three other Latin languages, while paying a great attention to the linguistic methodology. A sample of words to be measured which fit the best linguistic criteria is being elaborate (the obstacles are numerous!) and will be used as a basis for a measurement of the presence of the six following languages: English, Spanish, French, Italian, Portuguese and Rumanian. The results will be published in a few weeks.
[BACK TO TOP]
Copyright © 1996-1999 FUNREDES
Created: 24 VIII 1998
Last Modified: 02 VII 1999
|$ruta = $HTTP_SERVER_VARS['PHP_SELF']; list($f,$lc06,$idioma,$directorio,$pagina) = split( '[/]',$ruta); ?> Spanish version French version|