Languages & Cultures Title (click on left side to go to Funredes Home)
Project Overview
Email FUNREDES
Languages & Cultures Home L1 Study L2 Study L3 Study L4 Study C1 Study C2 Study

Menu

INTRODUCTION

NEW MEASUREMENTS
ALIS TECHNOLOGIES
ALTA VISTA
ON THE DIACRITICS

RESULTS

COMMENTS ON THE RESULTS

TRENDS

CONCLUSION

L3 Title

Statistical Considerations on French Language and Culture on the Internet

This study was presented at the Visionarios conference, which took place in Caracas (Venezuela) from April 22 to April 24, 1998.
Author:
Daniel Pimienta
Acknowledgments to Marcelo Sztrum, Catherine Dhaussy, and Daniel Prado for their contributions.

Introduction

For its third update, the study must deal with a certain amount of new parameters related to the use of AltaVista as the measuring tool. Hence, it initiates the transition towards a new step, much more reliable as far as methodology is concerned.

Talking about the very content, the third set of results shows that French keeps on growing more quickly than English, the relatively slow trend of 1997 being confirmed. As for Spanish, it continues its fast improvement, and it clearly gets closer to French. The domination of French over Spanish had decreased from 140% in 1996 to 92% in 97, and it is now reduced to 39%. Extrapolating, we can see a ratio of 1 to 9 between French and English for the year 2000 and a relationship of equality between Spanish and French at the same time...

What is new on the measurement of the presence of the languages in the Internet?

ALIS TECHNOLOGIES
[BACK TO TOP]

First of all, we would like to mention a study of Alis Technologies with the support of the Internet Society: "Web Languages Hit Parade". The study claims itself to be the "first major study of the actual distribution of languages on the Internet", and to use a "rigorous method for exploring the Web".

The Alis methodology is very different from ours. It pays great importance to data-processing. As a matter of fact, the methodology used by Alis Technology is based on a program which automatically recognizes several languages (17) on the Web space. The measurement protocol consists in picking randomly 60,000 Internet sites on the criteria of their IP number(*), to validate a valid subset made suitable for the measurement of 8,000 Web sites, and to run the program. After this procedure, Alis applies on the result it has obtained some corrections the nature of which is not specified. It is a very interesting procedure, since it can be automated and reproduced as often as required. Furthermore, it can be applied simultaneously to several languages.

Compared to our work, the results show a much stronger presence of English (82% against 70%). The ratio French/Spanish is still very close to our results.

The main difference between the two approaches lies in the claimed ambition: the sole aim of the study led by FUNREDES is to provide a very broad estimation; on the countrary, Alis strongly affirms the high validity of its results.

Of course, this encourages us to have a closer look at Alis methodology.

  1. It is not possible for us to make any judgement on the value of the program of language recognition. Only cross-checkings on the results obtained by various methods could make possible to validate the results of this program.
  2. However, as far as statistics are concerned, the method appears quite doubtful. Why a sample of 8,000 Web pages taken randomly among a universe higher than 100 million pages would provide a serious base for extrapolation? Of course, the survey institutes have shown their amazing capacity to extrapolate the intentions of vote with a remarkable precision, starting from samples of 2,000 voters for a vote of 50 million. But the survey institutes do not randomly pick up the elements of their samples; it is quite the contrary indeed! The sample is normalized, i.e. the proportions of certain parameters (social, economic, geographical...) of its elements are very precisely weighted.
  3. The problem mentioned in 2 could have been solved if the study had repeated the procedure several dozen of time and had published the average of the results (especially if the variance were very weak). That could have increased the credibility of the results. It seems that, for the moment, the numerous "hand-made checks" of the automatic process make this approach impossible. However, the measurement of only 3 distinct samples could have reassured (or worried) us about the argument mentioned in 2.
  4. The corrective adjustment of the results remains very mysterious (it seems that it remains a prerogative of the surveys specialists though :-)...)
  5. Last of all, Alis' ambition is not to measure, right now, anything but the presence of the languages in the Web space. They neither aim at measuring other Internet world nor, even less, at approaching the cultural evaluation which in fact constitutes the very philosophy of our study.

Let us conclude by saying that the current limits of the study made by Alis encourage us to keep on defending our view, and even to make it more systematic as far as linguistics is concerned, in order to bring a more credible approach to the measurement of the languages on the Internet.


(*)The IP numbers (IP stand for Internet Protocol) identify in one unique way each system connected to the Internet. They have a structure standardized in four sections, of a value from 0 to 999, separated by dots. The names of the systems (or domains) are translated into IP numbers by a method known as "domain service". E.g., the IP number for <funredes.org> is 205.160.164.9.


ALTAVISTA
[BACK TO TOP]

On the one hand, the powerful search engine by Digital Corp. brings innovations: it now pays attention to the diacritic signs (accents, and other "special" characters, which do not exist in the English alphabet). And just like Alis, it offers an option to separate languages (Alis identifies 17 languages, and AltaVista claims 25). Without further investigations, the algorithms can already be said to be different.

On the other hand, the size of the universe taken into account by AltaVista has not really improved: it remains about 100 million Websites, for a universe experiencing a very strong exponential growth. As a relative proportion, AltaVista may have decreased from a cover of about 70% to a cover much more limited, perhaps about 20%. That is still a percentage strong enough to extrapolate our results. However, one needs to wonder if this approach might not benefit the older sites, hence the ones that are available in English rather than in other languages.

The study of the evolutions of the AltaVista search engine led to surprising situations, and, as we are now going to see it, it compelled us to take into account other search engines, and to use them to keep on working.

On the diacritics
[BACK TO TOP]

Some cross-checkings show that searching a word without using diacritics includes all the posibilities of the word with diacritics. Thus, the search "peche" displays "peche", "pêche" (fishing, or peach), "pèche" (conjugated form of "to sin"), "péché" (a sin), and all possible spelling mistakes and typos, such as "péche" or "pëche". This made us perform searches without diacritics for the comparisons with English. Also, this requires to chose the sample of words in a very careful way.

Search by language

Using the current version of AltaVista, we discovered an incoherent phenomenon, which will prevent us from using this way of counting. Behind this apparent inconsistency, there may be a logic, but, be as it may, this logic is not compatible with the counting goal. What is it?

In some cases, the result "all languages" (ANY) seems equal to the addition of all the results for each language, or it is higher, which is normal, since all the languages are not taken into account, and due to the multi-language pages, also. But in other cases (the main part of our sample list), this result is lower than the one given by the search in English (it is then very difficult to make interpretation)!

Here are some examples, for words or phrases: FUNREDES, FUNDACION REDES Y DESARROLLO, iberian, INTERNET, WEB (EN=English, FR=French, ES=Spanish, DE=German):

FUNREDES        ANY EN FR ES DE
# DOCUMENTS  572   294   85   164   4 
# OCCURRENCES 4043 4043 4043 4043 4043
2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes)
"fundacion redes y desarrollo" 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes)
# DOCUMENTS 156 26 24 91 0
# OCCURRENCES 200 31 24 100 0
2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes)
IBERIAN ANY EN FR ES DE
# DOCUMENTS 11094 10266 25 214 33
# OCCURRENCES 18946 18946 18946 18946 18946
2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes)
INTERNET 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes)
# DOCUMENTS 4846307 7794545 314441 264538 2by2transparent.gif (43 bytes)
# OCCURRENCES 30098345 30098345 30098345 30098345 2by2transparent.gif (43 bytes)
2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes)
WEB 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes)
# DOCUMENTS 5093017 10397446 244279 191402 2by2transparent.gif (43 bytes)
# OCCURRENCES 35497288 35497288 35497288 35497288 2by2transparent.gif (43 bytes)

It seems that for the current English words (in the dictionary of AltaVista?), the result systematically gives a value for "any language" which is lower than the value for "English" (but what does this value mean?) and that for the English words, whether they are compound or not, the value "any language" is close to the sum of the values by language. We have requested explanations from AltaVista and we are waiting for an answer.

AltaVista displays two results for a request. The first one, on the top of the page, is the total number of pages of its sample which mention the given word ("documents"). The second, on the bottom of the page, indicates the number of times that the given word appears in the pagesÊof the sample ("occurrences"). There is another inconsistency here: sometimes the second one is identical in each language; some other times, the result is different according to the language (apparently for the expressions made of several words, such as "fundacion redes y desarrollo").

This anomaly represents an obstacle for our measurements, but it is still possible, playing some tricks, to establish a comparison between the algorithm of Alis and the one of AltaVista. If one seeks with AltaVista all of the documents which contain all the words except a probably non-existent word (e.g., by asking the following expression to be searched: "- qwxk49fnr8e4"), the result seems to be the total number of pages which the algorithm of AltaVista considers to belong to a given language. Of course, the option "any language" makes us obtain the measurement of the total universe of the pages of AltaVista: a little more than 100 million when the measurement was made. Cross-checking with words or very frequent combinations (e.g. "of+he" in English) confirms the validity of the result. On this matter, our experiments show that if the measurement of very frequent short words has been able to give apparently convincing results in the past, the method now leads to values which are not very reliable.

Comparison AltaVista/Alis

2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) ALTAVISTA ALIS
2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) TOTAL COUNTING RESULTS
ANY 107958869 % WITHOUT % WITH(**) WITHOUT WITH
2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) CORRECTION CORRECTION
ENGLISH 70065677 64.90% 76.35% 84.00 82.30
JAPANESE 4369675 4.05% 4.76% 3.10 1.6
GERMAN 4009554 3.71% 4.37% 4.50 4.00
FRENCH 1951446 1.81% 2.13% 1.8 1.5
SPANISH 1495195 1.38% 1.63% 1.20 1.10
ITALIAN 1490109 1.38% 1.62% 1.00 0.80
PORTUGUESE 905676 0.84% 0.99% 0.70 0.70
DUTCH 849045 0.79% 0.93% 0.6 0.4
SWEDISH 804266 0.74% 0.88% 1.10 0.60
CHINESE 742741 0.69% 0.81% 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes)
RUSSIAN 499447 0.46% 0.54% 0.30 0.10
CZECH 469659 0.44% 0.51% 0.30 0.30
FINNISH 411951 0.38% 0.45% 0.40 0.30
NORWEGIAN 336751 0.31% 0.37% 0.60 0.30
DANISH 300481 0.28% 0.33% 0.30 0.30
POLISH 280975 0.26% 0.31% 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes)
KOREAN 215064 0.20% 0.23% 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes)
HUNGARIAN 197043 0.18% 0.21% 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes)
GREEK 83780 0.08% 0.09% 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes)
ESTONIAN 78955 0.07% 0.09% 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes)
HEBREW 48843 0.05% 0.05% 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes)
ICELANDIC 34749 0.03% 0.04% 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes)
ROMANIAN 28052 0.03% 0.03% 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes)
LATVIEN 22616 0.02% 0.02% 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes)
LITHUANIAN 20539 0.02% 0.02% 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes)
2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes) 2by2transparent.gif (43 bytes)
THE REMAINDER 18246580 16.90% 2by2transparent.gif (43 bytes) MULTILINGUAL SITES
THE CORRECTED REMAINDER(**) 2052750 2.24% 2by2transparent.gif (43 bytes) 15% 2by2transparent.gif (43 bytes)

(**)a correction has to be made in order to take into account the difference between the total and the sum of the measured languages. What does this value of almost 17% represent? In theory, it may represent the sum of the values of the languages which are not measured. But this percentage is far too large for that. Perhaps, in addition to the values of the languages which are not measured, the multilingual websites that the algorithm did not know how to classify were taken into account. The fact that the number is so large leads us to think that the multilingual sites are not entered in several languages (if not, the total could be lower than the sum of counting by language). We will then make the hypothesis that "the remainder" represents the counting of the multilingual sites and the sites in the languages which are not recognized by the algorithm... as well as the sites which are not recognized by the algorithm though written in one of the "recognized" languages (errors of the algorithm), without forgetting the pages which present symbols that are not characteristic of a language (images, formulas...). We also take the (probably false!) hypothesis that the errors are equally distributed to languages: we will then ignore them. It remains us to fix a parameter to distribute the multilingual sites and the other languages. After several tests, we chose the following couple: 15% of multilingual sites (or neutrals in the field of the language) and 2.24% of sites in the other languages (e.g. 100 languages with 0.02%), because it appears to us as being the most probable one.


We can see that the comparison shows a more significant value for English with the method of Alis than with the method that we call "complement of the empty set" in AltaVista. Now, as we will see further on, our method of counting word by word makes us suspect that the counting of AltaVista as well emphasizes the importance of English. That makes one wonder about the result of Alis Technologies and justifies a study with more elaborate linguistic criteria. The comparison between the three methods gives the following result:

2by2transparent.gif (43 bytes) EN/FR FR/ES
METHOD COMPLEMENT OF THE EMPTY SET 35.90 1.31
METHOD ALIS 46.67 1.36
METHOD FUNREDES 17.60 1.33

 

MEASUREMENTS OF FEBRUARY 1998
[BACK TO TOP]

The innovations of AltaVista and the observed anomalies led us to proceed to cross-checkings with other search engines. Thus we proceeded to a set of 5 measurements:

  • M1: With Hotbot (adding with and without diacritics)
  • M2: With Excite (adding with and without diacritics)
  • M3: With AltaVista all languages without diacritics
  • M4: With AltaVista by language without diacritics
  • M5: The sum of the two previous results

We initially thought, to carry out the comparison with our results of the previous years, that the indicator M5 was, in spite of the reserve mentioned above, the most adequate. But the results of correlation lead us to change our mind and to take the indicator M3, which is independent of the algorithm of language.

SUMMARY OF THE RESULTS

2by2transparent.gif (43 bytes) ENGLISH FRENCH SPANISH EN/FR FR/ES EN/ES
M1: HOTBOT FEBRUARY 98 100221545 6090080 3230690 16.46 1.89 31.02
M2: EXCITE FEBRUARY 98 23689345 1430583 910317 16.56 1.57 26.02
M3: ALTAVISTA ALL LANGUAGES 2/98 26017027 1478396 1115708 17.60 1.33 23.32
M4: ALTAVISTA BY LANGUAGE 2/98 70718558 2946712 2058398 24.00 1.43 34.36
M5: M3+M4 96735585 4425108 3174106 21.86 1.39 30.48

COMMENTS
[BACK TO TOP]

There is a some coherence between the results of the three search engines, but we need to acknowledge the variations as well. It seems that AltaVista strengthens English less than the two others. Can the introduction of the recognition of the languages explain the differences? Probably. Which value can now be given to the study of tendency? We still have some doubts, and that also justifies us retaking this study with a more solid methodology as far as linguistics and technique are concerned.

TRENDS
[BACK TO TOP]

2by2transparent.gif (43 bytes) EN/FR FR/ES
AVERAGE MARCH 1996 21.91 2.40
AVERAGE MARCH 1997 19.99 1.92
AVERAGE FEBRUARY 1998 17.60 1.33

The progressions are almost linear, and the extrapolation shows the ratio English/French equal to 1 in 2006 and the ratio French/Spanish equal to 1 in 2000.

MEASUREMENTS OF THE DIACRITIC CHARACTERS

A result of interest for the supporters of the correct use of the languages in the network is a measurement of the relationship between the spelling with and without diacritic characters. The results are stable with all the engines.

2by2transparent.gif (43 bytes) French Spanish
Percentage of sites without diacritic (average) 20% 50%


CONCLUSION

It is now time to consolidate the method with the support of linguists. In collaboration with the Latin Union and its team of language specialists, Funredes has begun generalizing the study, and including three other Latin languages, while paying a great attention to the linguistic methodology. A sample of words to be measured which fit the best linguistic criteria is being elaborate (the obstacles are numerous!) and will be used as a basis for a measurement of the presence of the six following languages: English, Spanish, French, Italian, Portuguese and Rumanian. The results will be published in a few weeks.

 

[BACK TO TOP]


[email protected]
Copyright © 1996-1999 FUNREDES
Created: 24 VIII 1998
Last Modified: 02 VII 1999

Spanish version       French version

Back
L1
2by2transparent.gif (43 bytes) L22by2transparent.gif (43 bytes) L32by2transparent.gif (43 bytes) L42by2transparent.gif (43 bytes) C12by2transparent.gif (43 bytes) C2
Languages & Culture Home
Funredes Home