L3: The Third Study on Languages and the Internet

Menu

NEW MEASUREMENTS
ALIS TECHNOLOGIES
ALTA VISTA
ON THE DIACRITICS

Statistical Considerations on French Language and Culture on the Internet

This study was presented at the Visionarios conference, which took place in Caracas (Venezuela) from April 22 to April 24, 1998.
Author: Daniel Pimienta
Acknowledgments to Marcelo Sztrum, Catherine Dhaussy, and Daniel Prado for their contributions.

Introduction

For its third update, the study must deal with a certain amount of new parameters related to the use of AltaVista as the measuring tool. Hence, it initiates the transition towards a new step, much more reliable as far as methodology is concerned.

Talking about the very content, the third set of results shows that French keeps on growing more quickly than English, the relatively slow trend of 1997 being confirmed. As for Spanish, it continues its fast improvement, and it clearly gets closer to French. The domination of French over Spanish had decreased from 140% in 1996 to 92% in 97, and it is now reduced to 39%. Extrapolating, we can see a ratio of 1 to 9 between French and English for the year 2000 and a relationship of equality between Spanish and French at the same time...

What is new on the measurement of the presence of the languages in the Internet?

ALIS TECHNOLOGIES
[BACK TO TOP]

First of all, we would like to mention a study of Alis Technologies with the support of the Internet Society: "Web Languages Hit Parade". The study claims itself to be the "first major study of the actual distribution of languages on the Internet", and to use a "rigorous method for exploring the Web".

The Alis methodology is very different from ours. It pays great importance to data-processing. As a matter of fact, the methodology used by Alis Technology is based on a program which automatically recognizes several languages (17) on the Web space. The measurement protocol consists in picking randomly 60,000 Internet sites on the criteria of their IP number(*), to validate a valid subset made suitable for the measurement of 8,000 Web sites, and to run the program. After this procedure, Alis applies on the result it has obtained some corrections the nature of which is not specified. It is a very interesting procedure, since it can be automated and reproduced as often as required. Furthermore, it can be applied simultaneously to several languages.

Compared to our work, the results show a much stronger presence of English (82% against 70%). The ratio French/Spanish is still very close to our results.

The main difference between the two approaches lies in the claimed ambition: the sole aim of the study led by FUNREDES is to provide a very broad estimation; on the countrary, Alis strongly affirms the high validity of its results.

Of course, this encourages us to have a closer look at Alis methodology.

It is not possible for us to make any judgement on the value of the program of language recognition. Only cross-checkings on the results obtained by various methods could make possible to validate the results of this program.
However, as far as statistics are concerned, the method appears quite doubtful. Why a sample of 8,000 Web pages taken randomly among a universe higher than 100 million pages would provide a serious base for extrapolation? Of course, the survey institutes have shown their amazing capacity to extrapolate the intentions of vote with a remarkable precision, starting from samples of 2,000 voters for a vote of 50 million. But the survey institutes do not randomly pick up the elements of their samples; it is quite the contrary indeed! The sample is normalized, i.e. the proportions of certain parameters (social, economic, geographical...) of its elements are very precisely weighted.
The problem mentioned in 2 could have been solved if the study had repeated the procedure several dozen of time and had published the average of the results (especially if the variance were very weak). That could have increased the credibility of the results. It seems that, for the moment, the numerous "hand-made checks" of the automatic process make this approach impossible. However, the measurement of only 3 distinct samples could have reassured (or worried) us about the argument mentioned in 2.
The corrective adjustment of the results remains very mysterious (it seems that it remains a prerogative of the surveys specialists though :-)...)
Last of all, Alis' ambition is not to measure, right now, anything but the presence of the languages in the Web space. They neither aim at measuring other Internet world nor, even less, at approaching the cultural evaluation which in fact constitutes the very philosophy of our study.

Let us conclude by saying that the current limits of the study made by Alis encourage us to keep on defending our view, and even to make it more systematic as far as linguistics is concerned, in order to bring a more credible approach to the measurement of the languages on the Internet.

(*)The IP numbers (IP stand for Internet Protocol) identify in one unique way each system connected to the Internet. They have a structure standardized in four sections, of a value from 0 to 999, separated by dots. The names of the systems (or domains) are translated into IP numbers by a method known as "domain service". E.g., the IP number for <funredes.org> is 205.160.164.9.

ALTAVISTA
[BACK TO TOP]

On the one hand, the powerful search engine by Digital Corp. brings innovations: it now pays attention to the diacritic signs (accents, and other "special" characters, which do not exist in the English alphabet). And just like Alis, it offers an option to separate languages (Alis identifies 17 languages, and AltaVista claims 25). Without further investigations, the algorithms can already be said to be different.

On the other hand, the size of the universe taken into account by AltaVista has not really improved: it remains about 100 million Websites, for a universe experiencing a very strong exponential growth. As a relative proportion, AltaVista may have decreased from a cover of about 70% to a cover much more limited, perhaps about 20%. That is still a percentage strong enough to extrapolate our results. However, one needs to wonder if this approach might not benefit the older sites, hence the ones that are available in English rather than in other languages.

The study of the evolutions of the AltaVista search engine led to surprising situations, and, as we are now going to see it, it compelled us to take into account other search engines, and to use them to keep on working.

On the diacritics
[BACK TO TOP]

Some cross-checkings show that searching a word without using diacritics includes all the posibilities of the word with diacritics. Thus, the search "peche" displays "peche", "pêche" (fishing, or peach), "pèche" (conjugated form of "to sin"), "péché" (a sin), and all possible spelling mistakes and typos, such as "péche" or "pëche". This made us perform searches without diacritics for the comparisons with English. Also, this requires to chose the sample of words in a very careful way.

Search by language

Using the current version of AltaVista, we discovered an incoherent phenomenon, which will prevent us from using this way of counting. Behind this apparent inconsistency, there may be a logic, but, be as it may, this logic is not compatible with the counting goal. What is it?

In some cases, the result "all languages" (ANY) seems equal to the addition of all the results for each language, or it is higher, which is normal, since all the languages are not taken into account, and due to the multi-language pages, also. But in other cases (the main part of our sample list), this result is lower than the one given by the search in English (it is then very difficult to make interpretation)!

Here are some examples, for words or phrases: FUNREDES, FUNDACION REDES Y DESARROLLO, iberian, INTERNET, WEB (EN=English, FR=French, ES=Spanish, DE=German):

FUNREDES	ANY	EN	FR	ES	DE
# DOCUMENTS	572	294	85	164	4
# OCCURRENCES	4043	4043	4043	4043	4043

"fundacion redes y desarrollo"
# DOCUMENTS	156	26	24	91	0
# OCCURRENCES	200	31	24	100	0

IBERIAN	ANY	EN	FR	ES	DE
# DOCUMENTS	11094	10266	25	214	33
# OCCURRENCES	18946	18946	18946	18946	18946

INTERNET
# DOCUMENTS	4846307	7794545	314441	264538
# OCCURRENCES	30098345	30098345	30098345	30098345

WEB
# DOCUMENTS	5093017	10397446	244279	191402
# OCCURRENCES	35497288	35497288	35497288	35497288

It seems that for the current English words (in the dictionary of AltaVista?), the result systematically gives a value for "any language" which is lower than the value for "English" (but what does this value mean?) and that for the English words, whether they are compound or not, the value "any language" is close to the sum of the values by language. We have requested explanations from AltaVista and we are waiting for an answer.

AltaVista displays two results for a request. The first one, on the top of the page, is the total number of pages of its sample which mention the given word ("documents"). The second, on the bottom of the page, indicates the number of times that the given word appears in the pagesÊof the sample ("occurrences"). There is another inconsistency here: sometimes the second one is identical in each language; some other times, the result is different according to the language (apparently for the expressions made of several words, such as "fundacion redes y desarrollo").

This anomaly represents an obstacle for our measurements, but it is still possible, playing some tricks, to establish a comparison between the algorithm of Alis and the one of AltaVista. If one seeks with AltaVista all of the documents which contain all the words except a probably non-existent word (e.g., by asking the following expression to be searched: "- qwxk49fnr8e4"), the result seems to be the total number of pages which the algorithm of AltaVista considers to belong to a given language. Of course, the option "any language" makes us obtain the measurement of the total universe of the pages of AltaVista: a little more than 100 million when the measurement was made. Cross-checking with words or very frequent combinations (e.g. "of+he" in English) confirms the validity of the result. On this matter, our experiments show that if the measurement of very frequent short words has been able to give apparently convincing results in the past, the method now leads to values which are not very reliable.

Comparison AltaVista/Alis

		ALTAVISTA		ALIS
		TOTAL COUNTING		RESULTS
ANY	107958869	% WITHOUT	% WITH()**	WITHOUT	WITH
		CORRECTION		CORRECTION
ENGLISH	70065677	64.90%	76.35%	84.00	82.30
JAPANESE	4369675	4.05%	4.76%	3.10	1.6
GERMAN	4009554	3.71%	4.37%	4.50	4.00
FRENCH	1951446	1.81%	2.13%	1.8	1.5
SPANISH	1495195	1.38%	1.63%	1.20	1.10
ITALIAN	1490109	1.38%	1.62%	1.00	0.80
PORTUGUESE	905676	0.84%	0.99%	0.70	0.70
DUTCH	849045	0.79%	0.93%	0.6	0.4
SWEDISH	804266	0.74%	0.88%	1.10	0.60
CHINESE	742741	0.69%	0.81%
RUSSIAN	499447	0.46%	0.54%	0.30	0.10
CZECH	469659	0.44%	0.51%	0.30	0.30
FINNISH	411951	0.38%	0.45%	0.40	0.30
NORWEGIAN	336751	0.31%	0.37%	0.60	0.30
DANISH	300481	0.28%	0.33%	0.30	0.30
POLISH	280975	0.26%	0.31%
KOREAN	215064	0.20%	0.23%
HUNGARIAN	197043	0.18%	0.21%
GREEK	83780	0.08%	0.09%
ESTONIAN	78955	0.07%	0.09%
HEBREW	48843	0.05%	0.05%
ICELANDIC	34749	0.03%	0.04%
ROMANIAN	28052	0.03%	0.03%
LATVIEN	22616	0.02%	0.02%
LITHUANIAN	20539	0.02%	0.02%

THE REMAINDER	18246580	16.90%		MULTILINGUAL SITES
THE CORRECTED REMAINDER(**)	2052750	2.24%		15%

(**)a correction has to be made in order to take into account the difference between the total and the sum of the measured languages. What does this value of almost 17% represent? In theory, it may represent the sum of the values of the languages which are not measured. But this percentage is far too large for that. Perhaps, in addition to the values of the languages which are not measured, the multilingual websites that the algorithm did not know how to classify were taken into account. The fact that the number is so large leads us to think that the multilingual sites are not entered in several languages (if not, the total could be lower than the sum of counting by language). We will then make the hypothesis that "the remainder" represents the counting of the multilingual sites and the sites in the languages which are not recognized by the algorithm... as well as the sites which are not recognized by the algorithm though written in one of the "recognized" languages (errors of the algorithm), without forgetting the pages which present symbols that are not characteristic of a language (images, formulas...). We also take the (probably false!) hypothesis that the errors are equally distributed to languages: we will then ignore them. It remains us to fix a parameter to distribute the multilingual sites and the other languages. After several tests, we chose the following couple: 15% of multilingual sites (or neutrals in the field of the language) and 2.24% of sites in the other languages (e.g. 100 languages with 0.02%), because it appears to us as being the most probable one.

We can see that the comparison shows a more significant value for English with the method of Alis than with the method that we call "complement of the empty set" in AltaVista. Now, as we will see further on, our method of counting word by word makes us suspect that the counting of AltaVista as well emphasizes the importance of English. That makes one wonder about the result of Alis Technologies and justifies a study with more elaborate linguistic criteria. The comparison between the three methods gives the following result:

	EN/FR	FR/ES
METHOD COMPLEMENT OF THE EMPTY SET	35.90	1.31
METHOD ALIS	46.67	1.36
METHOD FUNREDES	17.60	1.33

MEASUREMENTS OF FEBRUARY 1998
[BACK TO TOP]

The innovations of AltaVista and the observed anomalies led us to proceed to cross-checkings with other search engines. Thus we proceeded to a set of 5 measurements:

M1: With Hotbot (adding with and without diacritics)
M2: With Excite (adding with and without diacritics)
M3: With AltaVista all languages without diacritics
M4: With AltaVista by language without diacritics
M5: The sum of the two previous results

We initially thought, to carry out the comparison with our results of the previous years, that the indicator M5 was, in spite of the reserve mentioned above, the most adequate. But the results of correlation lead us to change our mind and to take the indicator M3, which is independent of the algorithm of language.

SUMMARY OF THE RESULTS

	ENGLISH	FRENCH	SPANISH	EN/FR	FR/ES	EN/ES
M1: HOTBOT FEBRUARY 98	100221545	6090080	3230690	16.46	1.89	31.02
M2: EXCITE FEBRUARY 98	23689345	1430583	910317	16.56	1.57	26.02
M3: ALTAVISTA ALL LANGUAGES 2/98	26017027	1478396	1115708	17.60	1.33	23.32
M4: ALTAVISTA BY LANGUAGE 2/98	70718558	2946712	2058398	24.00	1.43	34.36
M5: M3+M4	96735585	4425108	3174106	21.86	1.39	30.48

COMMENTS
[BACK TO TOP]

There is a some coherence between the results of the three search engines, but we need to acknowledge the variations as well. It seems that AltaVista strengthens English less than the two others. Can the introduction of the recognition of the languages explain the differences? Probably. Which value can now be given to the study of tendency? We still have some doubts, and that also justifies us retaking this study with a more solid methodology as far as linguistics and technique are concerned.

TRENDS
[BACK TO TOP]

	EN/FR	FR/ES
AVERAGE MARCH 1996	21.91	2.40
AVERAGE MARCH 1997	19.99	1.92
AVERAGE FEBRUARY 1998	17.60	1.33

The progressions are almost linear, and the extrapolation shows the ratio English/French equal to 1 in 2006 and the ratio French/Spanish equal to 1 in 2000.

MEASUREMENTS OF THE DIACRITIC CHARACTERS

A result of interest for the supporters of the correct use of the languages in the network is a measurement of the relationship between the spelling with and without diacritic characters. The results are stable with all the engines.

	French	Spanish
Percentage of sites without diacritic (average)	20%	50%

CONCLUSION

It is now time to consolidate the method with the support of linguists. In collaboration with the Latin Union and its team of language specialists, Funredes has begun generalizing the study, and including three other Latin languages, while paying a great attention to the linguistic methodology. A sample of words to be measured which fit the best linguistic criteria is being elaborate (the obstacles are numerous!) and will be used as a basis for a measurement of the presence of the six following languages: English, Spanish, French, Italian, Portuguese and Rumanian. The results will be published in a few weeks.

Statistical Considerations on French Language and Culture on the Internet

Introduction

What is new on the measurement of the presence of the languages in the Internet?

MEASUREMENTS OF FEBRUARY 1998 [BACK TO TOP]

MEASUREMENTS OF FEBRUARY 1998
[BACK TO TOP]