4. RESULTS IN DETAIL

4.1 Internet Methodology

The accelerated evolution, since our last study, of the search engines tasked with indexing the content of the Web has meant that a supplementary study and a deep questioning of the Internet methodology used to obtain our results are needed. This study goes through three stages :

Identifying and selecting the main search engines available.
Verifying how they make their searches as regards counting methods¹.
Selecting the search engines which would best suit our methodology.

4.1.1 Identification of the main search engines available on the Web and their shortlisting

During this first stage the following search engines were identified : Altavista, Excite, Fastsearch², Google, Hotbot, Infoseek, iWon, Lycos, Northernlight, Yahoo and Webtop.

Webtop, a completely new search engine, has not yet been tested enough to warrant its inclusion. Hotbot, Lycos and Yahoo have been left out because they are direct partners with other search engines and give the same results : Lycos uses Fastsearch's index, and Yahoo that of Google. Hotbot and iWon share the same index, as does Inktomi. Hotbot which was our choice for the previous study has not been included again as it no longer shows the count results. Inktomi's index is not directly offered to users : that is down to iWon. Excite cannot be used because, like Hotbot, it did not show the count results at the time of the study³.

The following search engines were used : Altavista, Fastsearch, Google, Infoseek, iWon and Northern Light.

4.1.2 Validation of the selected search engines as part of the applied methodology.

The automation of the counting process, described in detail in Appendix 5, allowed the results to be got from the 6 preselected search engines, each one being queried for the 1600 variants of the 57 terms in each language. The results showed clear differences between the search engines and caused us much worry as regards the strength of our methodology! It was clear that we needed to analyse the search engines used to explain these differences and to know which search engines produce the most credible results.

Several criteria were put forward to validate the use of a particular search engine in this study. A search engine, to be useful in the application of our methodology must have the following characteristics :

possess an index of a size comparative to the Web,
take into account, coherently, diacritical marks,
give coherent results with regards to the counting of the pages found,
have a homogenous index as regards languages.

The results returned by each search engine, the details concerning their selection, as well as other general information are available in Appendix 4.

4.1.3 Final selection of the search engines for the application of the methodology

Just two search engines, Google and Fastsearch, were chosen from the shortlist to be used in support of a study of the presence of different languages on the Web in August 2000. The results returned by these search engines were added together⁴ in order to arrive at the final result. The similarity of these results, whose indexing and search methods differ, is a good indicator of the strength of our methodology :

	English⁵	Spanish	French	Italian	Portuguese	Rumanian	German⁶
Google	210	7.86%	7.33%	4.65%	2.82%	0.27%	7.89%
Fast	147	8.41%	7.33%	4.60%	3.95%	0.37%	8.47%

Now it seems, from a follow-up count carried out in June 2001, that Google doesn't satisfactorily handle diacritical marks so we have had to leave aside the results it returns.

4.2 Linguistic Methodology

Aside from the introduction, by way of an exploration, of German equivalences and the correction of a few errors which cropped up when defining the word variants⁷, the linguistic methodology remains the same as in the last study.

The selection of the 57 terms for each language made in 1998 was enlarged to include German. Each term, with the inclusion of a certain quantity of variants (orthographical, according to whether or not diacritical marks, synonymical, dialetical or morphosyntaxic characteristics are present...) was chosen on the principle that it could be considered both equivalent to those of the same class in all of the studied languages and distinctive, that is to say without (or almost without) interlinguistic homographies⁸ of some of its variants or other equivalent obstacles. The sample of the 57 terms (with all the corresponding variants in each language) can be found in Appendix 3.

4.2.1 New problems posed by German

German word construction is very different from that of the other languages studied up until now : in languages like German a single word can be 'composed' of several radicals which, in equivalent terms in the other languages studied (with the partial exception of English, but much less so than German) are seperated into distinct words, constituting a syntagm.

Whereas the equivalent terms having already been set out for 'non-composed' words and a search was made for seperate words, without an indefenite context before or after, German is heavily 'penalised' since frequently terms such as Ziegenkäse, equivalent to "goat's cheese" in English and "fromage de chèvre" in French, are systematically ruled out.

The first step has been taken here to increase the value of the results obtained through the old methodology by 30%, to realise a minimum possible threshold. But in order to get results as reliable as those for the other languages, there undoubtedly needs to be another search carried out for 'non-isolated' words (with and without an unspecified context before and after the word), all the more helped by, as far as possible, a factor which expresses the deviation in the number of words between the other languages in the study and German. This corrective factor will probably be found in research carried out as part of similar interlinguistic corpora.

4.2.2 Other problems

The linguistic work should be understood by looking at Appendix 3 and Appendix 7. Further details of the linguistic methodology can be found in the previous study, L4, Chapter 4.2.

It is worth remembering that there is frequently bias on the Web as regards words with diacritical marks (accents, etc.) and variants without diacritical marks. In the case of German, the morphosyntaxic capital / lower case distinction has not been taken into account as it is ignored by our chosen search engines.

Furthermore, the decision was made not to include words with less than 4 letters to avoid possible homographies (most often but not only with abbreviations). Homographies between at least two of the studied languages were extremely common, especially but not only between Spanish and Portugese, and we emphasise that we had to avoid the coincidences of borrowed words. Sometimes a casal homography, such as the German variants of Montage / Montages from Montag (Monday), was found in French as a borrowed homography, as the French word is borrowed by almost all other languages in the world of cinema.

4.3 Statistical Methodology

The result's interval of reliability of 90% and 99% were established by using the T distribution of Student, using the hypothesis of a Normal distribution type.

4.3.1 Tally results for the selected search engines

Table 16 in Appendix 4 shows, by language, the results from the Web as given by the 6 selected search engines in august 2000.

4.3.2 Statistical calculations of results relative to English

These are mean percentages which represent the presence of latin languages (and German) relative to English (June 2001 results).

Table 7 : Details of the statistical results

	Spanish	French	Italian	Portugese	Rumanian	German⁹
*Mean*	10.95%	8.86%	5.88%	5.40%	0.32%	13.4%
*Standard deviation*	9.46%	5.09%	5.55%	5.49%	0.33%	8.97%
*Variance coefficient*	0.86	0.57	0.94	1.01	1.02	0.66
*Interval of reliability at 90%*	8.89-13.01	7.75-9.97	4.67-7.09	4.20-6.60	0.25-0.39	11.45-15.37

The variance coefficient is the square root of the standard deviation squared divided by the mean squared. A value greater than 1 signifies a strong dispersal and a unreliable mean. A value lower than 1 indicates a weak dispersal and thus a result all the more reliable than the weaker ones. The interval of reliability will therefore be all the more narrow since the value of the variance coefficient is weak.

4.3.3 Calculations of the absolute results

According to the conclusions set out in Chapter 3.2.2, the absolute presence of the languages studied can be shown as :

English	52%
Spanish	5.69%
French	4.61%
Italian	3.06%
Portugese	2.81%
Rumanian	0.17%
German⁹	6.97%
Remainder	24.96%

Here, Spanish, as we predicted in the previous study, has surpassed French. And there is a larger German content than any of the latin languages.

These results were obtained by calculating the relative results (previous chapter) as well as a realistic approximating of the weighting for the languages not included in the study and referred to as "remainder".

The approximation of the languages not included in the study was made using the measurement of the size of each language's domain in Fastsearch's index. When these counts were made, Fastsearch's index contained an index of 575 million pages in 31 languages. To achieve the index count ( using Fastsearch's own algorithm ), the "advanced search" option needs to be used and a search is submitted, for each language, applying the technique first used in the previous study called "compliment of the empty set" (search for the number of pages which do not contain an imaginary word¹¹). Table 17 can be found in Appendix 6. This table gives an approximate weighting for each language, deduced from the search engine algorithm's capability to recognise languages, which without doubt is not perfect. For example, if a search is submitted for the letter "è" in English sites (in Google or Fastsearch) the result is a million pages including those in Thai, Korean, Japanese, Russian...

A different way of achieving the weighting of the languages we haven't studied is by seeing its dynamic evolution between L4 and L5. From the the table of hypotheses of absolute values set out in Chapter 3.2.1, and from the absolute values of the languages included in the study carried out in September 1998, we arrive at the following table :

Table 8 : Hypothesis of the evolution of the weighting of the languages studied

Languages studied	L5's hypotheses of absolute weighting			L4 Sept 1998	Evolution L4/L5
ENGLISH	55.00%	50.00%	45.00%	75.00%	-26.67%	-33.33%	-40%
SPANISH	6.02%	5.48%	4.93%	2.53%	137.94%	116.60%	94.86%
FRENCH	4.87%	4.43%	3.99%	2.81%	73.31%	57.65%	41.99%
ITALIAN	3.23%	2.94%	2.65%	1.50%	115.33%	96.00%	76.67%
PORTUGESE	2.97%	2.70%	2.43%	0.82%	262.20%	229.27%	196.34%
RUMANIAN	0.18%	0.16%	0.14%	0.15%	20.00%	6.67%	-6.67%
Remainder of languages	20.35%	27.59%	34.83%	17.19%	18.38%	60.50%	102.62%

Once again, it is the hypothesis of the absolute weighting of English at 50% which is the most realistic.

In fact, a growth of less than than 18.38% of the other languages¹² does not appear sufficient : this would represent an evolution twice as slow than that of Rumanian and 4 to 15 times slower than the other latin languages studied. On the other hand an increase of 102% for the non-studied languages seems exagerrated : that would translate as a quicker world-wide growth than that of the other latin languages studied (excepting Portugese). A growth of, on average, 60% for the non-studied languages would put them on the same level of French and would seem more credible. This crosschecking strengthens our hypothesis that the absolute value of English stands at 50% as a final result.

4.4 Comparison with other studies

4.4.1 Comparison with the previous studies

English / French and French / Spanish relationships have changed between the first and the present study in the following way¹³ :

Table 9 : Evolution of the relationships of the weighting between English, French and Spanish.

	English/French	French/Spanish	English /Spanish
March 1996 (L1)	21.91	2.40	52.58
March 1997 (L2)	19.99	1.92	38.38
March 1998 (L3)	17.60	1.33	23.32
September 1998 (L4)	35.59	1.11	39.53
August 2000 (L5)	13.66	0.91	12.38
June 2000 (L5)	11.28	0.81	9.14

The figures in Italics (L1 to L3), as we recall, are too approximate to be taken seriously. The real observation takes place from L4.

4.4.2 Comparison with similar studies (Alis and Inktomi)

Alis' study in 1998 has not been repeated : we therefore keep the analysis carried out during L4. On the other hand, Inktomi has published results which caused a great stir on the Internet and are now used as an official source for numerous reports.

Table 10 : Results of Inktomi's study (February 2000)

LANGUAGE	PROPORTION (%)
English	86.54
German	5.83
French	2.36
Italian	1.55
Spanish	1.23
Portugese	0.75
Dutch	0.54
Finnish	0.50
Swedish	0.36
Japanese	0.34

These figures contribute to the erroneous view that English continues to represent more than 80% of existing webpages. However, it is easy to find an aberration in these results, both in their presentation and their interpretation...

In fact, the proportion said to be represented by English (86%) is not in relation to all languages but only to the 10 languages shown in the table, as well as the total of the 10 languages adding up to 100% ! If we follow the hypothesis that 30% of webpages are in languages not included by Inktomi, the actual total for English would be 86.54% x (100 − 30) = 60.58% ! ! !

Apart from this evident slip of the pen which devalues the full results (but doesn't prevent the majority of Internet marketeers from citing these absurd figures...) it is still interesting to compare our results, based on a sample of terms whose linguistic selection criteria are presented in our reports, with the language search algorithms used by the different search engines, whose workings remain out of view. See, for example, Table 17 in Appendix 6 which gives figures from Fastsearch and compares them with those of our study.

Until there is proof to the contrary, we must consider our method more rigorous as regards the methodology and conclude that the Internet language search algorithms have the unfortunate tendance to overevaluate the presence of English.

[BACK TO TOP]

Previous page - Next page

NB The results of the count are not part of the main function of the search engines which is to identify pages which contain the terms of the search, in order of relevance. Some search engines provide a count of the pages which respond to the search criteria, others do not. In all cases the returned results must be checked for weaknesses.
Also known as Alltheweb.
This problem has since been corrected, the results from Excite will be included in the next version of this study as long as it continues to offer this function.
For the statistical calculations we added together the results from the two search engines to get a longer series of of aleatory variable values.
Millions of pages in English.
These are the gross results, without the 30% correction.
These were minor errors which did not cause too much disparity in the published results of the last study. This is detailed in Appendix 3.
We are referring to the forms which may be written the same way in more than one language ; homographs within a language are considered as the same word (graphically).
Results increased by 30% (see note on Table 1, Chapter 3.2.1)
Results increased by 30% (see note on Table 1, Chapter 3.2.1)
The search argument is, for example, < - "hgavdhjgduhgedujhgsdfyuhg"> .
The "other languages" cover different realities : Scandinavian and Asian languages growing faster than the less prevalent languages.
This change is to be viewed with reservation as the linguistic methodology used to get the results for L1 to L3 was not a strong as that used for L4 and this study.