L5 Title

3. Overview of the study and its results

3.1 Methodology

The results were obtained by adopting the 1998 methodology. Initially it involved the selecting of 57 words from each language, each having variants, according to whether or not there were diacritical marks, orthographical, synonimical, dialetical or morphosyntaxical characteristics and having an equivalent meaning and implication in the studied languages (details of linguistic criteria : part 4.2 and Appendix 7). Then it was a matter of analysing and comparing the results to deduct, by statistical methods, the percentile values of the presence of each language. For each occurence, the relation of the latin languages to English was used as an aleatory variable and the statistical techniques were applied using the hypothesis of standard mathematical distribution of this aleatory variable (Gauss' also shows normal distribution). The results below were obtained by summarising the results given by the the two search engines which satisfied the selection criteria described in Appendix 4. The measurements used in this study were carried out between august 2000 and June 2001.

3.2 Commented summary on the results

3.2.1 Results relative to English

The following table shows the average relation between each latin language (plus German) and English, acquired by measuring the occurence of the terms on the Web in June 2001.

Table 1 : Mean of Latin languages (plus German) compared to English

WWW

SPANISH

10.95%

FRENCH

8.86%

ITALIAN

5.88%

PORTUGESE

5.40%

RUMANIAN

0.32%

GERMAN

> 13.42% (estimate)1



3.2.2 Results in full

The results above allow the evaluation of the presence of latin languages in relation to English, and approximate that of German; to give figures for the absolute presence of these languages on the Web, it is first necessary to hypothesise on the absolute presence of English. The table below shows the absolute values, derived from various hypotheses on the presence of English.

Table 2 : Absolute presence of the studied languages on the Web

If ENGLISH= 65.00% 60.00% 55.00% 52.00% 50.00% 45.00% 40.00%
then SPANISH= 7.12% 6.57% 6.02% 5.69% 5.48% 4.93% 4.38%
then FRENCH= 5.76% 5.32% 4.87% 4.61% 4.43% 3.99% 3.54%
then ITALIAN = 3.82% 3.53% 3.23% 3.06% 2.94% 2.65% 2.35%
then PORTUGESE = 3.51% 3.24% 2.97% 2.81% 2.70% 2.43% 2.16%
then ROMANIAN= 0.21% 0.19% 0.18% 0.17% 0.16% 0.14% 0.13%
then GERMAN=2 = 8.71% 8.04% 7.37% 6.30% 6.97% 6.70% 5.37%
Thus the presence of other languages= 5.83% 13.10% 20.35% 24.96% 27.59% 34.83% 42.07%

This table gives a clearer picture of the absolute weighting of the studied languages in relation to the pages which exist on the Web. One of the most significant indicators is that the space available for the rest of the languages allows us to hypothesise that the absolute presence of English is most probably around 52%.

It can be considered that Chinese and Japanese have, in all likelihood, as much a presence as that of German or Spanish (between 5 and 8%), as well as the languages that represent between 0.5 and 3% (Korean, Dutch, Russian and the four Scandinavian languages giving a total of between 8 and 10%), those languages with a weak presence, like Rumanian (around ten languages at 0.1% making up 1%) and, finally, the very numerous languages whose presence is marginal. The latter is very difficult to estimate ; maintaining the hypothesis that there are 200 languages each with a presence of 0.01%, we get a total of 2%... One of the things we know the least about, which remains for future evaluation, is the possible multiplicity of languages on the Internet, languages which exist at the moment total around 3000 or 6000...

These estimates give rise to a total weighting for the non-studied languages at 25% and thus we are moved to support the hypothesis that the absolute prescence of English is 52%.

The estimate of 25% for the languages that have not been directly included in the study is reinforced by the dynamic evolution of their weighting detailed in chapter 4.3.3.

3.3 Relationship between the number of speakers and their presence on the Web

It is clear that the values for absolute presence are not truly indicative of the strength of a language on the networks. To get a relevant result, it is advisable to reconcile the values that show the presence of the languages on the Internet with their presence in the real world. The relative presence of these languages is calculated without putting too much emphasis on 'multilingualism'. This method still contains the methodological weaknesses found in the L4 study.

Table 3 : Presence of the studied languages (figures in millions)

English Spanish Portuguese French Italian Rumanian German
Absolute Presence (number of speakers)
630
375
190
130
60
30
120
Relative Presence
(% of world popn.)
10.50%
6.25%
3.17%
2.17%
1%
0.50%
2.00%


Table 4: Weighted presence of the studied languages on the WWW

Absolute presence 2001

Weighted presence 1998

Weighted presence 2000

Weighted presence 2000

ENGLISH

52.00%

7.14

5.71

4.95

SPANISH

5.69%

0.40

0.78

0.91

FRENCH

4.61%

1.30

2.02

2.12

ITALIAN

3.06%

1.50

2.77

3.06

PORTUGESE

2.81%

0.26

0.68

0.88

ROMANIAN

0.17%

0.30

0.38

0.34

GERMAN

> 6.97%

Not available

3.153

3.493

A quotient equal to 1 is considered as a 'normal' result, lower than 1,as weak and higher than 1, as a respectable result.

There is a strong increase in Spanish and especially in Portugese, but both remain below the 'normal' threshold. German and Italian obtained an excellent score and French received a good result.

3.4 Content productivity of internauts classed by language

The latest results released the from Global Reach (http//:www.glreach.com)study at the end of March 2001 give a value for the number of Internet users by language:

Table 5: Numbers of internauts classed by language

English

Spanish

Portugese

French

Italian

Rumanian

German

Remainder

Internauts (millions)

215.6

20.4

16.6

14.2

11.5

0.6

27.5

146.2

Distribution in %

47.6%

4.5%

3.7%

3.1%

2.5%

0.13%

6.1%

32.2%



Comparing these results with those of our own study (see table 6), it should be possible to deduce which linguistic parties produce the most information on the Web.

Table 6: Productivity of speakers

Pages

Internauts

P/I

ENGLISH

52.00%

47.6%

1.09

SPANISH

5.69%

4.5%

1.26

FRENCH

4.61%

3.7%

1.25

ITALIAN

3.06%

3.7%

1.25

PORTUGESE

2.81%

2.5%

1.12

RUMANIAN

0.17%

0.13%

1.31

GERMAN

6.97%4

6.1%

1.14



We arrived at a remarkable result : the proportion of the available webpages in each language and that of internauts is of the same magnitude ! The percentile relation between pages and users is around 1 for all the languages studied5, which would demonstrate that at this time the quantity of webpages produced in a particular language is directly proportional to the number of internauts who use that language. The English result is surprising : a higher value would be expected as a result of multilingualism6. Could this then indicate that the productivity of native English speakers is lower than that of the non-native English speakers mentioned, proof that the population of other Western language speakers is more conscious of the linguistic stakes of the Internet ? It would be very interesting to know how much web content exists for languages new to the Internet...

[BACK TO TOP]

Previous page - Next page



  1. There is, in this version of the study, a difference in quality between the results for German and for the other languages. German word construction differs greatly from the other studied languages, and which would in effect heavily 'penalise' the results if the search engines were asked to show results "by isolated or sperated terms" (or rather 'morphological unit'), that is to say that with no context before or after the term. To acquire results which are as valid as those for the other languages, other research needs to be carried out by 'non-isolated term' (i.e. with an indeterminate context before and after) also making use of, as far as possible, a specific factor which explains the difference in quantity of the seperated terms of the studied language or between certain languages and German. Our solution here has been to keep the sample, extending the same linguistic methodology to include 57 equivalent terms in German and to search again by term or isolated 'morphological unit'. The results obtained in this way needed correcting by at least 30% in a way as to reconcile the linguistic reality. The figure of 13.42% is achieved by increasing the first gross figure of 10.5% by 30%.
  2. Results increased by 30% (see above note).
  3. See "2" above.
  4. See "2" above.
  5. The difference in the absolute values is less than 25% and it is difficult to draw conclusions concerning the weak variations which are probably within the "interval of reliability" of the figures published by Global Reach, which does not benefit from a standard methodology for all languages.
  6. The number of non-English speakers who publish pages in English (or translate their pages into English) is, as we know, very high.