Lenguas y culturas - Langues et cultures - Languages and Cultures

 

 

 

 

INDICATORS OF LANGUAGES IN THE INTERNET

SYNTHESIS OF RESULTS FROM VERSION 3, March 2022

 

Version 1 : 2017, with 130 languages with L1 > 5 million speakers

Version 2 : 2021, with 329 languages with L1 > 1 million speakers and important bias reduction

Version 3 : 3/2022, with comprehensive bias reduction and redefinition of some outputs

 

 

More than a new version, this is the reach of maturity for the method

 as all the biases are now controlled to an acceptable threshold,

and the produced indicators are reliable within  a

- 20% +20% confidence interval

 




 
The Observatory is pleased to share the results of version 3 of its model for computing indicators of the presence of languages ​​on the Internet,
which, as for version 2, announced in 2021, processes the 329 languages ​​over one million native speakers.

A confidence interval of -20% +20%, may seem wide if we apply the criteria of other statistical works,
but for the data about the place of languages ​​on the Internet, a subject that has always been very difficult to reach,
and prone to chronic misinformation,
this is a feat.


All the results are available

under
CC-BY-SA 4.0 license.

What do the results tell us?

The transition of the Internet between the domination of European languages, English in the lead,
towards Asian languages and Arabic, Chinese in the lead, is well advanced
and
the winner is multilingualism,
but African languages are slow to take their place


SYNTHESIS

Access a short PDF document describing the linguistic resource obtained

(English version presented at LREC2022 in SIGUL2022 workshop)

English

Français

Portuguêse 

Español




 

CREDITS

 

 

      STUDIES CONDUCTED BY 

 

      WITH SUPPORT OFfor Version 1 (2017) and Version 3 (2022)

        WITH SUPPORT OF         via funding from the Cultural and Educational Dept.of the Brazilian Ministry of Foreign Affairs  and coordination from        for Version 2 (2021)

                                                   

      MAIN RESEARCHER : Daniel Pimienta (p i m i e n t [email protected])
 
      MAIN ASSISTANT for version 2 and 3 : Álvaro Blanco
 
      LINGUISTIC & FUNDER COORDINATION SUPPORT FOR Version 2 : Gilvan Müller de Oliveira
 
      WORKING METHODOLOGY : partially based on an idea of Daniel Prado



 

.

.

.

 

METHODOLOGICAL NOTE

 

 

 

 

    This is an indirect approximation of the space of languages in the net using different data sources and statistics technics.

 

    All computations and results are made on the basis of L1+L2 where L1 is mother tongue and L2 second language(s)

 

    Following our main demo-linguistic source (Ethnologue #24) the world population (L1) and L1+L2 speakers population are :

 

L1 = 7 231 699 136     L2 = 10 361 716 756       L1+L2/L1 = 1.4328

 

    The confidence interval of all the produced figures is estimated to be within the window
-20% …..V..… +20%.

 

Read the results below as % of Web contents in English is higher than 16% and lower than 24%

the % of contents for the rest of languages is between 18% and 26%.



 

 

 

 

 

 

                                                                  

 

 

ALL INDICATORS FOR 30 LANGUAGES WITH HIGHER CONTENT PERCENTAGE

RANK

 

 

%

%

%

 

%

%

%

 

 

 

 

 

WORLD

 

 

 

 

C.PROD.

 

CONTENTS

 

 

INTERNAUTS

POP.

CONN.

 

CONTENTS

VIRT.PRES.

 

L1+L2

ISO

LANGUAGES

L1+L2

L1+L2

Speakers

 

L1+L2

L1+L2

L1+L2

 

1

zho

Chinese Macro

18,46%

14,72%

71,38%

 

21,60%

1,47

1,17

 

2

eng

English

14,83%

13,01%

64,86%

 

19,60%

1,51

1,32

 

3

spa

Spanish

6,79%

5,24%

73,72%

 

7,85%

1,50

1,16

 

4

hin

Hindi

4,19%

5,80%

41,16%

 

3,76%

0,65

0,90

 

5

rus

Russian

3,51%

2,49%

80,32%

 

3,76%

1,51

1,07

 

6

fra

French

2,98%

2,58%

65,80%

 

3,33%

1,29

1,12

 

7

por

Portuguese

2,99%

2,49%

68,43%

 

3,13%

1,26

1,05

 

8

ara

Arabic Macro

3,97%

3,53%

63,99%

 

3,09%

0,87

0,78

 

9

jpn

Japanese

1,99%

1,22%

92,63%

 

2,66%

2,18

1,34

 

10

deu

German, Standard

2,04%

1,30%

89,17%

 

2,37%

1,82

1,16

 

11

msa

Malay Macro

2,36%

2,36%

56,93%

 

1,96%

0,83

0,83

 

12

tur

Turkish

1,17%

0,85%

78,05%

 

1,14%

1,35

0,98

 

13

ita

Italian

0,87%

0,66%

75,83%

 

1,00%

1,53

1,14

 

14

kor

Korean

0,90%

0,79%

65,16%

 

0,98%

1,24

1,09

 

15

fas

Persian Macro

1,08%

0,81%

75,91%

 

0,88%

1,09

0,82

 

16

ben

Bengali

1,11%

2,58%

24,55%

 

0,88%

0,34

0,79

 

17

vie

Vietnamese

0,92%

0,74%

70,96%

 

0,85%

1,15

0,92

 

18

urd

Urdu

0,95%

2,22%

24,38%

 

0,66%

0,30

0,70

 

19

tha

Thai

0,80%

0,59%

77,95%

 

0,65%

1,12

0,82

 

20

pol

Polish

0,60%

0,39%

87,09%

 

0,63%

1,59

1,04

 

21

mar

Marathi

0,69%

0,96%

41,06%

 

0,58%

0,60

0,83

 

22

tel

Telugu

0,68%

0,92%

41,69%

 

0,56%

0,60

0,82

 

23

tam

Tamil

0,61%

0,82%

42,15%

 

0,51%

0,62

0,83

 

24

jav

Javanese

0,62%

0,66%

53,76%

 

0,44%

0,66

0,70

 

25

nld

Dutch

0,38%

0,24%

91,14%

 

0,41%

1,73

1,08

 

26

guj

Gujarati

0,44%

0,60%

41,47%

 

0,36%

0,61

0,83

 

27

ukr

Ukrainian

0,40%

0,32%

71,02%

 

0,35%

1,09

0,88

 

28

kan

Kannada

0,41%

0,57%

41,11%

 

0,33%

0,59

0,82

29

ron

Romanian

0,32%

0,23%

79,57%

 

0,30%

1,29

0,93

 

30

aze

Azerbaijani Macro

0,33%

0,23%

81,54%

 

0,28%

1,21

0,85

 

 

 

REMAIN

22,60%

30,10%

 

 

15,13%

 

 

 

 

 

TOTAL

100,00%

100,00%

 

 

100,00%

 

 

 

 

LEGEND
 ISO = 3 letters ISO 639 code for languages
L1+L2 = first and second language speakers
Internauts = % of connected speakers
World Pop. = % of speaker's population over the world total of L1+L2 speakers
CONN. = % of connected speakers over the world total of L1+L2 connected persons
CONTENTS = % of Web contents in each language over the total of Internet Webpages (NOT over the total of websites!)
VIRT. PRES. = Virtual Presence the ratio of CONTENT over World Pop. for each language
C. PROD. = Content productivity, the ratio of CONTENT over CONN. for each language
.
.
.

 

 

 

See the results for the top languages by category of indicator

 

Read the basic methodological note

 

Check the comparison with other similar data (W3Techs and InternetWorldStats)

 

Access the full results for all 329 languages by downloading corresponding Excel files

 

 

Home