The model uses Ethnologue as the source for demo-linguistic data (L1+L2 speakers repartition per country),

ITU and World Bank for connectivity data (% of persons connected to the Internet per country)

and a large set of data sources (*) to produce 5 indicators:



Internauts : % of connected persons per language

Traffic : % of traffic per language (statistical work based on Alexa and SimilarWeb data from several hundreds of selected websites) (**)


Usage : % of Internet usage per language: from data divided between main social networks subscribers, connecting infrastructure (World Bank data), open applications, streaming and e.commerce (T-Index from Translated)


Interfaces and translation languages : counting the presence of languages in a large scope of application interfaces and online translation applications


Indexes : measuring the strength of countries in terms of Information society indicators and transforming it into languages (24 different indicators)



The average of those indicators is assumed to be a fair approximation of contents

within a confidence interval of -20% +20%



(*)   Most sources offer data per country. The data per language is obtained by weighting with demo-linguistic data.


(**) Most sources hardly cover all countries; extrapolation technics weighting with the % of connected people or using quartile approach are used.






Why would the mean of previous indicators be a fair approximation of Web contents?


The logical method to measure the presence of languages in the Web seems to be the application of

a reliable language recognition algorithm on all the existing webpages and count…



Yes… but the Web is too large to make that method practically applicable and the intents lose meaning for 2 main reasons:

1)         The sampling supposed to be representing the whole universe is biased

2)         The lack of consideration of multilingualism



and the results are extremely biased for those reasons.



Only remains two possibilities:



a)         For whoever use the logical method, focus on biases and pay due attention to multilingualism

b)        For other players, use alternative methods.




The rationale of our alternative method




The data which can be relied, upon because of limited biases, are :

-            demo-linguistic data (L1+L2 speakers repartition by country)

-            Internet connecting rate data (% of persons connected to the Internet per country).




From those 2 sources

and a working hypothesis stating that all language speakers have the same connecting rate in each country,

it is possible to compute the connecting rate per language.



In absence of further data,

this would be the first fair approximation of web contents per language

as the experience have shown that the percentage of contents seems to be linked to the percentage of Internet users by some sort of natural economic law.



In order to improve and consider that some languages are doing better (or lesser) than the average in terms of content production,

it is possible to try to modulate the previous figures from other indirect parameters.



This is exactly what our model is doing,

considering factors such as :

traffic, use of applications, existence of interfaces or translation programs, existence of e.government, open data and other Information Society attributes.




Beyond the main indicator of speakers connected to the Internet, it can be considered that


languages are, for economical, social, cultural, network education or other reasons generating more or less contents as a consequence of:

more or less Internet traffic resulting from tariff, cultural or education reasons,


more or less speakers subscribed to applications


more or less information society support where speakers live (e.g. e.government)


their absence (or presence) in application interface or translation programs

and, in general, their level of technological support for digital life, which can drastically limit or foster their use.



As a general rule, contents are produced by L1 speakers,

however L2 speakers of a given language may decide also to generate contents because of economic reason

(no wonder why the productivity of some major languages is so high compare to others !).



Our indirect method provided cannot obviously replace a real measurement

However, in the absence of such measurement, and in the context of extremely biased results from incomplete measurements,

it is a fair or better approximation, as long as it duly reflects those different factors.



The method is basically a way of obtaining the contents repartition per language

as a modulation of the connected speakers repartition per language,

in function of various measured parameters.



        Obviously, as for every statistical approach, all biases need to be exposed, made explicit and analyzed…










Version 1

Version 2

Version 3







Demo linguistic

Yoshua (2017)

Ethnologue #24 (2021)

Ethnologue #24 (2021)




Experts may disagree with




some data but yet the best





data available








L2 extrapolation

Compute L2 results from L1






Ethnologue provides L2




Strong bias favors language with high

data therefore this bias




presence in developing countries





(English and French mainly)









Main weighting

All speakers of each country





are computed the same

As long as the model is not

This working hypothesis is


connected %.

used to compare languages

the basis of the model as it





Light bias against European

within a country and is

allows most computing as a



limited to speakers

modulation of the value



languages in developing countries



population over one

around the % of connected



and in favor of immigration



million, the bias is

persons per country.



languages in developed countries
















The bias favor the most




technics for

connected countries but




effects are considered








marginal (specially when the





source covers more than





70% of total)












SOURCES BIASES : 0 = totally biased 20 = totally unbiased




Version 1

Version 2

Version 3








18 ITU a fair source with yearly

15 ITU stopped updating its

19 World Bank took over the data




estimated when no data is

and updates are frequent




given by country officials.









13 Alexa strongly biased against

11 The Alexa bias against

16 Technic implemented to cancel



Asian languages and lightly biases in

Asian countries seems

the selection bias. Uses a mix of Alexa



favor of European languages (except

overcome but a new bias

error-filtered and SimilarWeb.



Portuguese). Selection bias somehow

and an error affects now

A small bias remains which affect



controlled by using the truncated

European countries.

many European languages.(*)



mean at 20%.


Tool’s biases are reflected in result





Chines’s result out of proportion.








12 Rely in data from main social

12 Same

15 Integration of non occidental



networks. Biased against non


social networks. Some improvements



occidental languages.


still possible for V4.








19 Those are objective data and

19 Same

19 Same



sampling is wide.










15 The sampling needs to be

18 Sampling close to

18 Same













5 Depends strongly on Wikimedia

8 Technics used to control

OUT After dense effort to include all


online encyclopedias beyond



statistics which are excellent but

Wikimedia statistics biases.



strongly biases against non occidental


Wikimedia, it is concluded it is better



languages and highly favor some


to suppress this indicator as the goal



languages (French, Hebrew,


is not reachable as an input.













(*)  The use of top ranked websites deserve countries with higher information literacy rate where a larger portion of traffic goes to non top websites.











V1 was strongly biased against non European languages,

 and at the same time biased in favor of the few European language with high presence in developing countries with low connectivity rate (mainly English and French).



V2 solved the second main bias and reduce the non European language negative bias but not enough as the content input indicator remained strongly biased.



V3 solved the content bias by suppressing it as input and removed almost all non European languages negative biases. Overall it remains now a slight negative European language bias but the level of reliability of the results have improved and reached a new quality threshold.



The evolution of the method has made a switch from strong negative biases towards non European languages to light negative biases toward European languages… and a possible positive bias towards Chinese due to the new Traffic indicator process.



That said data are to be taken with caution,

 as reliable only within a -20% +20% confidence interval

specially when comparing raw results whihc are within this interval

(as shown in the inverted pyramid of the main content per language for the 4 languages in position 4).















Content productivity is measured on the basis of L1+L2 figures. It should be quite useful to check the value of another content productivity factor based only on L1;

as Version 3 of the model computes everything on L1+L2 basis this would require another version of the model.



The USAGE indicator still can be improved and its biases reduced by focusing:

-            Its video streaming component adding to YouTube and Netflix other sources


-            Its open data component adding to the unique source and focusing stats on open data, MOOCs, etc.


-            The biases have evolved from high against non European languages into low to European languages, this needs to be addressed.



The TRAFFIC indicator offers a result for Chinese out of proportion compared to the other languages. This needs to be investigated. The impact on the final result is however marginal, a value more proportioned would leave Chinese equal to English and anyway within the same confidence interval.

















If you want to know the details of the methodology, read:

 The method behind the unprecedented production of indicators of the presence of languages in the InternetSept. 2022





See the introduction


See the results for the top languages by category of indicator


Compare with other similar data (W3Techs and InternetWorldStats)


Access the full results for all 329 languages by downloading corresponding Excel files