INDICATORS OF LANGUAGES IN THE INTERNET, V3, March 2022
BASIC METHODOLOGICAL PROCESS
The model uses Ethnologue as the source for demo-linguistic data (L1+L2 speakers repartition per country),
ITU and World Bank for connectivity data (% of persons connected to the Internet per country)
and a large set of data sources (*) to produce 5 indicators:
Internauts : % of connected persons per language
Traffic : % of traffic per language (statistical work based on Alexa and SimilarWeb data from several hundreds of selected websites) (**)
Usage : % of Internet usage per language: from data divided between main social networks subscribers, connecting infrastructure (World Bank data), open applications, streaming and e.commerce (T-Index from Translated)
Interfaces and translation languages : counting the presence of languages in a large scope of application interfaces and online translation applications
Indexes : measuring the strength of countries in terms of Information society indicators and transforming it into languages (24 different indicators)
The average of those indicators is assumed to be a fair approximation of contents
within a confidence interval of -20% +20%
(*) Most sources offer data per country. The data per language is obtained by weighting with demo-linguistic data.
(**) Most sources hardly cover all countries; extrapolation technics weighting with the % of connected people or using quartile approach are used.
Why would the mean of previous indicators be a fair approximation of Web contents?
The logical method to measure the presence of languages in the Web seems to be the application of
a reliable language recognition algorithm on all the existing webpages and count…
Yes… but the Web is too large to make that method practically applicable and the intents lose meaning for 2 main reasons:
1) The sampling supposed to be representing the whole universe is biased
2) The lack of consideration of multilingualism
and the results are extremely biased for those reasons.
Only remains two possibilities:
a) For whoever use the logical method, focus on biases and pay due attention to multilingualism
b) For other players, use alternative methods.
The rationale of our alternative method
The data which can be relied, upon because of limited biases, are :
- demo-linguistic data (L1+L2 speakers repartition by country)
- Internet connecting rate data (% of persons connected to the Internet per country).
From those 2 sources
and a working hypothesis stating that all language speakers have the same connecting rate in each country,
it is possible to compute the connecting rate per language.
In absence of further data,
this would be the first fair approximation of web contents per language
as the experience have shown that the percentage of contents seems to be linked to the percentage of Internet users by some sort of natural economic law.
In order to improve and consider that some languages are doing better (or lesser) than the average in terms of content production,
it is possible to try to modulate the previous figures from other indirect parameters.
This is exactly what our model is doing,
considering factors such as :
traffic, use of applications, existence of interfaces or translation programs, existence of e.government, open data and other Information Society attributes.
Beyond the main indicator of speakers connected to the Internet, it can be considered that
languages are, for economical, social, cultural, network education or other reasons generating more or less contents as a consequence of:
more or less Internet traffic resulting from tariff, cultural or education reasons,
more or less speakers subscribed to applications
more or less information society support where speakers live (e.g. e.government)
their absence (or presence) in application interface or translation programs
and, in general, their level of technological support for digital life, which can drastically limit or foster their use.
As a general rule, contents are produced by L1 speakers,
however L2 speakers of a given language may decide also to generate contents because of economic reason
(no wonder why the productivity of some major languages is so high compare to others !).
Our indirect method provided cannot obviously replace a real measurement
However, in the absence of such measurement, and in the context of extremely biased results from incomplete measurements,
it is a fair or better approximation, as long as it duly reflects those different factors.
The method is basically a way of obtaining the contents repartition per language
as a modulation of the connected speakers repartition per language,
in function of various measured parameters.
• Obviously, as for every statistical approach, all biases need to be exposed, made explicit and analyzed…
BIAS EVOLUTION ACCROSS THE VERSIONS
METHOD BIASES
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
|
ELEMENT |
Version 1 |
Version 2 |
Version 3 |
|
|
|
|
|
|
|
|
Demo linguistic |
Yoshua (2017) |
Ethnologue #24 (2021) |
Ethnologue #24 (2021) |
|
|
source |
|
Experts may disagree with |
|
|
|
|
some data but yet the best |
|
|
|
|
|
|
data available |
|
|
|
|
|
|
|
|
|
L2 extrapolation |
Compute L2 results from L1 |
Solved |
Same |
|
|
|
extrapolation. |
Ethnologue provides L2 |
|
|
|
|
Strong bias favors language with high |
data therefore this bias |
|
|
|
|
presence in developing countries |
disappeared. |
|
|
|
|
(English and French mainly) |
|
|
|
|
|
|
|
|
|
|
Main weighting |
All speakers of each country |
Same |
Same |
|
|
hypothesis |
are computed the same |
As long as the model is not |
This working hypothesis is |
|
|
connected %. |
used to compare languages |
the basis of the model as it |
|
|
|
|
|
|||
|
|
Light bias against European |
within a country and is |
allows most computing as a |
|
|
|
limited to speakers |
modulation of the value |
|
|
|
|
languages in developing countries |
|
||
|
|
population over one |
around the % of connected |
|
|
|
|
and in favor of immigration |
|
||
|
|
million, the bias is |
persons per country. |
|
|
|
|
languages in developed countries |
|
||
|
|
acceptable. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Extrapolation |
The bias favor the most |
Same |
Same |
|
|
technics for |
connected countries but |
|
|
|
|
effects are considered |
|
|
|
|
|
sources |
|
|
|
|
|
marginal (specially when the |
|
|
|
|
|
|
source covers more than |
|
|
|
|
|
70% of total) |
|
|
|
|
|
|
|
|
|
SOURCES BIASES : 0 = totally biased 20 = totally unbiased
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
|
ELEMENT |
Version 1 |
Version 2 |
Version 3 |
|
|
|
|
|
|
|
|
Internauts |
18 ITU a fair source with yearly |
15 ITU stopped updating its |
19 World Bank took over the data |
|
|
|
Updates* |
estimated when no data is |
and updates are frequent |
|
|
|
|
given by country officials. |
|
|
|
|
|
|
|
|
|
Traffic |
13 Alexa strongly biased against |
11 The Alexa bias against |
16 Technic implemented to cancel |
|
|
|
Asian languages and lightly biases in |
Asian countries seems |
the selection bias. Uses a mix of Alexa |
|
|
|
favor of European languages (except |
overcome but a new bias |
error-filtered and SimilarWeb. |
|
|
|
Portuguese). Selection bias somehow |
and an error affects now |
A small bias remains which affect |
|
|
|
controlled by using the truncated |
European countries. |
many European languages.(*) |
|
|
|
mean at 20%. |
|
Tool’s biases are reflected in result |
|
|
|
|
|
Chines’s result out of proportion. |
|
|
|
|
|
|
|
|
Usage |
12 Rely in data from main social |
12 Same |
15 Integration of non occidental |
|
|
|
networks. Biased against non |
|
social networks. Some improvements |
|
|
|
occidental languages. |
|
still possible for V4. |
|
|
|
|
|
|
|
|
Interface |
19 Those are objective data and |
19 Same |
19 Same |
|
|
|
sampling is wide. |
|
|
|
|
|
|
|
|
|
|
Indexes |
15 The sampling needs to be |
18 Sampling close to |
18 Same |
|
|
|
enlarged. |
exhaustive. |
|
|
|
|
|
|
|
|
|
Contents |
5 Depends strongly on Wikimedia |
8 Technics used to control |
OUT After dense effort to include all |
|
|
online encyclopedias beyond |
|
|||
|
|
statistics which are excellent but |
Wikimedia statistics biases. |
|
|
|
|
strongly biases against non occidental |
|
Wikimedia, it is concluded it is better |
|
|
|
languages and highly favor some |
|
to suppress this indicator as the goal |
|
|
|
languages (French, Hebrew, |
|
is not reachable as an input. |
|
|
|
Swedish…). |
|
|
|
|
|
|
|
|
|
![]()
(*) The use of top ranked websites deserve countries with higher information literacy rate where a larger portion of traffic goes to non top websites.
V1 was strongly biased against non European languages,
and at the same time biased in favor of the few European language with high presence in developing countries with low connectivity rate (mainly English and French).
V2 solved the second main bias and reduce the non European language negative bias but not enough as the content input indicator remained strongly biased.
V3 solved the content bias by suppressing it as input and removed almost all non European languages negative biases. Overall it remains now a slight negative European language bias but the level of reliability of the results have improved and reached a new quality threshold.
The evolution of the method has made a switch from strong negative biases towards non European languages to light negative biases toward European languages… and a possible positive bias towards Chinese due to the new Traffic indicator process.
That said data are to be taken with caution,
as reliable only within a -20% +20% confidence interval
specially when comparing raw results whihc are within this interval
(as shown in the inverted pyramid of the main content per language for the 4 languages in position 4).
POTENTIAL IMPROVEMENTS FOR VERSION 4
Content productivity is measured on the basis of L1+L2 figures. It should be quite useful to check the value of another content productivity factor based only on L1;
as Version 3 of the model computes everything on L1+L2 basis this would require another version of the model.
The USAGE indicator still can be improved and its biases reduced by focusing:
- Its video streaming component adding to YouTube and Netflix other sources
- Its open data component adding to the unique source and focusing stats on open data, MOOCs, etc.
- The biases have evolved from high against non European languages into low to European languages, this needs to be addressed.
The TRAFFIC indicator offers a result for Chinese out of proportion compared to the other languages. This needs to be investigated. The impact on the final result is however marginal, a value more proportioned would leave Chinese equal to English and anyway within the same confidence interval.

.
.
.
The method behind the unprecedented production of indicators of the presence of languages in the Internet, Sept. 2022
.
.
.
See the results for the top languages by category of indicator
Compare with other similar data (W3Techs and InternetWorldStats)
Access the full results for all 329 languages by downloading corresponding Excel files