INDICATORS OF LANGUAGES IN THE INTERNET, V3, March 2022

BASIC METHODOLOGICAL PROCESS

The model uses Ethnologue as the source for demo-linguistic data (L1+L2 speakers repartition per country),

ITU and World Bank for connectivity data (% of persons connected to the Internet per country)

and a large set of data sources (*) to produce 5 indicators:

Internauts : % of connected persons per language

Traffic : % of traffic per language (statistical work based on Alexa and SimilarWeb data from several hundreds of selected websites) (**)

Usage : % of Internet usage per language: from data divided between main social networks subscribers, connecting infrastructure (World Bank data), open applications, streaming and e.commerce (T-Index from Translated)

Interfaces and translation languages : counting the presence of languages in a large scope of application interfaces and online translation applications

Indexes : measuring the strength of countries in terms of Information society indicators and transforming it into languages (24 different indicators)

The average of those indicators is assumed to be a fair approximation of contents

within a confidence interval of -20% +20%

(*) Most sources offer data per country. The data per language is obtained by weighting with demo-linguistic data.

(**) Most sources hardly cover all countries; extrapolation technics weighting with the % of connected people or using quartile approach are used.

The rationale of our alternative method

The data which can be relied, upon because of limited biases, are :

- demo-linguistic data (L1+L2 speakers repartition by country)

- Internet connecting rate data (% of persons connected to the Internet per country).

From those 2 sources

and a working hypothesis stating that all language speakers have the same connecting rate in each country,

it is possible to compute the connecting rate per language.

In absence of further data,

this would be the first fair approximation of web contents per language

as the experience have shown that the percentage of contents seems to be linked to the percentage of Internet users by some sort of natural economic law.

In order to improve and consider that some languages are doing better (or lesser) than the average in terms of content production,

it is possible to try to modulate the previous figures from other indirect parameters.

This is exactly what our model is doing,

considering factors such as :

traffic, use of applications, existence of interfaces or translation programs, existence of e.government, open data and other Information Society attributes.

Beyond the main indicator of speakers connected to the Internet, it can be considered that

languages are, for economical, social, cultural, network education or other reasons generating more or less contents as a consequence of:

more or less Internet traffic resulting from tariff, cultural or education reasons,

more or less speakers subscribed to applications

more or less information society support where speakers live (e.g. e.government)

their absence (or presence) in application interface or translation programs

and, in general, their level of technological support for digital life, which can drastically limit or foster their use.

As a general rule, contents are produced by L1 speakers,

however L2 speakers of a given language may decide also to generate contents because of economic reason

(no wonder why the productivity of some major languages is so high compare to others !).

Our indirect method provided cannot obviously replace a real measurement

However, in the absence of such measurement, and in the context of extremely biased results from incomplete measurements,

it is a fair or better approximation, as long as it duly reflects those different factors.

The method is basically a way of obtaining the contents repartition per language

as a modulation of the connected speakers repartition per language,

in function of various measured parameters.

• Obviously, as for every statistical approach, all biases need to be exposed, made explicit and analyzed…

BIAS EVOLUTION ACCROSS THE VERSIONS

METHOD BIASES

ELEMENT	Version 1	Version 2	Version 3

Demo linguistic	Yoshua (2017)	Ethnologue #24 (2021)	Ethnologue #24 (2021)
source		Experts may disagree with
source		some data but yet the best
		data available

L2 extrapolation	Compute L2 results from L1	Solved	Same
	extrapolation.	Ethnologue provides L2
	Strong bias favors language with high	data therefore this bias
	presence in developing countries	disappeared.
	(English and French mainly)

Main weighting	All speakers of each country	Same	Same
hypothesis	are computed the same	As long as the model is not	This working hypothesis is
hypothesis	connected %.	used to compare languages	the basis of the model as it
	connected %.	used to compare languages	the basis of the model as it
	Light bias against European	within a country and is	allows most computing as a
	Light bias against European	limited to speakers	modulation of the value
	languages in developing countries	limited to speakers	modulation of the value
	languages in developing countries	population over one	around the % of connected
	and in favor of immigration	population over one	around the % of connected
	and in favor of immigration	million, the bias is	persons per country.
	languages in developed countries	million, the bias is	persons per country.
	languages in developed countries	acceptable.
		acceptable.

Extrapolation	The bias favor the most	Same	Same
technics for	connected countries but
technics for	effects are considered
sources	effects are considered
sources	marginal (specially when the
	source covers more than
	70% of total)

SOURCES BIASES : 0 = totally biased 20 = totally unbiased

ELEMENT	Version 1	Version 2	Version 3

Internauts	18 ITU a fair source with yearly	15 ITU stopped updating its	19 World Bank took over the data
	Updates*	estimated when no data is	and updates are frequent
		given by country officials.

Traffic	13 Alexa strongly biased against	11 The Alexa bias against	16 Technic implemented to cancel
	Asian languages and lightly biases in	Asian countries seems	the selection bias. Uses a mix of Alexa
	favor of European languages (except	overcome but a new bias	error-filtered and SimilarWeb.
	Portuguese). Selection bias somehow	and an error affects now	A small bias remains which affect
	controlled by using the truncated	European countries.	many European languages.(*)
	mean at 20%.		Tool’s biases are reflected in result
			Chines’s result out of proportion.

Usage	12 Rely in data from main social	12 Same	15 Integration of non occidental
	networks. Biased against non		social networks. Some improvements
	occidental languages.		still possible for V4.

Interface	19 Those are objective data and	19 Same	19 Same
	sampling is wide.

Indexes	15 The sampling needs to be	18 Sampling close to	18 Same
	enlarged.	exhaustive.

Contents	5 Depends strongly on Wikimedia	8 Technics used to control	OUT After dense effort to include all
Contents	5 Depends strongly on Wikimedia	8 Technics used to control	online encyclopedias beyond
	statistics which are excellent but	Wikimedia statistics biases.	online encyclopedias beyond
	strongly biases against non occidental		Wikimedia, it is concluded it is better
	languages and highly favor some		to suppress this indicator as the goal
	languages (French, Hebrew,		is not reachable as an input.
	Swedish…).

BIAS SUMMARY

V1 was strongly biased against non European languages,

and at the same time biased in favor of the few European language with high presence in developing countries with low connectivity rate (mainly English and French).

V2 solved the second main bias and reduce the non European language negative bias but not enough as the content input indicator remained strongly biased.

V3 solved the content bias by suppressing it as input and removed almost all non European languages negative biases. Overall it remains now a slight negative European language bias but the level of reliability of the results have improved and reached a new quality threshold.

The evolution of the method has made a switch from strong negative biases towards non European languages to light negative biases toward European languages… and a possible positive bias towards Chinese due to the new Traffic indicator process.

That said data are to be taken with caution,

as reliable only within a -20% +20% confidence interval

specially when comparing raw results whihc are within this interval

(as shown in the inverted pyramid of the main content per language for the 4 languages in position 4).

POTENTIAL IMPROVEMENTS FOR VERSION 4

Content productivity is measured on the basis of L1+L2 figures. It should be quite useful to check the value of another content productivity factor based only on L1;

as Version 3 of the model computes everything on L1+L2 basis this would require another version of the model.

The USAGE indicator still can be improved and its biases reduced by focusing:

- Its video streaming component adding to YouTube and Netflix other sources

- Its open data component adding to the unique source and focusing stats on open data, MOOCs, etc.

- The biases have evolved from high against non European languages into low to European languages, this needs to be addressed.

The TRAFFIC indicator offers a result for Chinese out of proportion compared to the other languages. This needs to be investigated. The impact on the final result is however marginal, a value more proportioned would leave Chinese equal to English and anyway within the same confidence interval.