INDICATORS OF LANGUAGES IN THE INTERNET, V3, March 2022

 

 

 

 

 

 

 

BASIC METHODOLOGICAL PROCESS

 

 

The model uses Ethnologue as the source for demo-linguistic data (L1+L2 speakers repartition per country),

ITU and World Bank for connectivity data (% of persons connected to the Internet per country)

and a large set of data sources (*) to produce 5 indicators:

 

 

Internauts : % of connected persons per language

Traffic : % of traffic per language (statistical work based on Alexa and SimilarWeb data from several hundreds of selected websites) (**)

 

Usage : % of Internet usage per language: from data divided between main social networks subscribers, connecting infrastructure (World Bank data), open applications, streaming and e.commerce (T-Index from Translated)

 

Interfaces and translation languages : counting the presence of languages in a large scope of application interfaces and online translation applications

 

Indexes : measuring the strength of countries in terms of Information society indicators and transforming it into languages (24 different indicators)

 

 

The average of those indicators is assumed to be a fair approximation of contents

within a confidence interval of -20% +20%

 

 

(*)   Most sources offer data per country. The data per language is obtained by weighting with demo-linguistic data.

 

(**) Most sources hardly cover all countries; extrapolation technics weighting with the % of connected people or using quartile approach are used.

 

 

 

 

 

Why would the mean of previous indicators be a fair approximation of Web contents?

 

The logical method to measure the presence of languages in the Web seems to be the application of

a reliable language recognition algorithm on all the existing webpages and count…

 

 

Yes… but the Web is too large to make that method practically applicable and the intents lose meaning for 2 main reasons:

1)         The sampling supposed to be representing the whole universe is biased

2)         The lack of consideration of multilingualism

 

 

and the results are extremely biased for those reasons.

 

 

Only remains two possibilities:

 

 

a)         For whoever use the logical method, focus on biases and pay due attention to multilingualism

b)        For other players, use alternative methods.

 

 

 

The rationale of our alternative method

 

 

 

The data which can be relied, upon because of limited biases, are :

-            demo-linguistic data (L1+L2 speakers repartition by country)

-            Internet connecting rate data (% of persons connected to the Internet per country).

 

 

 

From those 2 sources

and a working hypothesis stating that all language speakers have the same connecting rate in each country,

it is possible to compute the connecting rate per language.

 

 

In absence of further data,

this would be the first fair approximation of web contents per language

as the experience have shown that the percentage of contents seems to be linked to the percentage of Internet users by some sort of natural economic law.

 

 

In order to improve and consider that some languages are doing better (or lesser) than the average in terms of content production,

it is possible to try to modulate the previous figures from other indirect parameters.

 

 

This is exactly what our model is doing,

considering factors such as :

traffic, use of applications, existence of interfaces or translation programs, existence of e.government, open data and other Information Society attributes.

 

 

 

Beyond the main indicator of speakers connected to the Internet, it can be considered that

 

languages are, for economical, social, cultural, network education or other reasons generating more or less contents as a consequence of:

more or less Internet traffic resulting from tariff, cultural or education reasons,

 

more or less speakers subscribed to applications

 

more or less information society support where speakers live (e.g. e.government)

 

their absence (or presence) in application interface or translation programs

and, in general, their level of technological support for digital life, which can drastically limit or foster their use.

 

 

As a general rule, contents are produced by L1 speakers,

however L2 speakers of a given language may decide also to generate contents because of economic reason

(no wonder why the productivity of some major languages is so high compare to others !).

 

 

Our indirect method provided cannot obviously replace a real measurement

However, in the absence of such measurement, and in the context of extremely biased results from incomplete measurements,

it is a fair or better approximation, as long as it duly reflects those different factors.

 

 

The method is basically a way of obtaining the contents repartition per language

as a modulation of the connected speakers repartition per language,

in function of various measured parameters.

 

 

        Obviously, as for every statistical approach, all biases need to be exposed, made explicit and analyzed…

 

 

BIAS EVOLUTION ACCROSS THE VERSIONS

METHOD BIASES

 

 

 

 

ELEMENT

Version 1

Version 2

Version 3

 

 

 

 

 

 

Demo linguistic

Yoshua (2017)

Ethnologue #24 (2021)

Ethnologue #24 (2021)

 

source

 

Experts may disagree with

 

 

 

some data but yet the best

 

 

 

 

data available

 

 

 

 

 

 

 

L2 extrapolation

Compute L2 results from L1

Solved

Same

 

 

extrapolation.

Ethnologue provides L2

 

 

 

Strong bias favors language with high

data therefore this bias

 

 

 

presence in developing countries

disappeared.

 

 

 

(English and French mainly)

 

 

 

 

 

 

 

 

Main weighting

All speakers of each country

Same

Same

 

hypothesis

are computed the same

As long as the model is not

This working hypothesis is

 

connected %.

used to compare languages

the basis of the model as it

 

 

 

 

Light bias against European

within a country and is

allows most computing as a

 

 

limited to speakers

modulation of the value

 

 

languages in developing countries

 

 

population over one

around the % of connected

 

 

and in favor of immigration

 

 

million, the bias is

persons per country.

 

 

languages in developed countries

 

 

acceptable.

 

 

 

 

 

 

 

 

 

 

 

Extrapolation

The bias favor the most

Same

Same

 

technics for

connected countries but

 

 

 

effects are considered

 

 

 

sources

 

 

 

marginal (specially when the

 

 

 

 

source covers more than

 

 

 

 

70% of total)

 

 

 

 

 

 

 

 

 

 

 

SOURCES BIASES : 0 = totally biased 20 = totally unbiased

 

 

ELEMENT

Version 1

Version 2

Version 3

 

 

 

 

 

 

Internauts

18 ITU a fair source with yearly

15 ITU stopped updating its

19 World Bank took over the data

 

 

Updates*

estimated when no data is

and updates are frequent

 

 

 

given by country officials.

 

 

 

 

 

 

 

Traffic

13 Alexa strongly biased against

11 The Alexa bias against

16 Technic implemented to cancel

 

 

Asian languages and lightly biases in

Asian countries seems

the selection bias. Uses a mix of Alexa

 

 

favor of European languages (except

overcome but a new bias

error-filtered and SimilarWeb.

 

 

Portuguese). Selection bias somehow

and an error affects now

A small bias remains which affect

 

 

controlled by using the truncated

European countries.

many European languages.(*)

 

 

mean at 20%.

 

Tool’s biases are reflected in result

 

 

 

 

Chines’s result out of proportion.

 

 

 

 

 

 

Usage

12 Rely in data from main social

12 Same

15 Integration of non occidental

 

 

networks. Biased against non

 

social networks. Some improvements

 

 

occidental languages.

 

still possible for V4.

 

 

 

 

 

 

Interface

19 Those are objective data and

19 Same

19 Same

 

 

sampling is wide.

 

 

 

 

 

 

 

 

Indexes

15 The sampling needs to be

18 Sampling close to

18 Same

 

 

enlarged.

exhaustive.

 

 

 

 

 

 

 

Contents

5 Depends strongly on Wikimedia

8 Technics used to control

OUT After dense effort to include all

 

online encyclopedias beyond

 

 

statistics which are excellent but

Wikimedia statistics biases.

 

 

strongly biases against non occidental

 

Wikimedia, it is concluded it is better

 

 

languages and highly favor some

 

to suppress this indicator as the goal

 

 

languages (French, Hebrew,

 

is not reachable as an input.

 

 

Swedish…).

 

 

 

 

 

 

 

 

 

(*)  The use of top ranked websites deserve countries with higher information literacy rate where a larger portion of traffic goes to non top websites.

 

 

 

 

 

 

BIAS SUMMARY

 

 

 

V1 was strongly biased against non European languages,

 and at the same time biased in favor of the few European language with high presence in developing countries with low connectivity rate (mainly English and French).

 

 

V2 solved the second main bias and reduce the non European language negative bias but not enough as the content input indicator remained strongly biased.

 

 

V3 solved the content bias by suppressing it as input and removed almost all non European languages negative biases. Overall it remains now a slight negative European language bias but the level of reliability of the results have improved and reached a new quality threshold.

 

 

The evolution of the method has made a switch from strong negative biases towards non European languages to light negative biases toward European languages… and a possible positive bias towards Chinese due to the new Traffic indicator process.

 

 

That said data are to be taken with caution,

 as reliable only within a -20% +20% confidence interval

specially when comparing raw results whihc are within this interval

(as shown in the inverted pyramid of the main content per language for the 4 languages in position 4).

 

 

 

 

 

 

POTENTIAL IMPROVEMENTS FOR VERSION 4

 

 

 

 

 

 

 

Content productivity is measured on the basis of L1+L2 figures. It should be quite useful to check the value of another content productivity factor based only on L1;

as Version 3 of the model computes everything on L1+L2 basis this would require another version of the model.

 

 

The USAGE indicator still can be improved and its biases reduced by focusing:

-            Its video streaming component adding to YouTube and Netflix other sources

 

-            Its open data component adding to the unique source and focusing stats on open data, MOOCs, etc.

 

-            The biases have evolved from high against non European languages into low to European languages, this needs to be addressed.

 

 

The TRAFFIC indicator offers a result for Chinese out of proportion compared to the other languages. This needs to be investigated. The impact on the final result is however marginal, a value more proportioned would leave Chinese equal to English and anyway within the same confidence interval.

 

 

 

 

 

 

THE GRAPHIC VIEW OF THE EVOLUTION OF THE METHOD FROM V1 TO V3

 

 

 

 

 

 

.

.

.

If you want to know the details of the methodology, read:

 The method behind the unprecedented production of indicators of the presence of languages in the InternetSept. 2022

.

.

.

 

See the introduction

 

See the results for the top languages by category of indicator

 

Compare with other similar data (W3Techs and InternetWorldStats)

 

Access the full results for all 329 languages by downloading corresponding Excel files

 

Home