APPENDIX 4 : SELECTION OF THE SEARCH ENGINES FOR THE L2000 STUDY

Introduction

The search engines used in the last study have changed and others have appeared in the last two years. A systematic study analysing the compatibility with the methodology used in this study was needed. Incompatibilities lead to the abandoning of certan search engines. The search engines which were selected (see Chapter 4.1.1) were : AltaVista, Fastsearch (Alltheweb), Google, Infoseek, iWon and Northernlight. These six search engines, each independent from the other, compete with other in searching for key-words on the Internet.

It appeared that the results of the measurements of the presence on the Web of the terms of our sample vary greatly depending on which search engine is used. To attempt to understand this phenomenon which risks completely disqualifying our methodology, from August 2000 a study has been carried out which takes into consideration the following elements, which are likely to have an influence on the validity of our results :

the number of pages indexed,

the way in which these pages are indexed,

the coherence of the count results given.

Results by motor and by language

The results below (table 16) give the total number of Internet pages sharing between them each of the 1600 terms of the study. The column for English gives the total number of pages (in millions) returned by each search engine for all of the English terms in august 2000. The figures in the other columns show, by language, the percentage, in relation to English, of the total number of counted pages. For example, for iWon : 212 million pages were counted containing the terms in English and 2.14 million for Portugese (1.01% of 212).

Table 16 : Results of the study for the 6 selected search engines

	English	Spanish	French	Italian	Portugese	Rumanian	German
Altavista	188 M	9.28%	9.56%	4.50%	3.98%	0.19%	16.06%
Fast	147M	8.41%	7.33%	4.60%	3.95%	0.37%	8.47%
Google	210M	7.86%	7.33%	4.65%	2.82%	0.27%	7.89%
Infoseek	37M	2.49%	3.97%	2.98%	0.96%	0.03%	5.39%
IWon	212M	4.13%	2.64%	0.69%	1.01%	0.35%	5.44%
Northern Light	145M	6.32%	5.26%	3.66%	3.50%	0.26%	5.23%

As can be seen, the results given , other than those of Fast and Google, were very different for each of the search engines, which seriously puts the validity of our method into question. Therefore it was necessary to analyse the specifications of each of the search engines in order the reason for these differences and also to determine which motors provide the most accurate results according to our criteria.

An analysis of the search engines demands prior knowledge of the quantitative characteristics of the Internet.

Data relative to the Internet and to the search engines

How big is the Web ?

Several sets of data are available :

In January 2000, Inktomi

the Web had surpassed a billion pages

4.4.2

New figure from Inktomi in May 2000 : 1.5 billion pages. Inktomi stressed that the proportion of pages replicated in mirror sites was greater than 20% (out of 6.5 billion servers indexed, 1.5 billion were mirrors).

Cyveillance search study : 2.1 billion pages in July 2000, with an exponential growth of 7 million new pages each day. This study maintains that 84.7% of the pages on the Web are in English. The search engines with the largest indexes.

Competition rages in the search engine market : provoking a strong motivation to increase the size of the indexes. At the moment, the leaders in domain of web indexing are :

Google

Webtop

Inktomi

Search Engine Watch

Altavista

Fastsearch

Northern Light

It is important to note that the search engines index a good proportion of the web space which interests us (between 25 and 50%) which makes the application of our methodology possible without too many pitfalls on the statistical plan².

How are the pages indexed ?

We should point out not all pages detected by the search engines are included in their index. The following table shows the actual number of web pages present in the index compared to the volume given for the indexed web space³.

	ANALYSED PAGES (millions)	INDEXED PAGES (millions)
Altavista	400	250
Fast	700	400
Excite	920	250
Inktomi	1000	110

It is interesting to understand how the reduction is worked out and how it can effect the validity of the results of our study. Two differing approaches have been identified :

Inktomi : A basic index of 110 million pages selected and classed from a source of 1 billion pages. The selection criteria for the basic index considers only the URLs of pages which are most often cited (that is to say those pages which have the largest number of links to them). This technique allows the selection of the most sought-after pages, classing them easily in order of 'popularity', all the time keeping the access time down thanks to the reduced size of the index. This approach which is totally in line with the primary objective of the search engines unfortunately disqualifies us from applying our methodology to it because the statistical redistribution of pages is falsified by the algorithm which favours certain pages with a bias against the linguistic scheme being used (the most popular pages, which are most often English, would have a greater probability of being selected). This would most clearly be expressed in the case of Rumanian (which has abnormally weak scores, more often than not has zero returns).
Altavista, Excite, Fast, and Google : A larger index with less focus on selection and greater independence of content (not including mirror sites and those returning 401⁴ or 404⁵). With this technique, indexes are larger ; even if they do not really give the most relevant results they are still compatible with our methodology since they do not favour any language more than another. A point of note : Google keeps an image of the page at the time of including it in its index, allowing the retrieval of information even when the indexed page has been taken off the Web.

Validating the search engines as part of our methodology

Altavista

AltaVista has been for many years one of the most used Web search engines. Its index is still one of the largest, however, once again⁶, it is not possible to include it in our study. The reasons being :

Altavista 'truncates' results⁷. AltaVista does this to reduce its response time when its server is overloaded (the search engine can stop performing a search and only give a partial result).
As during the previous study, it was not possible for us to establish the precise nature of how diacritical marks are dealt with ; if there is a logic it was not evident to us and under these conditions it is impossible to carry out a serious study.
The values given in the results change in a random way ; for example, you can easily have a different number of pages for the same request if you look at results 1 to 10 or 11 to 20...

Infoseek

Infoseek's index is too small to be used as part of our study (36 million pages in English, compared to more than 150 for most of the other search engines). This weak index gives an advantage to English over the other languages present on the Internet.

iWon

iWon uses the same index as Hotbot (Inktomi), whixh we used in the previous study. Inktomi's method for selecting pages is not compatible with our methodology as we explained in the preceding paragraph.

Northern Light

This search engine was not able to be used in our study as it doesn't take into account diacritical marks (in particular, it does not correctly interpret the marks used in Rumanian). Furthermore, it systematically searches for plurals of terms for searches in English, but not in other languages).

Google

This search engine claims to possess the largest Web index at this time ; it also has an extremely fast interface. This search engine, like AltaVista, 'truncated' its results when the study started, which caused it to be excluded. However, on checking later it was found this inconvenience had been corrected between August 2000 and November 2000. Therefore Google has been selected to give us the final results in August 2000 but not for the later calculations.

Fastsearch

Fastsearch has one of the largest indexes, having a quick response time, and does not truncate results and doesn't restrict its index to the most popular sites. Therefore Fastsearch has been selected together with Google for the first calculations and alone for the later ones.

Conclusion

Fastsearch is, at the time of this study (June 2001), the only search engine allowing us to employ our methodology for the proportional measurement of the languages on the Web and therefore it has been used for the calculation of the last results of the study. But the fact that the figures provided by Google and Fastsearch until January 2001 are statistically very close (the intervals of reliability largely back each other up) is an essential element in maintaining confidence in the validity of our methodology.

Top

It seems that half of these are not indexed directly but they are calculated by an algorithm counting the links on pages. The precise nature of this algorithm is not available to us so we are not able to draw conclusions about it, but it seems not to have any impact on our measurements.
However, we are not prohibited from thinking that, for a sample that comprises 25% to 50% of the universe, there may be a bias in selecting an index which favours the languages used the most, with English in first place. In particular, it is higly probable that the newest sites are not indexed as quickly as older sites which means there is a statistical prejudice against the younger languages on the Internet.
The figures given in the previous paragraph are from March 2000 and these are from July 2000, which explains the difference.
A page with restricted access, not available to the general public.
A non-existent page within a correctly referenced site.
Altavista was omitted from the previous study for the same reasons.
Meaning that it does not take into account some pages which correspond to the search criteria, causing a reduction in the number of counts, thus not corresponding to reality.