L5 Title

APPENDIX 4 : SELECTION OF THE SEARCH ENGINES FOR THE L2000 STUDY

Introduction

The search engines used in the last study have changed and others have appeared in the last two years. A systematic study analysing the compatibility with the methodology used in this study was needed. Incompatibilities lead to the abandoning of certan search engines. The search engines which were selected (see
Chapter 4.1.1) were : AltaVista, Fastsearch (Alltheweb), Google, Infoseek, iWon and Northernlight. These six search engines, each independent from the other, compete with other in searching for key-words on the Internet.

It appeared that the results of the measurements of the presence on the Web of the terms of our sample vary greatly depending on which search engine is used. To attempt to understand this phenomenon which risks completely disqualifying our methodology, from August 2000 a study has been carried out which takes into consideration the following elements, which are likely to have an influence on the validity of our results :
  • the number of pages indexed,
  • the way in which these pages are indexed,
  • the coherence of the count results given.

    Results by motor and by language

    The results below (table 16) give the total number of Internet pages sharing between them each of the 1600 terms of the study. The column for English gives the total number of pages (in millions) returned by each search engine for all of the English terms in august 2000. The figures in the other columns show, by language, the percentage, in relation to English, of the total number of counted pages. For example, for iWon : 212 million pages were counted containing the terms in English and 2.14 million for Portugese (1.01% of 212).

    Table 16 : Results of the study for the 6 selected search engines

    English

    Spanish

    French

    Italian

    Portugese

    Rumanian

    German

    Altavista

    188 M

    9.28%

    9.56%

    4.50%

    3.98%

    0.19%

    16.06%

    Fast

    147M

    8.41%

    7.33%

    4.60%

    3.95%

    0.37%

    8.47%

    Google

    210M

    7.86%

    7.33%

    4.65%

    2.82%

    0.27%

    7.89%

    Infoseek

    37M

    2.49%

    3.97%

    2.98%

    0.96%

    0.03%

    5.39%

    IWon

    212M

    4.13%

    2.64%

    0.69%

    1.01%

    0.35%

    5.44%

    Northern Light

    145M

    6.32%

    5.26%

    3.66%

    3.50%

    0.26%

    5.23%



    As can be seen, the results given , other than those of Fast and Google, were very different for each of the search engines, which seriously puts the validity of our method into question. Therefore it was necessary to analyse the specifications of each of the search engines in order the reason for these differences and also to determine which motors provide the most accurate results according to our criteria.

    An analysis of the search engines demands prior knowledge of the quantitative characteristics of the Internet.

    Data relative to the Internet and to the search engines

    How big is the Web ?

    Several sets of data are available :
    The search engines with the largest indexes.

    Competition rages in the search engine market : provoking a strong motivation to increase the size of the indexes. At the moment, the leaders in domain of web indexing are : It is important to note that the search engines index a good proportion of the web space which interests us (between 25 and 50%) which makes the application of our methodology possible without too many pitfalls on the statistical plan2.

    How are the pages indexed ?

    We should point out not all pages detected by the search engines are included in their index. The following table shows the actual number of web pages present in the index compared to the volume given for the indexed web space3.

     

    ANALYSED PAGES
    (millions)

    INDEXED PAGES (millions)

    Altavista

    400

    250

    Fast

    700

    400

    Excite

    920

    250

    Inktomi

    1000

    110



    It is interesting to understand how the reduction is worked out and how it can effect the validity of the results of our study. Two differing approaches have been identified :
    1. Inktomi : A basic index of 110 million pages selected and classed from a source of 1 billion pages. The selection criteria for the basic index considers only the URLs of pages which are most often cited (that is to say those pages which have the largest number of links to them). This technique allows the selection of the most sought-after pages, classing them easily in order of 'popularity', all the time keeping the access time down thanks to the reduced size of the index. This approach which is totally in line with the primary objective of the search engines unfortunately disqualifies us from applying our methodology to it because the statistical redistribution of pages is falsified by the algorithm which favours certain pages with a bias against the linguistic scheme being used (the most popular pages, which are most often English, would have a greater probability of being selected). This would most clearly be expressed in the case of Rumanian (which has abnormally weak scores, more often than not has zero returns).

    2. Altavista, Excite, Fast, and Google : A larger index with less focus on selection and greater independence of content (not including mirror sites and those returning 4014 or 4045). With this technique, indexes are larger ; even if they do not really give the most relevant results they are still compatible with our methodology since they do not favour any language more than another. A point of note : Google keeps an image of the page at the time of including it in its index, allowing the retrieval of information even when the indexed page has been taken off the Web.

    Validating the search engines as part of our methodology

    Altavista

    AltaVista has been for many years one of the most used Web search engines. Its index is still one of the largest, however, once again6, it is not possible to include it in our study. The reasons being : Infoseek

    Infoseek's index is too small to be used as part of our study (36 million pages in English, compared to more than 150 for most of the other search engines). This weak index gives an advantage to English over the other languages present on the Internet.

    iWon

    iWon uses the same index as Hotbot (Inktomi), whixh we used in the previous study. Inktomi's method for selecting pages is not compatible with our methodology as we explained in the preceding paragraph.

    Northern Light

    This search engine was not able to be used in our study as it doesn't take into account diacritical marks (in particular, it does not correctly interpret the marks used in Rumanian). Furthermore, it systematically searches for plurals of terms for searches in English, but not in other languages).

    Google

    This search engine claims to possess the largest Web index at this time ; it also has an extremely fast interface. This search engine, like AltaVista, 'truncated' its results when the study started, which caused it to be excluded. However, on checking later it was found this inconvenience had been corrected between August 2000 and November 2000. Therefore Google has been selected to give us the final results in August 2000 but not for the later calculations.

    Fastsearch

    Fastsearch has one of the largest indexes, having a quick response time, and does not truncate results and doesn't restrict its index to the most popular sites. Therefore Fastsearch has been selected together with Google for the first calculations and alone for the later ones.

    Conclusion

    Fastsearch is, at the time of this study (June 2001), the only search engine allowing us to employ our methodology for the proportional measurement of the languages on the Web and therefore it has been used for the calculation of the last results of the study. But the fact that the figures provided by Google and Fastsearch until January 2001 are statistically very close (the intervals of reliability largely back each other up) is an essential element in maintaining confidence in the validity of our methodology.



    Top

    1. It seems that half of these are not indexed directly but they are calculated by an algorithm counting the links on pages. The precise nature of this algorithm is not available to us so we are not able to draw conclusions about it, but it seems not to have any impact on our measurements.
    2. However, we are not prohibited from thinking that, for a sample that comprises 25% to 50% of the universe, there may be a bias in selecting an index which favours the languages used the most, with English in first place. In particular, it is higly probable that the newest sites are not indexed as quickly as older sites which means there is a statistical prejudice against the younger languages on the Internet.
    3. The figures given in the previous paragraph are from March 2000 and these are from July 2000, which explains the difference.
    4. A page with restricted access, not available to the general public.
    5. A non-existent page within a correctly referenced site.
    6. Altavista was omitted from the previous study for the same reasons.
    7. Meaning that it does not take into account some pages which correspond to the search criteria, causing a reduction in the number of counts, thus not corresponding to reality.