International Journal of Internet Science

A peer reviewed open access journal for empirical findings, methodology, and theory of social and behavioral science concerning the Internet and its implications for individuals, social groups, organizations, and society.

Volume 3, Issue 1 (2008)

Objectivity, Reliability, and Validity of Search Engine Count Estimates
Dietmar Janetzko
National College of Ireland, Dublin, Ireland

Abstract: Count estimates ("hits") provided by Web search engines have received much attention as a yardstick to measure a variety of phenomena of interest as diverse as, e.g., language statistics, popularity of authors, or similarity between words. Common to these activities is the intention to use Web search engines not only for search but for ad hoc measurement. Using search engine count estimates (SECEs) in this way means that a phenomenon of interest, e.g., the popularity of an author, is conceived of as a measurand, and SECEs are taken to be its quantitative measures. However, the data quality of SECEs has not yet been studied systematically, and concerns have been raised against the use of this kind of data. This article examines the data quality of SECEs focusing on classical goodness criteria, i.e., objectivity, reliability, and validity. The results of a series of studies indicate that with the exception of Boolean queries that use disjunction or negation objectivity as well as test-retest reliability and parallel-test reliability of SECEs is good for most types of browsers and search engines examined. Estimation of validity required model development (all-subsets regression) revealing satisfying results by using an explorative approach to feature selection. The ļ¬ndings are discussed in the light of previous objections and perspectives for using Web search count estimates are delineated.

Keywords: Data quality, goodness criteria, Web mining, search engines, search engine counts

