Focused Sampling: Computing Topical Web Statistics

Aggregate statistical data about the web is very useful in numerous scenarios, such as market research, intelligence, and social studies. In many of these applications, one is interested not in generic data about the whole web but rather in highly focused information pertinent to a specific domain or topic. Furthermore, timely acquisition of this information provides a competitive advantage.

Focused statistical data can be gathered by a brute force crawl of the whole web, or by a "focused crawl", which collects mainly pages that are relevant to the topic of interest. Crawling, however, is an expensive enterprise, requiring substantial resources. For the purpose of gathering statistical data, random sampling of web pages is a much faster, cheaper, and even more reliable approach. We develop the first efficient method for generating a random sample of web pages relevant to a given user-specified topic. Previously, techniques for getting only an unfocused sample of pages from the whole web were known. Our method is based on a new random walk on (a modified version of) a subgraph of the web graph.

By: Ziv Bar-Yossef; Tapas Kanungo; Robert Krauthgamer

Published in: RJ10339 in 2005


