Focused Sampling: Computing Topical Web Statistics

Aggregate statistical data about the web is very useful in numerous scenarios, such as market research, intelligence, and social studies. In many of these applications, one is interested not in generic data about the whole web but rather in highly focused information pertinent to a specific domain or topic. Furthermore, timely acquisition of this information provides a competitive advantage.

Focused statistical data can be gathered by a brute force crawl of the whole web, or by a "focused crawl", which collects mainly pages that are relevant to the topic of interest. Crawling, however, is an expensive enterprise, requiring substantial resources. For the purpose of gathering statistical data, random sampling of web pages is a much faster, cheaper, and even more reliable approach. We develop the first efficient method for generating a random sample of web pages relevant to a given user-specified topic. Previously, techniques for getting only an unfocused sample of pages from the whole web were known. Our method is based on a new random walk on (a modified version of) a subgraph of the web graph.

By: Ziv Bar-Yossef; Tapas Kanungo; Robert Krauthgamer

Published in: RJ10339 in 2005

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rj10339.pdf

Questions about this service can be mailed to reports@us.ibm.com .