Topic Distillation and Spectral Filtering

        This paper discusses topic distillation, an information retrieval problem that is emerging as a critical task for the WWW. Algorithms for this problem must distill a small number of high-quality documents addressing a broad topic from a large set of candidates. We gice a review of the literature, and compare the problem with related tasks such as classification, clustering, and indexing. We then describe a general approach to topic distillation with applications to searching and partitioning, based on the algebraic properties of matrices derived from particular documents within the corpus. Our method - which we call spectral filtering - combines the use of terms, hyperlinks and anchor-text to improve retrieval performance. We give results for broad-topic queries on the WWW, and also give some anecdotal results applying the same techniques to US Supreme Court law cases, US patents, and a set of Wall Street Journal newspaper articles.

By: Soumen Chakrabarti, Byron E. Dom, David Gibson, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tompkins

Published in: RJ10127 in 1998

This Research Report is not available electronically. Please request a copy from the contact listed below. IBM employees should contact ITIRC for a copy.

Questions about this service can be mailed to reports@us.ibm.com .