Subject-Based Searching Using Automatically Extracted Metadata; The AIM subject prototype

Search for text documents is usually based on keywords. However, the usefulness of a document is also based on other characteristics such as writing style, language and subject. The subject is a characteristic of paramount importance, because the use of synonyms might produce unwanted results with keyword-based searching alone. Vice versa, if the subject of a query and a candidate document is determined correctly, subject-based searching could help to identify the most useful documents. This paper describes the architecture of a working prototype that is capable of determining automatically the subjects of text documents and queries. Based on the judgement, queries are matched with document collections that have a high probability of providing useful data. As the experimental results indicate, the AIM subject prototype could narrow search space to 2% of its original size and still obtain roughly half of the relevant documents.

By: Thomas Kirsche and Rob Barrett

Published in: RJ10063 in 1996

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

RJ10063.pdf

Questions about this service can be mailed to reports@us.ibm.com .