Attribute Classification Using Feature Analysis

Database integration and migration are important, but labor-intensive tasks. To transform data from one representation to another, an expert user must identify and express correspondences between different attributes of different schemata. There are potentially many attributes in a source schema that might correspond to a particular target attribute. Our aim is to ease the burden of the user by classifying source attributes so that they can be automatically and intelligently matched to target attributes.

For categorical data, we present a novel variation of existing Naive Bayes classification techniques based on domain-independent feature selection. For numerical data, we use a quantile-based classification method, discovering characteristic distributions of the data. We show through extensive experiments that automatic classification of attributes is both feasible and useful for identifying potential matches. The techniques are exploited for several different tasks in Clio, a tool for semi-automatic schema mapping.

By: Felix Naumann, Ching-Tien Ho, Xuqing Tian, Laura Haas, Nimrod Megiddo

Published in: RJ10264 in 2002

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rj10264.pdf

Questions about this service can be mailed to reports@us.ibm.com .