Use Domain Knowledge to Improve Data Mining Performance of Very Large Datasets via Clustering

Data mining is a very computationally intensive task. It is not the same as data query problems where information from a data repository is queried. Data mining involves exhaustive computation to uncover information hidden in the data—information that represents patterns in this data [1]. Therefore, the task is, to a great extent, unlimited. Using statistical analysis methods, data mining tools analyze the data and compute the relationships among the attributes (also called "features") of the data, seeking strong correlations that may be evidence of new and important information [2, 3].

We present methods for using domain knowledge, particularly in the medical domain, to reduce the dataset size for further data mining analysis.

By: Uri Shani; Simona Cohen

Published in: H-0239 in 2006

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

H-0239.pdf

Questions about this service can be mailed to reports@us.ibm.com .