The frequent substring pattern mining problem is the problem of enumerating all substrings appearing more frequently than some threshold in a given string. This paper introduces a novel mining algorithm that is faster and requires less working memory than existing algorithms. Moreover, the algorithm generates a data structure that is useful for browsing frequent substring patterns and their contexts in the original text.
My proposed algorithm is divide-and-conquer approach that decomposes the mining task into a set of smaller tasks by using a ternary partitioning technique. After the process of frequent substring mining, the algorithm generates an incomplete suffix array which represents a pruned suffix trie based on the number of occurrences. This incomplete suffix represents the appearance positions of frequent substring patterns compactly so that their contexts can be browsed quickly. Although the average time complexity of this algorithm is O(n log n), a trial performance study shows the proposed algorithm runs faster than the pattern enumeration using a suffix tree.
By: Yuta Tsuboi
Published in: RT0548 in 2007
LIMITED DISTRIBUTION NOTICE:
This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.
Questions about this service can be mailed to reports@us.ibm.com .