Scalable Algorithm for Distributed In-Memory Text Indexing

Text based search remains an important technique to retrieve data. Existing and emerging domains including the massive and rapidly changing Web, sensor networks, stream computing and others involve searching huge quantity of data coming at large rates of 10GB/s - 100GB/s. High input data rates in these applications impose real time constraints on indexing. This makes it necessary to have high indexing rates for large volumes of data. Future parallel architectures with storage class memories will pave the way for high speed in-memory text indexing, where index can be completely stored in a high capacity storage class memory.
We present sequential and distributed in-memory text indexing algorithms. These algorithms leverage our novel index data structures that make text document indexing and merging process more efficient. We compare our design with the indexing algorithm in CLucene and show that the asymptotic time complexity of our sequential indexing algorithm (O(R.γ/k)) is better than CLucene (O(R.γ/k * (log(R/k) + θ))), where R is the number of documents, γ and θ are document-term distribution parameters and k is the merge-factor parameter in index construction. We also present the asymptotic time complexity for our distributed indexing algorithm and show that it is better in scalability compared to distributed version of CLucene.
The experimental results of our sequential & distributed indexing algorithm using actual website data on an MPP architecture (BG/L) are also presented along with comparison to sequential & distributed indexing using CLucene. We demonstrate around 2x improvement in indexing performance for the sequential algorithm and around 3x – 7x improvement in indexing performance for the distributed algorithm.

By: Ankur Narang, Vikas Agarwal, Monu Kedia, Vijay K Garg

Published in: RI08009 in 2008

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

lucenebg-resrep.pdf

Questions about this service can be mailed to reports@us.ibm.com .