Fast String Search on the Cell Processor

Copyright © (2008) by Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distrubuted for profit or commericial advantage. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee.

Your latest PC is probably a dual core or a quad core, but the one which will sit on your desk in a couple of years will be a “manycore”, like the Cell processor by Sony, Toshiba and IBM, which has 9 cores, or the Intel Terascale prototype, which has 80. Manycores are the future, and since they are so different to program from traditional processors, programmers must get ready for the change now, e.g. learning how to map basic algorithms efficiently onto the new hardware.

String searching is one of these basic algorithms, and it has a host of applications: search engines, network intrusion detection systems, virus scanners, spam filters, DNA analysis and many others. The Cell processor, with its multiple cores, promises to speed it up a lot.

In this article we show how we mapped it efficiently on the Cell. We present two implementations. The “fast” one supports a small dictionary size (100 patterns approximately) and provides a throughput of 40 Gbps, which is 100 times faster than reference implementations on x86 architectures. The “heavyduty” implementation is slower (3.3 4.3 Gbps) but it supports dictionaries with tens of thousands of strings –and beyond, depending on the available RAM.

This task is not trivial: we had to change our algorithm significantly to reach top performance. In particular, to exploit the memory subsystem at its best, we employ a pipelined parallelization strategy, and we shuffle the data layout to fight congestion: these techniques are quite unfamiliar to most programmers of traditional architectures. The details follow.

By: Daniele Paolo Scarpazza; Oreste Villa; Fabrizio Petrini

Published in: ACM Operating Systems Review, volume 42, (no 1), pages 13-20 in 2008

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rc24428.pdf

Questions about this service can be mailed to reports@us.ibm.com .