DotStar: Breaking the Scalability and Performance Barriers in Regular Expression Set Matching

Regular expressions are widely used to parse data and to detect recurrent patterns and information: they are a common choice for defining configurable rules for a variety of systems. In fact, many dataintensive applications rely on regular expression parsing as the first line of defense to perform on-line data filtering. Unfortunately, few solutions can keep up with the increasing data rates and the complexity posed by several hundreds of regular expressions.

In this paper we present DotStar (.*), a complete software tool-chain that can compile a set of regular expressions into an automaton that is highly optimized to run on multi-core processors with vector/SIMD extensions. DotStar relies on several algorithmic breakthroughs, to transform the user-provided regular expressions into a sequence of more manageable intermediate representations. The resulting automaton is both space and time efficient, and can perform the search in a single pass, without backtracking.

The experimental evaluation, performed on a state-of-the-art Intel quad-core processor, shows that DotStar can efficiently handle both small sets of regular expressions, such as those used in protocol parsing, and much larger sets like the ones designed for Network Intrusion Detection Systems (NIDS). The experimental results show that we can achieve processing rates ranging from 2.2 Gbits/sec with the more demanding sets of NIDS expressions, to 5 Gbits/sec with XML parsing, with a performance speedup of almost two orders of magnitude when compared to popular libraries such as Boost, reaching, and in some case exceeding, the performance of specialized ASICs and FPGAs.

By: Davide Pasetto; Fabrizio Petrini

Published in: RC24645 in 2008

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rc24645.pdf

Questions about this service can be mailed to reports@us.ibm.com .