Optimized XML/HTML parsing within single thread

Disclosed is a method for parsing a XML/HTML document efficiently within a single thread in the middleware. It requires no modification of the existing Web application. It consists of the three components: (1) a XML/HTML parser (p1) which implementes the extented version of standard API, (2) a queue (q1) which stores a fragment (df(i)) of the XML/HTML document, and (3) a main processor (m1) which manages overall process. When m1 receives the fragment df(i), m1 passes df(i) to q1 first and invokes p1 to parse df(i). p1 itself acts like the implementation of the standard API while p1 can read df(i) from q1. If there is no data in q1, p1 suspends the parsing process by storing any parser internal information completely with no regression and returns the event which lets m1 know the parsing process is suspended. If m1 receives the next fragment df(i+1) and passes it to q1, m1 calls p1 again to resume the parsing process from the first data of df(i+1). If m1 receives an EOF signal, which tells the end of the XML/HTML document, m1 passes the signal to q1 so that p1 can know the end of the XML/HTML document through q1 and finish the parsing process. The method disclosed here doesn’t require buffering of whole XML/HTML document before the parsing process starts even when a program runs within single thread. Therefore, it allows the program to parse very huge XML/HTML documents with no concern about memory usage. It will also minimize performance overhead because of the mechanism to suspend and resume the parsing process efficiently.

By: Masayoshi Teraguchi, Sachiko Yoshihama

Published in: RT0851 in 2009

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

RT0851.pdf

Questions about this service can be mailed to reports@us.ibm.com .