Robust Web Data Extraction with XML Path Expressions

Automated extraction of structured Web data has attracted considerable interest in
both the academia and industry. A particularly promising approach is to employ XML
technologies to translate semi-structured HTML documents to "pure" XML documents.
In this approach, HTML documents are first normalized into XHMTL and then mapped
to the desired XML application format by using XML path expressions and regular
expressions.
In this paper we describe a methodology for creating XML path (XPath) expressions
that are capable of extracting data from virtually any HTML page, while placing an
emphasis on the persistent integrity of these expressions. This robustness is critical given
the vulnerability of extraction technologies to the continually changing content, structure,
and formatting of pages on the Web. We define categories of extraction rules in terms of
their dependence on content, structural, or formatting features, and provide practical tips
on how to create dependable data extraction patterns for the Web.

By: Jussi Myllymaki, Jared Jackson

Published in: RJ10245 in 2002

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

RJ10245.pdf

Questions about this service can be mailed to reports@us.ibm.com .