Phantom XML

The original purpose of the W3C Extensible Markup Language (XML) was as a generic representation and interchange format for structured and semi-structured data. In practice this means that mappings exist for many data formats between the native form and an XML version. There are even standards for setting up such mappings, such as SQLX for mapping relational databases and the Global Grid Forum "Data Format Description Language" (DFDL) for mapping binary formats. More mappings will certainly arise and there might be a time when XML will be the long awaited data unification language...but there are still some rocks on the path. Indeed, XML suffers from a specific problem: the default data representation as character sequences is not well suited for processing, not even for processing by the XML standard languages. XPath 2.0, XSLT 2.0, and XQuery 1.0, part of the next generation of such languages, acknowledge this fact by having their behaviour specified in terms of an abstract version of XML, the data model, with a separate document describing the relationship between instances of this data model and actual "serialized" XML documents. An especially interesting case is when data is only accessed through queries in XPath. In this case the naive model does not work: it is overly expensive in space (and thus time) to parse or convert entire data structures to an in-memory XML data model instance and then run a generic XPath engine on the converted results.

We propose to make XPath access space efficient in general by the following strategy: First, virtualize the XML Data Model to allow creation of lightweight cursor-based adapters making various data structures appear as if they were XML without introducing unnecessary overhead. An example of this is to wrap access to the entire file system as a single virtual XML node corresponding to the "current file or directory" from which XPath navigation gives access to the whole filesystem. Second, catalog useful profiles of restricted Data Model instances for which such access is as efficient as directly accessing the underlying data structure. Examples, explained in the paper, are streaming, streaming with back-pointers, linear, and random access. Third, describe an analysis of XPath queries that determines the profile requirements for that query (for a specific XPath evaluator). This analysis can determine, for example, that simple "downwards" XPath queries can be executed within the constraints of the streaming profile. Finally, provide a cache wrapper that allows use of a data source with a restricted profile in a context requiring a more complete profile, for example, materializing all actually visited nodes in an internal tree permits use of the random access profile over streaming data sources.

In this paper, we describe each of these components in detail and show through examples how they interact in a running XML processing system. We call the approach for phantom XML to highlight the fact that the combination of direct adapters and an optional cache implies that no traditional XML is stored in the system. We measure actual efficiency (in space and time) of running some realistic example XPath queries against, both real XML documents and virtual XML data, using our implementation. This demonstrates, for example, that streaming XPaths runs in constant space at competitive speeds, and that XPaths with locally bounded predicates run in space unrelated to the overall data size.

By: Kristoffer Rose; Lionel Villard

Published in: RC23850 in 2006

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rc23850.pdf

Questions about this service can be mailed to reports@us.ibm.com .