HIL: A High-Level Scripting Language for Entity Integration

We introduce HIL, a high-level scripting language for entity resolution and integration. HIL aims at providing the core logic for complex data processing flows that aggregate facts from large collections of structured or unstructured data into a set of clean, unified entities. Such complex flows typically include many stages of data processing that start from the outcome of information extraction and continue with entity resolution, mapping and fusion. A HIL program captures the overall integration flow through a combination of SQL-like rules that link, map, fuse and aggregate entities.

HIL differs from previous tool-driven schema mapping systems in that it is a programming framework that (1) allows for more flexible specification of the integration rules, (2) incorporates entity resolution, (3) allows user-defined functions for customized cleansing, normalization and matching of values, and (4) uses a notion of logical indexes in its data model to facilitate the modular construction and aggregation of entities.

As a result, HIL can accurately express complex integration tasks, while still being high-level and focused on the logical entities (rather than the physical operations). Compilation algorithms translate the HIL specification into efficient run-time queries that execute on Hadoop. We show how our framework is applied to a real-world integration of entities in the financial domain, based on public filings archived by the U.S. Securities and Exchange Commission (SEC).

By: Mauricio Hernández, Georgia Koutrika, Rajasekar Krishnamurthy, Lucian Popa, Ryan Wisnesky

Published in: RJ10499 in 2012

rj10499.pdf

Questions about this service can be mailed to reports@us.ibm.com .