Towards an Interoperability Standard for Text and Multi-Modal Analytics

Unstructured information may be thought of as the direct product of human communications. Examples include natural language documents, email, speech, images and video. It is information that was not encoded for machines to understand but rather authored for humans to understand. We say it is “unstructured” because it lacks explicit semantics (“structure”) required for computer programs to interpret the information as intended by the human author or required by the application. A growing number of applications see increasing value in exploiting unstructured information. This growth is largely driven by the wealth of unstructured information found on the external web, in corporate intranets, document repositories, call-centers, and in customer and employee business communications. For unstructured information to be processed by traditional applications, it must first be analyzed to assign application-specific semantics to the unstructured content. This analysis is performed by software components called text and multi-modal analytics. This report motivates and proposes elements of an architecture specification for creating, composing and facilitating the interoperability of text and multi-modal analytics based on the open-source UIMA project originated at IBM Research.

By: David Ferrucci; Adam Lally; Daniel Gruhl; Edward Epstein; Marshall Schor; J. William Murdock; Andy Frenkiel; Eric W. Brown; Thomas Hampp; Yurdaer Doganata; Christopher Welty; Lisa Amini; Galina Kofman; Lev Kozakov; Yosi Mass

Published in: RC24122 in 2006

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rc24122.pdf

Questions about this service can be mailed to reports@us.ibm.com .