The Talent FST System

This document describes the TALENT 5.1 Finite State subsystem (TFST). It positions the process of finite state matching over annotations within the larger context of TALENT’s infrastructure. It then describes how to specify rules for pattern matching (over sequences of annotations), and how to specify the features or properties of annotations that a match would be focusing on. The process of grammar writing and development is broadly outlined, and some indication is given concerning the different kinds of tasks and applications which can effectively utilise finite state technology. Some general guidelines on grammar development and matching strategies are offered, and sample grammars are included as examples. Finally, a brief sketch is offered of an experimental environment for finite state grammar development.

The TFST subsystem, as described here, is hosted by the Talent 5.1 document processing infrastructure. Design considerations addressing the encapsulation of FS matching functionality as a Talent (meta-)plugin are discussed in (Neff, Byrd, and Boguraev, 2003). Work is under progress on making the full functionality available, via a new and revised formalism (which targets a typed feature structures-based representation model), within an emerging framework for unstructured information management (UIM; see (Ferrucci and Lally, 2003) for details of UIM architecture).

Evolution of the TFST capability is motivated, beyond well-articulated arguments promoting the deployment of finite state processing techniques for NLP application development, by considerations of enabling such processing within industrial strength NLP frameworks which exploit emerging notions like pipelined architectures, open-ended intercomponent communication, and in particular the adoption of linguistic annotations as fundamental descriptive/analytic device.

By: Branimir K. Boguraev, Mary S. Neff

Published in: RC22976 in 2003

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rc22976.pdf

Questions about this service can be mailed to reports@us.ibm.com .