ONDUX

6 June 2010

proceedings article
Published by Association for Computing Machinery (ACM)

p. 807-818
https://doi.org/10.1145/1807167.1807254

Abstract

Information extraction by text segmentation (IETS) applies to cases in which data values of interest are organized in implicit semi-structured records available in textual sources (e.g. postal addresses, bibliographic information, ads). It is an important practical problem that has been frequently addressed in the recent literature. In this paper we introduce ONDUX (On Demand Unsupervised Information Extraction), a new unsupervised probabilistic approach for IETS. As other unsupervised IETS approaches, ONDUX relies on information available on pre-existing data to associate segments in the input string with attributes of a given domain. Unlike other approaches, we rely on very effective matching strategies instead of explicit learning strategies. The effectiveness of this matching strategy is also exploited to disambiguate the extraction of certain attributes through a reinforcement step that explores sequencing and positioning of attribute values directly learned on-demand from test data, with no previous human-driven training, a feature unique to ONDUX. This assigns to ONDUX a high degree of flexibility and results in superior effectiveness, as demonstrated by the experimental evaluation we report with textual sources from different domains, in which ONDUX is compared with a state-of-art IETS approach.

Keywords

This publication has 10 references indexed in Scilit:

Exploiting Structured Reference Data for Unsupervised Text Segmentation with Conditional Random Fields
Published by Society for Industrial & Applied Mathematics (SIAM) ,2008
LABRADOR: Efficiently publishing relational databases on the web by using keyword-based query interfaces
Information Processing & Management, 2007
FLUX-CIM
Published by Association for Computing Machinery (ACM) ,2007
Information Extraction
Foundations and Trends® in Databases, 2007
Information extraction from research papers using conditional random fields
Information Processing & Management, 2006
Answering Imprecise Queries over Autonomous Web Databases
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2006
Integrating Unstructured Data into Relational Databases
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2006
Mining reference tables for automatic text segmentation
Published by Association for Computing Machinery (ACM) ,2004
Automatic segmentation of text into structured records
Published by Association for Computing Machinery (ACM) ,2001
The New Statistical Analysis of Data
Published by Springer Nature ,1996