Integrating Unstructured Data into Relational Databases
- 1 January 2006
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
Abstract
In this paper we present a system for automatically integrating unstructured text into a multi-relational database using state-of-the-art statistical models for structure extraction and matching. We show how to extend current highperforming models, Conditional Random Fields and their semi-markov counterparts, to effectively exploit a variety of recognition clues available in a database of entities, thereby significantly reducing the dependence on manually labeled training data. Our system is designed to load unstructured records into columns spread across multiple tables in the database while resolving the relationship of the extracted text with existing column values, and preserving the cardinality and link constraints of the database. We show how to combine the inference algorithms of statistical models with the database imposed constraints for optimal data integration.Keywords
This publication has 11 references indexed in Scilit:
- Efficient Batch Top-k Search for Dictionary-based Entity RecognitionPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2006
- Exploiting dictionaries in named entity extractionPublished by Association for Computing Machinery (ACM) ,2004
- Mining reference tables for automatic text segmentationPublished by Association for Computing Machinery (ACM) ,2004
- Adaptive name matching in information integrationIEEE Intelligent Systems, 2003
- Shallow parsing with conditional random fieldsPublished by Association for Computational Linguistics (ACL) ,2003
- Interactive deduplication using active learningPublished by Association for Computing Machinery (ACM) ,2002
- A comparison of algorithms for maximum entropy parameter estimationPublished by Association for Computational Linguistics (ACL) ,2002
- Automatic segmentation of text into structured recordsACM SIGMOD Record, 2001
- Learning to Parse Natural Language with Maximum Entropy ModelsMachine Learning, 1999
- On the limited memory BFGS method for large scale optimizationMathematical Programming, 1989