Efficient Batch Top-k Search for Dictionary-based Entity Recognition

1 January 2006

conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

Abstract

We consider the problem of speeding up Entity Recognition systems that exploit existing large databases of structured entities to improve extraction accuracy. These systems require the computation of the maximum similarity scores of several overlapping segments of the input text with the entity database. We formulate a Batch-Top-K problem with the goal of sharing computations across overlapping segments. Our proposed algorithm performs a factor of three faster than independent Top-K queries and only a factor of two slower than an unachievable lower bound on total cost. We then propose a novel modification of the popular Viterbi algorithm for recognizing entities so as to work with easily computable bounds on match scores, thereby reducing the total inference time by a factor of eight compared to stateof- the-art methods.

Keywords

This publication has 11 references indexed in Scilit:

Exploiting dictionaries in named entity extraction
Published by Association for Computing Machinery (ACM) ,2004
Mining reference tables for automatic text segmentation
Published by Association for Computing Machinery (ACM) ,2004
Efficient set joins on similarity predicates
Published by Association for Computing Machinery (ACM) ,2004
Top-k Query Evaluation with Probabilistic Guarantees
Published by Elsevier ,2004
Robust and efficient fuzzy match for online data cleaning
Published by Association for Computing Machinery (ACM) ,2003
Optimal aggregation algorithms for middleware
Journal of Computer and System Sciences, 2003
Shallow parsing with conditional random fields
Published by Association for Computational Linguistics (ACL) ,2003
Automatic segmentation of text into structured records
ACM SIGMOD Record, 2001
Learning to Parse Natural Language with Maximum Entropy Models
Machine Learning, 1999
Robust part-of-speech tagging using a hidden Markov model
Computer Speech & Language, 1992