Progress in recognizing typeset mathematics

Abstract
Printed mathematics has a number of features which distinguish it from conventional text. These include structure in two dimensions (fractions, exponents, limits), frequent font changes, symbols with variable shape (quotient bars), and substantially differing notational conventions from source to source. When compounded with more generic problems such as noise and merged or broken characters, printed mathematics offers a challenging arena for recognition. Our project was initially driven by the goal of scanning and parsing some 5,000 pages of elaborate mathematics (tables of definite integrals). While our prototype system demonstrates success on translating noise-free typeset equations into Lisp expressions appropriate for further processing, a more semantic top-down approach appears necessary for higher levels of performance. Such an approach may benefit the incorporation of these programs into a more general document processing viewpoint. We intend to release to the public our somewhat refined prototypes as utility programs in the hope that they will be of general use in the construction of custom OCR packages. These utilities are quite fast even as originally prototyped in Lisp, where they may be of particular interest to those working on 'intelligent' optical processing. Some routines have been re-written in C++ as well. Additional programs providing formula recognition and parsing also form a part of this system. It is important however to realize that distinct conflicting grammars are needed to cover variations in contemporary and historical typesetting, and thus a single simple solution is not possible.

This publication has 0 references indexed in Scilit: