An overview of multitext

Abstract
Http ://mult itext, uwaterloo, ca The research focus of the MultiText project is the development and prototyping of scalable technologies for distributed information retrieval systems. The MultiText system is based on the network-of-workstations architecture shown in figure 1. The system is composed of several major components: The index engines maintain the index file structures and provide search capabilities. The text servers are specialized by document type and provide retrieval capabilities for arbitrary text passages specified at the word level. Finally, the marshaller/dispatcher interacts with clients and coordinates query and update activities. Research issues are addressed in the context of this distributed architecture. Issues of concern to the MultiText Project include data distribution, load balancing, fast update, compression, fault tolerance, document structure, relevance ranking, and user interaction. Support for document structure is a particular feature of the MultiText system. The system can support multiple document formats within a single integrated database and provide specific support for structure inherent in each document type. The MultiText query language, GCL, provides facilities for directly referencing document structure and allows queries to reference equivalent structural elements across differently formatted documents. Ranking in the MultiText system is based on passage retrieval, with the score of a passage based on its length and the score of a document based on the score of the passages contained within it. As well as ranking full documents, the method allows ranking of arbitrary document components. Scores do not depend on collection-wide statistics, making the ranking method particularly suitable for use in a dynamic distributed environment. In order to implement the structural retrieval capabilities of GCL, the MultiText system supports a full positional index using inverted list file structures. Text in the database is concatenated into a single term sequence, essentially one large document, and is addressed by term position. Markup is used to identify documents and other structural elements at query time. For each specific term (e.g. "the") the index stores a sorted list of the positions in the sequence where the term Occurs. The index provides two basic access operations. Given a specific term and a position in the term sequence, one operation returns the location of the first occurrence of the term after the specified position; the other returns the location of the last occurrence before the specified position. The GCL query language is implemented in terms of these basic operations. The index structures used by MultiText efficiently support the operations, minimizing the number of disk accesses and allowing large portions of the index lists to be skipped during query processing. On a network of four inexpensive PC's, costing less than US$10,000, the current version of MultiText can search an index for 100GB of text and return the top 20 documents in less than a second on average, with good retrieval effectiveness. We continue to work to improve performance. Strategies for prefetching and managing multiple query streams are of particular interest.

This publication has 0 references indexed in Scilit: