Gauging Similarity with n -Grams: Language-Independent Categorization of Text
- 10 February 1995
- journal article
- research article
- Published by American Association for the Advancement of Science (AAAS) in Science
- Vol. 267 (5199) , 843-848
- https://doi.org/10.1126/science.267.5199.843
Abstract
A language-independent means of gauging topical similarity in unrestricted text is described. The method combines information derived from n-grams (consecutive sequences of n characters) with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents. No prior information about document content or language is required. Context, as it applies to document similarity, can be accommodated by a well-defined procedure. When an existing document is used as an exemplar, the completeness and accuracy with which topically related documents are retrieved is comparable to that of the best existing systems. The results of a formal evaluation are discussed, and examples are given using documents in English and Japanese.Keywords
This publication has 18 references indexed in Scilit:
- Automatic Analysis, Theme Generation, and Summarization of Machine-Readable TextsScience, 1994
- Patients’ Perception of Laughter in a Rehabilitation HospitalRehabilitation Nursing Journal, 1990
- A re-examination of relevance: toward a dynamic, situational definition∗Information Processing & Management, 1990
- Automatic spelling correction using a trigram similarity measureInformation Processing & Management, 1983
- The generation and use of text fragments for data compressionInformation Processing & Management, 1982
- SPELLING ERROR DETECTION AND CORRECTION BY COMPUTER: SOME NOTES AND A BIBLIOGRAPHYJournal of Documentation, 1982
- The use of trigram analysis for spelling error detectionInformation Processing & Management, 1981
- Computer programs for detecting and correcting spelling errorsCommunications of the ACM, 1980
- Automatic detection and correction of spelling errors in a large data baseJournal of the American Society for Information Science, 1980
- DOCUMENT RETRIEVAL EXPERIMENTS USING INDEXING VOCABULARIES OF VARYING SIZE. II. HASHING, TRUNCATION, DIGRAM AND TRIGRAM ENCODING OF INDEX TERMSJournal of Documentation, 1979