TEXTQUEST: DOCUMENT CLUSTERING OF MEDLINE ABSTRACTS FOR CONCEPT DISCOVERY IN MOLECULAR BIOLOGY

Abstract
We present an algorithm for large-scale document clustering of biological text, obtained from Medline abstracts. The algorithm is based on statistical treatment of terms, stemming, the idea of a “go-list”, unsupervised machine learning and graph layout optimization. The method is flexible and robust, controlled by a small number of parameter values. Experiments show that the resulting document clusters are meaningful as assessed by cluster-specific terms. Despite the statistical nature of the approach, with minimal semantic analysis, the terms provide a shallow description of the document corpus and support concept discovery.

This publication has 0 references indexed in Scilit: