GaP
- 25 July 2004
- proceedings article
- Published by Association for Computing Machinery (ACM)
- p. 122-129
- https://doi.org/10.1145/1008992.1009016
Abstract
We present a probabilistic model for a document corpus that combines many of the desirable features of previous models. The model is called "GaP" for Gamma-Poisson, the distributions of the first and last random variable. GaP is a factor model, that is it gives an approximate factorization of the document-term matrix into a product of matrices Λ and X. These factors have strictly non-negative terms. GaP is a generative probabilistic model that assigns finite probabilities to documents in a corpus. It can be computed with an efficient and simple EM recurrence. For a suitable choice of parameters, the GaP factorization maximizes independence between the factors. So it can be used as an independent-component algorithm adapted to document data. The form of the GaP model is empirically as well as analytically motivated. It gives very accurate results as a probabilistic model (measured via perplexity) and as a retrieval model. The GaP model projects documents and terms into a low-dimensional space of "themes," and models texts as "passages" of terms on the same theme.Keywords
This publication has 7 references indexed in Scilit:
- Document clustering based on non-negative matrix factorizationPublished by Association for Computing Machinery (ACM) ,2003
- A study of smoothing methods for language models applied to Ad Hoc information retrievalPublished by Association for Computing Machinery (ACM) ,2001
- Independent component analysis: algorithms and applicationsNeural Networks, 2000
- Probabilistic latent semantic indexingPublished by Association for Computing Machinery (ACM) ,1999
- Fast and robust fixed-point algorithms for independent component analysisIEEE Transactions on Neural Networks, 1999
- An introduction to latent semantic analysisDiscourse Processes, 1998
- A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge.Psychological Review, 1997