A tree-based model for homogeneous groupings of multinomials
- 19 October 2005
- journal article
- research article
- Published by Wiley in Statistics in Medicine
- Vol. 24 (22) , 3513-3522
- https://doi.org/10.1002/sim.2182
Abstract
The motivation of this paper is to provide a tree-based method for grouping multinomial data according to their classification probability vectors. We produce an initial tree by binary recursive partitioning whereby multinomials are successively split into two subsets and the splits are determined by maximizing the likelihood function. If the number of multinomials k is too large, we propose to order the multinomials, and then build the initial tree based on a dramatically smaller number k–1 of possible splits. The tree is then pruned from the bottom up. The pruning process involves a sequence of hypothesis tests of a single homogeneous group against the alternative that there are two distinct, internally homogeneous groups. As pruning criteria, the Bayesian information criterion and the Wilcoxon rank-sum test are proposed. The tree-based model is illustrated on genetic sequence data. Homogeneous groupings of genetic sequences present new opportunities to understand and align these sequences. Copyright © 2005 John Wiley & Sons, Ltd.Keywords
This publication has 15 references indexed in Scilit:
- Bayesian Binary Segmentation Procedure for Detecting Streakiness in SportsJournal of the Royal Statistical Society Series A: Statistics in Society, 2004
- Tests for 2×Kcontingency tables with clustered ordered categorical dataStatistics in Medicine, 2001
- Testing and Locating Variance Changepoints with Application to Stock PricesJournal of the American Statistical Association, 1997
- The chimpanzee α-fetoprotein-encoding gene shows structural similarity to that of gorilla but distinct differences from that of humanGene, 1995
- Exponential survival treesStatistics in Medicine, 1989
- RECPAM: a computer program for recursive partition and amalgamation for censored survival data and other situations frequently occurring in biostatistics. I. Methods and program featuresComputer Methods and Programs in Biomedicine, 1988
- Estimating the Dimension of a ModelThe Annals of Statistics, 1978
- Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric ProblemsThe Annals of Statistics, 1974
- A Cluster Analysis Method for Grouping Means in the Analysis of VariancePublished by JSTOR ,1974
- On Grouping for Maximum HomogeneityJournal of the American Statistical Association, 1958