Bayesian search of functionally divergent protein subgroups and their function specific residues

Open Access

26 July 2006

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 22 (20) , 2466-2474
https://doi.org/10.1093/bioinformatics/btl411

Abstract

Motivation: The rapid increase in the amount of protein sequence data has created a need for an automated identification of evolutionarily related subgroups from large datasets. The existing methods typically require a priori specification of the number of putative groups, which defines the resolution of the classification solution. Results: We introduce a Bayesian model-based approach to simultaneous identification of evolutionary groups and conserved parts of the protein sequences. The model-based approach provides an intuitive and efficient way of determining the number of groups from the sequence data, in contrast to the ad hoc methods often exploited for similar purposes. Our model recognizes the areas in the sequences that are relevant for the clustering and regards other areas as noise. We have implemented the method using a fast stochastic optimization algorithm which yields a clustering associated with the estimated maximum posterior probability. The method has been shown to have high specificity and sensitivity in simulated and real clustering tasks. With real datasets the method also highlights the residues close to the active site. Availability: Software ‘kPax’ is available at Author Webpage Contact:pekka.marttinen@helsinki.fi Supplementary information:Author Webpage

Keywords

This publication has 21 references indexed in Scilit:

Determining functional specificity from protein sequences
Bioinformatics, 2005
Accurate Detection of Very Sparse Sequence Motifs
Journal of Computational Biology, 2004
Simultaneous feature selection and clustering using mixture models
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2004
A Family of Evolution–Entropy Hybrid Methods for Ranking Protein Residues by Importance
Journal of Molecular Biology, 2004
Using Orthologous and Paralogous Proteins to Identify Specificity-determining Residues in Bacterial Transcription Factors
Journal of Molecular Biology, 2002
Information Content of Protein Sequences
Journal of Theoretical Biology, 2000
Bayes Factors
Journal of the American Statistical Association, 1995
Bayes Factors
Journal of the American Statistical Association, 1995
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
Nucleic Acids Research, 1994
Comparing partitions
Journal of Classification, 1985