Euk-PLoc: an ensemble classifier for large-scale eukaryotic protein subcellular location prediction
- 19 January 2007
- journal article
- Published by Springer Nature in Amino Acids
- Vol. 33 (1) , 57-67
- https://doi.org/10.1007/s00726-006-0478-8
Abstract
With the avalanche of newly-found protein sequences emerging in the post genomic era, it is highly desirable to develop an automated method for fast and reliably identifying their subcellular locations because knowledge thus obtained can provide key clues for revealing their functions and understanding how they interact with each other in cellular networking. However, predicting subcellular location of eukaryotic proteins is a challenging problem, particularly when unknown query proteins do not have significant homology to proteins of known subcellular locations and when more locations need to be covered. To cope with the challenge, protein samples are formulated by hybridizing the information derived from the gene ontology database and amphiphilic pseudo amino acid composition. Based on such a representation, a novel ensemble hybridization classifier was developed by fusing many basic individual classifiers through a voting system. Each of these basic classifiers was engineered by the KNN (K-Nearest Neighbor) principle. As a demonstration, a new benchmark dataset was constructed that covers the following 18 localizations: (1) cell wall, (2) centriole, (3) chloroplast, (4) cyanelle, (5) cytoplasm, (6) cytoskeleton, (7) endoplasmic reticulum, (8) extracell, (9) Golgi apparatus, (10) hydrogenosome, (11) lysosome, (12) mitochondria, (13) nucleus, (14) peroxisome, (15) plasma membrane, (16) plastid, (17) spindle pole body, and (18) vacuole. To avoid the homology bias, none of the proteins included has ≥25% sequence identity to any other in a same subcellular location. The overall success rates thus obtained via the 5-fold and jackknife cross-validation tests were 81.6 and 80.3%, respectively, which were 40–50% higher than those performed by the other existing methods on the same strict dataset. The powerful predictor, named “Euk-PLoc”, is available as a web-server at http://202.120.37.186/bioinf/euk. Furthermore, to support the need of people working in the relevant areas, a downloadable file will be provided at the same website to list the results predicted by Euk-PLoc for all eukaryotic protein entries (excluding fragments) in Swiss-Prot database that do not have subcellular location annotations or are annotated as being uncertain. The large-scale results will be updated twice a year to include the new entries of eukaryotic proteins and reflect the continuous development of Euk-PLoc.Keywords
This publication has 62 references indexed in Scilit:
- Predicting protein structural class with pseudo-amino acid composition and support vector machine fusion networkAnalytical Biochemistry, 2006
- Prediction of protein structural class with Rough SetsBMC Bioinformatics, 2006
- The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene OntologyNucleic Acids Research, 2004
- UniProt: the Universal Protein knowledgebaseNucleic Acids Research, 2004
- Prediction of Tight Turns and Their Types in ProteinsAnalytical Biochemistry, 2000
- Prediction of Protein Structural Classes and Subcellular LocationsCurrent Protein & Peptide Science, 2000
- Gene Ontology: tool for the unification of biologyNature Genetics, 2000
- Relation between amino acid composition and cellular location of proteinsJournal of Molecular Biology, 1997
- The SWISS-PROT protein sequence data bank and its supplement TrEMBLNucleic Acids Research, 1997
- A novel approach to predicting protein structural classes in a (20–1)‐D amino acid composition spaceProteins-Structure Function and Bioinformatics, 1995