PSLDoc: Protein subcellular localization prediction based on gapped‐dipeptides and probabilistic latent semantic analysis

7 February 2008

journal article
research article
Published by Wiley in Proteins-Structure Function and Bioinformatics

Vol. 72 (2) , 693-710
https://doi.org/10.1002/prot.21944

Abstract

Prediction of protein subcellular localization (PSL) is important for genome annotation, protein function prediction, and drug discovery. Many computational approaches for PSL prediction based on protein sequences have been proposed in recent years for Gram‐negative bacteria. We present PSLDoc, a method based on gapped‐dipeptides and probabilistic latent semantic analysis (PLSA) to solve this problem. A protein is considered as a term string composed by gapped‐dipeptides, which are defined as any two residues separated by one or more positions. The weighting scheme of gapped‐dipeptides is calculated according to a position specific score matrix, which includes sequence evolutionary information. Then, PLSA is applied for feature reduction, and reduced vectors are input to five one‐versus‐rest support vector machine classifiers. The localization site with the highest probability is assigned as the final prediction. It has been reported that there is a strong correlation between sequence homology and subcellular localization (Nair and Rost, Protein Sci 2002;11:2836–2847; Yu et al., Proteins 2006;64:643–651). To properly evaluate the performance of PSLDoc, a target protein can be classified into low‐ or high‐homology data sets. PSLDoc's overall accuracy of low‐ and high‐homology data sets reaches 86.84% and 98.21%, respectively, and it compares favorably with that of CELLO II (Yu et al., Proteins 2006;64:643–651). In addition, we set a confidence threshold to achieve a high precision at specified levels of recall rates. When the confidence threshold is set at 0.7, PSLDoc achieves 97.89% in precision which is considerably better than that of PSORTb v.2.0 (Gardy et al., Bioinformatics 2005;21:617–623). Our approach demonstrates that the specific feature representation for proteins can be successfully applied to the prediction of protein subcellular localization and improves prediction accuracy. Besides, because of the generality of the representation, our method can be extended to eukaryotic proteomes in the future. The web server of PSLDoc is publicly available at http://bio‐cluster.iis.sinica.edu.tw/∼bioapp/PSLDoc/. Proteins 2008.

Keywords

This publication has 50 references indexed in Scilit:

ngLOC: an n-gram-based Bayesian method for estimating the subcellular proteomes of eukaryotes
Genome Biology, 2007
Assembly Factor Omp85 Recognizes Its Outer Membrane Protein Substrates by a Species-Specific C-Terminal Motif
PLoS Biology, 2006
Mimicking Cellular Sorting Improves Prediction of Subcellular Localization
Journal of Molecular Biology, 2005
Protein classification based on text document classification techniques
Proteins-Structure Function and Bioinformatics, 2005
Phylogenetic and structural analyses of the oxa1 family of protein translocases
FEMS Microbiology Letters, 2001
A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach1 1Edited by B. Holland
Journal of Molecular Biology, 2001
Database resources of the National Center for Biotechnology Information
Nucleic Acids Research, 2000
Protein secondary structure prediction based on position-specific scoring matrices 1 1Edited by G. Von Heijne
Journal of Molecular Biology, 1999
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
Prediction of Protein Secondary Structure at Better than 70% Accuracy
Journal of Molecular Biology, 1993