Improving gene annotation using peptide mass spectrometry
Open Access
- 22 December 2006
- journal article
- Published by Cold Spring Harbor Laboratory in Genome Research
- Vol. 17 (2) , 231-239
- https://doi.org/10.1101/gr.5646507
Abstract
Annotation of protein-coding genes is a key goal of genome sequencing projects. In spite of tremendous recent advances in computational gene finding, comprehensive annotation remains a challenge. Peptide mass spectrometry is a powerful tool for researching the dynamic proteome and suggests an attractive approach to discover and validate protein-coding genes. We present algorithms to construct and efficiently search spectra against a genomic database, with no prior knowledge of encoded proteins. By searching a corpus of 18.5 million tandem mass spectra (MS/MS) from human proteomic samples, we validate 39,000 exons and 11,000 introns at the level of translation. We present translation-level evidence for novel or extended exons in 16 genes, confirm translation of 224 hypothetical proteins, and discover or confirm over 40 alternative splicing events. Polymorphisms are efficiently encoded in our database, allowing us to observe variant alleles for 308 coding SNPs. Finally, we demonstrate the use of mass spectrometry to improve automated gene prediction, adding 800 correct exons to our predictions using a simple rescoring strategy. Our results demonstrate that proteomic profiling should play a role in any genome sequencing project.Keywords
This publication has 45 references indexed in Scilit:
- Mapping the Arabidopsis organelle proteomeProceedings of the National Academy of Sciences, 2006
- Frequent occurrence of protein isoforms with or without a single amino acid residue by subtle alternative splicing: the case of Gln in DRPLA affects subcellular localization of the productsJournal of Human Genetics, 2005
- Perspectives in spicing up proteomics with splicingProteomics, 2005
- NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteinsNucleic Acids Research, 2004
- The International Protein Index: An integrated database for proteomics experimentsProteomics, 2004
- A method for reducing the time required to match protein sequences with tandem mass spectraRapid Communications in Mass Spectrometry, 2003
- Mass spectrometry-based proteomicsNature, 2003
- Initial sequencing and comparative analysis of the mouse genomeNature, 2002
- Genome sequence and comparative analysis of the model rodent malaria parasite Plasmodium yoelii yoeliiNature, 2002
- Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database SearchAnalytical Chemistry, 2002