Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics
Open Access
- 1 February 2006
- journal article
- Published by Springer Nature in Genome Biology
- Vol. 7 (4) , R35
- https://doi.org/10.1186/gb-2006-7-4-r35
Abstract
Background: Defining the location of genes and the precise nature of gene products remains a fundamental challenge in genome annotation. Interrogating tandem mass spectrometry data using genomic sequence provides an unbiased method to identify novel translation products. A six-frame translation of the entire human genome was used as the query database to search for novel blood proteins in the data from the Human Proteome Organization Plasma Proteome Project. Because this target database is orders of magnitude larger than the databases traditionally employed in tandem mass spectra analysis, careful attention to significance testing is required. Confidence of identification is assessed using our previously described Poisson statistic, which estimates the significance of multi-peptide identifications incorporating the length of the matching sequence, number of spectra searched and size of the target sequence database. Results: Applying a false discovery rate threshold of 0.05, we identified 282 significant open reading frames, each containing two or more peptide matches. There were 627 novel peptides associated with these open reading frames that mapped to a unique genomic coordinate placed within the start/stop points of previously annotated genes. These peptides matched 1,110 distinct tandem MS spectra. Peptides fell into four categories based upon where their genomic coordinates placed them relative to annotated exons within the parent gene. Conclusion: This work provides evidence for novel alternative splice variants in many previously annotated genes. These findings suggest that annotation of the genome is not yet complete and that proteomics has the potential to further add to our understanding of gene structures.Keywords
This publication has 27 references indexed in Scilit:
- Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometryGenome Biology, 2004
- Recent advances in gene structure predictionCurrent Opinion in Structural Biology, 2004
- Accurate Identification of Novel Human Genes Through Simultaneous Gene Prediction in Human, Mouse, and RatGenome Research, 2004
- Combining Phylogenetic and Hidden Markov Models in Biosequence AnalysisJournal of Computational Biology, 2004
- Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genesProceedings of the National Academy of Sciences, 2003
- Leveraging the Mouse Genome for Gene Prediction in Human: From Whole-Genome Shotgun Reads to a Global Synteny MapGenome Research, 2003
- Mass spectrometry allows direct identification of proteins in large genomesProteomics, 2001
- Interrogating the human genome using uninterpreted mass spectrometry dataProteomics, 2001
- Identification of human chromosome 22 transcribed sequences with ORF expressed sequence tagsProceedings of the National Academy of Sciences, 2000
- Mining Genomes: Correlating Tandem Mass Spectra of Modified and Unmodified Peptides to Sequences in Nucleotide DatabasesAnalytical Chemistry, 1995