Massive Sequence Comparisons as a Help in Annotating Genomic Sequences

Open Access

12 June 2001

journal article
research article
Published by Cold Spring Harbor Laboratory in Genome Research

Vol. 11 (7) , 1296-1303
https://doi.org/10.1101/gr.gr-1776r

Abstract

An all-by-all comparison of all the publicly available protein sequences from plants has been performed, followed by a clusterization process. Within each of the 1064 resulting clusters—containing sequences that are orthologous as well as paralogous—the sequences have been submitted to a pyramidal classification and their domains delineated by an automated procedure à la PRODOM. This process provides a means for easily checking for any apparent inconsistency in a cluster, for example, whether one sequence is shorter or longer than the others, one domain is missing, etc. In such cases, the alignment of the DNA sequence of the gene with that of a close homologous protein often reveals (in 10% of the clusters) probable sequencing errors (leading to frameshifts) or probable wrong intron/exon predictions. The composition of the clusters, their pyramidal classifications, and domain decomposition, as well as our comments when appropriate, are available fromhttp://chlora.infobiogen.fr:1234/PHYTOPROT.

Keywords

This publication has 27 references indexed in Scilit:

Identification of common molecular subsequences
Published by Elsevier ,2004
Analysis of the genome sequence of the flowering plant Arabidopsis thaliana
Nature, 2000
InterPro—an integrated documentation resource for protein families, domains and functional sites
Bioinformatics, 2000
GeneRAGE: a robust algorithm for sequence clustering and domain detection
Bioinformatics, 2000
Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps
Bioinformatics, 2000
Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thalianasequences
Bioinformatics, 1999
FramePlus: aligning DNA to protein sequences
Bioinformatics, 1999
PairWise and SearchWise: Finding the Optimal Alignment in a Simultaneous Comparison of a Protein Profile against All DNA Translation Frames
Nucleic Acids Research, 1996
Modular arrangement of proteins as inferred from analysis of homology
Protein Science, 1994
On the statistical significance of nucleic add similarities
Nucleic Acids Research, 1984