Rule-based knowledge aggregation for large-scale protein sequence analysis of influenza A viruses
Open Access
- 13 February 2008
- journal article
- conference paper
- Published by Springer Nature in BMC Bioinformatics
- Vol. 9 (S1) , S7
- https://doi.org/10.1186/1471-2105-9-S1-S7
Abstract
The explosive growth of biological data provides opportunities for new statistical and comparative analyses of large information sets, such as alignments comprising tens of thousands of sequences. In such studies, sequence annotations frequently play an essential role, and reliable results depend on metadata quality. However, the semantic heterogeneity and annotation inconsistencies in biological databases greatly increase the complexity of aggregating and cleaning metadata. Manual curation of datasets, traditionally favoured by life scientists, is impractical for studies involving thousands of records. In this study, we investigate quality issues that affect major public databases, and quantify the effectiveness of an automated metadata extraction approach that combines structural and semantic rules. We applied this approach to more than 90,000 influenza A records, to annotate sequences with protein name, virus subtype, isolate, host, geographic origin, and year of isolation.This publication has 18 references indexed in Scilit:
- Identification of human-to-human transmissibility factors in PB2 proteins of influenza A by large-scale mutual information analysisBMC Bioinformatics, 2008
- Using provenance to manage knowledge of In Silico experimentsBriefings in Bioinformatics, 2007
- The Molecular Biology Database Collection: 2007 updateNucleic Acids Research, 2006
- A systematic bioinformatics approach for selection of epitope-based vaccine targetsCellular Immunology, 2006
- Modelling data across labs, genomes, space and timeNature Cell Biology, 2006
- Large-Scale Sequence Analysis of Avian Influenza IsolatesScience, 2006
- Ontologies in biology: design, applications and future challengesNature Reviews Genetics, 2004
- Allergen databasesAllergy, 2003
- Jena: a semantic Web toolkitIEEE Internet Computing, 2002