Knowledge accumulation and resolution of data inconsistencies during the integration of microbial information sources
- 27 June 2005
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Knowledge and Data Engineering
- Vol. 17 (8) , 1111-1126
- https://doi.org/10.1109/tkde.2005.131
Abstract
The Internet has emerged as an ever-increasing environment of multiple heterogeneous and autonomous data sources that contain relevant but overlapping information on microorganisms. Microbiologists might therefore seriously benefit from the design of intelligent software agents that assist in the navigation through this information-rich environment, together with the development of data mining tools that can aid in the discovery of new information. These applications heavily depend upon well-conditioned data samples that are correlated with multiple information sources, hence, accurate database merging operations are desirable. Information systems designed for joining the related knowledge provided by different microbial data sources are hampered by the labeling mechanism for referencing microbial strains and cultures that suffers from syntactical variation in the practical usage of the labels, whereas, additionally, synonymy and homonymy are also known to exist amongst the labels. This situation is even complicated by the observation that the label equivalence knowledge is itself fragmentarily recorded over several data sources which can be suspected of providing information that might be both incomplete and incorrect. This paper presents how extraction and integration of label equivalence information from several distributed data sources has led to the construction of a so-called integrated strain database, which helps to resolve most of the above problems. Given the fact that information retrieved from autonomous resources might be overlapping, incomplete, and incorrect, much energy was spent into the completion of missing information, the discovery of new associations between information objects, and the development and application of tools for error detection and correction. Through a thorough evaluation of the different levels of incompleteness and incorrectness encountered within the incorporated data sources, we have finally given proof of the added value of the integrated strain database as a necessary service provider for the seamless integration of microbial information sources.Keywords
This publication has 43 references indexed in Scilit:
- Identification of common molecular subsequencesPublished by Elsevier ,2004
- Chryseobacterium defluvii sp. nov., isolated from wastewaterInternational Journal of Systematic and Evolutionary Microbiology, 2003
- An optimal O(N2) algorithm for computing the min-transitive closure of a weighted graphInformation Processing Letters, 2000
- Syntactic clustering of the WebComputer Networks and ISDN Systems, 1997
- An approach to designing very fast approximate string matching algorithmsIEEE Transactions on Knowledge and Data Engineering, 1994
- On resolving schematic heterogeneity in multidatabase systemsDistributed and Parallel Databases, 1993
- Structuring Strain Data for Storage and Retrieval of Information on Fungi and Yeasts in MINE, the Microbial Information Network EuropeMicrobiology, 1988
- Duplicate record elimination in large data filesACM Transactions on Database Systems, 1983
- A general method applicable to the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology, 1970
- Measures of the Amount of Ecologic Association Between SpeciesEcology, 1945