Knowledge accumulation and resolution of data inconsistencies during the integration of microbial information sources

27 June 2005

journal article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Knowledge and Data Engineering

Vol. 17 (8) , 1111-1126
https://doi.org/10.1109/tkde.2005.131

Abstract

The Internet has emerged as an ever-increasing environment of multiple heterogeneous and autonomous data sources that contain relevant but overlapping information on microorganisms. Microbiologists might therefore seriously benefit from the design of intelligent software agents that assist in the navigation through this information-rich environment, together with the development of data mining tools that can aid in the discovery of new information. These applications heavily depend upon well-conditioned data samples that are correlated with multiple information sources, hence, accurate database merging operations are desirable. Information systems designed for joining the related knowledge provided by different microbial data sources are hampered by the labeling mechanism for referencing microbial strains and cultures that suffers from syntactical variation in the practical usage of the labels, whereas, additionally, synonymy and homonymy are also known to exist amongst the labels. This situation is even complicated by the observation that the label equivalence knowledge is itself fragmentarily recorded over several data sources which can be suspected of providing information that might be both incomplete and incorrect. This paper presents how extraction and integration of label equivalence information from several distributed data sources has led to the construction of a so-called integrated strain database, which helps to resolve most of the above problems. Given the fact that information retrieved from autonomous resources might be overlapping, incomplete, and incorrect, much energy was spent into the completion of missing information, the discovery of new associations between information objects, and the development and application of tools for error detection and correction. Through a thorough evaluation of the different levels of incompleteness and incorrectness encountered within the incorporated data sources, we have finally given proof of the added value of the integrated strain database as a necessary service provider for the seamless integration of microbial information sources.

Keywords

This publication has 43 references indexed in Scilit:

Identification of common molecular subsequences
Published by Elsevier ,2004
Chryseobacterium defluvii sp. nov., isolated from wastewater
International Journal of Systematic and Evolutionary Microbiology, 2003
An optimal O(N2) algorithm for computing the min-transitive closure of a weighted graph
Information Processing Letters, 2000
Syntactic clustering of the Web
Computer Networks and ISDN Systems, 1997
An approach to designing very fast approximate string matching algorithms
IEEE Transactions on Knowledge and Data Engineering, 1994
On resolving schematic heterogeneity in multidatabase systems
Distributed and Parallel Databases, 1993
Structuring Strain Data for Storage and Retrieval of Information on Fungi and Yeasts in MINE, the Microbial Information Network Europe
Microbiology, 1988
Duplicate record elimination in large data files
ACM Transactions on Database Systems, 1983
A general method applicable to the search for similarities in the amino acid sequence of two proteins
Journal of Molecular Biology, 1970
Measures of the Amount of Ecologic Association Between Species
Ecology, 1945