Biological Sequence Simulation for Testing Complex Evolutionary Hypotheses: indel-Seq-Gen Version 2.0
Open Access
- 3 August 2009
- journal article
- research article
- Published by Oxford University Press (OUP) in Molecular Biology and Evolution
- Vol. 26 (11) , 2581-2593
- https://doi.org/10.1093/molbev/msp174
Abstract
Sequence simulation is an important tool in validating biological hypotheses as well as testing various bioinformatics and molecular evolutionary methods. Hypothesis testing relies on the representational ability of the sequence simulation method. Simple hypotheses are testable through simulation of random, homogeneously evolving sequence sets. However, testing complex hypotheses, for example, local similarities, requires simulation of sequence evolution under heterogeneous models. To this end, we previously introduced indel-Seq-Gen version 1.0 (iSGv1.0; indel, insertion/deletion). iSGv1.0 allowed heterogeneous protein evolution and motif conservation as well as insertion and deletion constraints in subsequences. Despite these advances, for complex hypothesis testing, neither iSGv1.0 nor other currently available sequence simulation methods is sufficient. indel-Seq-Gen version 2.0 (iSGv2.0) aims at simulating evolution of highly divergent DNA sequences and protein superfamilies. iSGv2.0 improves upon iSGv1.0 through the addition of lineage-specific evolution, motif conservation using PROSITE-like regular expressions, indel tracking, subsequence-length constraints, as well as coding and noncoding DNA evolution. Furthermore, we formalize the sequence representation used for iSGv2.0 and uncover a flaw in the modeling of indels used in current state of the art methods, which biases simulation results for hypotheses involving indels. We fix this flaw in iSGv2.0 by using a novel discrete stepping procedure. Finally, we present an example simulation of the calycin-superfamily sequences and compare the performance of iSGv2.0 with iSGv1.0 and random model of sequence evolution.Keywords
This publication has 31 references indexed in Scilit:
- Tools for simulating evolution of aligned genomic regions with integrated parameter estimationGenome Biology, 2008
- PROMALS3D: a tool for multiple protein sequence and structure alignmentsNucleic Acids Research, 2008
- Simulating DNA Coding Sequence Evolution with EvolveAGene 3Molecular Biology and Evolution, 2008
- Transducers: an emerging probabilistic framework for modeling indels on treesBioinformatics, 2007
- Recent Evolutions of Multiple Sequence Alignment AlgorithmsPLoS Computational Biology, 2007
- indel-Seq-Gen: A New Protein Family Simulator Incorporating Domains, Motifs, and IndelsMolecular Biology and Evolution, 2006
- BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmarkProteins-Structure Function and Bioinformatics, 2005
- SABmark—a benchmark for sequence alignment that covers the entire known fold spaceBioinformatics, 2004
- Empirical Analysis of Protein Insertions and Deletions Determining Parameters for the Correct Placement of Gaps in Protein Sequence AlignmentsJournal of Molecular Biology, 2004
- The rapid generation of mutation data matrices from protein sequencesBioinformatics, 1992