A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes

Open Access

31 October 2008

journal article
research article
Published by Springer Nature in BMC Genomics

Vol. 9 (1) , 1-18
https://doi.org/10.1186/1471-2164-9-517

Abstract

Background: The challenges of accurate gene prediction and enumeration are further aggravated in large genomes that contain highly repetitive transposable elements (TEs). Yet TEs play a substantial role in genome evolution and are themselves an important subject of study. Repeat annotation, based on counting occurrences of k-mers, has been previously used to distinguish TEs from low-copy genic regions; but currently available software solutions are impractical due to high memory requirements or specialization for specific user-tasks. Results: Here we introduce the Tallymer software, a flexible and memory-efficient collection of programs for k-mer counting and indexing of large sequence sets. Unlike previous methods, Tallymer is based on enhanced suffix arrays. This gives a much larger flexibility concerning the choice of the k-mer size. Tallymer can process large data sizes of several billion bases. We used it in a variety of applications to study the genomes of maize and other plant species. In particular, Tallymer was used to index a set of whole genome shotgun sequences from maize (B73) (total size 10⁹ bp.). We analyzed k-mer frequencies for a wide range of k. At this low genome coverage (≈ 0.45×) highly repetitive 20-mers constituted 44% of the genome but represented only 1% of all possible k-mers. Similar low-complexity was seen in the repeat fractions of sorghum and rice. When applying our method to other maize data sets, High-C ₀ t derived sequences showed the greatest enrichment for low-copy sequences. Among annotated TEs, the most highly repetitive were of the Ty3/gypsy class of retrotransposons, followed by the Ty1/copia class, and DNA transposons. Among expressed sequence tags (EST), a notable fraction contained high-copy k-mers, suggesting that transposons are still active in maize. Retrotransposons in Mo17 and McC cultivars were readily detected using the B73 20-mer frequency index, indicating their conservation despite extensive rearrangement across cultivars. Among one hundred annotated bacterial artificial chromosomes (BACs), k-mer frequency could be used to detect transposon-encoded genes with 92% sensitivity, compared to 96% using alignment-based repeat masking, while both methods showed 92% specificity. Conclusion: The Tallymer software was effective in a variety of applications to aid genome annotation in maize, despite limitations imposed by the relatively low coverage of sequence available. For more information on the software, see http://www.zbh.uni-hamburg.de/Tallymer.

Keywords

This publication has 44 references indexed in Scilit:

Low-pass shotgun sequencing of the barley genome facilitates rapid identification of genes, conserved non-coding sequences and novel repeats
BMC Genomics, 2008
The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla
Nature, 2007
MIPSPlantsDB--plant database resource for integrative and comparative plant genome research
Nucleic Acids Research, 2007
Striking Similarities in the Genomic Distribution of Tandemly Arrayed Genes in Arabidopsis and Rice
PLoS Computational Biology, 2006
ReAS: Recovery of Ancestral Sequences for Transposable Elements from the Unassembled Reads of a Whole Genome Shotgun
PLoS Computational Biology, 2005
The map-based sequence of the rice genome
Nature, 2005
De novo identification of repeat families in large genomes
Bioinformatics, 2005
Structure and evolution of theCinfulretrotransposon family of maize
Genome, 2003
Nuclear DNA content in F1 hybrids of maize
Heredity, 1993
Selfish genes, the phenotype paradigm and genome evolution
Nature, 1980