Automated SNP detection from a large collection of white spruce expressed sequences: contributing factors and approaches for the categorization of SNPs
Open Access
- 6 July 2006
- journal article
- research article
- Published by Springer Nature in BMC Genomics
- Vol. 7 (1) , 174
- https://doi.org/10.1186/1471-2164-7-174
Abstract
High-throughput genotyping technologies represent a highly efficient way to accelerate genetic mapping and enable association studies. As a first step toward this goal, we aimed to develop a resource of candidate Single Nucleotide Polymorphisms (SNP) in white spruce (Picea glauca [Moench] Voss), a softwood tree of major economic importance. A white spruce SNP resource encompassing 12,264 SNPs was constructed from a set of 6,459 contigs derived from Expressed Sequence Tags (EST) and by using the bayesian-based statistical software PolyBayes. Several parameters influencing the SNP prediction were analysed including the a priori expected polymorphism, the probability score (PSNP), and the contig depth and length. SNP detection in 3' and 5' reads from the same clones revealed a level of inconsistency between overlapping sequences as low as 1%. A subset of 245 predicted SNPs were verified through the independent resequencing of genomic DNA of a genotype also used to prepare cDNA libraries. The validation rate reached a maximum of 85% for SNPs predicted with either PSNP ≥ 0.95 or ≥ 0.99. A total of 9,310 SNPs were detected by using PSNP ≥ 0.95 as a criterion. The SNPs were distributed among 3,590 contigs encompassing an array of broad functional categories, with an overall frequency of 1 SNP per 700 nucleotide sites. Experimental and statistical approaches were used to evaluate the proportion of paralogous SNPs, with estimates in the range of 8 to 12%. The 3,789 coding SNPs identified through coding region annotation and ORF prediction, were distributed into 39% nonsynonymous and 61% synonymous substitutions. Overall, there were 0.9 SNP per 1,000 nonsynonymous sites and 5.2 SNPs per 1,000 synonymous sites, for a genome-wide nonsynonymous to synonymous substitution rate ratio (Ka/Ks) of 0.17. We integrated the SNP data in the ForestTreeDB database along with functional annotations to provide a tool facilitating the choice of candidate genes for mapping purposes or association studies.Keywords
This publication has 36 references indexed in Scilit:
- A composite linkage map from two crosses for the species complex Picea mariana × Picea rubens and analysis of synteny with other PinaceaeTheoretical and Applied Genetics, 2005
- Trans‐species shared polymorphisms at orthologous nuclear gene loci among distant species in the conifer Picea (Pinaceae): implications for the long‐term maintenance of genetic diversity in treesAmerican Journal of Botany, 2005
- Association genetics of complex traits in conifersTrends in Plant Science, 2004
- Bioinformatics Tools for Single Nucleotide Polymorphism Discovery and AnalysisAnnals of the New York Academy of Sciences, 2004
- A 3.9-Centimorgan-Resolution Human Single-Nucleotide Polymorphism Linkage Map and Screening SetAmerican Journal of Human Genetics, 2003
- Apparent homology of expressed genes from wood-forming tissues of loblolly pine ( Pinus taeda L.) with Arabidopsis thalianaProceedings of the National Academy of Sciences, 2003
- Mining for Single Nucleotide Polymorphisms and Insertions/Deletions in Maize Expressed Sequence Tag DataPlant Physiology, 2003
- SSAHA: A Fast Search Method for Large DNA DatabasesGenome Research, 2001
- A map of human genome sequence variation containing 1.42 million single nucleotide polymorphismsNature, 2001
- Basic local alignment search toolJournal of Molecular Biology, 1990