Reconstructing Ancestral Haplotypes with a Dictionary Model

1 April 2006

journal article
research article
Published by Mary Ann Liebert Inc in Journal of Computational Biology

Vol. 13 (3) , 767-785
https://doi.org/10.1089/cmb.2006.13.767

Abstract

We propose a dictionary model for haplotypes. According to the model, a haplotype is constructed by randomly concatenating haplotype segments from a given dictionary of segments. A haplotype block is defined as a set of haplotype segments that begin and end with the same pair of markers. In this framework, haplotype blocks can overlap, and the model provides a setting for testing the accuracy of simpler models invoking only nonoverlapping blocks. Each haplotype segment in a dictionary has an assigned probability and alternate spellings that account for genotyping errors and mutation. The model also allows for missing data, unphased genotypes, and prior distribution of parameters. Likelihood evaluations rely on forward and backward recurrences similar to the ones encountered in hidden Markov models. Parameter estimation is carried out with an EM algorithm. The search for the optimal dictionary is particularly difficult because of the variable dimension of the model space. We define a minimum description length criteria to evaluate each dictionary and use a combination of greedy search and careful initialization to select a best dictionary for a given dataset. Application of the model to simulated data gives encouraging results. In a real dataset, we are able to reconstruct a parsimonious dictionary that captures patterns of linkage disequilibrium well.

Keywords

This publication has 32 references indexed in Scilit:

Haplotype Block Partitioning and Tag SNP Selection Using Genotype Data and Their Applications to Association Studies
Genome Research, 2004
Haplotype reconstruction from genotype data using Imperfect Phylogeny
Bioinformatics, 2004
The International HapMap Project
Nature, 2003
Haplotyping as Perfect Phylogeny: A Direct Approach
Journal of Computational Biology, 2003
Robustness of Inference of Haplotype Block Structure
Journal of Computational Biology, 2003
Recombination hotspots rather than population history dominate linkage disequilibrium in the MHC class II region
Human Molecular Genetics, 2003
Genomewide motif identification using a dictionary model
Proceedings of the IEEE, 2002
Model Selection and the Principle of Minimum Description Length
Journal of the American Statistical Association, 2001
A Method for Comparing Two Hierarchical Clusterings
Journal of the American Statistical Association, 1983
A Universal Prior for Integers and Estimation by Minimum Description Length
The Annals of Statistics, 1983