Genome-wide local ancestry inference in admixed individuals with scalable penalized nearest neighbor algorithm

Loter
Solaris
Inria
Machine Learning
Population genetics
conference
SMPGD
Statistical Methods for Post Genomic Data (SMPGD) 2020, Pasteur Institute, Paris, France
Authors

Ghislain Durif

Julien Mairal

Michael Blum

Published

January 23, 2020

Keywords: Machine Learning, Optimization, Population genomics, Local ancestry inference, Python

Summary

In most Eukaryotes species, the transmission of genetic materials between generations is achieved through sexual reproduction. During this process, each individual inherits half of their genome from both their parents. Thanks to genetic recombination, the genome of an individual is a non-uniform combination of the genetic material of their ancestors. This process directly impacts individual phenotype transmission and species or population evolution.

During inter-population (or inter-species) breeding events, descendants inherit an admixture of genetic materials from both source populations (or species). The study of genome-wide locus ancestry (i.e. determining the population of origin of each locus) can be done with local ancestry inference (LAI). It can be very useful to characterize admixture events (time, proportion) during a species history. Local ancetry inference can also be used to study biological adaptation and penotypic variation, or to explore population-specific disease predisposition.

We present Loter2, a machine-learning-based library for genome-wide local ancestry inference, derived from Loter (https://github.com/bcm-uga/Loter). Our method uses a locus-based penalised nearest-neighbor-like approach to determine the local ancestry of each locus in haplotypes from admixed individuals, by using reference haplotypes of individuals from different potential source populations. Loter2 aims at finding for each admixed haplotypes the closest reference haplotype regarding SNP similarity. The resolution is achieved with an efficient and scalable dynamic programming algorithm (with linear complexity). Loter2 implements a specific penalized optimization scheme to account for (i) reference population intra-variability (with a penalty on switches between reference populations), (ii) phasing error in haplotypes (authorizing switches between homologous haplotypes), (iii) a priori locus similarity between admixed and reference populations (based on locus-specific supervised learning of local ancestry). In addition, we use a bagging technique to get more robust results and to avoid hyper-parameter tunning (simplified usage).

Loter2 is able to process haplotype data where haplotype estimation (or phasing) is done in silico by processing SNP genotype data (or directly obtained with haplotype sequencing). For instance, in diploid species, each locus can be homozygous (both ancestral allele or both derived allele) or heterozygous (ancestral and derived allele). We used the phasing software Beagle (Browning & Browning, 2016) in the experiments.

Performance and comparison to state-of-the-art approaches for local ancestry inference with Loter2 are proposed based on the analysis of simulated genotype data, generated with the software msprime (Kelleher et al. 2016), using human chromosome recombination maps and realistic scenarii of admixture events during human species history.