# Machine Learning for Evolutionary Genetics Dr. Dan Schrider Evolutionary Phenomen > Altered genealogy > How will this show up in sequence alignment The way we actually do this is going in reverse: Sequence alignment > capture properties of genealogy > infer presence of evolutionary phenomen PopGen stat - Pi! **Why use one stat when you can use many? More statistics - more info about genealogy - more power!** ### Genetic Diversity within populations #### Sample of genomes (n) is usually much smaller than N - randomly sample subset of individuals from population - but what's a population? "group of randomly mating individuals without subdivisions" but this really doesn't exist in nature - subtle violations of this assumption might be fine - How big should your sample be? 10s to 100s fine for some things, others you want more #### Type of sequencing: Typically using Illumina, typically limited to SNPs Harder to detect indels and definitely structural variants ### Outline of Talk #### 1. General framework for likelihood-free population genetic inference ## Positive selection (mutation and selection): Selective sweeps skew patterns of diversity - **Reduces diversity (pi)** - in areas linked to beneficial mutation - Number of sequences vs # of distinct haplotypes - fewer and fewer distinct haplotypes - **SFS**: Excess of RARE variants and excess of common variants - **Excess LD** Hard sweeps vs soft sweeps (selection on standing variation) Dan made a tool called the ***Soft/Hard Inference Tool (S/HIC)*** to detect different types of selective sweeps Many stats better than 1 stat but how is NO statistics better than many stats? Skip the genealogy inference (just sequence alignment -> infer evolution) ## Introgression - Speciation followed by gene flow - Example of adaptive introgression (gene flow mimicry to heliconius) - example with simulans and sechelia dmin = look at all cross species pairs adn see how diverged they are (branch lengths) and find the shortest branch length between the two most similar individuals FILET - uses a lot of statistics and can find a much greater range of that parameter space where we can detect proportion of individual involved in migration event by when did it occur But STILL CNN (this one for detecting introgression is called UNET) is better than FILET #### 2. Deep learning for population genomic time-series How much better can we do if we take repeated samplings from the same population Tracking haplotype frequencies over time #### 3. Adventures in phylogenetic inference Incomplete lineage sorting! :)