# Machine Learning for Evolutionary Genetics
Dr. Dan Schrider
Evolutionary Phenomen > Altered genealogy > How will this show up in sequence alignment
The way we actually do this is going in reverse:
Sequence alignment > capture properties of genealogy > infer presence of evolutionary phenomen
PopGen stat - Pi!
**Why use one stat when you can use many? More statistics - more info about genealogy - more power!**
### Genetic Diversity within populations
#### Sample of genomes (n) is usually much smaller than N
- randomly sample subset of individuals from population
- but what's a population? "group of randomly mating individuals without subdivisions" but this really doesn't exist in nature
- subtle violations of this assumption might be fine
- How big should your sample be? 10s to 100s fine for some things, others you want more
#### Type of sequencing:
Typically using Illumina, typically limited to SNPs
Harder to detect indels and definitely structural variants
### Outline of Talk
#### 1. General framework for likelihood-free population genetic inference
## Positive selection (mutation and selection): Selective sweeps skew patterns of diversity
- **Reduces diversity (pi)** - in areas linked to beneficial mutation
- Number of sequences vs # of distinct haplotypes - fewer and fewer distinct haplotypes
- **SFS**: Excess of RARE variants and excess of common variants
- **Excess LD**
Hard sweeps vs soft sweeps (selection on standing variation) Dan made a tool called the ***Soft/Hard Inference Tool (S/HIC)*** to detect different types of selective sweeps
Many stats better than 1 stat but how is NO statistics better than many stats?
Skip the genealogy inference (just sequence alignment -> infer evolution)
## Introgression
- Speciation followed by gene flow
- Example of adaptive introgression (gene flow mimicry to heliconius)
- example with simulans and sechelia
dmin = look at all cross species pairs adn see how diverged they are (branch lengths) and find the shortest branch length between the two most similar individuals
FILET - uses a lot of statistics and can find a much greater range of that parameter space where we can detect
proportion of individual involved in migration event by
when did it occur
But STILL CNN (this one for detecting introgression is called UNET) is better than FILET
#### 2. Deep learning for population genomic time-series
How much better can we do if we take repeated samplings from the same population
Tracking haplotype frequencies over time
#### 3. Adventures in phylogenetic inference
Incomplete lineage sorting! :)