MemPanG25 - Day 2b - Implicit pangenomics and pangenome genotyping

# MemPanG25 - Day 2b - Implicit pangenomics and pangenome genotyping ## REMEMBER: How to download files on your computer **ON YOUR COMPUTER**, not into the machine you've logged in, open another terminal and run the following command after specifying your user, the IP of your virtual machine, and the path(s) of the file(s) to download: scp -i mempang25_users.pem user@XX.XX.XX.XX:/path/of/your/file/filename.ext . ## implicit pangenome graphs ### Learning objectives - understand what we mean with 'implicit' - understand why we need implicit pangenome grpahs - work with the implicit graph model ### Getting started ### Primates MHC We already prepared the latest and greates assemblies of several different primates. You can find their chromosomes that are homologous to human chromosome 6: cd $DIR_MEMPANG25 mkdir -p day2_impg/assemblies cd day2_impg/assemblies #wget https://garrisonlab.s3.amazonaws.com/teaching/primates16.hsa6.fa.gz cp $DIR_MEMPANG25/Share/day2_impg/primates16.hsa6.fa.gz . samtools faidx primates16.hsa6.fa.gz To work with the implicit graph model, we need to align the assemblies against the genome to use as target to query the pangenome alignment. We will use CHM13 as a target genome: cd $DIR_MEMPANG25/day2_impg mkdir alignments # IT TAKES TIME wfmash assemblies/primates16.hsa6.fa.gz --target-prefix chm13 -t 32 -p 95 > alignments/primates16-vs-chm13.hsa6.paf # ALIGNMENTS ARE ALSO HERE, JUST IN CASE! #cp $DIR_MEMPANG25/Share/day2_impg/primates16-vs-chm13.hsa6.paf alignments What are we asking wfmash with the `--target-prefix` parameter? Query the MHC region from the pangenome alignment with the `impg query` command: cd $DIR_MEMPANG25/day2_impg impg query --paf-files alignments/primates16-vs-chm13.hsa6.paf -r chm13#1#chr6:28385000-33300000 > primates16.MHC.bed The first time the commands takes a bit longer because it generates an index for the PAF file. Take a look at the result: cat primates16.MHC.bed What do you think? <details> <summary>Click me for the answer</summary> The ranges are fragmented. This can be expected, given the alignment parameter (`-p 95`) and the sequence diversity in such a region. </details> Let's try to merge intervals closer than 500kbps. `impg query` merges only contiguous intervals (distance 0) by default. Run impg query --help to check how to ask it to merge intervals closer than 500kps. <details> <summary>Click me for the answer</summary> impg query --paf-files alignments/primates16-vs-chm13.hsa6.paf -r chm13#1#chr6:28385000-33300000 -d 500000 > primates16.MHC.d500kbp.bed </details> Check how the result changed. Prepare a bgzipped FASTA file `primates16.MHC.d500kbp.fa.gz` containing the sequences from the ranges specified in the `primates16.MHC.bed`. Hint: you can use `samtools` or `bedtools` for that. Now let's run PGGB with this new FASTA file to get an "explicit" pangenome graph for the MHC region: cd $DIR_MEMPANG25/day2_impg pggb -i primates16.MHC.d500kbp.fa.gz -o pggb.primates16.MHC.d500kbp -G 1100 -t 32 You will get an error because. Try to understand why by checking [PanSN-spec](https://github.com/pangenome/PanSN-spec). <details> <summary>Click me for the answer</summary> Sequence names like `hg002#M#chr6_MATERNAL...` do not match the Pangenome Sequence Naming (PanSN). Indeed, the HPRC decided to restrict the `haplotype_id` to be an integer. found at this locus. </details> To continue, simply specify the number of haplotypes (otherwise, fix the error by changing the `haplotype_id` for the `hg002` sample): cd $DIR_MEMPANG25/day2_impg # IT TAKES TIME pggb -i primates16.MHC.d500kbp.fa.gz -o pggb.primates16.MHC.d500kbp -G 1100 -t 32 -n 16 # PGGB'S OUTPUT IS ALSO HERE, JUST IN CASE! #cp -r $DIR_MEMPANG25/Share/day2_impg/pggb.primates16.MHC.d500kbp . Take a look at the results. See if there are inverted sequences. Can you see the C4 locus? Extract it from the built pangenome graph and visualize it. Is it the same the C4 region seen in the human pangenome? <details> <summary>Click me for the answer</summary> Similar, but not the same. The diverged small loop inside the structural variant suggests recurrent insertion of the endogenous retrovirus found at this locus. </details> BONUS: try to inject and untangle the C4 annotations, making a gggenes plot of them. ## Haplotype deconvolution ### Learning objectives - understand how we leverage the pangenome to genotype samples - genotype short-read samples ### Getting started ### C4 We are going to genotype short read samples in the C4 region. Let's prepare a BED file with its coordinate on hg38: cd $DIR_MEMPANG25 mkdir -p day2_genotyping/regions_of_interest echo -e "chr6\t31972057\t32055418" > day2_genotyping/regions_of_interest/C4.bed Prepare also the reference genome. We have already downloaded the full `grch38` genome. To save time for later steps, let's prepare the a FASTA file with only `chr6`, the chromosome where our region of interest is: cd $DIR_MEMPANG25/day2_genotyping mkdir reference cd reference # DO NOT RUN! WE HAVE ALREADY PREPARED THE FULL REFERENCE #wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa #wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa.fai # Let's prepare chr6 to save time later samtools faidx $DIR_MEMPANG25/Share/day2_genotyping/GRCh38_full_analysis_set_plus_decoy_hla.fa chr6 > GRCh38_full_analysis_set_plus_decoy_hla.chr6.fa samtools faidx GRCh38_full_analysis_set_plus_decoy_hla.chr6.fa We will use year 1 assemblies from the Human Pangenome Reference Consortium. For simplicity, we will download the PGGB pangenome graph of chromosome 6 and get the contigs in FASTA format with ODGI. To save time, we will work on a few samples: cd $DIR_MEMPANG25/day2_genotyping mkdir -p pangenomes cd pangenomes # DO NOT RUN IT! WE HAVE ALREAD DOWNLOADED THE GRAPH AND PREPARED CONTIGS FOR A SUBSET OF SAMPLES # Download chromosome 6 pangenome graph # wget https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/scratch/2021_11_16_pggb_wgg.88/chroms/chr6.pan.fa.a2fb268.4030258.6a1ecc2.smooth.gfa.gz # gunzip chr6.pan.fa.a2fb268.4030258.6a1ecc2.smooth.gfa.gz # Convert the graph to FASTA file while fixing PanSN in grch38 and chm13 # odgi paths -i chr6.pan.fa.a2fb268.4030258.6a1ecc2.smooth.gfa -f -t 16 -P | sed -e 's/chm13#chr/chm13#1#chr/g' -e 's/grch38#chr/grch38#1#chr/g' > chr6.pan.fa # samtools faidx chr6.pan.fa # Prepare a smaller pangenome to save time # samtools faidx chr6.pan.fa $(grep 'HG00438#\|HG00673#\|HG00735#' chr6.pan.fa.fai | cut -f 1) chm13#1#chr6 grch38#1#chr6 > chr6.pan.3+2refs.fa # COPY THE FASTA FILE INSTEAD OF PRODUCING IT, CALLING IT `chr6.pan.fa` cp $DIR_MEMPANG25/Share/day2_genotyping/chr6.pan.3+2refs.fa chr6.pan.fa echo "Indexing..." samtools faidx chr6.pan.fa We will genotype samples in the C4 region. For that, we need to find the corresponding region on all the assemblies in the pangenome. As easy way, we can align the pangenome against the reference. cd $DIR_MEMPANG25/day2_genotyping mkdir wfmash wfmash \ reference/GRCh38_full_analysis_set_plus_decoy_hla.chr6.fa pangenomes/chr6.pan.fa \ -s 10k -p 95 \ -t 16 \ > wfmash/chr6.pangenome_to_reference.paf and then project the alignments of the C4 region onto the reference by using `impg`: cd $DIR_MEMPANG25/day2_genotyping mkdir impg impg query \ -p wfmash/chr6.pangenome_to_reference.paf \ -b regions_of_interest/C4.bed \ > impg/C4.projected.bedpe Let's extract now the C4 regions in the pangenome: # Merge regions bedtools sort -i impg/C4.projected.bedpe | \ bedtools merge -d 100000 > impg/C4.merged.bed bedtools getfasta -fi pangenomes/chr6.pan.fa -bed impg/C4.merged.bed | bgzip -@ 16 > impg/C4.extracted.fa.gz samtools faidx impg/C4.extracted.fa.gz Now we build a C4 region pangenome graph with PGGB: cd $DIR_MEMPANG25/day2_genotyping mkdir pggb pggb -i impg/C4.extracted.fa.gz -o pggb/C4 -t 16 # Rename the final ODGI graph in a more human-friendly way mv pggb/C4/*smooth.final.og pggb/C4/C4.final.og Let's chop the graph and get a haplotype coverage matrix: cd $DIR_MEMPANG25/day2_genotyping mkdir odgi odgi chop \ -i pggb/C4/C4.final.og \ -c 32 \ -o odgi/C4.chopped.og odgi paths \ -i odgi/C4.chopped.og \ -H | \ cut -f 1,4- | \ gzip > odgi/C4.paths_matrix.tsv.gz We are going to use pre-aligned reads against the human reference genome hg38 in CRAM format. This format requires a reference genome to fetch reads from the original alignments. We already downloaded 39 samples from 1000G, subsetting reads covering the C4 region to save space and time, so just copy them in your working folder: cd $DIR_MEMPANG25/day2_genotyping mkdir -p sequencing_reads/C4 cd sequencing_reads/C4 #for sample in HG00438 HG00673 HG00735; do wget https://garrisonlab.s3.amazonaws.com/teaching/subset_short_read_samples/chr6_30972057_33055418/$sample.final.chr6_30972057_33055418.cram; wget https://garrisonlab.s3.amazonaws.com/teaching/subset_short_read_samples/chr6_30972057_33055418/$sample.final.chr6_30972057_33055418.cram.crai; done cp $DIR_MEMPANG25/Share/day2_genotyping/subset_short_read_samples/chr6_30972057_33055418/*cram* sequencing_reads/C4 <details> <summary>Click me for stuff to run at home</summary> In the future, if you are interested in working with other samples and/or regions, here is how you can download the full pre-aligned reads: cd $DIR_MEMPANG25/day2_genotyping mkdir sequencing_reads cd sequencing_reads # Get list of 1000G samples wget https://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/1000G_2504_high_coverage/additional_698_related/1000G_698_related_high_coverage.sequence.index # Select samples (3 in this example) grep 'HG00438\|HG00673\|HG00735' 1000G_698_related_high_coverage.sequence.index | cut -f 1 > 1000G.selected.txt # Add also links for CRAM indexes sed 's/$/.crai/' 1000G.selected.txt >> 1000G.selected.txt # Download CRAM and index files wget -i 1000G.selected.txt </details> We will index the C4 region pangenome and align sequencing reads to it with [BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2). Let's prepare the index: cd $DIR_MEMPANG25/day2_genotyping mkdir -p alignments/C4 bwa-mem2 index impg/C4.extracted.fa.gz and then align the sequencing reads against the pangenome. We will align only those for the samples we are using: cd $DIR_MEMPANG25/day2_genotyping cat pangenomes/chr6.pan.fa.fai | cut -f 1 -d '#' | uniq > samples.txt ls sequencing_reads/C4/*cram | grep -Ff samples.txt | while read CRAM; do echo $CRAM NAME=$(basename $CRAM .cram) # Extract reads covering the C4 region and then align them against the pangenome samtools view \ -T $DIR_MEMPANG25/Share/day2_genotyping/GRCh38_full_analysis_set_plus_decoy_hla.fa \ -L regions_of_interest/C4.bed \ -M \ -b \ $CRAM | \ samtools sort -n | \ samtools fasta | \ bwa-mem2 mem -t 6 impg/C4.extracted.fa.gz - | \ samtools view -b -F 4 -@ 2 - \ > alignments/C4/$NAME.reads_vs_extracted.bam done Inject the pangenome alignments into the pangenome graph: cd $DIR_MEMPANG25/day2_genotyping odgi view \ -i odgi/C4.chopped.og \ -g > odgi/C4.chopped.gfa ls sequencing_reads/C4/*cram | grep -Ff samples.txt | while read CRAM; do echo $CRAM NAME=$(basename $CRAM .cram) gfainject \ --gfa odgi/C4.chopped.gfa \ --bam alignments/C4/$NAME.reads_vs_extracted.bam \ > alignments/C4/$NAME.injected.gaf done Let's get the sample coverage vectors: cd $DIR_MEMPANG25/day2_genotyping ls sequencing_reads/C4/*cram | grep -Ff samples.txt | while read CRAM; do echo $CRAM NAME=$(basename $CRAM .cram) gafpack \ --gfa odgi/C4.chopped.gfa \ --gaf alignments/C4/$NAME.injected.gaf \ --len-scale | \ gzip > alignments/C4/$NAME.coverage.gafpack.gz done Prepare clusters: cd $DIR_MEMPANG25/day2_genotyping mkdir -p clusters # First generate the similarity matrix using odgi odgi similarity \ -i odgi/C4.chopped.og \ > odgi/C4.similarity.tsv # Download the clustering script from cosigt repository wget https://raw.githubusercontent.com/davidebolo1993/cosigt/16b18815cf9fdfcbf2afbf588a02740c27941ee3/cosigt_smk/workflow/scripts/cluster.r -P clusters # Run the clustering script to generate the JSON Rscript clusters/cluster.r odgi/C4.similarity.tsv clusters/C4.clusters.json You can take a look at the generated clusters: cat clusters/C4.clusters.tsv | column -t Now the actual genotyping step. cd $DIR_MEMPANG25/day2_genotyping mkdir -p cosigt/C4 ls sequencing_reads/C4/*cram | grep -Ff samples.txt | while read CRAM; do echo $CRAM NAME=$(basename $CRAM .cram) cosigt \ -i $NAME \ -p odgi/C4.paths_matrix.tsv.gz \ -g alignments/C4/$NAME.coverage.gafpack.gz \ -c clusters/C4.clusters.json \ -o cosigt/C4 mv cosigt/C4/cosigt_genotype.tsv cosigt/C4/$NAME.cosigt_genotype.tsv mv cosigt/C4/sorted_combos.tsv cosigt/C4/$NAME.sorted_combos.tsv done Check the output: cd $DIR_MEMPANG25/day2_genotyping/cosigt/C4 grep 'final' -h *.cosigt_genotype.tsv | column -t Are the genotypes correct? Let's make a visualization of the region by using the GRCh38 reference annotations to annotate the pangenome. Download GENCODE annotation: cd $DIR_MEMPANG25/day2_genotyping mkdir annotation wget -c https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.annotation.gtf.gz -P annotation Then subset the annotation falling in the C4 region: # Subset GTF bedtools intersect -wa -a annotation/gencode.v47.annotation.gtf.gz -b regions_of_interest/C4.bed > annotation/C4_chr6_31972057_32055418.gtf We will use the `annotate.r` script from `cosigt`' repository to prepare the annotation. This scipt expects the filename to specify start and end. cd $DIR_MEMPANG25/day2_genotyping wget -c https://raw.githubusercontent.com/davidebolo1993/cosigt/ed9f117a7e1dad23e262e9d78dd777a97a0fde74/cosigt_smk/workflow/scripts/annotate.r -P annotation Rscript annotation/annotate.r annotation/C4_chr6_31972057_32055418.gtf chr6 annotation/C4_chr6_31972057_32055418.bed Prepare the final annotation to be injected in the pangenome graph: cd $DIR_MEMPANG25/day2_genotyping sed 's/chr6/grch38#1#chr6/g' annotation/C4_chr6_31972057_32055418.bed -i odgi procbed -i pggb/C4/C4.final.og -b annotation/C4_chr6_31972057_32055418.bed > annotation/C4_chr6_31972057_32055418.procbed.bed Why did we renamed the sequences (with the `sed` command) in `C4_chr6_31972057_32055418.bed` before running `odgi procbed`? Let's inject the annotation into the graph: cd $DIR_MEMPANG25/day2_genotyping odgi inject -i pggb/C4/C4.final.og -b annotation/C4_chr6_31972057_32055418.procbed.bed -o annotation/C4.final.inject.og -P Check the injected paths: odgi paths -i annotation/C4.final.inject.og -L | tail -n 12 Flip the paths in order to have everything in the same orientation: odgi flip -i annotation/C4.final.inject.og --ref-flips <(cut -f 1 annotation/C4_chr6_31972057_32055418.procbed.bed | uniq) -o annotation/C4.final.inject.flip.og Why are we using the `--ref-flips` argument of `odgi inject`? Annotate the flipped graph: odgi untangle -i annotation/C4.final.inject.flip.og -R <(cut -f 4 annotation/C4_chr6_31972057_32055418.procbed.bed) -j 0.5 -g > annotation/C4.gggenes.tsv and plot the annotated pangenome together with the clustering: wget -c https://raw.githubusercontent.com/davidebolo1993/cosigt/ed9f117a7e1dad23e262e9d78dd777a97a0fde74/cosigt_smk/workflow/scripts/plotgggenes.r -P annotation # In this example we have few samples, so we reduce the height of the figure in the script sed 's/height=20/height=2/g' annotation/plotgggenes.r -i Rscript annotation/plotgggenes.r annotation/C4.gggenes.tsv clusters/C4.clusters.json annotation/C4.gggenes.png BONUS: Try to genotype a few other samples. You can select them from the full chr6 pangenome at `$DIR_MEMPANG25/Share/day2_genotyping/chr6.pan.fa`. Remember to include the `grch38` for the annotations. Try to genotype the same pangenome in the CYP2D6 locus. CYP2D6 is an enzyme responsible for the metabolism of many drugs. It is located at chr22:42077656-42253758 on GRCh38. You can find the reads at `$DIR_MEMPANG25/Share/day2_genotyping/subset_short_read_samples/chr22_42077656_42253758`