subwaystation
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    <!--- panacus 0.2 https://anaconda.org/bioconda/panacus ---> # HPRC HUGO24 Workshop - Building and Analyzing Pangenome Graphs #### NOTE: Allow starting a web server for browsing outputs Let's get a web server running. Log-in again to your virtual machine from another terminal and run: ```shell python -m http.server 889N # replace N with your user number (for example: N = 1 for user 1) ``` You can access this by pointing your web browser at `http://<your_ip>:889N/`, where <your_ip> is the ip address of your instance. curl ifconfig.me can show you your ip address. <br/> ##### TEMPORARY SOLUTION TO DOWNLOAD FILES ON YOUR COMPUTER rsync -av user1@<YOUR_MACHINE>:/path/of/your/file . ## PGGB ### Learning objectives - build pangenome graphs using pggb - explore pggb's results - understand how parameters affect the built pangenome graphs ### Getting started Make sure you have `pggb` v0.5.4 and its tools installed. It is already available on the course workstations. If you want to build everything on your laptop, follow the instructions at the [pggb homepage](https://github.com/pangenome/pggb) (`guix`, `docker`, `singularity`, and `conda` alternatives are available). So make sure you have checked out pggb repository: cd ~ git clone https://github.com/pangenome/pggb.git Now create a directory to work on for this part of the tutorial tutorial: mkdir hprc_hugo24_pggb cd hprc_hugo24_pggb ln -s ../pggb/data <!--- Take a look at the sequence names of `data/HLA/DRB1-3123.fa.gz.fai`: cut -f 1 data/HLA/DRB1-3123.fa.gz.fai gi|568815592:32578768-32589835 gi|568815529:3998044-4011446 gi|568815551:3814534-3830133 gi|568815561:3988942-4004531 gi|568815567:3779003-3792415 gi|568815569:3979127-3993865 gi|345525392:5000-18402 gi|29124352:124254-137656 gi|28212469:126036-137103 gi|28212470:131613-146345 gi|528476637:32549024-32560088 gi|157702218:147985-163915 They do not correspond to the [PanSN-spec](https://github.com/pangenome/PanSN-spec) with '#' as the delimiter. Let's fix this first: cp data/HLA/DRB1-3123.fa.gz ./ gunzip DRB1-3123.fa.gz cat DRB1-3123.fa | while read line; do [[ $line = '>'* ]] && echo $(echo $line | cut -f 1 -d" ")#1 || echo $line; done > DRB1-3123.pansn.fa bgzip DRB1-3123.pansn.fa samtools faidx DRB1-3123.pansn.fa.gz ---> ### Build HLA pangenome graphs The [human leukocyte antigen (HLA)](https://en.wikipedia.org/wiki/Human_leukocyte_antigen) system is a complex of genes on chromosome 6 in humans which encode cell-surface proteins responsible for the regulation of the immune system. Let's build a pangenome graph from a collection of sequences of the DRB1-3123 gene: pggb -i data/HLA/DRB1-3123.fa.gz -n 12 -t 16 -o out_DRB1_3123 Run `pggb` without parameters to get information on the meaning of each parameter: pggb Take a look at the files in the `out_DRB1_3123` folder. Why did we specify `-n 12`? <details> <summary>Click me for the answer</summary> The pggb graph is defined by the number of mappings per segment of each genome `-n, --n-mappings N`. Ideally, you should set this to equal the number of haplotypes in the pangenome. Because, one expects the `number of haplotypes minus 1` as the maximum number of secondary mappings and alignments. Keep in mind that the total work of alignment is proportional to `N*N`, and these multimappings can be highly redundant. If you provide a `N` that is not equal to the number of haplotypes, provide the actual number of haplotypes to `-H`. This helps smoothxg to determine the right POA problem size. <!--- `pggb` expects sequence names to follow the [PanSN-spec](https://github.com/pangenome/PanSN-spec). This potentially enables `pggb` to automatically detect the correct paremeter setting for `-c, --n-mappings N` and `-n, --n-haplotypes N`. For teaching, we set this manually. Setting this parameter is important, because the `pggb` graph is defined by the number of mappings per segment of each genome `-c, --n-mappings N`. Ideally, you should set this to the number of haplotypes minus 1 in the pangenome. Because that's the maximum number of secondary mappings and alignments that we expect. Keep in mind that the total work of alignment is proportional to `O(N) = N*N`, and these multimappings can be highly redundant. `-n, --n-haplotypes N` helps `smoothxg` to determine the right POA problem size. ---> </details> <br /> <!--- How many alignments were executed during the pairwise alignment (take a look at the PAF output)? Visualize the alignments: cd out_DRB1_3123 pafplot *.paf --size 2000 cd .. This creates `*.paf.png`. Examine the PNG on your local laptop. ---> How many alignments were executed during the pairwise alignment (take a look at the PAF output)? Visualize the alignments: cd out_DRB1_3123 pafplot *.paf --size 2000 cd .. Use `odgi stats` to obtain the graph length, and the number of nodes, edges, and paths. Do you think the resulting pangenome graph represents the input sequences well? Check the length and the number of the input sequences to answer this question. <details> <summary>Click me for the answer</summary> The total graph length is much longer than the length of most of the input sequences. This indicates an underalignment of all the sequences. </details> <br /> How many blocks were selected and 'smoothed' during the two rounds of graph normalization? <details> <summary>Hint</summary> Take a look at the `*.log` file to answer this question. </details> <br /> Try building the same pangenome graph by specifying a lower percent identity (`-p 90` by default): pggb -i data/HLA/DRB1-3123.fa.gz -p 80 -n 12 -t 16 -o out2_DRB1_3123 Check graph statistics. Does this pangenome graph represent better or worse the input sequences than the previously produced graph? <details> <summary>Click me for the answer</summary> The total graph length is now closer to each length of most of the input sequences. </details> <br /> Try to decrease the number of mappings to keep for each segment: pggb -i data/HLA/DRB1-3123.fa.gz -p 80 -n 6 -t 16 -o out3_DRB1_3123 How does it affect the graph? Try to increase the target sequence length for the partial order alignment (POA) problem (`-G 700,900,1100` by default): pggb -i data/HLA/DRB1-3123.fa.gz -p 80 -n 12 -t 16 -G 1400,1800,2200 -o out4_DRB1_3123 How is this changing the runtime and the memory usage? How is this affecting graph statistics? How many blocks were selected and 'smoothed' during the two rounds of graph normalization? <details> <summary>Hint</summary> Take a look at the `*.log` file to answer this question. </details> <br /> Take the second `pggb` run and try to increase the segment length (`-s 5000` by default): pggb -i data/HLA/DRB1-3123.fa.gz -s 20000 -p 80 -n 12 -t 16 -o out5_DRB1_3123 How is this affecting graph statistics? Why? <details> <summary>Hint</summary> The length of a segment for mapping is now so large for the given sequence identity, that some mappings are not possible anymore. </details> <br /> <!--- `pggb` produces intermediate graphs during the process. Let's keep all of them: pggb -i DRB1-3123.pansn.fa.gz -p 80 -n 12 -c 11 -t 16 --keep-temp-files -o out2_DRB1_3123_keep_intermediate_graphs What does the file with name ending with `.seqwish.gfa` contain? And what about the file with name ending with `.smooth.1.gfa`? Take a look at the graph statistics of all the GFA files in the `out2_DRB1_3123_keep_intermediate_graphs` folder. <details> <summary>Click me for the answer</summary> - The `.seqwish.gfa` graph is the graph that directly comes out of `seqwish`. - `.smooth.[0-1].gfa` graphs are intermediate graphs after each round of `smoothxg`. </details> <br /> ---> ## nf-core/pangenome ### Learning objectives In this exercise you learn how to - run a [nf-core](https://nf-co.re/) [Nextflow](https://www.nextflow.io/) pipeline, - configure the resources according to what is available, - deal with alternative parameter names, - understand the [nf-core/pangenome](https://github.com/nf-core/pangenome) pipeline's output: - [MutiQC](https://multiqc.info/), - used CPU, RAM, ... - workflow timeline, - output folders ### Getting started Make sure you have `screen`, `wget`, `git`, `Nextflow`, and `Docker` installed. All tools are already available on the course VMs. Now create a directory to work in for this tutorial: cd ~ mkdir hprc_hugo24_nextflow cd hprc_hugo24_nextflow One can distribute the available compute resources efficiently across the different processes of the Nextflow pipeline using [config](https://www.nextflow.io/docs/latest/config.html) files. During the course you have access to 32 threads with 64 gigabytes of memory. To ensure that each run only consumes up to these resources, please create the following config file: hprc_hugo24_config="executor { cpus = 32 memory = 64.GB }" echo "$hprc_hugo24_config" > hprc_hugo24.config Prepare input data: # download the HPRC PGGB chr20 graph wget https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/pggb/chroms/chr20.hprc-v1.0-pggb.gfa.gz gunzip chr20.hprc-v1.0-pggb.gfa.gz We want to build a graph with 4 haplotypes, so we need to extract a subset from all the sequences in the graph: # extract all sequences in FASTA format with ODGI odgi paths -i chr20.hprc-v1.0-pggb.gfa -f -t 32 -P > chr20.hprc-v1.0-pggb.gfa.fa # index the FASTA samtools faidx chr20.hprc-v1.0-pggb.gfa.fa We select the two references CHM13, GRCH38, and the 2 haplotypes of the HG01978 diploid sample: grep "chm13\|grch38\|HG01978" chr20.hprc-v1.0-pggb.gfa.fa.fai | cut -f 1 > chr20.pan4.txt # fetch the sequences of the desired haplotypes samtools faidx chr20.hprc-v1.0-pggb.gfa.fa -r chr20.pan4.txt > chr20.hprc.pan4.fa # zip it bgzip -@32 chr20.hprc.pan4.fa # index the FASTA samtools faidx chr20.hprc.pan4.fa.gz ### Building an HPRC 4 haplotypes chr20 pangenome graph with nf-core/pangenome Whilst we can limit the maximum allocatable resources with `hprc_hugo24.config`, one can assign resources for each step of the pipeline using a different config file: chr20_hprc_pan4_config="process { withName:'MULTIQC|MULTIQC_COMMUNITY|SAMTOOLS_FAIDX|CUSTOM_DUMPSOFTWAREVERSIONS' { // these tools can only make use of one thread and need little RAM cpus = 1 memory = 1.GB } withName:'TABIX_BGZIP|ODGI_STATS|WFMASH_ALIGN|VG_DECONSTRUCT' { cpus = 8 memory = 8.GB } withName:'WFMASH_MAP_ALIGN|WFMASH_MAP|SEQWISH|ODGI_BUILD|ODGI_UNCHOP|ODGI_SORT|ODGI_LAYOUT|WFMASH_MAP_COMMUNITY|ODGI_SQUEEZE' { cpus = 16 memory = 16.GB } withName:'SMOOTHXG' { cpus = 32 memory = 32.GB } withName:'GFAFFIX|ODGI_VIEW|ODGI_VIZ*|ODGI_DRAW|SPLIT_APPROX_MAPPINGS_IN_CHUNKS|PAF2NET|NET2COMMUNITIES|EXTRACT_COMMUNITIES' { // these tools can only make use of one thread and need medium RAM cpus = 1 memory = 8.GB } }" echo "$chr20_hprc_pan4_config" > chr20.hprc.pan4.config Let's build the chromosome 20 pangenome graph. If you are interested in setting additional parameters you can always visit https://nf-co.re/pangenome/1.1.2/parameters for details. All parameters starting with one `-` are handed over to Nextflow, all parameters starting with two `-` are handled by the pipeline itself: nextflow run nf-core/pangenome -r 1.1.2 --input chr20.hprc.pan4.fa.gz --outdir chr20.hprc.pan4_out --n_haplotypes 4 --wfmash_map_pct_id 98 --wfmash_segment_length 10k --wfmash_n_mappings 3 --seqwish_min_match_length 311 --smoothxg_poa_length "1000," -c hprc_hugo24.config,chr20.hprc.pan4.config --wfmash_exclude_delim '#' -profile docker --wfmash_chunks 4 <details> <summary>Click me for the explanations of some Nextflow parameters</summary> - `nextflow run`: Execute a Nextflow pipeline. - `nf-core/pangenome -r 1.1.2`: Select pipeline https://github.com/nf-core/pangenome for execution in its version 1.1.2. - `--n_haplotypes 4`: We have 4 haplotypes as input. - `--wfmash_map_pct_id 98`: Genomic sequences of human vary by ~2%. - `--wfmash_n_mappings 3`: We want to retain 3 mappings for each segment, because each segment could map to 3 other haplotypes. - `seqwish_min_match_length 311`: Filter exact matches below this length. This can smooth the graph locally and prevent the formation of complex local graph topologies from forming due to differential alignments. - `--wfmash_exclude_delim '#'`: Our input sequences follows the pansn spec so the idea is to skip mappings when the query and target have the same prefix: '#'. Since our sample `HG01978` still consists of contigs, this will reduce our mapping problem and speed up `WFMASH_MAP`. - `--wfmash_chunks 4`: One advantage that `nf-core/pangenome` has over `pggb` is that it can parallelize the often heavy base-pair level alignments across nodes of a cluster. The parameter `--wfmash_chunks` determines into how many equally large subproblems the alignments should be split after the `WFMASH_MAP` process. It is recommended that this number roughly fits the number of available nodes one has. During the course, a full cluster is not available, so we are improvising. In `chr20.hprc.pan4.config` the number of CPUs for `WFMASH_ALIGN` is set to 8. Assuming we are able to run this in parallel on our 32T/64GB machine, one can expect that at most 4 `WFMASH_ALIGN` process can be executed in parallel. </details> <br /> *In which folder can the final ODGI graph be found? And in which folder in the final GFA graph?* <details> <summary>Click me for the answer</summary> - ODGI: *FINAL_ODGI* - GFA: *FINAL_GFA* </details> Open the MultiQC report and other statistics on your local machine in order to take a closer look. chr20.hprc.pan4_out/multiqc/multiqc_report.html . chr20.hprc.pan4_out/pipeline_info/execution_*.html . chr20.hprc.pan4_out/pipeline_info/pipeline_dag_*.html . In the MultiQC report you will find vital graph statistics, lots of 1D graph visualizations and a 2D graph visualization serving both as quantitative and qualitative graph validation information. In `execution_report_*.html*` you can find an overview of the executed pipeline and especially the resource consumption of each process of the pipeline. If you notice that a process is consuming much less RAM than it was given in `chr20.hprc.pan4.config` you would want to adjust this in future runs. Assuming you want to run `nf-core/pangenome` on a cluster, it is crucial to limit the allocated resources for each process, so your jobs usually have a higher chance to be submitted by the cluster scheduler. In `execution_timeline_*.html` one can observe when which process was executed and which processes were submitted in parallel, assuming resources were available. ## ODGI ### Learning objectives - extract subgraphs representing loci of interest - visualize graph annotation - make phylogenetic trees ### Getting started Check out odgi repository (we need one of its example): cd ~ git clone https://github.com/pangenome/odgi.git Now create a directory to work on for this tutorial: cd ~ mkdir hprc_hugo24_subgraphs cd hprc_hugo24_subgraphs ln -s ~/odgi/test ### Exploring the HPRC chromosome 6 pangenome graph Download the pangenome graph of the Human chromosome 6 in GFA format, decompress it, and convert it to a graph in odgi format. wget https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/scratch/2021_11_16_pggb_wgg.88/chroms/chr6.pan.fa.a2fb268.4030258.6a1ecc2.smooth.gfa.gz gunzip chr6.pan.fa.a2fb268.4030258.6a1ecc2.smooth.gfa.gz odgi build -g chr6.pan.fa.a2fb268.4030258.6a1ecc2.smooth.gfa -o chr6.pan.og -t 32 -P This graph contains contigs of 88 haploid, phased human genome assemblies from 44 individuals, plus the `chm13` and `grch38` reference genomes. #### Extraction The [major histocompatibility complex (MHC)](https://en.wikipedia.org/wiki/Major_histocompatibility_complex) is a large locus in vertebrate DNA containing a set of closely linked polymorphic genes that code for cell surface proteins essential for the adaptive immune system. In humans, the MHC region occurs on chromosome 6. The human MHC is also called the HLA (human leukocyte antigen) complex (often just the HLA). To see the coordinates of some HLA genes, execute: head test/chr6.HLA_genes.bed -n 5 The coordinates are expressed with respect to the `grch38` reference genome. To extract the subgraph containing all the HLA genes annotated in the `chr6.HLA_genes.bed` file, execute: odgi extract -i chr6.pan.og -o chr6.pan.MHC.og -b <(bedtools merge -i test/chr6.HLA_genes.bed -d 10000000) -d 0 -E -t 32 -P The instruction extracts: - the nodes belonging to the `grch38#chr6` path ranges specified in the the `chr6.HLA_genes.bed` file via `-b`, - all nodes between the min and max positions touched by the given path ranges, also if they belong to other paths (`-E`), - the edges connecting all the extracted nodes, and - the paths traversing all the extracted nodes. How many paths are present in the extracted subgraph? With 90 haplotypes (44 diploid samples plus 2 haploid reference genomes), how many paths would you expect in the subgraph if the MHC locus were solved with a single contig per haplotype? To visualize the graph, execute: odgi sort -i chr6.pan.MHC.og -o - -O | odgi viz -i - -o chr6.pan.MHC.png -s '#' -a 20 Why are we using `odgi sort` before visualizing the graph? Are there haplotypes where the MHC locus is not resolved with a single contig? If so, which ones? Counts the number of contigs for each haplotype. Generate the graph layout with `odgi layout`. (remember to specify the number of threads). Visualize the layout with `odgi draw` Specify `-P` to get information on the progress. The MHC locus includes the complement component 4 (C4) region, which encodes proteins involved in the complement system. In humans, the C4 gene exists as 2 functionally distinct genes, C4A and C4B, which both vary in structure and copy number ([Sekar et al., 2016](https://www.nature.com/articles/nature16549)). Moreover, C4A and C4B genes segregate in both long and short genomic forms, distinguished by the presence or absence of a human endogenous retroviral (HERV) sequence. Find C4 coordinates: wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz zgrep 'gene_id "C4A"\|gene_id "C4B"' hg38.ncbiRefSeq.gtf.gz | awk '$1 == "chr6"' | cut -f 1,4,5 | bedtools sort | bedtools merge -d 15000 | bedtools slop -l 10000 -r 20000 -g hg38.chrom.sizes | sed 's/chr6/grch38#chr6/g' > hg38.ncbiRefSeq.C4.coordinates.bed Extract the C4 locus: odgi extract -i chr6.pan.og -b hg38.ncbiRefSeq.C4.coordinates.bed -o - -d 0 -E -t 32 -P | odgi explode -i - --biggest 1 --sorting-criteria P --optimize -p chr6.pan.C4 odgi sort -i chr6.pan.C4.0.og -o chr6.pan.C4.sorted.og -p Ygs -x 100 -t 32 -P What `odgi explode` is doing? Regarding the `odgi viz` visualization, select the haplotypes to visualize odgi paths -i chr6.pan.C4.sorted.og -L | grep 'chr6\|HG00438\|HG0107\|HG01952' > chr6.selected_paths.txt and visualize them: # odgi viz: default (binned) mode odgi viz -i chr6.pan.C4.sorted.og -o chr6.pan.C4.sorted.png -c 12 -w 100 -y 50 -p chr6.selected_paths.txt # odgi viz: color by strand odgi viz -i chr6.pan.C4.sorted.og -o chr6.pan.C4.sorted.z.png -c 12 -w 100 -y 50 -p chr6.selected_paths.txt -z # odgi viz: color by position odgi viz -i chr6.pan.C4.sorted.og -o chr6.pan.C4.sorted.du.png -c 12 -w 100 -y 50 -p chr6.selected_paths.txt -du # odgi viz: color by depth odgi viz -i chr6.pan.C4.sorted.og -o chr6.pan.C4.sorted.m.png -c 12 -w 100 -y 50 -p chr6.selected_paths.txt -m -B Spectral:4 For the `chr6.pan.C4.sorted.m.png` image we used the Spectra color palette with 4 levels of node depths, so white indicates no depth, while grey, red, and yellow indicate depth 1, 2, and greater than or equal to 3, respectively. What information does this image provide us about the state of the C4 region in the selected haplotypes? Visualize all haplotypes with `odgi viz`, coloring by depth. How many haplotypes have three copies of the C4 region? How many haplotypes are missing the HERV sequence? What is the copy number state of the `chm13` and `grch38` reference genomes? Use `odgi layout` and `odgi draw` to compute and visualize the layout of the C4 locus. Try to find out how to add the following /home/participant/odgi/test/chr6.C4.bed to `odgi draw`'s SVG output. The HERV sequence may be present or absent in the C4 regions across haplotypes: how does this reflect on the structure of the graph? ### Primate chromosome 6 Download the pangenome graph of the primate chromosome 6 in GFA format, decompress it, and convert it to a graph in `odgi` format. wget https://zenodo.org/record/7933393/files/primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.gfa.zst zstd -d primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.gfa.zst odgi build -g primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.gfa -o primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.og -t 16 -P This graph contains contigs of chromosome 6 of six diploid (2 haplotypes for each sample), phased primate genome assemblies, plus the chm13 and grch38 reference genomes. Primate genomes were downloaded from https://genomeark.github.io/t2t-all/. Compute the dissimilarity (distance) between all possible pairs of haplotypes: odgi similarity -i primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.og --distances -t 32 -D '#' -p 2 -P > primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.dist.tsv The `-D` and `-p` options specifies to use the 2nd occurrence of the # character to group the results. As path names follow the PanSN-spec specification, this means that results are grouped by haplotype. Take a look at the output: head primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.dist.tsv Construct a phylogenetic tree by using the jaccard.distance: library(tidyverse) library(ape) library(ggtree) path_dist_tsv <- 'primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.dist.tsv' # Read sparse matrix sparse_matrix_df <- read_tsv(path_dist_tsv) # Prepare distance matrix jaccard_dist_df <- sparse_matrix_df %>% arrange(group.a, group.b) %>% select(group.a, group.b, dice.distance) %>% pivot_wider(names_from = group.b, values_from = dice.distance) %>% column_to_rownames(var = "group.a") # Clustering jaccard_hc <- as.dist(jaccard_dist_df) %>% hclust() # Open a pdf device with the specified width and height pdf(file = "dendrogram.haplotypes.pdf", width = 5, height = 6) # Plot the dendrogram plot( jaccard_hc, # Label at same height hang = -1, main = 'primate14.chr6', xlab = 'Haplotype', ylab = 'Jaccard distance', sub = '', cex = 1.2 ) # Close the device and save the file dev.off() Try to make the tree by grouping the results by sample. ## Bonus: Pangenome growth curve ### Learning objectives In this exercise you learn how to - Evaluate and interpret the growth curve of a pangenome. ### Getting started Make sure you have `panacus` installed. It is already available on the course VMs. Now create a directory to work on for this tutorial: cd ~ mkdir hprc_openness cd hprc_openness Download the 44 haplotypes chrM HPRC human pangenome graph ([Liao, Asri, Ebler et al., 2023](https://doi.org/10.1038/s41586-023-05896-x)) from the [HPRC Pangenome Resources](https://github.com/human-pangenomics/hpp_pangenome_resources) and the 50 haplotypes *E. coli* pangenome graph ([Garrison, Guarracino et al., 2023](https://www.biorxiv.org/content/10.1101/2023.04.05.535718v1)): mkdir chrM cd chrM wget -c https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/pggb/chroms/chrM.hprc-v1.0-pggb.gfa.gz gunzip chrM.hprc-v1.0-pggb.gfa.gz mv chrM.hprc-v1.0-pggb.gfa chrM.gfa cd .. mkdir ecoli50 cd ecoli50 wget -c https://zenodo.org/record/7937947/files/ecoli50.gfa.zst zstd -d ecoli50.gfa.zst cd .. ### `odgi heaps` `odgi heaps` calculates permutations of the path pangenome coverage in the graph. The number of permutations affects the accuracy of the subsequent power law regression. This regression happens in this [Rscript](https://github.com/pangenome/odgi/blob/master/scripts/heaps_fit.R) that uses the heap's law ([Tettelin et al., 2008](https://www.sciencedirect.com/science/article/pii/S1369527408001239?via=ihub#section0020)) to calculate a pangenome growth curve from all `odgi heaps` permutations. For more details, take a look at https://en.wikipedia.org/wiki/Pan-genome#Classification. ### `Panacus` `panacus` is able to calculate the pangenome openness without the need to perform any permutations. Indeed, it directly applies the binomial formula described in [Parmigiani et al., 2022](https://www.biorxiv.org/content/10.1101/2022.11.15.516472v2), Section 2.1, Eq 1. `panacus` exposes a parameter (`-c`) that allow users to chose which graph feature (sequence, node, edge) is taken into account to calculate the growth histogram parameter. The coverage `-l` parameter sets the _minimum number_ of haplotypes visiting a graph feature in order for this graph feature to be included into the calculation at all. With `-q` one can set the _minimum fraction_ of haplotypes that must share a graph feature *after each time a haplotype is added to the growth histograph*. For example, assuming a 100 haplotypes pangenome graph, setting `-c bps -q 0,1,0.5,0.1` would calculate the pangenome growth in sequence scape (`-c bps`) for 4 different cases. Remember, quorum sets the minimum fraction of haplotypes for a nucleotide to be included in the openness calculation. - without setting any quorum (`-q 0`), so all sequences are considered. - limited to sequences that are traversed by 100% of haplotypes (`-q 1`). This is the `core pangenome`. - limited to sequences that are traversed by at least 50% of the haplotypes (`-q 0.5`). This is the `shell pangenome`. - limited to sequences that are traversed by at least 10% of the haplotypes (`-q 0.1`). This is the `cloud pangenome`. `panacus` fits two curves, one following heap's law, and one using heap's power law for modeling the data. ### Pangenome growth curve of the HPRC chrM pangenome graph Create the matrix of path pangenome coverage permutations of the chrM graph with `odgi heaps` subsequently performing the heap's law regression: cd chrM odgi heaps -i chrM.gfa -S -n 100 > chrM.gfa.heaps.tsv Rscript PATH_TO_ODGI_GIT_REPOSITORY/scripts/heaps_fit.R chrM.gfa.heaps.tsv chrM.gfa.heaps.tsv.pdf # this should be executed on your local machine Taking a look at the PDF, we can surprisingly observe 2 traces of permutations. *How can this happen*? <details> <summary>**Hint**</summary> Take a look at the 1D visualization of the graph. </details> <details> <summary>**Explanation**</summary> The CHM13 reference was linearized differently than all the other mitochondrial sequences. Therefore it has an additional `tip`. `odgi heaps` permutation algorithm is reflecting this, because the permutation always starts with the beginning of each genome. </details> So to get a cleaner pangenome growth curve with `odgi heap` we remove the CHM13 reference sequence and run the analysis again: odgi paths -i chrM.gfa -L | head -n 1 > chrM.gfa.chm13 odgi paths -i chrM.gfa -X chrM.gfa.chm13 -o chrM.gfa.no_chm13.og odgi heaps -i chrM.gfa.no_chm13.og -S -n 100 > chrM.gfa.no_chm13.og.heaps.tsv Rscript PATH_TO_ODGI_GIT_REPOSITORY/scripts/heaps_fit.R chrM.gfa.no_chm13.og.heaps.tsv chrM.gfa.heaps.no_chm13.og.tsv.pdf # this should be executed on your local machine This looks much better! With every added genome, the number of newly added bases is really low. Let's take a closer look with `panacus`. Create and visualize a growth histogramm of the chrM graph in sequence space with `panacus`: RUST_LOG=info panacus histgrowth chrM.gfa -c bp -q 0,1,0.5,0.1 -t 16 > chrM.gfa.histgrowth.tsv panacus-visualize.py -e -l "lower right" chrM.gfa.histgrowth.tsv > chrM.gfa.histgrowth.tsv.pdf <details> <summary>**See PNG**</summary> ![](https://hackmd.io/_uploads/rkRXGBvHn.png) </details> Taking a look at the top Figure in the PDF there is a nearly asymptotic growth of the number of nucleotides with increasing number of genomes. <!--- A value of`γ=0.019` indicates a closed pangenome, because $γ$ is very close to $0$. ---> The bottome Figure shows the added number of bases per added individual. Which is basically 0 for all, except for the first one. In conclusion, ChrM is a closed pangenome. The core pangenome (`quorum >= 1`) is very large, with the shell pangenome (`quorum >= 0.5`) being even slightly larger, the cloud pangenome (`quorum >= 0.1`) is not noticably larger. ### Pangenome growth curve of the *E.coli* pangenome graph Create the matrix of path pangenome coverage permutations of the *E.coli* graph with `odgi heaps` subsequently performing the power law regression: cd ~/day2_openness/ecoli50 odgi heaps -i ecoli50.gfa -S -n 100 -t 16 > ecoli50.gfa.heaps.tsv Rscript heaps_fit.R ecoli50.gfa.heaps.tsv ecoli50.gfa.heaps.tsv.pdf With every added genome, the number of newly added bases is at least 100k. Let's take a closer look with `panacus`. Create and visualize a growth histograph of the *E. coli* graph in sequence space with `panacus`: RUST_LOG=info panacus histgrowth ecoli50.gfa -c bp -q 0,1,0.5,0.1 -t 16 > ecoli50.gfa.histgrowth.tsv panacus-visualize.py -e -l "upper left" ecoli50.gfa.histgrowth.tsv > ecoli50.gfa.histgrowth.tsv.pdf <details> <summary>**See PNG**</summary> ![](https://hackmd.io/_uploads/BJG17rvr3.png) </details> Taking a look at the top Figure in the PDF there is a polynomial growth of the number of nucleotides with increasing number of genomes visible. <!---The value `1>γ=0.386>0` indicates an open pangenome. ---> The bottom Figure shows the added number of bases per added individual. Even the 50th added individual adds more than 100k new bases to the pangenome. In conclusion, the *E. coli* pangenome is open. The core pangenome (`quorum >= 1`) is quite small, with the shell pangenome (`quorum >= 0.5`) not adding much more sequence, and the cloud pangenome (`quorum >= 0.1`) adding some more sequence.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully