subwaystation
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    <!--- panacus 0.2 https://anaconda.org/bioconda/panacus ---> # MemPanG23 - Day 2b - Pangenome openness and nf-core/pangenome #### NOTE: Allow starting a web server for browsing outputs Let's get a web server running. Log-in again to your virtual machine from another terminal and run: ```shell python -m http.server 889N # replace N with your user number (for example: N = 1 for user 1) ``` You can access this by pointing your web browser at `http://<your_ip>:889N/`, where <your_ip> is the ip address of your instance. curl ifconfig.me can show you your ip address. <br/> ##### TEMPORARY SOLUTION TO DOWNLOAD FILES ON YOUR COMPUTER rsync -av user1@<YOUR_MACHINE>:/path/of/your/file . ## Pangenome openness ### Learning objectives In this exercise you learn how to - evaluate and interpret the openness of a pangenome. ### Getting started Make sure you have `panacus` installed. It is already available on the course VMs. Now create a directory to work on for this tutorial: cd ~ mkdir day2_openness cd day2_openness Download the 44 haplotypes chrM HPRC human pangenome graph ([Liao, Asri, Ebler et al., 2023](https://doi.org/10.1038/s41586-023-05896-x)) from the [HPRC Pangenome Resources](https://github.com/human-pangenomics/hpp_pangenome_resources) and the 50 haplotypes *E. coli* pangenome graph ([Garrison, Guarracino et al., 2023](https://www.biorxiv.org/content/10.1101/2023.04.05.535718v1)): mkdir chrM cd chrM wget -c https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/pggb/chroms/chrM.hprc-v1.0-pggb.gfa.gz gunzip chrM.hprc-v1.0-pggb.gfa.gz mv chrM.hprc-v1.0-pggb.gfa chrM.gfa cd .. mkdir ecoli50 cd ecoli50 wget -c https://zenodo.org/record/7937947/files/ecoli50.gfa.zst zstd -d ecoli50.gfa.zst cd .. ### `odgi heaps` `odgi heaps` calculates permutations of the path pangenome coverage in the graph. The number of permutations affects the accuracy of the subsequent power law regression. This regression happens in this [Rscript](https://github.com/pangenome/odgi/blob/master/scripts/heaps_fit.R) that uses the heap's law ([Tettelin et al., 2008](https://www.sciencedirect.com/science/article/pii/S1369527408001239?via=ihub#section0020)) to calculate a pangenome growth curve from all `odgi heaps` permutations. For more details, take a look at https://en.wikipedia.org/wiki/Pan-genome#Classification. ### `Panacus` `panacus` is able to calculate the pangenome openness without the need to perform any permutations. Indeed, it directly applies the binomial formula described in [Parmigiani et al., 2022](https://www.biorxiv.org/content/10.1101/2022.11.15.516472v2), Section 2.1, Eq 1. `panacus` exposes a parameter (`-c`) that allow users to chose which graph feature (sequence, node, edge) is taken into account to calculate the growth histogram parameter. The coverage `-l` parameter sets the _minimum number_ of haplotypes visiting a graph feature in order for this graph feature to be included into the calculation at all. With `-q` one can set the _minimum fraction_ of haplotypes that must share a graph feature *after each time a haplotype is added to the growth histograph*. For example, assuming a 100 haplotypes pangenome graph, setting `-c bps -q 0,1,0.5,0.1` would calculate the pangenome growth in sequence scape (`-c bps`) for 4 different cases. Remember, quorum sets the minimum fraction of haplotypes for a nucleotide to be included in the openness calculation. - without setting any quorum (`-q 0`), so all sequences are considered. - limited to sequences that are traversed by 100% of haplotypes (`-q 1`). This is the `core pangenome`. - limited to sequences that are traversed by at least 50% of the haplotypes (`-q 0.5`). This is the `shell pangenome`. - limited to sequences that are traversed by at least 10% of the haplotypes (`-q 0.1`). This is the `cloud pangenome`. `panacus` fits two curves, one following heap's law, and one using heap's power law for modeling the data. ### A Closed Pangenome? Create the matrix of path pangenome coverage permutations of the chrM graph with `odgi heaps` subsequently performing the heap's law regression: cd chrM odgi heaps -i chrM.gfa -S -n 100 > chrM.gfa.heaps.tsv Rscript PATH_TO_ODGI_GIT_REPOSITORY/scripts/heaps_fit.R chrM.gfa.heaps.tsv chrM.gfa.heaps.tsv.pdf # this should be executed on your local machine Taking a look at the PDF, we can surprisingly observe 2 traces of permutations. *How can this happen*? <details> <summary>**Hint**</summary> Take a look at the 1D visualization of the graph. </details> <details> <summary>**Explanation**</summary> The CHM13 reference was linearized differently than all the other mitochondrial sequences. Therefore it has an additional `tip`. `odgi heaps` permutation algorithm is reflecting this, because the permutation always starts with the beginning of each genome. </details> So to get a better pangenome growth curve with `odgi heap` we remove the CHM13 reference sequence and run the analysis again: odgi paths -i chrM.gfa -L | head -n 1 > chrM.gfa.chm13 odgi paths -i chrM.gfa -X chrM.gfa.chm13 -o chrM.gfa.no_chm13.og odgi heaps -i chrM.gfa.no_chm13.og -S -n 100 > chrM.gfa.no_chm13.og.heaps.tsv Rscript PATH_TO_ODGI_GIT_REPOSITORY/scripts/heaps_fit.R chrM.gfa.no_chm13.og.heaps.tsv chrM.gfa.heaps.no_chm13.og.tsv.pdf # this should be executed on your local machine This looks much better! With every added genome, the number of newly added bases is really low. Let's take a closer look with `panacus`. Create and visualize a growth histograph of the chrM graph in sequence space with `panacus`: RUST_LOG=info panacus histgrowth chrM.gfa -c bp -q 0,1,0.5,0.1 -t 16 > chrM.gfa.histgrowth.tsv panacus-visualize.py -e -l "lower right" chrM.gfa.histgrowth.tsv > chrM.gfa.histgrowth.tsv.pdf <details> <summary>**See PNG**</summary> ![](https://hackmd.io/_uploads/rkRXGBvHn.png) </details> Taking a look at the top Figure in the PDF there is a nearly asymptotic growth of the number of nucleotides with increasing number of genomes. <!--- A value of`γ=0.019` indicates a closed pangenome, because $γ$ is very close to $0$. ---> The bottome Figure shows the added number of bases per added individual. Which is basically 0 for all, except for the first one. In conclusion, ChrM is a closed pangenome. The core pangenome (`quorum >= 1`) is very large, with the shell pangenome (`quorum >= 0.5`) being even slightly larger, the cloud pangenome (`quorum >= 0.1`) is not noticably larger. ### A Open Pangenome Create the matrix of path pangenome coverage permutations of the *E.coli* graph with `odgi heaps` subsequently performing the power law regression: cd ~/day2_openness/ecoli50 odgi heaps -i ecoli50.gfa -S -n 100 -t 16 > ecoli50.gfa.heaps.tsv Rscript heaps_fit.R ecoli50.gfa.heaps.tsv ecoli50.gfa.heaps.tsv.pdf With every added genome, the number of newly added bases is at least 100k. Let's take a closer look with `panacus`. Create and visualize a growth histograph of the *E. coli* graph in sequence space with `panacus`: RUST_LOG=info panacus histgrowth ecoli50.gfa -c bp -q 0,1,0.5,0.1 -t 16 > ecoli50.gfa.histgrowth.tsv panacus-visualize.py -e -l "upper left" ecoli50.gfa.histgrowth.tsv > ecoli50.gfa.histgrowth.tsv.pdf <details> <summary>**See PNG**</summary> ![](https://hackmd.io/_uploads/BJG17rvr3.png) </details> Taking a look at the top Figure in the PDF there is a polynomial growth of the number of nucleotides with increasing number of genomes visible. <!---The value `1>γ=0.386>0` indicates an open pangenome. ---> The bottom Figure shows the added number of bases per added individual. Even the 50th added individual adds more than 100k new bases to the pangenome. In conclusion, the *E. coli* pangenome is open. The core pangenome (`quorum >= 1`) is quite small, with the shell pangenome (`quorum >= 0.5`) not adding much more sequence, and the cloud pangenome (`quorum >= 0.1`) adding some more sequence. #### Acknowledgments The MemPanG23 team wants to thank [Daniel Doerr](https://danieldoerr.de/) for revising and polishing the `Pangenome openness` part of the tutorial. ## nf-core/pangenome ### Learning objectives In this exercise you learn how to - run a [nf-core](https://nf-co.re/) [Nextflow](https://www.nextflow.io/) pipeline, - configure the resources according to what is available, - deal with alternative parameter names, - understand the [nf-core/pangenome](https://github.com/nf-core/pangenome) pipeline's output: - [MutiQC](https://multiqc.info/), - used CPU, RAM, ... - workflow timeline, - output folders, - where are my communities? ### Getting started Make sure you have `screen`, `wget`, `git`, `Nextflow`, and `Docker` installed. All tools are already available on the course VMs. If you haven't done before, clone the `pggb` repository into your home folder: cd ~ git clone https://github.com/pangenome/pggb.git Now create a directory to work in for this tutorial: mkdir day2_nextflow cd day2_nextflow ln -s ~/pggb/data Download the 8 haplotypes yeast FASTA sequences ([Yeast Population Reference Panel](https://yjx1217.github.io/Yeast_PacBio_2016/welcome/)): wget https://zenodo.org/record/7933393/files/scerevisiae8.fa.gz One can distribute the available compute resources efficiently across the different processes of the Nextflow pipeline using [config](https://www.nextflow.io/docs/latest/config.html) files. During the course you have access to 16 threads with 32 gigabytes of memory. To ensure that each run only consumes up to these resources and does not snitch them from other course participants, please create the following config file: mempang23_config="executor { cpus = 16 memory = 32.GB }" echo "$mempang23_config" > mempang23.config ### Building a LPA pangenome graph with nf-core/pangenome Whilst we can limit the maximum allocatable resources with `mempang23.config`, one can assign resources for each step of the pipeline using a different config file: wget https://raw.githubusercontent.com/nf-core/pangenome/a_brave_new_world/conf/hla.config Check it out! Let's build an LPA pangenome graph. If you are interested in setting additional parameters you can always visit https://nf-co.re/pangenome/a_brave_new_world/parameters for details. All parameters starting with one `-` are handed over to Nextflow, all parameters starting with two `-` are handled by the pipeline itself: nextflow run nf-core/pangenome -r a_brave_new_world -profile docker -c mempang23.config,hla.config --input data/LPA/LPA.fa.gz --n_haplotypes 14 --outdir LPA14 *In which folder can the final ODGI graph be found? And in which folder in the final GFA graph?* <details> <summary>**See Solution**</summary> - ODGI: *odgi_sort* - GFA: *odgi_view* </details> Open the MultiQC report and other statistics on your local machine in order to take a closer look. LPA14/multiqc/multiqc_report.html . LPA14/pipeline_info/execution_*.html . LPA14/pipeline_info/pipeline_dag_*.html . In the MultiQC report you will find vital graph statistics, lots of 1D graph visualizations and a 2D graph visualization serving both as quantitative and qualitative graph validation information. In `execution_report_*.html*` you can find an overview of the executed pipeline and especially the resource consumption of each process of the pipeline. If you notice that a process is consuming much less RAM than it was given in `hla.config` you might want to adjust this. Assuming you want to run `nf-core/pangenome` on a cluster, it is crucial to limit the allocated resources for each process, so your jobs usually have a higher chance to be submitted by the cluster scheduler. In `execution_timeline_*.html` one can observe when which process was executed and which processes were submitted in parallel, assuming resources were available. Also take a look at all the output folders. *In which one is the final graph stored?* If you are not sure, take a look a the `pipeline_dag_*.html` file. ### Parallelizing base-level alignments across a cluster One advantage that `nf-core/pangenome` has over `pggb` is that it can parallelize the often heavy base-level alignments across nodes of a cluster. The parameter `--wfmash_chunks` determines into how many equally large subproblems the alignments should be split after the `WFMASH_MAP` process. It is recommended that this number roughly fits the number of available nodes one has. During the course, a full cluster is not available, so we are improvising. In `hla.config` the number of CPUs for `WFMASH_ALIGN` is set to 4. Assuming we are able to run this in parallel on our simulated 16T/32GB machine, one can expect that at most 4 `WFMASH_ALIGN` process can be executed in parallel. For this lesson, it is sufficient, that we only execute the alignment step of the pipeline, which is set with `--wfmash_only`: nextflow run nf-core/pangenome -r a_brave_new_world -profile docker -c mempang23.config,hla.config --input data/LPA/LPA.fa.gz --n_haplotypes 14 --outdir LAP14_wfmash --wfmash_only --wfmash_chunks 4 Examine the `execution_timeline_*.html` to find out if it worked. If you are interested, you can play around with the resources in `hla.config` and see how this affects the parallelism. ### Building a community yeast pangenome In the previous tutorials, you learned how to partition the yeast sequences into communities manually. The `nf-core/pangenome` pipeline is able to do this automatically. On top of that, it will create a pangenome graph for each of the communities on the fly, merging all of them in one final graph. Before we create the yeast graph, let's open a `screen` session, since the graph construction will take ~8 minutes: screen -R yeast Create your own `yeast.config`. You can start with the `hla.config` one: cp hla.config yeast.config Modify it as you see fit. Once you ran the pipeline, or even during the pipeline run, you may have to adjust it (e.g. because you did not reserve enough resources for a specific process). Let's start building: nextflow run nf-core/pangenome -r a_brave_new_world -profile docker -c mempang23.config,yeast.config --input scerevisiae8.fa.gz --n_haplotypes 8 --outdir yeast8 --wfmash_map_pct_id 95 --communities --smoothxg_poa_length "1100," Since this will take some time, you can watch all the steps, or you can check out what's happening in the background. Type `CONTROL+A+D` in order to detach from the `screen` session. With less .nextflow.log you can observe what `Nextflow` is actually doing. If you scroll down to the bottom, you may see in which output folder each process is storing there files and commands. You can `cd` in there and take a look at for example `.command.sh` or `command.log`. You can always go back to the `screen` session via `screen -r yeast`. Once there, type several times `CONTROL+C` in order to abort the pipeline run. Nextflow stores all intermediate files, so we can just continue again with `-resume`: nextflow run nf-core/pangenome -r a_brave_new_world -profile docker -c mempang23.config,yeast.config --input scerevisiae8.fa.gz --n_haplotypes 8 --outdir yeast8 --wfmash_map_pct_id 95 --communities --smoothxg_poa_length "1100," -resume Once the pipeline is done, you can take a look a the MultiQC report of each community and the final graph(s). ### Bonus Challenge Configure the config file(s) for the lowest run time possible. You can even collaborate with your colleagues on the same VM only executing one Nextflow instance.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully