C. Titus Brown
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.

      Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Explore these features while you wait
      Complete general settings
      Bookmark and like published notes
      Write a few more notes
      Complete general settings
      Write a few more notes
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    --- tags: ggg, ggg2026, ggg201b --- # Week 7 - De novo assembly, take2! - GGG 201B lab section: Genomics (Winter 2026) 2/20/26 Titus Brown, ctbrown@ucdavis.edu ## At the beginning * start zoom / recording! * clean up directory & conda envs * next week: guest lecture on RNAseq ## Learning objectives, hands-on section of week 7: At the end of today, * you'll have installed the megahit assembler, the quast assembly metrics analysis package, and the prokka bacterial annotation software * you'll have done an assembly and annotation of an E. coli genome * you'll have learned how to measure the computational requirements for a task But first... ## Start up RStudio Server on farm Go to: https://ondemand.farm.hpc.ucdavis.edu/ and log in. Now select RStudio, and fill in the following values: (**note: 10 GB, 8 cores**) Account: ctbrowngrp Partition: low **Number of cores: 8** **Amount of memory: 10** Conda environment: r-4.4.2 Number of hours: 3 leave everything else as-is, and click "Launch". ## Go to the "Terminal" tab in RStudio You should be at a prompt that looks something like this: ``` ctbrown@cpu-3-50:~/$ ``` ## Load conda ``` module load conda ``` ## Create a few conda environments ```bash conda create -p ~ctbrown/scratch3/2026-conda/$USER/megahit \ -y megahit quast conda create -p ~ctbrown/scratch3/2026-conda/$USER/prokka \ -y prokka conda create -p ~ctbrown/scratch3/2026-conda/$USER/sourmash \ -y sourmash ln -s ~ctbrown/scratch3/2026-conda/$USER/{megahit,prokka,sourmash} \ ~/.conda/envs ``` ### Cleaning up old disk space If you run out of disk space with the above, it is because despite my best efforts, conda is putting the downloaded packages in your home directory under `~/.conda`. These can safely be removed - run ``` conda clean --all ``` to do so, and then re-run any failed commands. You can also remove the variant calling directory which is reasonably large: ``` rm -fr ~/201b-variants ``` If you run into problems with a particular conda environment install that failed the first time you ran it, just remove the entire directory like so: ``` rm -fr ~ctbrown/scratch3/2026-conda/$USER/prokka ``` and then re-run conda. ### Measuring disk space usage The command ``` du -sk ~/.??* ~/* | sort -n ``` will tell you what is using how much disk space in your home directory. The command: ``` df -h ~/ ``` will tell you how much free disk space you have to work with. ## Activate the snakemake conda environment ```bash! conda activate snakemake ``` ### You should now be at a prompt that looks like this: ``` (snakemake) ctbrown@cpu-3-50:~/$ ``` ## Now! Grab a github repository We are going to use a new snakemake workflow to do de novo assembly: [link](https://github.com/ngs-docs/2026-ggg-201b-assembly) Let's copy all of [these files](https://github.com/ngs-docs/2026-ggg-201b-assembly) into our home directory: ```bash git clone https://github.com/ngs-docs/2026-ggg-201b-assembly.git \ ~/201b-assembly cd ~/201b-assembly ``` ## We need some files! Now let's link in some data files: ```bash ln -s ~ctbrown/data/ggg201b/SRR2584857_?.fastq.gz . ``` Note: '?' in the command, what ends up in the directory? ## Check that we have everything ready to go: Try doing a "dry run" with snakemake: ``` snakemake -n ``` What does it report? Dry runs are useful for debugging - snakemake will tell you what it's going to run, without actually running it. ## Run snakemake, look at the outputs ``` /usr/bin/time -v snakemake -j 8 --use-conda ``` It will take about 5 minutes and at the end you should see something like this: ``` Command being timed: "snakemake -j 8 --use-conda" User time (seconds): 845.95 System time (seconds): 27.35 Percent of CPU this job got: 293% Elapsed (wall clock) time (h:mm:ss or m:ss): 4:57.08 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 913920 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 4 Minor (reclaiming a frame) page faults: 2987910 Voluntary context switches: 117661 Involuntary context switches: 99176 Swaps: 0 File system inputs: 831576 File system outputs: 2523368 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ``` So, this next bit is kind of a digression from assembly, but it's a pretty important computational point so I'm going to discuss it :). Let's look at the following lines: ``` User time (seconds): 845.95 System time (seconds): 27.35 Percent of CPU this job got: 293% Elapsed (wall clock) time (h:mm:ss or m:ss): 4:57.08 Maximum resident set size (kbytes): 913920 ``` The fourth line is how long it took, in minutes/seconds, according to the actual clock on the wall. So, you waited ~5 minutes. The fifth line is how much memory it required. This is essentailly a lower limit on how much memory you need to allocate via ondemand to run this workflow - in this case, about 1 GB (1 million kilobytes). The first three lines are trickier to explain, but it boils down to, in order: * how much CPU time was used by the user, to do read comparisons etc; * how much CPU time was used by the system, to do things like read/write from the disk; * how _many_ CPUs were used - in this case, about 3, because ~300%. So anyway this is how you figure out how "expensive" a particular task or set of tasks (in this case, the entire workflow) is. And this can guide how much memory and time you ask to reserve via OnDemand. ## Examining the assembly itself The primary output of this job is a bunch of assembled DNA contigs. Run: ``` head SRR2584857-assembly.fa ``` what do you see? Yeah, not that useful :laughing: ## Annotation The next step after genome assembly is genome _annotation_. Last week Dr. Soto showed you how to _view_ and explore annotated genome features, but first we have to generate them! Let's do so by running the [prokka](https://github.com/tseemann/prokka) tool. ``` snakemake --use-conda prokka_assembly -j 4 ``` Honestly, reading the text output of this is fascinating in terms of understanding the process! (This is saved in the .log file below). The important / interesting files output by prokka are all under the directory `SRR2584857_annot` and are: * SRR2584857_annot.txt - summary of output * SRR2584857_annot.err - summary of errors * SRR2584857_annot.faa - **protein sequences** * SRR2584857_annot.ffn - **the gene sequences** * SRR2584857_annot.gff - **GFF file containing features!** * SRR2584857_annot.tsv - spreadsheet of genes / annotations Additional files: * SRR2584857_annot.fna - the original assembly file, unchanged (but renamed :) * SRR2584857_annot.fsa - annotated assembly file (contigs named) * SRR2584857_annot.gbk - genbank format annotation * SRR2584857_annot.log - log of the output * SRR2584857_annot.sqn - not sure exactly * SRR2584857_annot.tbl - some other format ### Annotation basics Genome annotation is almost entirely based on: * simple models of genes, e.g. open reading frames (ORFs); * homology (inferred by sequence similarity) using either BLAST or HMMER; * other data sets (especially: RNAseq) * less reliable: eukaryotic gene finders Open reading frames are really simple: you find ATGs followed by long stretches of non-stop-codons, and you call it a gene! ### How annotation differs in euks and bacteria/archaea/viruses tl;dr Introns suck. Bacterial and viral gene prediction mostly Just Works, except when biological weirdness gets in the way (e.g. non-canonical amino acids). If you are annotating the genes in a eukaryotic genome, *especially* one with no close relatives that are well annotated, you will almost always want good comprehensive RNAseq. Here one big problem is that RNAseq is always of a particular tissue/mixture of tissues/life stage, so genes that are not expressed then/there won't be represented. There is also the problem that transcription from the genome is pretty noisy, so there are a lot of places that are transcribed but that (don't seem to be) genes. Often, protein-coding is used as a filter. But if you _just_ use sequence similarity, you almost always miss UTRs. So it always ends up being a mixture of computational techniques. Check out the following two papers for the actual process used to construct genes anotations from data - * Figure 1 of [Tissue resolved, gene structure refined equine transcriptome](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-3451-2) * Figure 1 in [Identification of long non-coding RNA in the horse transcriptome](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-3884-2) ## Stats for the assembly Let's take a look at the assembly statistics, as calculated by quast. Let's look at ``` cat SRR2584857_quast/report.txt ``` what does it tell us? In RStudio, open the file `icarus.html`. Let's explore the contig length distribution. Topics to cover - * N50 and L50 * why not just "mean"? ## Evaluating coverage The coverage formula for average coverage from shotgun sequencing is pretty simple: $$ C = B / G $$ where R is the total number of bases sequenced and G is the estimated genome size. The total number of bases sequenced can be broken down into: $$ B = R * L $$ where R is the number of reads and L is the average length of reads. To get the number of reads in a FASTQ file, you can run: ``` gunzip -c SRR2584857_1.fastq.gz | wc -l ``` and then divide that number by 4. To get your average read length, you can either run something like FastQC or just look at the FASTQ file like so: ``` gunzip -c SRR2584857_1.fastq.gz | head ``` G for this genome is about 4.5m. What's the coverage of these reads? ### Evaluate coverage with k-mers Let's use k-mers to look at the reads. Wee're going to use [sourmash](https://sourmash.readthedocs.io/) for this. First, use sourmash to turn the reads and assembly into k-mers: ``` snakemake --use-conda sketch_reads sketch_assembly -j 4 -p ``` Next, activate the sourmash environment install the abundhist plugin: ``` conda activate sourmash pip install sourmash_plugin_abundhist ``` and then run it like so: ``` sourmash scripts abundhist SRR2584857-reads.sig.zip --figure abundhist.png ``` you can also try: ``` sourmash scripts abundhist SRR2584857-reads.sig.zip \ -I SRR2584857-assembly.sig.zip \ --figure abundhist-genome.png ``` What do these figures show? Note that you can adjust the number of bins by adding `--bins`, e.g. `--bins 50`. Do these plots agree or disagree with your coverage estimate from just straight up math??

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Google Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully