Anton Nekrutenko

@nekrut

Joined on Sep 28, 2020

  • In the previous lecture we have seen the principle behind dynamic programming. This approach is extremely useful for comparing biological sequences, which is coincidentally one of the main points of this course. This lecture explain how this is done. In writing this text I heavily relied on wonderful course taught by Ben Langmead at Johns Hopkins. The cover image shows pairwise alignments for human, mouse, and dog KIF3 locus from Dubchak et al. 2000. How different are two sequences? Suppose you have two sequences of the same length: A C A T G C C T A A C T G C C T A C How different are they? In other words, how many bases should be change to turn one sequence onto the other: A C A T G C C T A
     Like  Bookmark
  • Introduction to Data Driven Life Sciences | Spring 2025 https://xkcd.com/2054/ Place and Time 005 Life Sciences | Tuesday, Thursday 10:35am - 11:50am EST :::success The class is in person only. However, for those who are located in Hershey or unable to attend a particular lecture<sup>1</sup> a Zoom link is provided. :::
     Like  Bookmark
  • Log in into https://usegalaxy.org Import this history: https://usegalaxy.org/u/cartman/h/d3 Run this workflow: https://usegalaxy.org/u/cartman/w/wf-ui-bof Look at invocation report and discuss it!
     Like  Bookmark
  • Based on Laila Los' presentation at GCC 23 as well as previous styles used by James and Anton. This guide is intended for infographics that will be used on Galaxy sites (in center carousel) and all other graphical materials such as tweets, flyers, etc. etc. :::warning This is a draft to be discussed at the next outreach meeting. ::: Colors Use the following colors only:
     Like  Bookmark
  • Quiz Prep Go to https://colab.research.google.com Log in with your PSU account and create a new notebook Complete the quiz Share you notebook with aun1@psu.edu. Make sure you set me as an editor: Questions Question 1 (50 pts) Using a for loop write code that will calculate the sum of the numbers 1 through 100.
     Like  Bookmark
  • The problem The difficulty with sequencing nucleic acids is nicely summarized by Hutchinson:2007: The chemical properties of different DNA molecules were so similar that it appeared difficult to separate them. The chain length of naturally occurring DNA molecules was much greater than for proteins and made complete sequencing seems unapproachable. The 20 amino acid residues found in proteins have widely varying properties that had proven useful in the separation of peptides. The existence of only four bases in DNA therefore seemed to make sequencing a more difficult problem for DNA than for protein. No base-specific DNAases were known. Protein sequencing had depended upon proteases that cleave adjacent to certain amino acids. It is therefore not surprising that protein-sequencing was developed before DNA sequencing by Sanger and Tuppy:1951.
     Like  Bookmark
  • Syllabus - About this course Lecture 1 - Introduction and History Lecture 2 - Shell I Lecture 3 - Shell II Lecture 4 - Intermission + History of Sequencing Lecture 5 - Python 1 - Variables, expressions, statements, functions Lecture 6 - Python 2 - Strings and lists and FASTQ Lecture 7 - Python 3 - A more careful look at lists and dictionaries Lecture 8 - Python 4 - Recap of what we learned so far
     Like  Bookmark
  • --- tags: BMMB554-23 --- # Lecture 28: Why academia is important <iframe src="https://docs.google.com/presentation/d/e/2PACX-1vSZCGlYuPouXTAr918Ipy6qsU4vatfK1IBWIBNTHBM7drMj8raDYK3S7d6pijOjmY--3n3OuxlPUxxf/embed?start=false&loop=false&delayms=60000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>
     Like  Bookmark
  • An overview of bulk RNAseq An overview of ShapeSeq An overview of ribosomal profiling Galactic resources https://training.galaxyproject.org/training-material/topics/transcriptomics/
     Like  Bookmark
  • Given a set of replicates from a single experimental condition estimate the chances of parallel mutations occuring by chance. The overall design Suppose you are adopting E. coli to a new environment (high salinity, particular antibiotic and so on). You have several replicates (parallel experiments) for treatment (blue) and control (red): After N generations you sequence last timepoint of your experiment. This gives you the following information: mutation rate per site per generation Tn/Ts ratio genes that contain mutations (mutations and how many replicates share this mutation)
     Like  Bookmark
  • Notes on microbial population genetics From Dykhuizen & Hartl, 1983 Particularities of variant calling Indels = trouble Left aligning of indels (a variant of re-aligning) is extremely important for obtaining accurate variant calls. For illustrating how left-aligning works, we expanded on an example provided by Tan et al. 2015. Suppose you have a reference sequence and a sequencing read: Reference GGGCACACACAGGG Read GGGCACACAGGG
     Like  Bookmark
  • Logistics After all we will have individual projects. Here are the logistical details. Pick a project from the list below. If you want, you can propose your own project as well. Fill in this poll to indicate: Which project you select Time you can meet with me to discuss the details Projects
     Like  Bookmark
  • Based on the course taught by Ben Langmead at Johns Hopkins. The cover image is from Wikpedia article on Burrows-Wheeler transform. The challenge of really large datasets You have a lot of reads and you need to align them against the genome. You can use classical alignment approaches based on the dynamic programming logic (which we will discuss in the next lecture). It will work, but the performance of these approaches expressed as the big O notation is: $\mathcal{O}(mn)$ where $n$ is the read length and $m$ is the genome length. $m$ can be very large. For instance, in the case of human genome it is $3\times10^9$. On top of that we very many reads. The latest Illumina NovaSeq 6000 machine can produce as much as 20 billion 150 bp eads. Taking this into account our big O notation becomes: $\mathcal{O}(dmn)$
     Like  Bookmark
  • Wrapping up assembly Projects Logistics After all we will have individual projects. Here are the logistical details. Pick a project from the list below. If you want, you can propose your own project as well. Next class (April 11th) I will circulate a sign-up sheet where you will: a. Indicate what project you decided to pick b. Reserve a time to meet with me to discuss details (if necessary - this is optional)
     Like  Bookmark
  • Before we begin, some fundamental terms: Contig - A sequence reconstructed by assembling together sequencing reads Scaffold - An ordered collection contigs. The sequence within the gaps between the contigs is usually not known. N50 - A statistic used for assessing the contiguity of a genome assembly. The contigs in an assembly are sorted by size and added, starting with the largest. The size of the contig is reported that makes the total greater than or equal to 50% of the genome size. <small>Image credit = Mike Schatz</small> Galaxy histories containing zebra finch assembly are here
     Like  Bookmark
  • The Galaxy project has been recognised by major global projects in the last years. It has proliferated into multiple new directions. Positive:The working groups seem to work well and major big changes could be implemented The various UI efforts in the last year demonstrated the real potential of our community and team The project and the community is able to quickly adapt to new scientific use-cases and software development initiatives (e.g., VGP, Pha4ge ...) It is a global effort with a true community support and major funding for main instances Negative: The team does not grow with the same speed. Having money does not make it easy to hire, certain grant aims are lagging behind
     Like  Bookmark
  • :::info Somehow I missed this article. Take a look. It describes new players on the sequencing market. ::: The following is based on three primary sources: Ben Langmead's Teaching Materials Pevzner and Compeau Bioinformatics Book (referred in text as "CP") Mike Schatz's Teaching Materials
     Like  Bookmark
  • Links Galaxy Hub - https://galaxyproject.org/ Galaxy Training Network - https://training.galaxyproject.org/ Galaxy US - https://usegalaxy.org Galaxy EU - https://usegalaxy.eu Galaxy Australia - https://usegalaxy.org.au Galaxy GitHub home - https://github.com/galaxyproject History and current state
     Like  Bookmark
  • from Science Issues containing Levene et al. 2003. PacBio was the second (after Helicos) single molecule sequencing (SMS) technology on the market. Today it is an important technology enabling assembly of complex genomes and transcriptomes. Fundamentals The era of Pacific Biosciences begins with a publication by Eid et al. 2009. Yet as was the case with other technologies it did not appear out of thin air. One of the key publications (written by some of the same authors) predating the birth of PacBio is the one by Levene et al. 2003. In particular it had the following figure: Figure 1 from Levene et al. 2003. An apparatus for single-molecule enzymology using zero-mode waveguides. The overall idea of this device is that it can detect fluorescent ligands (such as, for example, labeled dNTPs) in a very small volume: 10<sup>-21</sup> litre. Specifically, envision a glass slide fused to a metal (Aluminum) film with tiny holes. In fact, the diameter of the holes is smaller than the wavelength of light that is used to illuminate the slide. As a result only molecules at the bottom of this "hole" will be detectable. This allows to track single molecule dynamics. Now imagine that you put a single DNA polymerase molecule at the bottom of such a "hole". The polymerase will be pulling dNTPs close to the bottom of the "hole" at it performs polymerization reaction. Thus, only nucleotides that are being added to DNA strand will be detected by the device at any given time. It now make a movie of this process you are recording real time polymerization kinetics. And this is exactly what PacBio process does:
     Like  Bookmark
  • :::info This text and figures are based on the offician Oxford nanopоре documentation from the official nanoporetech site and official videos. ::: The basic principle of the Oxford Nanopore Technology (ONT) is straightforward. Molecules are driven through a pore in a membrane. As a molecule moves through membrane it blocks current. In fact the extent of "blockage" is sequence specific generating using current change profiles that can be "translated" into sequence. Length-specific translocation speed It began as technique for measuring polynecleotide length Blockage can be used to identify nucleotides
     Like  Bookmark