VernaML

repo link

Overview

0. Data Collection / Annotation (RNAGLib)

  1. Retrieve set of RNAs from PDB
  2. Find RNA-(protein/rna/ligand/ion) interface positions
  3. Slice graphs into interfaces and non-interface
  4. Find labels for binding partners (later after we try binary prediction)
    This part of the project is now moved to another document. See RNAGLib

1. Tune VeRNAl to produce a fingerprint

  1. Use binary prediction task to experiment with vernal params
  2. Start with prebuilt metagraphs
  3. Try building metagraphs with the interface graphs
    • if the metagraph has a higher proportion of interface motifs then we might get better results in predicting them
  4. Analyze motif importance
  5. Tune Vernal fuzzyness

2. Train a model to predict RNA-(protein/rna/ligand/ion/functions)

  1. First predict with simple classifiers on motif fingerprint (SVM)
  2. Make prediction model without fingerprint using RGCN
  3. Incorperate fingerprints as node attributes in RGCN

4. Pipelining and Testing

  1. Refactor all code written
  2. Turn the whole thing into one big python
  3. Test all code written
  4. Update repo with clear instruction on how to use

TODO

Motif Importance

  • Graph feature importance into histogram

  • Show other contexts of transcription case study motif

    • Use FR3D to search for motif in other structures
    • Graph its importance on other datasets
  • Update get_weights function to use l1 regularization

  • Have a relook at maximal only motifs calculation

  • Analyse motif importance for other tools too

  • Rerun motif importance after RNAGlib data cleanup

  • Compute more classifier stats (F1, recall, precision)

Resources/Papers

Papers

What are RNA Motifs
VeRNAl Paper
RNAMigos
Maetschke & Yuan 2009
Maticzka et al 2014 (graphprot)
Zhang et al. 2016
Yan and Huang 2017
RNA drugs review
RNA 3D motif
RNA 3D motif Atlas (BGSU Motifs)
JAR3D
FR3D
CaRNAaval

Repos

Graph drawing in 3D
VeRNAl
Migos
RNAfold

Ideas

  • Make predictions at the graph level
  • Use SSE decomposition to predict long-range RNA-RNA-interactions

Citations

  • BGSU RNA Site
    • Leontis, N. B., & Zirbel, C. L. (2012). Nonredundant 3D Structure Datasets for RNA Knowledge Extraction and Benchmarking. In RNA 3D Structure Analysis and Prediction N. Leontis & E. Westhof (Eds.), (Vol. 27, pp. 281–298). Springer Berlin Heidelberg. doi:10.1007/978-3-642-25740-7_13

Results

Dataset # Graphs # Avg. Nodes # Edges Download Link
RNA-RNA
RNA-Protein
RNA-Small Mol.
RNA-Ion

Data

Progress / Log

every week update with new TODOs and move completed TODOs into the Log

Week 10 (Interim Wrap up)

  • Receive feedback and redraft
  • Abstract
  • submit to Yue Li by next the the follow Tuesday
  • At the node level try implementing one hops to get all motifs for nodes around that node
    • AUROC increase ~= 0.1
  • Make a set of cutoff = 5 Angstroms

Week 9 (interim report writing)

  • Introduction
  • Methods
    • Datasets
    • prepare_data
    • train
  • Results
    • datasets
    • model performance
    • visualizations
  • Discussion
  • References

Week 8

  • Write script to make motif fingerprints train/onehot_encoder.py
    • Overall goal: given a node/graph output a list of motifs which it belongs to
    • Build_onehot_nodes
    • Build_onehot_graphs
      • Graphs is a little bit trickier. Need to slice and subset
  • Debug missing keys in ion graphs
  • Make binary predictions at node level and get accuracy estimates
  • Build onehots with Carnaval and 3d motif atlas libraries
  • k-fold cross validation

Week 7

  • fix README.md and errors with Mega
  • include some example images with chimera
    • Maybe draw some 3D structures with superposed graph. See this repo.
  • create PDB annotations file for graph level predictions

Week 6

  • Balance complement data set.
    • For each interface, sample one random node, expand that node with bfs_expand until number of nodes is equal to the interface.
  • Add annotations at the node level
    • for protein and rna: True/False
    • for Ion/Ligand: give name
  • Test balancing is working correctly
  • Debug segmentation fault
  • Describe training data (make a table with #nodes, #graphs, #edges, download link)
  • upload training data to mega
  • update repo with How to get data section - "VERSION 0 of DATA"

Week 5

  • Set up working with Chimera for sanity checks on interfaces
  • Debug ligand annotations
    • Gotcha: Nucleic acid bases can also be ligands!
    • Was initially computed by finding all non-polymer binding partners than taking those that were above a molecular mass threshold but found a molecule called A in list
    • FIX: added another conditional in ligand checking to make sure residue is non-polymer in interfaces.py
  • Rewrite data collection into script (currently notebook)
    • Gotcha: Was initially only collecting Representative structures but there exists redundancies within chains
    • Adjust script to retrieve specific chains to create the data set
    • But we need all chains in the structures to find interfaces between chains?
      • Check this with Carlos
  • Clean up sliced graphs to be connected.
    • see graph_utils.py
    • For disconnected graphs:
      • connect them if possible by adding just 1 or 2 edges (less than a threshold)
      • otherwise put into seperate files

Week 4

  • Modularize and refactor data collection / data annotation
    • remove src dir and put everything in prepare_data
    • write prepare_data/main.py
  • test and debug data annotation / data collection
    • DEBUGGING: when getting interfaces on 5fk3 it's not getting any in the list. RNA molecule with a SAM ligand.

Week 3

  • Set up and debug vernal environment set up on local machine
  • Write a scrip to subset RNA graphs into those with interfaces slice.py
  • Change interfaces to ignore DNA and disinguish between ions and ligands using ligand list downloaded from PDB
  • test interfaces.py
  • debug interfaces.py

Week 2

  • debug retrieve_structures.py
  • test and validate data collection
  • Write a script for finding interfaces: interfaces.py

Week 1

  • Read Biopython tutorial and watch videos
  • Data exploration on PDB data and working with CIF files
  • Write a script to retrieve representative set of RNA molecules retrieve_structures.py
Select a repo