RBP Project
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Write
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Help
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Write
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # ENN-Gene ENN-Gene is an application that simplifies the local training of custom Convolutional Neural Network models on Genomic data via an easy to use Graphical User Interface. ENN-Gene allows multiple streams of input information, including sequence, evolutionary conservation, and secondary structure, and includes utilities that can perform needed preprocessing steps, allowing simplified input such as genomic coordinates. ENN-Gene deals with all steps of training and evaluation of Convolutional Neural Networks for genomics, empowering researchers that may not be experts in machine learning to perform this powerful type of analysis. > We are preparing a video tutorial for an even easier start with the ENN-Gene application! ### Installation TODO ### Implementation ENN-Gene is built atop TensorFlow, one of the most popular Deep Learning frameworks. It can run on either CPU or GPU, offering a considerable speed-up when training larger CNNs. ENN-Gene accepts BED genomic interval files corresponding to a user-determined genome or transcriptome reference. Classifiers can be built for two or more classes. Full reproducibility of the process is ensured by logging all the user’s input and exporting it as a yaml file that can be loaded in ENN-Gene for future runs. The application consists of three consecutive modules, each module performing a specific task. To select a task browse the select box in the side panel. `Output folder` There is a subfolder created for each task in the given output folder (thus you can use the same path for all the tasks you run). After running a task, several files will be exported to the task subfolder. * Parameters.yaml file - contains all the parameters set by the user. The yaml file can be imported in a future run to reproduce the set up. * Log file - contains logged user input, warnings, errors etc. * Parameters.tsv file - a tsv table with one row per run, shared across the task. You can easily manage and compare results across the specific task with different parameters' setup. * Other task-specific files. The ENN-Gene application uses the [Streamlit framework](https://www.streamlit.io/) that is still in its early stages of development. Currently, all input files (or folders) must be defined by an absolute path. > **Due to the nature of the Streamlit framework, it is strongly recommended to fill-in the input fields top to bottom to avoid resetting already specified options.** Hopefully, this behavior will be removed with the framework update. #### 1 Preprocessing In the first module, data is preprocessed into a format convenient for CNN input. <!-- TODO fix first `Use already preprocessed file` Check this option to save preprocessing time if you already have files prepared from the previous run, and you want to just e.g. change the chromosomes' distribution among categories. --> `Branches` You may select one or more input types engineered from the given interval files. Each input type later corresponds to a branch in the neural network. * Sequence – one-hot encoded RNA or DNA sequence. Requires reference genome/transcriptome in a fasta file. * Secondary structure – computed by [ViennaRNA](https://www.tbi.univie.ac.at/RNA/) package, one-hot encoded. (Also requires the reference genome in fasta file). * Conservation score – counted based on the user provided reference file/s. `Alphabet` Define the nature of the sequences (DNA or RNA). `Apply strand` Choose to apply (if available) or ignore strand information. `Window size` All the samples prepared for the neural network must be of the same length, defined by the window size. Longer sequences get shortened, while short sequences are completed based on the reference. Both is done by randomly placing a window of selected size on the original sequence. `Seed` Parameter used for the reproducibility of the semi-random window placement. `Path to the reference fasta file` File containing reference genome/transcriptome. Required when Sequence or Secondary structure branch is selected. `Path to folder containing reference conservation files` Required when Conservation score branch is selected.'Path to folder containing reference conservation files' `Number of CPUs` You might assign multiple CPUs for the computation of the secondary structure. ##### Input Coordinate Files `Number of input files` There can be an arbitrary number of input files in BED format (two at minimum). Each input file corresponds to one class for the classification. Class name is based on the file name. `File no. 1, File no. 2, ...` Enter an absolute path for each interval file separately. ##### Dataset Size Reduction `Classes to be reduced` Number of samples in the selected class/es can be reduced to save the computing resources when training larger networks. `Target dataset size` Define a target size per each dataset selected to be reduced. Input a decimal number if you want to reduce the sample size by a ratio (e.g. 0.1 to get 10%), or an integer if you wish to select final dataset size (e.g. 5000 if you want exactly 5000 samples). A hint showing a number of rows in the original input file is displayed at the end. *Note: Samples are randomly shuffled across the chromosomes before the size reduction. If you split the dataset by chromosomes after reducing its size, make sure all the classes in all the categories (train, test, etc.) contain at least some data, as some small chromosomes might get fully removed.* `Seed` Parameter used for the reproducibility of the semi-random dataset size reduction. ##### Data Split Supplied data must be split into training, validation and testing datasets. *Note: The testing dataset is used for a direct evaluation of the trained model. If you train multiple models on the same datasets, you might want to keep a 'blackbox' dataset for a final evaluation.* `Random` Data are split into the categories randomly across the chromosomes, based on the given ratio. `Target ratio` Defines the ratio of the number of samples between the categories. Required format: train:validation:test:blackbox. `Seed` Parameter used for the reproducibility of the semi-random data split. `By chromosomes` Specific chromosomes might be selected for each category. To use this option, a fasta file with reference genome/transcriptome must be provided (the same one required for the sequence and secondary structure branches). List of available chromosomes is then inferred from the provided reference (scaffolds are ignored) - that may take up to few minutes. *Note: When selecting the chromosomes for the categories Streamlit will issue a warning 'Running read_and_cache(...).'. You may disregard that and continue selecting the chromosomes. Although, if your machine cannot handle it, and the process gets stuck, your input might get nullified. In that case, you will want to wait until the warning disappears.* `Run` After all the parameters are set and selected, press the run button. Depending on the amount of data, selected options, and the hardware available, the preprocessing might take several minutes to hours. Files with preprocessed datasets are exported to the 'datasets' subfolder at the `output folder` defined at the beginning. #### 2 Training In the second module, neural network architecture as well as the hyperparameters are set, and the model is trained using the data preprocessed in the first module. `Datasets folder` Define a path to the folder containing datasets created by the Preprocessing module. `Branches` Select the branches you want the model to be composed of. You might choose only from the branches preprocessed in the first module. `Output TensorBoard log files` For more information see the [official site](https://www.tensorflow.org/tensorboard). ##### Training Options `Batch size` Number of training samples utilized in one iteration. `No. of training epochs` An epoch is one complete pass through the training data. There can be an arbitrary number of training epochs. `Apply early stopping` A regularization technique to avoid overfitting when training for too many epochs. The model will stop training if the monitored metric (accuracy) does not improve for more than 0.1 (min_delta) during 10 training epochs (patience). `Optimizer` Select an optimizer. Available options: * Stochastic Gradient Descent ([SGD](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD)) - parameters are set as follows: momentum = 0.9, [nesterov](http://proceedings.mlr.press/v28/sutskever13.pdf) = True. * [RMSprop](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/RMSprop) - parameters are set to default. * [Adam](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) - parameters are set to default. Implements [Adam algorithm](https://arxiv.org/abs/1412.6980). `Learning rate options` Applies only for the SGD optimizer. Available options: * Use fixed learning rate - applies the same learning rate value throughout the whole training. * Use [learning rate scheduler](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/LearningRateScheduler) - gradually decreases learning rate from the given value. * Apply [one cycle policy](https://arxiv.org/abs/1506.01186) - uses the learning rate value as a maximum. The implementation for Keras is taken from [here](https://github.com/titu1994/keras-one-cycle), originally ported from the [fast.ai project](https://github.com/fastai/fastai). `Learning rate` Corresponds to the step size during the gradient descent. `Metric` Choose a metric. Available options: accuracy. `Loss function` Choose a loss function. Available options: categorical crossentropy. ##### Network Architecture The last section determines the network architecture. You may define architecture for each of the selected branches separately, as well as for the common part of the network following the branches' concatenation. `Number of layers` First set a number of layers per each part (branch or common part of the network). `Layer type` Types available for the branches: * [Convolution layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv1D) * [Locally connected 1D layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LocallyConnected1D) Types available for the connected part of the neural network: * [Dense layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) `Show advanced options` If checked, you may set options specific per layer type. If not, the defaults will apply. *Note: Due to the nature of the Streamlit framework, it is necessary to keep the checkbox checked for the options to be applied. If it is unchecked, the options get reset.* Common options are: * `Batch normalization` Applies [batch normalization](https://arxiv.org/abs/1502.03167) for the layer if checked. * `Dropout rate` Select a [dropout](https://jmlr.org/papers/v15/srivastava14a.html) rate. Options available for Convolution and Locally connected 1D layers: * `Number of filters` The number of output filters in the convolution. * `Kernel size` Specifies the length of the 1D convolutional window. Options available for Dense layer: * `Number of units` Dimensionality of the output space. *Note: The softmax activation function is used for the last layer.* `Run` When all is set, press the run button to start training the model. The training time depends on many variables (dataset size, network architecture, number of epochs, hardware available, etc.). You can monitor the progress on the chart indicating metric and loss function values. Resulting model and other files are exported to the 'training' subfolder in the selected `output folder`. #### 3 Prediction In the last module, a trained model can be used to classify novel, unseen data. Sequences provided to be classified are preprocessed similar to the first module for the purpose of the CNN. ##### Model You can either use a model trained with the ENN-Gene application, or any custom trained model. `Use a model trained by the ENN-Gene application` Preferred option. When selected this option, you must provide: * `Training folder containing the model (hdf5 file)` Except the hdf5 file with the trained model, the folder must contain the parameters.yaml file logged when training the model. Form that the parameters necessary for sequence preprocessing are read, and displayed below the field after that. `Use a custom trained model` When using model trained otherwise that through the application, necessary parameters must be provided separately. When selected this option, you must provide: * `Trained model (hdf5 file)` Path to the hdf5 file with the trained model. * `Window size` The size of the window must be the same as when used ofr the training the given model. * `Seed` Parameter used for the reproducibility of the semi-random window placement. * `Number of classes` Number must be the same as the number of classes used for training the given model. * `Class labels` Provide names of the classes for better results interpretation. The order of the classes must be the same as when encoding them for training the given model. * `Branches` Define branches used in the given model's architecture. Available options are: Sequence, Secondary structure, Conservation score. ##### Sequences You can provide the input sequences you wish to classify in following formats: * BED file * FASTA file * Text input - Paste one sequence per line. *Note: If the Conservation score branch is applied, only files in BED format are accepted, as the coordinates are necessary to get the score.* `Alphabet` Define the nature of the sequences (DNA or RNA). `BED file` When providing the sequences via an interval file, following must be specified: * `Path to the BED file containing intervals to be classified` * `Apply strand` Choose to apply (if available) or ignore strand information. * `Path to the reference fasta file` File containing reference genome/transcriptome. Required when Sequence or Secondary structure branch is selected. * `Path to folder containing reference conservation files` Required when Conservation score branch is selected.'Path to folder containing reference conservation files' *Note: When providing the sequences via FASTA file or text input, sequences shorter than the window size will be padded with Ns (might affect the prediction accuracy). Longer sequences will be cut to the length of the window.* `Run` After all the parameters are set and selected, press the run button. Calculating the predictions might take minutes to hours, depending on the number of sequnces, branches, hardware available etc. Results are exported to the 'prediction' subfolder in the selected `output folder`. Information about the input sequences are preserved in the result file (e.g. fasta header or coordinates from a bed file), while there are two more columns with the results appended. One column shows raw probabilities predicted by the model, the other class with the highest probability (#TODO decide upon the threshold).

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully