# Bioinformatics A.A. 2018/2019 - Report
Sviluppo di un set di reti neurali, basate su LSTM, per la discriminazione di sequenze oncogeniche in fusioni di geni.
- **Stefano Brilli** - s249914@studenti.polito.it
- **Francesco Chiarlo** - s253666@studenti.polito.it
- **Michele D'Amico** - s246501@studenti.polito.it
## Architetture & Risultati
Per valutare in maniera generale e comparabile i risultati di ogni modello, si è optato per un approccio in tre fasi:
- **1 - Holdout** - Selezione dei migliori parametri;
- **2 - Training** - Fit del modello con i parametri scelti;
- **2.a<sup>*</sup>** algoritmo 7.2 attraverso opzione: `--early_stopping_epoch`
- **2.b<sup>*</sup>** algoritmo 7.3 attraverso opzione: `--early_stopping_on_loss`
- **3 - Testing** - Predizione su dati *mai visti* dal modello.
Nota che nello stilare i seguenti risultati è stato utilizzato esclusivamente l'algoritmo **2.b**, in quanto ha fornito migliori prestazioni.
> <sup>*</sup> (Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. The MIT Press, pp. 246-250.)
---
### Modello #1 - LSTM unidirezionale su sequenze di DNA
<img src=https://i.imgur.com/LBuTDAX.png height=180>
**Caratteristiche della rete:**
- Codifica one-hot
- Dimensione del batch: **16**
- Numero di unità del layer FC: **32**
- Dropout rate: **0.4**
#### Risultati:
|Holdout| learning rate|LSTM units|Dropout|ES_epoch|Loss|Accuracy | F1|Precision|Recall|
|:-----:|:------------:|:--------:|:-----:|:------:|:--:|:-------:|:-:|:-------:|:----:|
| 1|1e-3|32|0.1| 18|0.589|0.738|0.689|0.838|0.627|
| 2|1e-3|32|0.3| 31|0.609|0.753|0.707|0.862|0.629|
| 3|1e-3|32|0.5| 34|0.586|0.738|0.708|0.812|0.658|
| 4|1e-4|48|0.1| 5|0.609|0.671|0.572|0.799|0.477|
| 5|1e-3|48|0.1| 26|0.715|0.563|0.367|0.589|0.289|
| 6|1e-4|48|0.3| 23|0.605|0.740|0.696|0.835|0.629|
| 7|1e-4|48|0.5| 56|0.709|0.587|0.421|0.663|0.331|
| 8|1e-3|64|0.1| 18|0.581|0.740|0.698|0.814|0.646|
| 9|1e-4|64|0.1| 47|0.639|0.704|0.619|0.778|0.535|
|10|5e-4|64|0.3| 12|0.602|0.717|0.614|0.852|0.512|
|11|1e-3|64|0.3| 24|0.596|0.758|0.723|0.839|0.659|
|12|1e-3|64|0.5| 6|0.622|0.686|0.604|0.799|0.525|
|13|1e-4|64|0.5| 45|0.628|0.693|0.616|0.734|0.549|
|**14**|**1e-4**|**16**|**0.1**|**24**|**0.539**|**0.799**|**0.799**|**0.820**|**0.802**|
|15|1e-4|16|0.3| 24|0.559|0.717|0.701|0.746|0.691|
|16|1e-4|16|0.5| 24|0.559|0.755|0.720|0.825|0.659|
|Test|Modello|Loss|Acc |F1-Score|Precision|Recall|AP |TN |FP |FN |TP |
|:--:|:--:|:--:|:--:|:------:|:-------:|:----:|:-:|:--:|:--:|:--:|:--:|
|-|14|0.386|0.871|0.865|0.919|0.828|0.930|211|19|40|190|
---
### Modello #2 - LSTM bidirezionale su sequenze di DNA
<img src=https://i.imgur.com/wi2kkeR.png height=180>
<!--  -->
**Caratteristiche della rete:**
- Codifica one-hot
- Max pooling size: **2**
- Dimensione del batch: **20**
- Learning rate: **5e-4**
#### Risultati:
|Holdout|Dropout|lstm_units|kernel_size|conv_num_filters|ES_epoch|Loss|Accuracy| F1 |Precision|Recall|
|:-----:|:-----:|:--------:|:---------:|:--------------:|:------:|:--:|:------:|:--:|:-------:|:----:|
| 1|[0.1;0.1;0.1;0.1]| 16| 3 | 50| 7 | 0.57295 | 0.73866 | 0.71711 | 0.77771 | 0.68804 |
| 2|[0.1;0.1;0.1;0.1]| 16| 5 | 50| 5 | 0.64208 | 0.69978 | 0.62190 | 0.77794 | 0.54552 |
| 3|[0.1;0.1;0.1;0.1]| 16| 10| 50| 3 | 0.62138 | 0.71490 | 0.61748 | 0.87659 | 0.50720 |
| 4|[0.3;0.3;0.3;0.3]| 16| 3 | 50| 42| 0.60901 | 0.69330 | 0.68989 | 0.71246 | 0.71297 |
| 5|[0.3;0.3;0.3;0.3]| 16| 5 | 50| 17| 0.50554 | 0.76674 | 0.69009 | 0.89745 | 0.60612 |
| 6|[0.3;0.3;0.3;0.3]| 16| 10| 50| 8 | 0.62940 | 0.69114 | 0.70468 | 0.69709 | 0.76915 |
| 7|[0.3;0.3;0.3;0.3]| 32| 5 | 50| 15| 0.49978 | 0.76026 | 0.71858 | 0.85039 | 0.65371 |
| 8|[0.3;0.3;0.3;0.3]| 32| 10| 50| 6| 0.62290 | 0.68035 | 0.61435 | 0.80636 | 0.52854 |
|**9**|**[0.3;0.3;0.3;0.3]**|**16**|**3**| **25**| **48**| **0.48879** | **0.79914** | **0.77643** | **0.86375** | **0.72218** |
|10|[0.3;0.3;0.3;0.3]| 16| 5 | 25| 14| 0.55411 | 0.78186 | 0.74186 | 0.85337 | 0.69907 |
|11|[0.3;0.3;0.3;0.3]| 16| 10| 25| 9 | 0.64211 | 0.63715 | 0.62165 | 0.68523 | 0.60172 |
| Test | Modello | Loss | Accuracy | F1 | Precision | Recall | AP | TN | FP | FN | TP |
|:------:|:---:|:----:|:--------:|:--:|:---------:|:------:|:---:|:--:|:--:|:--:|:--:|
| -| 9 |0.706| 0.711|0.677|0.750|0.633| 0.799| 180 | 50|83|147|
---
### Modello #3 - LSTM unidirezionale su sequenze di proteine
<img src=https://i.imgur.com/lA81Rtd.png height=200>
**Caratteristiche della rete:**
- Codifica one-hot
#### Risultati:
| Holdout | Layer size\* | Dropout\* | L2 Regularization\* | Learning rate | Batch size | ES_epoch | Loss | Accuracy | F1 | Precision | Recall |
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|-|
| 1| [128;128;128] | [0.5;0.5;0.5;0.5] | [0;0;0;0] | 1e-4 | 32 | 11 | 0.51546 | 0.73002 | 0.63162 | 0.94690 | 0.50496|
| 2| [128;128;128] | [0.5;0.5;0.5;0.5] | [1e-3;1e-3;1e-3;1e-3] | 1e-4 | 32 | 10 | 0.73175 | 0.73218 | 0.64949 | 0.88817 | 0.54384|
| 3| [128;128;128] | [0.5;0.5;0.5;0.5] | [1e-4;1e-4;1e-4;1e-4] | 1e-4 | 32 | 11 | 0.54805 | 0.73002 | 0.63080 | 0.94690 | 0.50474|
| 4| [128;128;128] | [0.5;0.5;0.7;0.7] | [0;0;0;0] | 1e-4 | 32 | 11 | 0.52107 | 0.71490 | 0.62148 | 0.89304 | 0.51461|
| 5| [128;128;128] | [0.3;0.3;0.5;0.5] | [0;0;0;0] | 1e-4 | 32 | 11 | 0.53654 | 0.74082 | 0.66415 | 0.87473 | 0.57688|
| 6| [128;128;128] | [0.1;0.1;0.5;0.5] | [0;0;0;0] | 1e-4 | 32 | 10 | 0.52756 | 0.72570 | 0.62877 | 0.92785 | 0.50865|
| 7| [128;128;128] | [0.4;0.4;0.2;0.2] | [0;0;0;0] | 5e-4 | 32 | 11 | 0.51169 | 0.72786 | 0.68798 | 0.85348 | 0.60160|
| 8| [128;128;128] | [0.4;0.4;0.5;0.5] | [0;0;0;0] | 5e-4 | 32 | 11 | 0.51184 | 0.73650 | 0.64521 | 0.92785 | 0.52804|
| 9| [128;128;128] | [0.4;0.4;0.8;0.8] | [0;0;0;0] | 5e-4 | 32 | 10 | 0.53321 | 0.70842 | 0.63742 | 0.81173 | 0.55339|
|10| [128;128;128] | [0.4;0.4;0.8;0.8] | [0;0;0;0] | 1e-3 | 32 | 6 | 0.51737 | 0.73002 | 0.69366 | 0.79503 | 0.63938|
|11| [128;128;64] | [0.5;0.5;0.5;0.5] | [0;0;0;0] | 1e-4 | 32 | 10 | 0.52251 | 0.73002 | 0.66371 | 0.87113 | 0.57291|
|12| [128;128;64] | [0.5;0.5;0.5;0.5] | [0;0;0;0] | 1e-4 | 16 | 7 | 0.53382 | 0.74730 | 0.69236 | 0.87422 | 0.60221|
|13| [64;64;64] | [0.5;0.5;0.5;0.5] | [0;0;0;0] | 1e-4 | 32 | 24 | 0.52053 | 0.74082 | 0.64615 | 0.91960 | 0.54510|
|14| [64;64;32] | [0.3;0.3;0.3;0.3] | [0;0;0;0] | 1e-4 | 16 | 17 | 0.52438 | 0.74730 | 0.65596 | 0.82823 | 0.57388|
|15| [32;32;32] | [0.5;0.5;0.5;0.5] | [0;0;0;0] | 1e-4 | 32 | 39 | 0.53357 | 0.74730 | 0.70721 | 0.87628 | 0.63022|
|16| [32;32;32] | [0.5;0.5;0.5;0.5] | [0;0;0;0] | 1e-4 | 16 | 25 | 0.52169 | 0.76890 | 0.72858 | 0.88953 | 0.64982|
|17| [32;32;32] | [0.3;0.3;0.5;0.5] | [0;0;0;0] | 1e-4 | 32 | 28 | 0.53589 | 0.75162 | 0.70089 | 0.87739 | 0.62547|
|**18**|**[16;16;32]**|**[0.3;0.3;0.3;0.3]**|**[0;0;0;0]** |**1e-4**|**16**|**27**|**0.53081**| **0.75378** | **0.76873** | **0.77319** | **0.79635**|
\* [LSTM; LSTM; FC]
| Test | Modello | Loss | Accuracy | F1 | Precision | Recall | AP | TN | FP | FN | TP |
|:------:|:---:|:----:|:--------:|:--:|:---------:|:------:|:---:|:--:|:--:|:--:|:--:|
| -| 18 | 0.708 | 0.652 | 0.643 |0.680| 0.634 | 0.697| 153 | 77 | 83 | 147 |
---
### Modello #4 - LSTM bidirezionale su sequenze di proteine
##### Modello A
<img src=https://i.imgur.com/c6fa65w.png height=250>
**Caratteristiche della rete:**
- Codifica embedding
- Dimensione del batch: **32**
- Learning rate iniziale: **5e-4**
- Dropout rate: **[0.2, 0.4, 0.6]**
- Numero di unità LSTM: **0.5 \* emb_size**
#### Risultati:
|Holdout|Dropout|emb_size|ES_epoch|Loss|Accuracy| F1|Precision|Recall|
|:-----:|:-----:|:------:|:------:|:--:|:------:|:-:|:-------:|:----:|
|A1 |[0.2,0.2,0.1]|16|20|0.674|0.736|0.723|0.766|0.699|
|A2 |[0.2,0.2,0.3]|16|61|0.605|0.711|0.680|0.797|0.620|
|A3 |[0.2,0.2,0.5]|16|45|0.628|0.739|0.727|0.755|0.725|
|A4 |[0.4,0.4,0.1]|16|45|0.634|0.745|0.737|0.769|0.729|
|A5 |[0.4,0.4,0.3]|16|19|0.670|0.711|0.718|0.726|0.731|
|A6 |[0.4,0.4,0.5]|16|37|0.644|0.719|0.729|0.717|0.765|
|A7 |[0.6,0.6,0.1]|16|35|0.635|0.730|0.729|0.752|0.726|
|A8 |[0.6,0.6,0.3]|16|24|0.663|0.752|0.728|0.791|0.687|
|A9 |[0.6,0.6,0.5]|16|47|0.624|0.732|0.743|0.721|0.787|
|**A10**|**[0.2,0.2,0.1]**|**32**|**35**|**0.673**|**0.767**|**0.743**|**0.834**|**0.685**|
|A11|[0.2,0.2,0.3]|32|47|0.643|0.732|0.666|0.881|0.559|
|A12|[0.2,0.2,0.5]|32|45|0.692|0.719|0.726|0.720|0.752|
|A13|[0.4,0.4,0.1]|32|31|0.731|0.741|0.717|0.797|0.668|
|A14|[0.4,0.4,0.3]|32|32|0.667|0.754|0.730|0.781|0.705|
|A15|[0.4,0.4,0.5]|32|35|0.740|0.708|0.713|0.752|0.707|
|A16|[0.6,0.6,0.1]|32|26|0.717|0.713|0.724|0.719|0.748|
|A17|[0.6,0.6,0.3]|32|19|0.764|0.706|0.699|0.754|0.670|
|A18|[0.6,0.6,0.5]|32|26|0.696|0.717|0.724|0.711|0.751|
##### Modello B
<img src=https://i.imgur.com/1PK16VV.png height=180>
**Caratteristiche della rete:**
- Codifica one-hot
- Dimensione del batch: **20**
- Learning rate iniziale: **5e-4**
- Dropout rate: **[0.5, 0.6]**
- Numero di unità LSTM: **[6, 10]**
- Dimensione kernel convoluzionale: **3**
- Max pooling size: **2**
#### Risultati:
|Holdout|Dropout|lstm_units|conv_num_filters|ES_epoch|Loss |Accuracy|F1 |Precision|Recall|
|:-----:|:-----:|:--------:|:--------------:|:------:|:---:|:------:|:--:|:-------:|:----:|
|B1|[0.5,0.6,0.5]| 6|50|10|0.550|0.721|0.721|0.758|0.738|
|B2|[0.5,0.5,0.6]| 6|50|12|0.545|0.717|0.717|0.740|0.741|
|B3|[0.6,0.5,0.5]| 6|50|12|0.554|0.726|0.714|0.763|0.711|
|B4|[0.5,0.6,0.6]| 6|50|10|0.541|0.715|0.722|0.756|0.727|
|B5|[0.6,0.5,0.6]| 6|50|12|0.544|0.728|0.716|0.763|0.715|
|B6|[0.6,0.6,0.5]| 6|50|12|0.546|0.724|0.707|0.786|0.675|
|**B7**|**[0.6,0.6,0.6]**| **6**|**50**|**10**|**0.541**|**0.715**|**0.722**|**0.756**|**0.727**|
|B8|[0.5,0.5,0.5]|10|50|12|0.544|0.706|0.680|0.779|0.639|
|Test|Modello|Loss |Accuracy| F1|Precision|Recall|AP |TN |FP |FN |TP |
|:--:|:-----:|:---:|:------:|:-:|:-------:|:----:|:--:|:--:|:--:|:--:|:--:|
|-|A10|0.729|0.652|0.610|0.709|0.551|0.760|173|57|103|127|
|-|B7|0.708|0.667|0.615|0.722|0.584|0.772|172|58|95|135|