# DTCC - Group Meeting 1
## Current State
### Input features
* Non-sequential
* SentenceTransformer embeddings
* distiluse-base-multilingual-cased-v1
* Readabilty metrics filtered
* https://github.com/andreasvc/readability/
* https://i.imgur.com/VOhZrwQ.png
* TF-IDF vectors of preprocessed and raw sentences
* Sequential
* Flair TransformerWordEmbeddings
* deepset/gbert-base
--> When combined, embeddings and readibilty metrics are scaled
### Regression models
* MLP
* layers: 3, neurons: 256-128-8, activation: relu
* ...
* layers: 1, neurons: 32, activation: sigmoid
* Gradient Tree Boosting Regression
* Random Forest Regression
* Support Vector Regression
* LSTM -> Linear
* HIDDEN_DIM = 64, N_LAYERS = 2, DROPOUT = 0.2O, BIDIRECTIONAL = True
### Results
* sklearn comparison results: https://docs.google.com/spreadsheets/d/1Wf4cxbCPNvC8xeZS1lz_aKal6pL3COmF/edit#gid=1431294104
| Features | Model | RMSE |
| -------- | -------- | -------- |
| SentenceTransformer Embeddings + Readabilty Metrics | GradientTreeBoosting | 0.653 |
| SentenceTransformer Embeddings + Readabilty Metrics | Pytorch MLP | 0.6502 |
| Flair TransformerWordEmbeddings | Pytorch LSTM -> Linear | 0.6857 |
## Questions / TODO
* Word level complexity features to feed into LSTM
* Simply copying and concatenating the readabilty metrics did not work out
* Ideas for 'manual' features:
* Word length
* Number of syllables
* POS-Tag
--> Generally difficult because of tokenization
* Feature ablation of readabilty features
* Deriving custom complexity metrics
* Balancing the readabilty features and the embedding features --> Beyond simple concatination
* Feed complexity features into network at later step?
* Reduce ST vector to lower dimensionality?
* Vector operation between the feature vectors?
* More training data for more complex model
num_labels = 1 or num_labels = 7 ?