DTCC - Group Meeting 1

# DTCC - Group Meeting 1 ## Current State ### Input features * Non-sequential * SentenceTransformer embeddings * distiluse-base-multilingual-cased-v1 * Readabilty metrics filtered * https://github.com/andreasvc/readability/ * https://i.imgur.com/VOhZrwQ.png * TF-IDF vectors of preprocessed and raw sentences * Sequential * Flair TransformerWordEmbeddings * deepset/gbert-base --> When combined, embeddings and readibilty metrics are scaled ### Regression models * MLP * layers: 3, neurons: 256-128-8, activation: relu * ... * layers: 1, neurons: 32, activation: sigmoid * Gradient Tree Boosting Regression * Random Forest Regression * Support Vector Regression * LSTM -> Linear * HIDDEN_DIM = 64, N_LAYERS = 2, DROPOUT = 0.2O, BIDIRECTIONAL = True ### Results * sklearn comparison results: https://docs.google.com/spreadsheets/d/1Wf4cxbCPNvC8xeZS1lz_aKal6pL3COmF/edit#gid=1431294104 | Features | Model | RMSE | | -------- | -------- | -------- | | SentenceTransformer Embeddings + Readabilty Metrics | GradientTreeBoosting | 0.653 | | SentenceTransformer Embeddings + Readabilty Metrics | Pytorch MLP | 0.6502 | | Flair TransformerWordEmbeddings | Pytorch LSTM -> Linear | 0.6857 | ## Questions / TODO * Word level complexity features to feed into LSTM * Simply copying and concatenating the readabilty metrics did not work out * Ideas for 'manual' features: * Word length * Number of syllables * POS-Tag --> Generally difficult because of tokenization * Feature ablation of readabilty features * Deriving custom complexity metrics * Balancing the readabilty features and the embedding features --> Beyond simple concatination * Feed complexity features into network at later step? * Reduce ST vector to lower dimensionality? * Vector operation between the feature vectors? * More training data for more complex model num_labels = 1 or num_labels = 7 ?