M1-HoNLP - HackMD

--- tags: Edu --- # Hands-on NLP Master 1 Computer Science -- Artificial Intelligence Coordination and lectures: Marc Evrard and Nona Naderi January-February, 2023-24  MCC = CC × 20% (2 graded assignments) + projet × 80% ## Project ### Information #### Presentation * In English * All group members should participate equally during the presentation * Reminder: **Groups from 2 to 3 members** #### Format to submit * Submit on eCampus by **Thursday evening, February 29**, at the latest * Notebook (**only 1** per group): Include code and report (in markdown) * The slides in PDF format (if different from the notebook, **only 1** per group) Remember to include **all member names** in the Notebook/Slides. * (Include external Python module if used) * Do not include data (you should include a link to the data in the NB) * Keep the size of the NB under 10 MB (e.g., avoid using Plotly) #### Report structure 1. Intro (explanation of the task in your own words) 2. Preprocessing: data description and cleaning 3. Training: models, hyperparameters, etc. 4. Evaluation: Metric description, Performances on Dev and Test Sets, Error analysis 5. Discussion and Conclusion (what you did; what worked; what didn't; if you had more time...) #### Evaluation of the project presentation (15 min + 5 min questions) * Preprocessing (exploration, cleaning) (`/4`) * Model (`/4`) * Code readability (`/4`) * Performance evaluation and analysis/explanation (`/4`) * Oral presentation (`/4`, **individual** grade) #### Recommendation * Use 3 levels for your approach: * Use a baseline approach, e.g., with simple word embeddings extractions (Word2vec, FastText, etc.) and a basic ML algorithm * Use a second, more expressive baseline with Transformers as feature extraction only and the same basic ML algorithms or directly with the already fine-tuned head for the tasks * Use a more comprehensive approach with a Transformers model that you fine-tuned on your data (or neural seq-to-seq models) * Train and evaluate on the labeled data * Try a prediction on other datasets to evaluate or — for generative approaches — manually inspect some produced examples. #### Example of domains * Sentiment analysis * Topic Modeling * Tweets Event Classification * Sequence Tagging/NER/POS * Parsing * Stance Detection * Punctuation Restoration (ASR post-processing) * Text Summarization * Multimodal Embeddings * Text Generation/ChatBot * Sarcasm/Irony Detection * Recommender System * Question Answering   ### Current schedule <iframe class="airtable-embed" src="https://airtable.com/embed/app0EwPzQaluBIdBq/shr4kuJdC7WueND1L?viewControls=on" frameborder="0" onmousewheel="" width="100%" height="533" style="background: transparent; border: 1px solid #ccc;"></iframe>