---
tags: Edu
---
# Hands-on NLP
Master 1 Computer Science -- Artificial Intelligence
Coordination and lectures: Marc Evrard and Nona Naderi
January-February, 2023-24
<!-- ## Outline
Machine learning and artificial intelligence (AI), in general, have begun to influence every aspect of our lives and societies, from the advertisements we see to the medical diagnoses we receive and the cars we drive. The rapid growth of AI research and applications brings both unprecedented opportunities and legitimate worries about its potential misuse. This class aims at raising awareness about properly using the data that powers our algorithms and other ethical issues related to fairness, including: experimentation on human subjects; avoiding, detecting, or correcting bias; explainability of decisions made and interpretability of models; privacy; confidence, reliability, and testing; and adversarial attacks and defenses. -->
MCC = CC × 20% (2 graded assignments) + projet × 80%
## Project
### Information
#### Presentation
* In English
* All group members should participate equally during the presentation
* Reminder: **Groups from 2 to 3 members**
#### Format to submit
* Submit on eCampus by **Thursday evening, February 29**, at the latest
* Notebook (**only 1** per group): Include code and report (in markdown)
* The slides in PDF format (if different from the notebook, **only 1** per group)
Remember to include **all member names** in the Notebook/Slides.
* (Include external Python module if used)
* Do not include data (you should include a link to the data in the NB)
* Keep the size of the NB under 10 MB (e.g., avoid using Plotly)
#### Report structure
1. Intro (explanation of the task in your own words)
2. Preprocessing: data description and cleaning
3. Training: models, hyperparameters, etc.
4. Evaluation: Metric description, Performances on Dev and Test Sets, Error analysis
5. Discussion and Conclusion
(what you did; what worked; what didn't; if you had more time...)
#### Evaluation of the project presentation (15 min + 5 min questions)
* Preprocessing (exploration, cleaning) (`/4`)
* Model (`/4`)
* Code readability (`/4`)
* Performance evaluation and analysis/explanation (`/4`)
* Oral presentation (`/4`, **individual** grade)
#### Recommendation
* Use 3 levels for your approach:
* Use a baseline approach, e.g., with simple word embeddings extractions
(Word2vec, FastText, etc.) and a basic ML algorithm
* Use a second, more expressive baseline with Transformers as feature extraction only and the same basic ML algorithms or directly with the already fine-tuned head for the tasks
* Use a more comprehensive approach with a Transformers model that you fine-tuned on your data (or neural seq-to-seq models)
* Train and evaluate on the labeled data
* Try a prediction on other datasets to evaluate or — for generative approaches — manually inspect some produced examples.
#### Example of domains
* Sentiment analysis
* Topic Modeling
* Tweets Event Classification
* Sequence Tagging/NER/POS
* Parsing
* Stance Detection
* Punctuation Restoration (ASR post-processing)
* Text Summarization
* Multimodal Embeddings
* Text Generation/ChatBot
* Sarcasm/Irony Detection
* Recommender System
* Question Answering
<!-- * Condition Detection From Text -->
<!--
* Tabular problem (try to avoid NLP or advanced image processing in this class)
* More than 10 features (ideally more than 100)
* More than 1000 instances (ideal 10k to 1M)
* Most problems you have chosen are already solved (e.g., Kaggle)
* Make sure you summarize the results of these solutions in your notebook (or slides) as state-of-the-art
* And give arguments for choosing your solution (that must be, of course, original)
-->
### Current schedule
<iframe class="airtable-embed" src="https://airtable.com/embed/app0EwPzQaluBIdBq/shr4kuJdC7WueND1L?viewControls=on" frameborder="0" onmousewheel="" width="100%" height="533" style="background: transparent; border: 1px solid #ccc;"></iframe>