--- author: Marc Evrard date: 2023 title: "L2-ISD2 (2024-25)" tags: Edu --- Info S4: Introduction à la science des données II ======== Year 2024-2025 Teachers -------- * Yue Ma (CM-TP) * Marc Evrard (CM-TP) Prerequisites ------------- Basic probability and statistics, basic algebra, programming experience in Python Preparing to ------------ Introduction to Machine Learning (Introduction à l’apprentissage statistique) Assessments ----------- 100% Continuous evaluation * Session 1: 30% CC + 70% Project * Session 2: 100% Improved Project <!-- Plan ---- Part | Week | Course | Practical :---:| ------: | -------- | ----------- 1 | March 11 | Classification | TP 1: Iris 2 | March 18 | Unsupervised Learning | TP 2: Digits 3 | March 25 | Regression & Evaluat. | TP 3: Ocean 4 | April 1 | Presentations TP 3 | Presentations TP 3 5 | April 8 | ML Projects Checklists | Project 6 | April 15 | Project | Project 6 | April 22 | *Holidays* | 7 | April 29 | Project Presentation (May 3) | Submission Deadline: May 1 --> Data Scientists Check-List -------------------------- * Frame the problem and look at the big picture * Collect the data * Explore the data to gain insights * Prepare the data for Machine Learning algorithms 5 Select a model and train it * Fine-tune your model * Present your solution <!-- --- Questions TP-1 -------------- <iframe class="airtable-embed" src="https://airtable.com/embed/appnj3rDDj4i0djDb/shrNZHi8U93ytzijt" frameborder="0" onmousewheel="" width="100%" height="533" style="background: transparent; border: 1px solid #ccc;"></iframe> --> --- Project ------- ### Information <!-- *** For year 2024-25 *** #### Presentation * In Engligh * All group members should participate equaly during the presentation --> #### Deadline Submit your notebook and slides (**only 1 submission per team**) on eCampus the day before the presentation at the latest (**May 1, 23:59**). #### Recommendation * Tabular problem (try to avoid NLP or advanced image processing in this class) * More than 10 features (ideally more than 100) * More than 1000 instances (ideal 10k to 1M) * Most problems you have chosen are already solved (e.g., [Codabench](https://www.codabench.org/), [Kaggle](https://www.kaggle.com/)) * Make sure you summarize the results of these solutions in your presentation as state-of-the-art (SOTA) * And give arguments for choosing your solution (that must be, of course, original) #### Format to submit * Notebook (**only 1** per team): Include code and report (in markdown) * The slides in PDF format (if different from the notebook, **only 1** per team) Don't forget to include **all team member names** in the Notebook/Slides. * Include external Python module if used * Do not include data * You should include a **link** to the data in the Notebook * Keep the size of the NB under 20 MB (e.g., avoid using Plotly) #### Report structure 1. Intro (explanation of the **task** in your own words and **SOTA**) 2. Preprocessing (exploration of the data, cleaning, etc.) 3. Modeling (training) 4. Evaluation 5. Conclusion (what you did, what worked, what didn't, if you had more time, etc.) #### Evaluation of the project presentation (15 min + 5 min questions) * Intro et preprocessing (exploration, cleaning) (/5) * Models (/5) * Performance evaluation and analysis/explanation (/5) * Code readability (/5) * Oral presentation + questions answering (/20, individual grade) ### Schedule <iframe class="airtable-embed" src="https://airtable.com/embed/appnj3rDDj4i0djDb/shr4QNc36aqX4ssfG" frameborder="0" onmousewheel="" width="100%" height="533" style="background: transparent; border: 1px solid #ccc;"></iframe> <!-- #### Group 1 <iframe class="airtable-embed" src="https://airtable.com/embed/appnj3rDDj4i0djDb/shrkGadMsNSw4163Q" frameborder="0" onmousewheel="" width="100%" height="533" style="background: transparent; border: 1px solid #ccc;"></iframe> #### Group 2 <iframe class="airtable-embed" src="https://airtable.com/embed/appnj3rDDj4i0djDb/shrjQnYMbwj2Kqa7C" frameborder="0" onmousewheel="" width="100%" height="533" style="background: transparent; border: 1px solid #ccc;"></iframe> #### Group 3 <iframe class="airtable-embed" src="https://airtable.com/embed/appnj3rDDj4i0djDb/shrOhyGRJbF5elXRG?viewControls=on" frameborder="0" onmousewheel="" width="100%" height="533" style="background: transparent; border: 1px solid #ccc;"></iframe> --> --- References ---------- ### Main References * VanderPlas, J. (2017). *Python Data Science Handbook: Essential Tools for Working with Data.* https://jakevdp.github.io/PythonDataScienceHandbook * McKinney, W. (2017). *Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (2^nd^ ed.).* https://github.com/wesm/pydata-book * Géron, A. (2019). *Hands-on Machine Learning with Scikit-Learn, Keras, and Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems (2^nd^ ed.).* https://github.com/ageron/handson-ml2 * Grus, J. (2019). *Data Science from Scratch: First Principles with Python (2^nd^ ed.).* * VanderPlas, J. (2016). *A Whirlwind Tour of Python.* https://jakevdp.github.io/WhirlwindTourOfPython <!-- Great references for **machine learning algorithm** theory: * @russell2003artificial _Artificial Intelligence: A Modern Approach_ (4th ed.) * @bishop2006pattern _Pattern Recognition and Machine Learning_ --> ### Online References * *The **Python** Language Reference*: https://docs.python.org/3/reference/index.html * *The **Python** Tutorial*: https://docs.python.org/3/tutorial/ * ***JupyterLab** Documentation*: https://jupyterlab.readthedocs.io/en/stable/ * ***NumPy** Documentation*: https://numpy.org/doc/stable/ * ***Matplotlib** Documentation*: https://matplotlib.org/stable/contents.html * ***pandas** Documentation*: https://pandas.pydata.org/docs/ * ***Scikit-learn** User Guide*: https://scikit-learn.org/stable/user_guide.html ### More Advanced References (ML) * Bishop, C. M. (2006). *Pattern recognition and machine learning.* Springer. * (**French**) Archives cours: *IFT 603 - Techniques d'apprentissage* de l'Université de Sherbrooke (Hugo Larochelle) * http://www.dmi.usherb.ca/~larocheh/cours/ift603_H2015/contenu.html