MSc Thesis proposal - Bart van Oort

# MSc Thesis proposal - Bart van Oort > Supervisors: Luís Cruz, Maurício Aniche, Arie van Deursen (TU Delft) > In collaboration with: ING AI FinTech lab Original proposal title: Code smells for machine learning code ## Problem description - In the current day and age, Machine Learning (ML) is used to help solve a massive variety of computing problems. - ML code is often written in Python or Jupyter Notebooks - Python is not a statically typed language and static analysis tools are not very prevalent, especially for Jupyter notebooks. - Specifically for ML code written in Python, the only available linter I currently know of is one from a previous MSc thesis: https://github.com/MarkHaakman/dslinter Does report false positives though. - ML code is often written by data scientists who have great experience in data analysis and statistics, but aren't very experienced with practices from classical software engineering. - Therefore there is little classical software engineering experience in the field, so static analysis tooling, code quality, software architecture and testing practices (among others) are not very prevalent or well used in the field of ML (yet). ## Research topics Following from the problem description, a number of interesting questions came to mind that can be investigated during the thesis, which have been categorised into multiple topics. Each of the subsections in this section outlines a topic to be researched during my thesis. ### Empirical analysis on the current state of the art in ML code quality practices With this topic, I aim to get a better understanding of how ML is currently being developed, how much experience in code quality practices the people developing ML code have and to what degree code quality practices are being applied in the industry. Based on the answers to these questions that I find during my research, I will be able to make a prioritisation of which ML code quality practices need more research / tools / standardisations to make ML developers in the field more productive and reduce their chances of making programmer errors. - Who write ML code? What kind of experience do these people / teams generally have? What level of knowledge about code quality practices do they have? - What (classical) software engineering and code quality practices are currently being applied in developing ML code? E.g. static analysis (e.g. `pylint`), dynamic analysis, testing, code reviews, editor integrations, CI/CD, architectural decoupling. How prevalent is the use of these techniques? - Are there standardised practices for creating ML projects? For example, when someone wants to write a ReactJS application (whether in JavaScript or TypeScript), then [`create-react-app`](https://github.com/facebook/create-react-app) can be used to set up the necessary tooling to develop, build, test and deploy that ReactJS application. Similarly for writing an Elm web-app, there is [`create-elm-app`](https://github.com/halfzebra/create-elm-app) and for generating Rust projects based on a template, there is [`cargo-generate`](https://github.com/ashleygwilliams/cargo-generate). Are there any such tools for ML code? - Python is the most used language for writing ML code, but what other programming languages are used for writing ML code? What reasons do ML developers have to use Python over these languages, or vice versa? - Regarding ML code written in these other languages, what is the state of (classical) software engineering and code quality practices currently being applied there? ### Improving `dslinter` This topic focuses on understanding and improving `dslinter`, which was the result of previous research on the subject of code smells in ML code. - What kind of code smells does `dslinter` pick up? Why are those code smells? - When does `dslinter` give false positives? Why are they false positives? - What is required to make `dslinter` more reliable? - What can be done to make `dslinter` more reliable? - Are there any other code linters available specifically for ML code in Python? What techniques do they apply and what can we learn from them? - When `dslinter` starts working reliably, how can its use be standardised in the ML developing world? ### Deployment & integration with existing infrastructure This is a topic that I personally find very interesting. There can be thousands of ML developers creating fancy models for solving all sorts of problems, but if none of them are deployable and production-ready, then they may never be used. Similarly, if a model cannot (easily) be integrated into existing infrastructure, then it may never be used. - How does ML code go from scrapbook to first prototype to minimum viable product (MVP) to production? What steps happen in between? How can we streamline this better / help developers in this process? How applicable are DevOps practices here? - How do ML developers experience this process? What do they spend their time on? What _should_ they be spending their time on? - What software engineering practices are being applied in deploying ML code? - How is the written ML code integrated into the company's IT infrastructure in the software architectural sense? What interface should exist between the ML code and the rest of the IT infrastructure, so that the complexity of the ML code, its inputs and outputs is abstracted away from the rest of the system? And vice versa too, i.e. so that the complexity of the rest of the system is abstracted away from the ML developer, allowing them to do what they do best: develop ML code.