Software Development Workflow

# Software Development Workflow [TOC] ## Introduction It is important to create a development environment and workflow that not only allows effective collaboration but also sets a foundation for the growth and evolution of your project. In this guide, we discuss organizing your project in a repository and setting up a workflow for personal and collaborative projects. ## Project Organization In software development, the initial choices will affect the final outcomes of our project. Among these choices, an important one is how to structure your project. To ensure your work is reproducible, a crucial initial step is to systematically organize your projects. ### Essential principles - **Directory Structure**: Employ a consistent and meaningful directory naming convention. - **Naming Files and Directories**: Use underscores or hyphens. - **Handling Access Levels**: Utilize different Git repositories for public and private parts of your project. Use `.gitignore` or a specific non-tracked folder for sensitive content and/or files that are too large. - **Clear Documentation**: Include a `README` at the root to provide a project summary and add an appropriate `LICENSE` to your project. This establishes the terms under which others can engage, reuse and modify it. Also, this ensures your work is legally safeguarded and the usage rights are clearly defined. - **Adhere to Coding Standards**: Follow a consistent coding style to enhance code readability. ### Other recommendations - **Code Reusability**: Store reusable software elements in a separate repository for efficiency across projects and consider packaging them. - **Code Modularity**: Aim for modular code design to improve maintainability and reusability, especially in larger projects. - **Dependency Management**: Use virtual environments (Python) or similar tools to manage project dependencies, ensuring consistent environments. - **CI/CD Integration**: Consider setting up Continuous Integration/Continuous Deployment pipelines to streamline testing and deployment processes. A common repository structure that works well for MATLAB and Python projects: ```shell your_project/ │ ├── build/ # Compiled application for distribution (if applicable) ├── docs/ # documentation directory ├── lib/ # third-party libraries ├── notebooks/ # Jupyter notebooks or MATLAB Live Editor scripts ├── src/ # your project's source code, including the main script │ └── mypkg/ # package │ ├── module # nested module │ └── subpkg1/ # sub-package ├── tests/ # your test directory │ ├── data/ # data files used in the project (if applicable) ├── processed_data/ # files from your analysis (if applicable) ├── results/ # results (if applicable) │ ├── .gitignore # untracked files ├── requirements.txt # software dependencies (Python) ├── README.md # overview └── LICENSE # license information ``` This structure is a guideline and can be adapted based on the specific needs and practices of your project. Some additional observations: - Naming convention: use lowercase for folders. Particular metadata files are often capitalized, such as README, LICENSE, CONTRIBUTING, CODE_OF_CONDUCT, CHANGELOG, CITATION.cff, NOTICE, and MANIFEST. - Carefully consider how users will access your software. They may not have access to your repository structure when installing it as a library. - Generally, all content that is generated upon build- or runtime should be added to `.gitignore`. This likely includes the content of `processed_data` and `results` folder. - Git cannot track empty folders. If you want to add empty folders to enforce a folder structure, e.g., `processed_data` or`results`, add the file `.gitkeep` to the folder. :::info :book: **Further reading** - [Code Refinery - Organizing your projects](https://coderefinery.github.io/reproducible-research/organizing-projects/) - [ArjanCodes guide to structuring Python projects](https://arjancodes.com/blog/guide-to-structuring-python-projects/) - [A collection of `.gitignore` templates](https://github.com/github/gitignore) - [TU Delft Software License Policy](https://filelist.tudelft.nl/TUDelft/Over_TU_Delft/Strategie/TU%20Delft%20Research%20Software%20Guidelines.pdf) - [Choosing between a `src/` layout and a flat layout for Python](https://packaging.python.org/en/latest/discussions/src-layout-vs-flat-layout/) ::: ## Project templates Templates are versatile tools that aim to standardize the software development process across various domains. ### GitHub repository templates You can make an existing repository a template, so you and others can generate new repositories with the same directory structure, branches, and files. Note, the template repository cannot include files stored using Git LFS. For more info, check out [Creating a template repository](https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-template-repository). - [Template repository for MATLAB/Octave projects](https://github.com/Remi-Gau/template_matlab_analysis) ### Cookiecutter for Python [Cookiecutter](https://www.cookiecutter.io/) creates Python projects from project templates. The advantage of using Cookiecutter is that new projects are set up quickly from a standardized template structure and can include everything needed to get started on a project, such as directory layouts, sample code, and even integrations with tools and services. - **Cookiecutter PyPackage:** A comprehensive template for Python projects, facilitating the creation of Python packages with best practices in testing, documentation, and package structure. Ideal for developers looking to distribute their Python libraries. - GitHub: [cookiecutter-pypackage](https://github.com/audreyfeldroy/cookiecutter-pypackage) - Github: [Netherlands eScience Center template](https://github.com/NLeSC/python-template) - **Cookiecutter Data Science:** Tailored for data science projects, this template organizes data, models, analyses, and notebooks, ensuring that data science projects are reproducible and well-documented from the start. - GitHub: [cookiecutter-data-science](https://github.com/drivendata/cookiecutter-data-science) - **Cookiecutter Machine Learning:** Designed specifically for machine learning projects, this template includes directories for datasets, models, notebooks, and scripts, supporting ML project best practices and facilitating experimentation and collaboration. - https://dagshub.com/DagsHub/Cookiecutter-MLOps - https://github.com/Chim-SO/cookiecutter-mlops For installation instructions, check out [Cookiecutter installation instructions](https://cookiecutter.readthedocs.io/en/1.7.3/installation.html). :::info :book: **Further reading** - [Tutorial for Cookiecutter](https://cookiecutter-python.readthedocs.io/en/latest/tutorial.html) - [Example - Cookiecutter Jupyter Book](https://github.com/executablebooks/cookiecutter-jupyter-book) - [Cookiecutter templates on GitHub](https://github.com/search?q=cookiecutter&type=Repositories) ::: ## Reusing projects and repositories #### Packaging Create an installable package or library that can be installed as a dependency in the environment. #### Git submodules [Git submodules](https://www.atlassian.com/git/tutorials/git-submodule) allow you to keep a Git repository as a subdirectory of another Git repository. It is a record that points to a specific commit in another external repository. #### Git subtree [Git subtree](https://www.atlassian.com/git/tutorials/git-subtree) allows you to merge the history of one repository into another as a subdirectory. It essentially brings in the contents of a repository into another as if it were part of the directory structure. In summary, submodules are more suitable when you need to maintain separate histories and explicit references to specific commits of nested repositories, while subtrees are useful when you want to merge the history of nested repositories into a single repository without maintaining separate references. :::warning :warning: **Avoid:** - Storing commonly-used folders in a separate folder on your system and adding the folder to the Python PATH. Other users/developers will not have access to these folders. - Direct copy-and-pasting of code as you lose any upstream changes to the external repository. ::: ## Dependency management Managing dependencies is a critical aspect of any software project. Efficient dependency management ensures that your project is reproducible, easy to set up, and less prone to conflicts between the different libraries that your code depends on. ### Python Ensuring that every contributor uses the same dependency versions is essential for project consistency and stability. - Virtual Environments: Use `venv` or `virtualenv` to create isolated Python environments for your projects. This prevents package versions from interfering with each other across different projects. - Requirements File: A `requirements` file to list all dependencies with their specific versions. You can generate this file using the command `pip freeze > requirements.txt` in an activated virtual environment. - Dependency Management Tools: Tools like `poetry` and `pipenv` provide a more sophisticated dependency management by handling virtual environment creation and dependency resolution in a more integrated manner. :::success :point_up: **Tip!** Consider using Conda, it is a preferred choice within the research software community. Conda is a system package manager that allows for managing both packages and environments. It is ideal for projects requiring specific Python versions, packages not available via pip, and other dependencies such as R libries, C and C++ libraries. ::: :::info :book: **Further reading** - [CodeRefinery - Recording dependencies](https://coderefinery.github.io/reproducible-research/dependencies/) - [The Turing Way - Package Management Systems](https://the-turing-way.netlify.app/reproducible-research/renv/renv-package) - [pipenv documentation](https://pipenv.pypa.io/en/latest/) - [poetry documentation](https://python-poetry.org/docs/) - [Conda documentation](https://docs.conda.io/projects/conda/en/stable/user-guide/getting-started.html) ::: ### MATLAB MATLAB does not use virtual environments in the same sense as Python, but it allows for setting up paths and toolboxes that act similarly by organizing and encapsulating project-specific functions and scripts. Dependency management in MATLAB often involves ensuring the correct toolboxes are licensed and available, and using MATLAB's Project feature to manage and share paths and environments with others. MATLAB toolbox requirements can be found with the function [**`requiredfilesandproducts`**](https://nl.mathworks.com/help/matlab/ref/matlab.codetools.requiredfilesandproducts.html) or with the [**Dependency Analyzer**](https://nl.mathworks.com/help/matlab/matlab_prog/analyze-project-dependencies.html).