owned this note
owned this note
Published
Linked with GitHub
# URSSI Winter School - Jan 2024
Hello all!
### Software Design and Modularity by Eva Maxfield Brow
#### How to Have Simpler Program?
1. easier to understand by breaking down (collaboration)
2. enable to maintain, reuse and extend
3. avoid extra work
#### Core Ideas
1. **Decomposable**: modules and teamwork
2. **Composable**: resued in different places
3. **Understandable**: each module can be examined, developed, etc in isolation
4. **Continuity**: small changes affect small part of module
5. **Isolation**: easy to identify where errors come from
* Cohesion: being consistent (like consistant naming, or simplify functions).
* Avoid coupling: avoid calling functions/modules from different scripts
#### Basic Application
1. Find the similarity of steps
2. Figure out what we need to know and change
3. Define and write basic functions
* Functions can be tested with different examples
#### Common Patterns
Repository of patterns: https://github.com/faif/python-patterns
Q&A: what is handler?
`handler`: the functions that have the certain signature/characteristics (universal), e.g., `class`; `factory`: with `def`: specific function/method that returns results. Ex:
```python
class GreekLocalizer: # Handler
"""A simple localizer a la gettext"""
def __init__(self) -> None:
self.translations = {"dog": "σκύλος", "cat": "γάτα"}
def localize(self, msg: str) -> str:
"""We'll punt if we don't have a translation"""
return self.translations.get(msg, msg)
class EnglishLocalizer: # Handler
"""Simply echoes the message"""
def localize(self, msg: str) -> str:
return msg
def get_localizer(language: str = "English") -> Localizer: # Factory
"""Factory"""
localizers: Dict[str, Type[Localizer]] = {
"English": EnglishLocalizer,
"Greek": GreekLocalizer,
}
return localizers[language]()
```
Q&A: lazy_evaluation
* `object` vs `class`
* @property
* the issue raises when it's complicated
Lazy_evaluation example:
```python
reader = CSVReader("path")
reader.path === "path"
data = reader.load() # normal assign
reader.data # lazy_property
```
```python
def get_localizer(language: str = "English") -> Localizer:
"""Factory"""
localizers: Dict[str, Type[Localizer]] = {
"English": parse_english,
"Greek": parse_greek,
}
return localizers[language]()
```
Live demo:
abc = abstract base class
```python
from abc import ABC, abstractmethod
class Localizer(ABC):
# every downstream class must define the method
@abstractmethod
def parse(self):
pass
# every downstream class has this method "built-in"
def example(self):
print("wow a new method")
# No error because parse method is defined
class EnglishLocalizer(Localizer):
def parse(self):
print("helo")
# Error because parse method not defined
class GreekLocalizer(Localizer):
pass
eng = EnglishLocalizer()
eng.parse() # works (even though its not really written)
greek = GreekLocalizer()
greek = greek.parse() # fails because not yet implemented
```
### Basics of Packaging Python Programs by Kyle Niemeyer
* How to turn the modules/functions installable?
#### Environments
* installing packages
* `pip install [pakcage]`: installing in the global ecosystem might cause the package conflicts later. Solutions:
* virtual environments (libraries); `venv` is python native, light weight, and faster but you could have duplicated packages in different environments, `conda env` is easy to use and has caches but can be slow.
* `venv`
* Note: Avoid large file
* create virtual environment `python3 -m venv <hidden_directory_name, ex: .venv>`
* list what's inside the virtual environment`ls -a .venv/bin`
* active the virtual environment`. .venv/bin/activate`
* check where is the python packages will be installed `which python3`
* install the packages you want `pip install numpy`
* deactivate `deactivate`
* `conda`
* `conda config --set auto_activate_base false`
* `conda env create -n some_name`
* `conda activate some_name
* `conda deactivate
* `pipx` (applications): creates new virtual envrionment for package
* install virtual package/environment `pipx install [package]`
* when you check the source, it will showed up in `Users/.../.local/...`
* install `matplotlib` into ipython virtual package`pipx inject ipython matplotlib`
* system managers, e.g., `brew`
#### Python Packaging
* module: python file (.py) that contains definitions and statements
* package: colleciton of module in the same directory
* package directory must contain `__init__.py` for Python to "see" it; usually leave it blank to start
* Python to package
* example:
* `rescale` function from sofware carpentry
* package in six lines:
* `mkdir <package>`
* `cd <pakcage>`
* `git init`
* `mkdir -p src/rescale tests docs` folders inside a folder help seperating the tests from the sources, etc. ex: with seperated folders,`pip install` won't install stuffs in test
* `touch src/rescale/__init__.py src/rescale/rescale.py`
* `touch pyproject.toml` may not seen it often now, pretty new stuff; similar to `setup.py`; include the info telling python how to install it.
* [build-system]
* [project], descriptive metadata, including name, version, etc.
* ... (see Metadata below)
* Install and import package
* under the parent folder (the same folder that contains the .toml file, `pip install -e .` install in "editable" mode
* `import <package>` or `from <package> import <module/function>` import and call
* more complicated structure, e.g.,`compphys` package structure
* `constants.py` the script stores all the constants
* `assets` the folder includes text files (usually not recommended, but some might need the text file for the contant or look-up table, etc.) or orphan script (non-importable python module because no `__init__.py`)
* `__init__.py` will be executed before any other modules imported
* only works on the current folder, each subdirectory will need it's own `__init__.py`
* use relitive path, ex: `from . import constant` `from .. import constants`
Note: Don't reinvent the wheel unless you can do it better.
* Other files that belong with your package
* description and other info
* terms and conditions
* track of changes
* Create a README
* `touch README.md`
* must have:
* name of the package
* brief description of the package does/provides
* install instructions
* a **brief** usage example
* software license (with more info in a seperate LICENSE file)
* may contain (or can be seperated files):
* badges near the top to show key info, such as the version, and the tests are currently passing
* info about how ppl can contribute to the package
* a code of conduct for ppl interact around your project
* contact info for authors and/or maintainers
* Create a software license
* `touch LICENSE`
* common used: BSD 3-clause license (copy and paste, but remember to change the year and the name)
* can be redistributed but give credit to the origin
* ! DON'T CHANGE THE LANGAUGE (lawyer proved)
* Keep a changelog
* particularly for the major changes between different versions
* what are you adding, fixing, changing?
* format can be based on the [keep a changelog](http://keepachangelog.com)
* gitlog can be too detail, ex: one fucntion changes can contain multiple commits
* Metadata: `pyproject.toml` machine readable `README`
* descriptable
* name
* version
* description
* readme: path to `README.md`
* keywords
* authors and maintainer (note: in dictionary format)
* urls
* classifiers
* functional
* requires-python: which version of python works
* dependencies: what packages and versions you used
* [project.optional-dependencies]
* `test = ["pytest>=6"]`
* check
* plot
* [project.scripts]
* `project-cli = "project.__main__:main:"`
### Collaboration with Git/GitHub/Workflows by Madicken Munk
#### Git Commands
* Individual (without a remote)
* `git init` create git repository, and created the "master" branch
* `git branch -M main` change branch name from "master" to "main" (branch name is already main in newer versions of git)
* `git add README.md` add README file to the version control
* `git commit -m 'add readme with code description....'` Add log
* note: check the version. `git --version`, different version might be slightly different
* note: if it's a exist project, after `git init`, don't do `git add *`, adding files sperately instead to prevent adding some unwanted files (e.g., keys, compile files, or large static files).
* `git status` showed what git is tracking
* Individual (with a remote)
* `git log --oneline` see the history
* Go to github and create a new repository under your account or your organization's account
* if you already have local README file, make sure you don't check the 'create README file' when creating the new repository.
* if the folder already exist locally, no need to download the repo from github
* `git remote add origin <repo url>`
* `git fetch`
* `git remote -v` check the version of remote
* `git branch -M main`
* `git push -u origin main` upload the main branch to the remote branch, origin.
* `-u`: set upstream. From now on, it's always push from origin to main, later on, you can just do `git push`.
* The upstream name needs to exist in the remote repo.
#### Collaborative Workflow
* Centralized Workflow (all members work on the same repo)
* `git pull` before you start your work, everyone commit to main.
* `git push` if there is no merge conflict, otherwise, fix the conflict and push again.
* cons: people change the same files and cause conflict.
* Feature Branching workflow (members work on the same repo but use individual branches to do feature development)
* The contributors need to have the merge right.
* `git checkout -b <new branch name>` checkout the branch names, if not exist, create new branch (make a pointer at the current point of history)
* `git branch` see what branch you're on now
* diagram to learn git branching: https://learngitbranching.js.org
* `git push -u <remote name, default is origin> <branch name>`
* go to Github and ask for pull request, and add title and description for your new branch
* note: when writing the title, make it clear to easier review.
* note: you can have a development branch as the default, so it won't disrupt the users
* pull request
* fostering code review, partifcularly if you're using Github.
* need to add permission for merge rights
* NEVER send a pull request from master/main
* main branch should be stable, you don't want to mess it up
* NEVER send a large pull request without notice
* being mindful of progress plan within the team
* Forking workflow (each collaborator "forks" a copy of the centralized repo, then the collaobrators submit pull request from their forks to the centralized repo)
* none of the contributers directly interact with the centralized repo. Keep the centralized repo stable (doesn't matter how messy contributors origin is. lol)
* click the fork on the Github repo
* copy the repo URL under your account
* `git clone <URL
* `cd <directory>`
* `git remote -v` check the remote
* `git remote add upstream <URL of the centralized repo>` add the remote of the centralized repo
* `git checkout -b <new branch name>`
* change/edit files you want
* `git add <file>`
* `git commit -m 'xxxxxx'`
* `git commit -amend` fix your commit message
* `git push origin <branch name>` push to the origin (under your own account)
* then if you go to the centralized repo, you can see the changes of the fork, and you can ask for pull request with the title and description.
* The admin of the repo can accept or reject the request and merge.
* Then you can delete the fork on Github under the pull request page
* Since the local copy doesn't have the new commit on the upstream (centralized repo), so before deleting the local branch, you'll need to pull the from the upstream to keep the local and upstream the same. we `git pull upstream main` (sync the local directory from upstream main)
* `git branch -d <branch name>` delete the branch
* note: `git diff` to see what's changed
#### Other tips/tricks
* https://learngitbranching.js.org/
* Visualize git branching, merging, and more
* Click on 'try this special link' in pop-up intro to go to an unprompted interface
* tagging commit
* `git tag -a <tagname>` ex: `git tag v0.1.0 C6`
* the tagged version will stay (and archived) even after other commits/changes
* tags are usually used for version numbers, or "paper" (the code use for the paper)
* `git log --oneline` see all the commits with their id/number
* `git cherry-pick <hash commit/commit id>` choosing a commit from one branch and applying it to another.
* `git revert --continue` creates a new commit that is the opposite of an existing commit.
* `git restore <file>` restoring files in the working tree from either the index or another commit.
* `git reset` updating your branch, moving the tip in order to add or remove commits from the branch. This operation changes the commit history.; can also be used to restore the index, like `git restore`.
* commit and push often (i.e., everyday)!
* `git commit --amend` if you want to edit typo or so in the commit messages
* `git stash` when you make changes on the wrong branch, this can be used to hide the changes in your working tree and bring them back
* `git status` checks the branch status
* `git stash pop` throws away the stash after applying it
* `git rebase` changes the base of the developer’s branch from one commit to another. usually used if you care a lot of history
### Testing, Formatting, Linting, Type-Checking, and Continuous Integration
#### Testing
* consist in writing small functions for testing
* code fucntion
* integration
* edge cases
* input and output
* need to add extra lines (`test = ["pytest">=<version no.>]) under the toml file
* add the test functions
* `assert_allclose`
* when matching the float, this will help
* test the known (a set of input should return expected output)
* test the bad value
* preloaded the data/results
* `pytest <directory>`
* it will count the test in your package/code
* `pytest <directory> -v` gives more detail of each test
* pytest is looking for the word "test", so be aware of your function naming
* how to run testing
* `pip install -e '.[lint,test]'` The `pre-commit` will be installed as dependeces here
* `pytest tests/`
* `pre-commit run --all-files`
* running the demo
`git clone git@github.com:evamaxfield/winter-school-lectures.git`
`git fetch`
`git switch just-testing`
`python3 -m venv .venv`
`source .venv/bin/activate`
`pip install -e '.[test]'`
`pytest`
#### Formatting
* making the format consistent, such as adding spaces, comma, etc.
* note: how to use pre-commit: `https://stackoverflow.com/collectives/articles/71270196/how-to-use-pre-commit-to-automatically-correct-commits-and-merge-requests-with-g
* running formatting, linting, etc.: `pre-commit run --all-files` (need the pre-commit-config.yaml file)
#### Linting
* fix the duplicated package import and alert you of unused varaibles left from debugging, remind you about tfunction and module doc standars, and help you to fix code which is doing extra works.
#### Type Checking
* first thing to add in a new
* pros: speed up your program; cons: can be annoying
* numpy and pandas they have typing for you already
* mypy documentation: https://mypy.readthedocs.io/en/stable/
* standard library docs: https://docs.python.org/3.11/library/typing.html
* typing cheastsheet from mypy: https://mypy.readthedocs.io/en/stable/cheat_sheet_py3.html
* mypy doesn't do any checking if you don't have type
* tips: use dataclass instead of dictionary
#### Continous Integration (CI)
* reduce or remove bugs entering code over time
* CI systems (GitHub Actions, GitLab CI/CD, Azure Piplines, etc.)
* GitHub Action
* defined in `ci.yml`
* ex: update to latest version of the packages and run it as scheduled
* if everything running successfully, then you can add `publish` in the `ci.yml` to publish on PyPI
#### Credit
https://pydev-guide.github.io/quickstart/
### Peer Code Review by Medicken Munk
What is code review?
* peer review the code to help the software development
* discover bugs
* maintain compliance
* ...
Different perspectives:
* maintainer: keep things sustaibanly running in the project.
* contributer: a person submitting a PR, issues, or feedback to a project.
* user: people who use the code but don't contribute upstream.
#### Summarize code review best practices
1. Read the PR description: the purpose of the code
2. Read the documentation submitted with the code. Do I understand how to use it now?
3. Look through the code: follow the logic?
4. Check the tests: Have tests been added for any new features? Help understand the code in anyway?
5. Consider the users: will it impact users? ex: is the API code changing from waht already exist?
6. Look through the code: suggest constructive improvement
7. Try to use the feture: Particualrly for new feature. Do I run into any errors when I use is?
Others
* is theis code you are comfrotable to move forward?
* adding dependencies
* if younotice somthing that you'd like change that isn't directly related to PR
Dos
* thank the contributor for their contribution
* state what things you particularly like about what was submitted
* be possitive for the posibility
* ask questions about things you're confused about
* if doing multiple rounds of review, try to not let things sstagnate
* use suggesionts to show your though process in code changes
Don'ts
* Nitpicking
* Don't reject code for sylistic or small changes
Tips
* Clear project descripiton in your README
* If you know the contributer, setting up zoom call would be helpful
Steps as a developer/contriputer
1. Read through reviewer's comments
2. Answer reviewer questions, if there is any.
3. Incorporate any code suggestions
4. ...
#### Demonstrate code review on a pull request (PR)
* If you don't want to make the pull request public, but just showing you're working on it, you can make the pull request draft
1. Compare to see the changes
2. Start review and leave suggestions/comment to the changes
* suggestion by puting this in the comment
\```suggestion
xxx
xxx
\```
3. Request changes (Github won't allow any merge if there is PR)
### Remote Development on HPCs
* Set up SSH Keys
* `ssh-keygen -t ed25519`
* Install miniconda (optional)
* Activate conda, you'll see the `python` is now link to the python under miniconda. You can see the path by typing `which python`
* Then you can start using `pip` to install the packages
* (some of HPC) You can use `module avail` to see what's the compile system on your account under HPC
* Using SLURM scheduler or use `srun -N <no. of cpu> --pty bash`
* Use `lscpu` to see cpu's info
### Documentation and Versioning
#### Documentation
Types of Documentation
* Theory manual
* User and developer guides
* README
* LICENSE
* INSTALL
* CITATION
* ABOUT
* CHANGELOG
* Code comments
* both for you and for developers (so, someone who knows the code)
* people can read code, don't put unuseful comment/unnecessary cruft. Explain what you're doing just by code
* useful comments
* name things
* reasons of the code
* Self-documenting code
* naming: a class, variable or function name should tell you why it exists, what it does, and how it is used
* underscores
* internal to modul: singal leading underscore
* avoids conflicts with Python keywords: single tail underscore
* majic function: double underscore
* cuntions/methods: better to use verbs
* human-readable phrase
* Booleans: "is_something"
* be consistent
* simple functions: functions should be small to be understandable and testable; they should only do one things.
* docstrings
* the document will showed when people use `help()` or `?`
* should be descriptive and concise
* explain the arguments, its behavior
* how you intent to be used (an example)
* SPHINX: automate generating HTML documentation
* it looks for the docstring, then nd create the doc
* numpy-style doctrings
* google-style docstrings
* using GitHub Pages and Actionswith stopes
1. `touch sphinx.yml` and edit
2. push what
3. turn on the github papge
* Generated API document
* `git add docs .github/workflows/sphinx.yml`
* `git commit -m 'adds sphinx action'`
* `git push`
* On GitHub -> Settings -> Pages -> Branch: gh-pages + /(root), then Save
* Wait a few minutes, then see https://[username].github.io/test-docs/
#### Sphinx
Complete working version: https://github.com/kyleniemeyer/test-docs
* Get sphinx running locally
1. `pip install sphinx`
2. Run `sphinx-quikcstart docs` on command line
3. Add extensions to the `conf.py` file (around line 17), and edit the file to the configure you like.
* example: 'sphinx.ext.autodoc' and 'sphinx.ext.napoleon'
4. Create individual .rst files for each module.
* Name of file is MODULENAME.rst
* Locate in source directory (or same spot as index.rst)
* include `.. automodule::package.module` in each .rst file
5. Run `make html` on command line
* local sphinx to webpage
1. Create folder in package main directory `mkdir -p .github/workflows`
2. Inside `.github/workflows`, create `sphinx.yml`
* name of yml is not important
3. Add content to .yml file [from workshop slides](https://kyleniemeyer.github.io/research-software-dev-modules/module-documentation/#/5/8)
4. Push .yml and docs/ from sphinx to Github
5. Ensure your Github branch/settings matches your sphinx.yml file
* If using default .yml file from Step 3, create new branch on Github called "gh-pages"
6. Look at webpage
#### Version number
* Three commonly used schemes
* SemVer: Semantic Versioning (MAJOR.MINOR.PATCH):
* MAJOR: when you make incompatible API changes,
* MINOR: when you add functionality in a backwards-compatible manner, and
* PATCH: when you make backwards-compatible bug fixes.
* ZeroVer
* CalVer: Calendar based versioning
### Open Science & Software Citation
#### Copyright
* Right of First Publication: copyright automatically goes the first creator of any creative work
* Copyrightable
* Facts and ideas are not copyrightable.
* Expressions of ideas are copyrightable
* ex: Game rules are not copyrightable (they are idea), but the specific creations based on a game rule are copyrightable
* ex: a fucntion, choice of name, etc. are not copyrightable, but the `std()` code acturally computes the standard deviation is copyrightable
* If you don't give a license, it doesn't give other permission to use it, so not counted as open-
* ex: data is not copyrightable since they are "facts"
* so some people avoid using no-licence software in their work.
* Public Domain - Give up your copyright, others can do whatever without crediting you
* just need to write: "This work has been placed in the public domain."
* Software licenses
* proprietary
* free/open source (FOSS, FLOSS, OSS)
* permissive: allow further distribution under any license e.g., BSD 3-clause, MIT
* copyleft (more restricted): require modification to be shared under the same license ("viral"), ex: GPL (people can't make money from this)(this group of ppl are not, better stay away from them... lol)(However, if you use their package as dependancy but not copied to your code, then it should be okay)
* ==pick an existing license, and copy and paste on your `LICENSE.txt `in your repo. Don't create your own==
* beyond copyright & licenses
*
* Patents: cover ideas and concepts; modern issue with "patent trolls"
* Trademarks: symbols that represent a business or org.
* Export control: gov. may forbid the transfer of source code (and data, ideas) to another country or foreign national without permission
* if you're working under export control, better not use Github, cause it's not controlable unless setting as private, but you get more control in GitLab.
* HIPAA compliance: software that deal with human patient.
* Note: when you're package get more popular, it's better to double check all your dependencies' licenses.
* Different version can have different licenses (ex: open-source at the begining, but sell them for the next version)
* Note: pre-print version (before the publisher start the copyright and their printing processes)'s copyright is belong to the authors,
* resources
* https://choosealicense.com
#### Citation
* add a CITATION.cff file
```
cff-version: 1.2.0
message: "Please cite the following works when using this software."
type: software
title: Title
abstract: Title
authors:
- family-names: ...
given-names: ...
orcid: ...
affiliation: ...
doi: ...
repository-code: ...
url: ...
keywords: ...
license: ...
```
* For research, need one more step: **archiving** software (or data)
* because the software is not static, different version can give different behavior/result
* retrieve doi by using: zenodo (free, gov. supported), figshare
* zenodo and github have great intergration
* go to zenodo.org and sign up/sign in
* go to settings, link acount to link GitHub account
* you can choose specific repos and ask zenodo to track
* `git tag -a v0.0.1 -m "version 0.0.1"``
* `git push origin --tags`
* Go to GitHub repo -> Releases -> v0.0.1
* Create release from tag
* Grab the DOI from Zenodo!
* How to cite:
* Name/Description
* Author(s)/Develpor(s)
* DOI or other persistent identifier
* Version number/commit hash
* Location (e.g., GitHub repo)
* (If there’s a paper describing it, cite that too)
* Reproducibility: Repro-packs
* you get to use your own figures even after the paper is published
* Kyle's practice:
* produce a single repro-pack for an entire paper, which contains:
* python plotting scripts and associated rsults data
* Figures (PDFs for plots, always; not resolution-dependent)
* Any other relevant data: input files, configuration files, etc.
* Upload to Figshare/Zenodo under CC-BY license
* Cite using the resulting DOI in the associated paper(s)
* Benefits
* improve reproducibility and impact of your work
* reuse your figures without violating the journal copyright.
* How to cite/mention
* Appendix A. Availability of material ...
* Cite the software/dataset with the specific version's doi.
* Journal of Open Sources Software (JOSS)
* not specific to Python (but reviewer needs to be able to run it)
* developer-friendly journal for research software packages
* open acess and no fees
* submit software together with short Markdown paper
* the reivewer reviews on the GitHub
* No rejections as long as the repo is in a good size and shape (not too small (<= 1000 lines in total, executable), don't care about the inovation, etc.
* Note: also consider submitting a talk/paper to SCIPY! (good to advertise your software and for the future career)
*