Collaborating with git + githu

# 2025 URSSI Summer School at University of Alaska, Fairbanks **Agenda and details:** https://github.com/si2-urssi/summerschool-Aug2025 Zulip space: https://urssi-softwareschool.zulipchat.com ## Research Software Lessons ### Collaborating with git + github Brigitta made a nice [tutorial](https://bsipocz.github.io/URSSI_2025Aug_first_PR/prs-and-reviewing) for this. * Remember about the pink sticky note if you need help! * Do the exercises on this page (https://learngitbranching.js.org/) as a review of GitHub once you get back. * Get a GitHub repository to your local machine * Create a **fork** in your own github environment. Otherwise, you may run into trouble uploading your changes later. * Run `git clone [repo-url]` in a local folder to download the repository. * Recommendation: Use `-o [git-username]` option when `git clone [repo-url]` to set the name of the remote origin. That prevents later confusion when the git project exists in different locations, e.g., when multiple people work on several forks. * Add another repo as a remote with `git remote add [name] [repo-url]` * If you're following Brigitta, her equivalent of `upstream` is called `bsipocz`, and `origin` is `bsipocz-2nd`. * You may see `git switch` in documentation to change branches. (think of this as a way to get *all* of the files from a particular commit) * `git restore` is the new term to check out individual files from a commit. (think of this as a way to get versions of specific files) * `git checkout` does both of these, and is the older "classic" language but all of your instructors are used to it. :slightly_smiling_face: * All branches exist three times within git. (1) on the remote location, (2) in your local cache, and (3) in a local branch. * You edit the local copy on your local machine * You update the cached copy from remote using `git fetch`. That shows you the changes that would happen to your code if you were to `git pull`. * You update the local copy from remote using `git pull` * You update the remote copy from local using `git push` * A forking workflow is recommended because you can *always* work this way and have your fork be as messy as you'd like. Even if you have write permission to make branches without forking, this can be harder for collaboration. * Steps to rename a clone of a repo, to a fork * Yes! If you're using `origin` and `upstream` as your names, say * `git remote rename [old name] [new name]` -- if you cloned your lab using the default this would be `git remote rename origin upstream` * `git remote add origin [address to your fork]` * You can also choose to *remove* a remote with `git remote rm [name]` , which is nice if you want to remove the caches from something for whatever reason. In that case you would do: * `git remote rm origin` * `git remote add upstream [address to lab repo]` * `git remote add origin [address to your fork]` Tips * Make commits small and thematically separate * Say what the content is that you added in the commit message, not the file that you changed. People can see the list of files changed in the commit. * Make the commit messsage short, but descriptive. * You can prefix your commit message with things like "DOC:" to say what the type of change the commit is to make it easier to search later. * `git log --graph --oneline` is the correct incantation of the "showing the git branch history in the terminal. * `git add -p` interactively add parts of file changes to the commit (to keep commits small and thematic) ### Software Design + Packaging #### project structure * Classifiers: * Select from a standard list * https://pypi.org/classifiers/ * Tests: * Unit test * Workflow test (integration) * Continuous integration (CI): * Github actions (.github/workflows): Run some workflows when a PR is submitted (e.g., install package in the current state and run the test. Then report status, pass/fail, etc.) * readthedocs: * CI for docs. build the documentation from PRs; can have it configured to preview when submitting PR. * Other workflows: * issue_template: minimally reproducible issue for a bug report; functionality /feature request; motivating examples / use cases #### packaging * build backend: * setuptools * hatchling: does a lot of default that otherwise would go into setuptools * what is it actually doing: look at pyproject toml, pickl a build system whcih will parse through the toml to create the wheel (metadata), compile, copy over license/readme. source --> binary * wheel: build for users when distributing the package, can be part of CI. * vs docker * manylinux2014 * glibc: on an OS, C library gets embedded in the OS itself and treated differently than others. --> whatever C library gets embedded in the OS is fixed and can't be changed (if want to update C, then need to update OS). * minimal requirement for package installation. (newer OS will work fine) * --> pick the oldest version one would care about * other build * Cython * use case 1: wrote code in python but some bits are slow (e.g., complicated math operation, missing multiple np operations), drop to C and do a few for loops to optimise * end product code is more similar to python * use case 2: access C library in python * other features * version info may show up in many places like toml, docs. setuptools-scm (source control management) can detect if using git and figure out version. * dirty tag: not all changes have been committed * license: * https://spdx.dev * Lists most licenses available. * https://choosealicense.com/ * List many licenses available and gives a bit more in-depth info about them. * Good for those new to licenses. * install optional pacakages: * `pip install -e ".[examples,docs]"` ### Git commands, structuring python packages * $ git add -p ← allows you to choose what parts of file to add * $ git diff ---cached Shows you what’s going to go in your commit * $ git diff # Sees differences between working environment and staging area * $ conda create -n example-pyproject python=3.13 pip # Creates new project * $ conda activate example-pyproject * $ python -m pip (make sures pip runs with your specified python in that location) → do that to install your package (non editable tho) $ python -m pip install . * $ pip list # Shows that your pip installed package is installed * $ conda install -n <project> python-build * $ python -m build . ### Code Review * DO: thank them for their contribution * Rebasing locally is what Brigitta does * Don’t be intimidated to be contribute based on unknown repo * Ask questions to the maintainer * Don’t be afraid of things to be broken * Labels for each module (github) * Unit tests and tests reaching out to apis that are variable * Codecov report (a bot on github) Tests * $ alias ghc # a github command line tool * Run tests locally? * Pr is based on the branch not the specific request * Can’t merge part of a pr- has to be a separate branch * To make a new pr, make a new branch, and make commits to that * ex) a branch for a bug fix, a branch for docs, another branch for other unrelated bugs, a feature branch, etc. Good to keep documentation to related code changes * Good practice to start new work from a new branch * 1) remember to fetch FIRST because to get most updated info (from github?) Testing the documentation too * Pr doesn’t close until after the branch is merged * Github “closed” is a magic term that can do some action * You can edit pull requests * Commits, Files changed, and status are important under Pull Requests github > Files changed > Review changes > generally click “Start a review” or else each comment is one email Click “+” button, and can select multiple ones, and then click the “Preview” button thing to get the lines shown up in that one comment $ git grep Simplify the api earlier on to save time later ### Testing and Linting Where is testing used? * integrationl: testing a module * Functional: beyond integration, the collection of code did something that it's designed to do * System: work flow * End-to-end: + environment * blackbox: IDC what the software does internally but I know that it's supposed to be doing something - testing if that this software is doing what it's supposed to * whitebox: I know everything internal about the software Testing Frameworks * unnittest: older style, built-in * doctest: built-in; point to a module and run test * nose: (less maintained, not recommended, but have been used) * pytest: most popular * hypothesis: randomised testing. e.g., generate a lot of floats to test * Python testing layout * testing frameworks know to look for files that being with 'test_' In packaging * you can view the content of the wheel file by changing the extention from `.whl` to `.zip` and then unzip. It will contain the files in the package (e.g., `triangle.py`) but it WON'T contain `tests/`. ### Documenting and Versioning What is documentation - documenting what the code does - API reference - User guide - Examples - Installation guide - Docstrings - Attribution Licenses - Link to outside references and help - Edits, or commit history, or PRs - Troubleshooting / FAQs - Bug Report - What is this software for - Link to other works that use the library Why? - makes code accessible - do it for your users - do it for you - BUILD A COMMUNITY! Check out JupyterBook for displaying notebooks online GitHub cheatsheet: https://education.github.com/git-cheat-sheet-education.pdf ## Open Science Practices The FAIR Principle - Findable, Accessible, Interoperable, & Reuseable - Findable - Assigned unique and persistent identifiers. - Richly described by clear metadata. - Indexed in a searchable resource. - Accessible - Retrievable from IDS using a standard protocol that: - is open, free, and universal. - allows for auth. where necessary. - A big part of this is the internet! - Interoperable - Uses a formal, accessible, broad langauge for sharing information. - Provided in known, standard format. - Includes qualified references - Reuseable - Richly described with accurate and relevant attributes. - Released under accessible license. - Meets community standards. Datanutrition.org is a good source for creating shorthand info. - Idea is to create something like a nutrition label. Includes lots of key data in an extremely small and easily digested format. Version Control - the practice of tracking and managing changes to code or other types of files. - can be useful in undoing mistakes. ### Ethos ### Open Science Tools and Resources ### Open Data Metadata Tagging Best Practices - Remember, the more metadata that you add, the easier it will be for users of your data to use it effectively. When in doubt: - Seek and comply with repository/community standards. - Investigate open science online resources for metadata, e.g., Turing Way. - Useful and informative metadata: - uses standards that are commonly used in your field. Accompanying Documentation - When creating your data, in addition to adding metadata, it is a best practice to create a document that users can refer to. The document can be done as a README, a user guide, or even a quick start (or all 3) Data Versioning Guidelines - Very field dependent. - Helpful to make things that are better, and sometimes you need to go back. - Raw data will never be versioned (though it should be available), but new data processing should be versioned. Types of Licenses - really important to do. - Types - Attribution (B&) - must credit the creator. - Non Commercial (NC) - work cannot be used for commercial means - [And many more that i missed!] Important questions to ask: - when and if to share data? - How might this data affect someone else? What unethical things can be done with the data? - Often comes down to who provides the funding. - should the data be shared? - Determiend by laws and funding agencies. - Verify your data is shareable! - Reasons it wouldn't be: - military secrets. - Private medical info. - Indigenous/cultural concerns. - Export and Security Considerations - may prevent the release of data. - Controlled Information Considerations (CIC) - e.g., Controlled Unclassified Information (CUI) - info - like personal information - that should not be shared publicly. Can also apply to developing tech (e.g., specific materials that DoE battery tech is being made of) - DO NOT ASK FOR FORGIVENESS - "Forgiveness" when messing this up is demanding your resignation as opposed to sending you to prison. - Intellectual Property Considerations - Data that is subject to intellectual property, copyright, and licensing concerns. - Can be artwork, graphics, books. - Control over intellectual property is very different for staff versus faculty. - It is good to talk about your uni's copyright/intellectual property/patent people. - They can help route you to the most open method. - Institutions will usually want to make your methods and data as open as possible because it gets them more citations and makes them more relevant. Will sometimes want you to patent things so they can make money. - Where to share data? - Lots of locations. - Sharing via email or websites is popular, but not recommended for long-term. Lack findability and/or long-term archival spport. - Sharing data as part of a supplemental of a publication is acceptable in some fields. - A long term repository that provides a permanent identifier that is the best option for sharing of data. - hot/spinning disk - more expensive tier of storage. Best for high volume read/write/access. - glacier/cold - least expensive, but reeeally slow. - Ensuring accessibility - Good repositories will share (or offer) your open data through standard protocols, like HTTPS or SFTP. - Some things require logins, and not everybody can get a login. - Want people to be able to see files before actually having to download. - Working with a repository - DON'T DO THIS ALONE!!! - ASK FOR HELP!!! Your library & IT people can help with this. - How to enalbe reuse of data! - Obtain a DOI - digital object identifier. - Zenodo does this when storing data! - Make it easy to cite your data! - Include a citation statement! - Good to have this in multiple formats. - Sharing data is a group effort! Types of sharing - Advanced Sharing - immediately goes online. Sharing at the time of collection, or soon after. - Intermediate sharing - you check that the data is good, then send it out. Usually at the time of publication. - Minimum sharing - end of grant. What most funding agencies want, but it is rarely enforced. - No sharing - not shared at all. Lots of good and valid (and not good) reasons to restrict data. ### Open Results