# GitHub Tutorial for Bioinformatics and Computational Biology
## 1. Why use GitHub?
GitHub is a platform for storing, version-controlling, sharing, and documenting code.
For bioinformatics and computational biology, GitHub is especially useful for:
- Keeping track of analysis scripts
- Recording changes to code over time
- Sharing reproducible workflows
- Collaborating with lab members
- Documenting software, pipelines, and tutorials
- Building a visible research portfolio
In short:
> Git tracks your code history.
> GitHub lets you store and share that history online.
---
## 2. Git vs GitHub
These two are related but not the same.
| Tool | What it does |
|---|---|
| Git | A version control system on your computer |
| GitHub | An online platform that hosts Git repositories |
You use **Git** locally, then push your work to **GitHub**.
---
## 3. Basic vocabulary
| Term | Meaning |
|---|---|
| Repository / repo | A project folder tracked by Git |
| Commit | A saved snapshot of your changes |
| Branch | A separate version of your project |
| Push | Upload local commits to GitHub |
| Pull | Download updates from GitHub |
| Clone | Download a GitHub repo to your computer |
| README | The front page documentation of a repo |
| `.gitignore` | A file telling Git what not to track |
---
## 4. Install Git
Check whether Git is already installed:
```bash
git --version
```
---
## 5. Set up your Git identity
You only need to do this once on a computer.
```bash
git config --global user.name "Your Name"
git config --global user.email "your_email@example.com"
```
Check your settings:
```bash
git config --global --list
```
---
## 6. Create a new project folder
Example:
```bash
mkdir my_scRNAseq_project
cd my_scRNAseq_project
```
Initialize Git:
```bash
git init
```
Now this folder is a Git repository.
---
## 7. Recommended project structure for bioinformatics
A simple structure:
```text
my_scRNAseq_project/
├── README.md
├── scripts/
│ ├── 01_qc.py
│ ├── 02_clustering.py
│ └── 03_marker_analysis.py
├── notebooks/
│ └── exploratory_analysis.ipynb
├── results/
│ ├── figures/
│ └── tables/
├── data/
│ └── README.md
├── env/
│ └── environment.yml
├── docs/
│ └── notes.md
└── .gitignore
```
Suggested usage:
| Folder | Purpose |
|---|---|
| `scripts/` | Main reusable analysis scripts |
| `notebooks/` | Exploratory work |
| `results/` | Figures and result tables |
| `data/` | Data information, not usually raw data |
| `env/` | Conda or pip environment files |
| `docs/` | Notes, protocol, methods explanation |
---
## 8. What should not be uploaded?
In bioinformatics, avoid uploading:
- Raw sequencing data
- Large `.fastq`, `.bam`, `.h5ad`, `.rds`, `.loom`, `.h5` files
- Private patient or clinical data
- Passwords or API keys
- Huge intermediate files
> GitHub officially recommends keeping repositories under 1 GB, and strongly recommends staying under 5 GB for performance and maintainability. Individual files over 100 MB are blocked entirely.
Instead, upload:
- Scripts
- Small example data
- Metadata templates
- Documentation
- Environment files
- Instructions for downloading or generating data
## 9. Create a `.gitignore`
Therefore, it is important to setup .gitignore.
A `.gitignore` file tells Git which files to ignore.
Example `.gitignore` for bioinformatics:
```gitignore
# Large data files
*.fastq
*.fastq.gz
*.fq
*.fq.gz
*.bam
*.sam
*.cram
*.vcf
*.vcf.gz
*.h5
*.h5ad
*.rds
*.loom
# Large folders
data/raw/
data/processed/
results/intermediate/
# Python
__pycache__/
*.pyc
.ipynb_checkpoints/
# R
.Rhistory
.RData
.Rproj.user/
# Conda environments
.conda/
.env/
# System files for Mac users
.DS_Store
# Secrets
*.key
*.pem
.env
config_private.yaml
```
Create the file:
```bash
touch .gitignore
```
Then edit it using your text editor.
---
## 10. Check project status
To see what has changed:
```bash
git status
```
This tells you which files are:
- Untracked
- Modified
- Staged
- Ready to commit
---
## 11. Add files to Git
Add one file:
```bash
git add README.md
```
Add everything:
```bash
git add .
```
Be careful with `git add .`.
Always check first:
```bash
git status
```
---
## 12. Commit your changes
A commit is a saved checkpoint.
```bash
git commit -m "Initial project setup"
```
Good commit messages are short but informative.
Examples:
```bash
git commit -m "Add Scanpy QC script"
git commit -m "Update clustering workflow"
git commit -m "Fix marker gene plotting function"
git commit -m "Add conda environment file"
```
Avoid vague messages like:
```bash
git commit -m "update"
git commit -m "stuff"
git commit -m "final final version"
```
---
## 13. Create a GitHub repository
On GitHub:
1. Go to GitHub.
2. Click **New repository**.
3. Choose a repository name.
4. Choose public or private.
5. Do not initialize with README if you already have one locally.
6. Click **Create repository**.
Then connect your local repo to GitHub.
Example:
```bash
git remote add origin https://github.com/your_username/my_scRNAseq_project.git
```
Push your code:
```bash
git branch -M main
git push -u origin main
```
---
## 14. Clone an existing repository
To download a GitHub repo:
```bash
git clone https://github.com/username/repository_name.git
```
Then enter the folder:
```bash
cd repository_name
```
---
## 15. Pull updates from GitHub
Before working, especially in a shared project:
```bash
git pull
```
This downloads the latest changes from GitHub.
---
## 16. A simple daily workflow
A common workflow:
```bash
git status
git pull
# edit scripts
git status
git add .
git commit -m "Describe what changed"
git push
```
For example:
```bash
git status
git add scripts/01_qc.py
git commit -m "Add mitochondrial filtering to QC script"
git push
```
---
## 17. Writing a useful README
Every bioinformatics repo should have a clear `README.md`.
A simple README structure:
```{md}
# Project title
## Overview
Briefly describe the project.
## Data
Describe the dataset used.
Do not upload private or large raw data directly unless appropriate.
## Environment
Explain how to install dependencies.
Example:
```{bash}
conda env create -f env/environment.yml
conda activate my_env
```