---
# System prepended metadata

title: GitHub Tutorial for Bioinformatics and Computational Biology

---

# GitHub Tutorial for Bioinformatics and Computational Biology

## 1. Why use GitHub?

GitHub is a platform for storing, version-controlling, sharing, and documenting code.

For bioinformatics and computational biology, GitHub is especially useful for:

- Keeping track of analysis scripts
- Recording changes to code over time
- Sharing reproducible workflows
- Collaborating with lab members
- Documenting software, pipelines, and tutorials
- Building a visible research portfolio

In short:

> Git tracks your code history.  
> GitHub lets you store and share that history online.

---

## 2. Git vs GitHub

These two are related but not the same.

| Tool | What it does |
|---|---|
| Git | A version control system on your computer |
| GitHub | An online platform that hosts Git repositories |

You use **Git** locally, then push your work to **GitHub**.

---

## 3. Basic vocabulary

| Term | Meaning |
|---|---|
| Repository / repo | A project folder tracked by Git |
| Commit | A saved snapshot of your changes |
| Branch | A separate version of your project |
| Push | Upload local commits to GitHub |
| Pull | Download updates from GitHub |
| Clone | Download a GitHub repo to your computer |
| README | The front page documentation of a repo |
| `.gitignore` | A file telling Git what not to track |

---

## 4. Install Git

Check whether Git is already installed:

```bash
git --version
```
---
## 5. Set up your Git identity

You only need to do this once on a computer.

```bash
git config --global user.name "Your Name"
git config --global user.email "your_email@example.com"
```

Check your settings:

```bash
git config --global --list
```

---

## 6. Create a new project folder

Example:

```bash
mkdir my_scRNAseq_project
cd my_scRNAseq_project
```

Initialize Git:

```bash
git init
```

Now this folder is a Git repository.

---

## 7. Recommended project structure for bioinformatics

A simple structure:

```text
my_scRNAseq_project/
├── README.md
├── scripts/
│   ├── 01_qc.py
│   ├── 02_clustering.py
│   └── 03_marker_analysis.py
├── notebooks/
│   └── exploratory_analysis.ipynb
├── results/
│   ├── figures/
│   └── tables/
├── data/
│   └── README.md
├── env/
│   └── environment.yml
├── docs/
│   └── notes.md
└── .gitignore
```

Suggested usage:

| Folder | Purpose |
|---|---|
| `scripts/` | Main reusable analysis scripts |
| `notebooks/` | Exploratory work |
| `results/` | Figures and result tables |
| `data/` | Data information, not usually raw data |
| `env/` | Conda or pip environment files |
| `docs/` | Notes, protocol, methods explanation |

---

## 8. What should not be uploaded?

In bioinformatics, avoid uploading:

- Raw sequencing data
- Large `.fastq`, `.bam`, `.h5ad`, `.rds`, `.loom`, `.h5` files 
- Private patient or clinical data
- Passwords or API keys
- Huge intermediate files

> GitHub officially recommends keeping repositories under 1 GB, and strongly recommends staying under 5 GB for performance and maintainability. Individual files over 100 MB are blocked entirely.


Instead, upload:

- Scripts
- Small example data
- Metadata templates
- Documentation
- Environment files
- Instructions for downloading or generating data



## 9. Create a `.gitignore`
Therefore, it is important to setup .gitignore. 

A `.gitignore` file tells Git which files to ignore.

Example `.gitignore` for bioinformatics:

```gitignore
# Large data files
*.fastq
*.fastq.gz
*.fq
*.fq.gz
*.bam
*.sam
*.cram
*.vcf
*.vcf.gz
*.h5
*.h5ad
*.rds
*.loom

# Large folders
data/raw/
data/processed/
results/intermediate/

# Python
__pycache__/
*.pyc
.ipynb_checkpoints/

# R
.Rhistory
.RData
.Rproj.user/

# Conda environments
.conda/
.env/

# System files for Mac users
.DS_Store

# Secrets
*.key
*.pem
.env
config_private.yaml
```

Create the file:

```bash
touch .gitignore
```

Then edit it using your text editor.

---

## 10. Check project status

To see what has changed:

```bash
git status
```

This tells you which files are:

- Untracked
- Modified
- Staged
- Ready to commit

---

## 11. Add files to Git

Add one file:

```bash
git add README.md
```

Add everything:

```bash
git add .
```

Be careful with `git add .`.

Always check first:

```bash
git status
```

---

## 12. Commit your changes

A commit is a saved checkpoint.

```bash
git commit -m "Initial project setup"
```

Good commit messages are short but informative.

Examples:

```bash
git commit -m "Add Scanpy QC script"
git commit -m "Update clustering workflow"
git commit -m "Fix marker gene plotting function"
git commit -m "Add conda environment file"
```

Avoid vague messages like:

```bash
git commit -m "update"
git commit -m "stuff"
git commit -m "final final version"
```

---

## 13. Create a GitHub repository

On GitHub:

1. Go to GitHub.
2. Click **New repository**.
3. Choose a repository name.
4. Choose public or private.
5. Do not initialize with README if you already have one locally.
6. Click **Create repository**.

Then connect your local repo to GitHub.

Example:

```bash
git remote add origin https://github.com/your_username/my_scRNAseq_project.git
```

Push your code:

```bash
git branch -M main
git push -u origin main
```

---

## 14. Clone an existing repository

To download a GitHub repo:

```bash
git clone https://github.com/username/repository_name.git
```

Then enter the folder:

```bash
cd repository_name
```

---

## 15. Pull updates from GitHub

Before working, especially in a shared project:

```bash
git pull
```

This downloads the latest changes from GitHub.

---

## 16. A simple daily workflow

A common workflow:

```bash
git status
git pull
# edit scripts
git status
git add .
git commit -m "Describe what changed"
git push
```

For example:

```bash
git status
git add scripts/01_qc.py
git commit -m "Add mitochondrial filtering to QC script"
git push
```

---

## 17. Writing a useful README

Every bioinformatics repo should have a clear `README.md`.

A simple README structure:

```{md}
# Project title

## Overview

Briefly describe the project.

## Data

Describe the dataset used.

Do not upload private or large raw data directly unless appropriate.

## Environment

Explain how to install dependencies.

Example:

```{bash}
conda env create -f env/environment.yml
conda activate my_env
```




