GitHub Tutorial for Bioinformatics and Computational Biology

# GitHub Tutorial for Bioinformatics and Computational Biology ## 1. Why use GitHub? GitHub is a platform for storing, version-controlling, sharing, and documenting code. For bioinformatics and computational biology, GitHub is especially useful for: - Keeping track of analysis scripts - Recording changes to code over time - Sharing reproducible workflows - Collaborating with lab members - Documenting software, pipelines, and tutorials - Building a visible research portfolio In short: > Git tracks your code history. > GitHub lets you store and share that history online. --- ## 2. Git vs GitHub These two are related but not the same. | Tool | What it does | |---|---| | Git | A version control system on your computer | | GitHub | An online platform that hosts Git repositories | You use **Git** locally, then push your work to **GitHub**. --- ## 3. Basic vocabulary | Term | Meaning | |---|---| | Repository / repo | A project folder tracked by Git | | Commit | A saved snapshot of your changes | | Branch | A separate version of your project | | Push | Upload local commits to GitHub | | Pull | Download updates from GitHub | | Clone | Download a GitHub repo to your computer | | README | The front page documentation of a repo | | `.gitignore` | A file telling Git what not to track | --- ## 4. Install Git Check whether Git is already installed: ```bash git --version ``` --- ## 5. Set up your Git identity You only need to do this once on a computer. ```bash git config --global user.name "Your Name" git config --global user.email "your_email@example.com" ``` Check your settings: ```bash git config --global --list ``` --- ## 6. Create a new project folder Example: ```bash mkdir my_scRNAseq_project cd my_scRNAseq_project ``` Initialize Git: ```bash git init ``` Now this folder is a Git repository. --- ## 7. Recommended project structure for bioinformatics A simple structure: ```text my_scRNAseq_project/ ├── README.md ├── scripts/ │ ├── 01_qc.py │ ├── 02_clustering.py │ └── 03_marker_analysis.py ├── notebooks/ │ └── exploratory_analysis.ipynb ├── results/ │ ├── figures/ │ └── tables/ ├── data/ │ └── README.md ├── env/ │ └── environment.yml ├── docs/ │ └── notes.md └── .gitignore ``` Suggested usage: | Folder | Purpose | |---|---| | `scripts/` | Main reusable analysis scripts | | `notebooks/` | Exploratory work | | `results/` | Figures and result tables | | `data/` | Data information, not usually raw data | | `env/` | Conda or pip environment files | | `docs/` | Notes, protocol, methods explanation | --- ## 8. What should not be uploaded? In bioinformatics, avoid uploading: - Raw sequencing data - Large `.fastq`, `.bam`, `.h5ad`, `.rds`, `.loom`, `.h5` files - Private patient or clinical data - Passwords or API keys - Huge intermediate files > GitHub officially recommends keeping repositories under 1 GB, and strongly recommends staying under 5 GB for performance and maintainability. Individual files over 100 MB are blocked entirely. Instead, upload: - Scripts - Small example data - Metadata templates - Documentation - Environment files - Instructions for downloading or generating data ## 9. Create a `.gitignore` Therefore, it is important to setup .gitignore. A `.gitignore` file tells Git which files to ignore. Example `.gitignore` for bioinformatics: ```gitignore # Large data files *.fastq *.fastq.gz *.fq *.fq.gz *.bam *.sam *.cram *.vcf *.vcf.gz *.h5 *.h5ad *.rds *.loom # Large folders data/raw/ data/processed/ results/intermediate/ # Python __pycache__/ *.pyc .ipynb_checkpoints/ # R .Rhistory .RData .Rproj.user/ # Conda environments .conda/ .env/ # System files for Mac users .DS_Store # Secrets *.key *.pem .env config_private.yaml ``` Create the file: ```bash touch .gitignore ``` Then edit it using your text editor. --- ## 10. Check project status To see what has changed: ```bash git status ``` This tells you which files are: - Untracked - Modified - Staged - Ready to commit --- ## 11. Add files to Git Add one file: ```bash git add README.md ``` Add everything: ```bash git add . ``` Be careful with `git add .`. Always check first: ```bash git status ``` --- ## 12. Commit your changes A commit is a saved checkpoint. ```bash git commit -m "Initial project setup" ``` Good commit messages are short but informative. Examples: ```bash git commit -m "Add Scanpy QC script" git commit -m "Update clustering workflow" git commit -m "Fix marker gene plotting function" git commit -m "Add conda environment file" ``` Avoid vague messages like: ```bash git commit -m "update" git commit -m "stuff" git commit -m "final final version" ``` --- ## 13. Create a GitHub repository On GitHub: 1. Go to GitHub. 2. Click **New repository**. 3. Choose a repository name. 4. Choose public or private. 5. Do not initialize with README if you already have one locally. 6. Click **Create repository**. Then connect your local repo to GitHub. Example: ```bash git remote add origin https://github.com/your_username/my_scRNAseq_project.git ``` Push your code: ```bash git branch -M main git push -u origin main ``` --- ## 14. Clone an existing repository To download a GitHub repo: ```bash git clone https://github.com/username/repository_name.git ``` Then enter the folder: ```bash cd repository_name ``` --- ## 15. Pull updates from GitHub Before working, especially in a shared project: ```bash git pull ``` This downloads the latest changes from GitHub. --- ## 16. A simple daily workflow A common workflow: ```bash git status git pull # edit scripts git status git add . git commit -m "Describe what changed" git push ``` For example: ```bash git status git add scripts/01_qc.py git commit -m "Add mitochondrial filtering to QC script" git push ``` --- ## 17. Writing a useful README Every bioinformatics repo should have a clear `README.md`. A simple README structure: ```{md} # Project title ## Overview Briefly describe the project. ## Data Describe the dataset used. Do not upload private or large raw data directly unless appropriate. ## Environment Explain how to install dependencies. Example: ```{bash} conda env create -f env/environment.yml conda activate my_env ```