A Practical Guide for Bioinformatics Project and Data Management

# A Practical Guide for Bioinformatics Project and Data Management This document was created as a discussion guide and handout for the MEC Lab's Bioinformatics Rodeo (15th Oct 2025). ## Contents [toc] ## Introduction ### 🚨 The challenge Many of you are embarking on your first undergrad/masters/PhD bioinformatics research projects, which is super exciting! But without a good system in place early on, things can quickly snowball into a disorganised mess of files and scripts. **Organising bioinformatics projects is challenging due to:** - Huge datasets - Hundreds and even thousands of intermediate files - Workflows spanning different types of scripts, tools, and programming languages **With poor organisation often leading to:** - Lost time: spending hours or even days trying to find certain files or work out what you did - Irreproducible results: unable to regenerate results for publication - Data loss: accidentally overwriting or deleting critical files - Barriers to collaboration: collaborators are unable to understand or use your work - Career impacts: slow research progress, delayed publications, irreproducible studies ### ✅ The solution Good project and data management saves time, reduces stress, and importantly ensures your research is reproducible for the scientific community. The overall goal is to ensure that someone unfamiliar with your project (including future you) can look at your files and understand exactly what you did and why. Taking a few minutes at the start of your project to think and decide on a logical organisation strategy will save you a lot of time and effort in the long run. At the same time, the pursuit of perfection and the temptation to begin with only the "best" methods (e.g. GitHub version control) can easily be overwhelming and slow down your research progress when first starting out. ### 🎯 Session goals The goal of this session is to introduce some key practical strategies for bioinformatics project and data management that you can implement today, aka what I wish someone had told me when I was first starting out too. We will focus on 3 topics in this 1-hour session: 1. How to set up a logical and standardised project file structure 2. Useful file naming conventions 3. Effective documentation for reproducibility By adopting some of the recommended practices outlined in this document, you will set up a solid foundation for managing your project from day one 💪 > Perfect is the enemy of good. Start simple, build habits, and improve over time. ## (1) A standardised file and directory structure ![turing.file.management](https://hackmd.io/_uploads/rJ_kky6age.jpg) Credit: The Turing Way. Illustration by Scriberia (CC-BY 4.0). Every bioinformatics project should follow a clear, logical, and consistent directory structure. Pros of following a standardised template: - Allows you and others to navigate any project quickly and clearly - Reduces decision fatigue on where files should go - Easier to write scripts that work across projects ### Example template NB. there is flexibility based on project specifics and personal preferences, but this template is as a great skeleton to begin with. ``` project_template/ # Change to your project name │ ├── 00_scripts/ # Scripts specific to the project │ ├── 01_data/ # Data files │ ├── raw/ # Raw data (NEVER EDIT): e.g. sequencing reads │ ├── reference/ # Reference files: e.g. sample metadata, reference genomes │ └── working/ # Intermediate data: e.g. cleaned, transformed, processed data │ ├── 02_results/ # Analysis outputs │ ├── figures/ # For final publication and preliminary exploration │ ├── reports/ # e.g. QC reports │ └── tables/ # For final publication and preliminary exploration │ ├── 03_docs/ # Project documentation hub (though keep elsewhere if preferred) │ └── lab_notebook.md │ └── README.md # Project overview in root directory ``` **To copy this template** ```bash # Create template directory structure mkdir -p project_template/{00_scripts,01_data/{raw,reference,working},02_results/{figures,reports,tables},03_docs} # Create an empty README file in the root project directory touch project_template/README.md ``` ### Breakdown of key directories #### `00_scripts` - All of your project-specific scripts go here - Store scripts that are useful across multiple projects in a different directory, e.g. `utils` - Can be split into logical subdirectories corresponding to major analysis steps, ordered by execution - Store the stdout logs of each submission script in a `logs` subdirectory :::warning **Discussion** - How to organise similar scripts across different projects (e.g. same workflow with different parameters and data) - When to use submission scripts vs quick command line manipulations? ::: #### `01_data/raw` - Never edit! Keep raw sequencing data exactly as received - **MEC lab-specific**: If you're working on Unity, then raw data is usually uploaded to a shared space, e.g.`/nese/meclab/Shared`. - If you're still fairly new to bioinformatics, then you may feel more comfortable copying the raw data into your own project directory in a scratch space to work with. This creates a low risk, stress-free environment to start practising the basics. - Otherwise, you can directly call the raw data from `nese` into your scripts, especially for huge datasets. If possible, create a backup (e.g. in cold storage) before doing this, so it's fine if files accidentally get corrupted. Don't hesitate to ask Lisa or a group member to check your script before you do this for the first time too. #### `01_data/working` - Store intermediate files from data processing steps here, e.g. filtered, cleaned, transformed, analysis data. - Logical subdirectories e.g., by sample ID, file type. Different systems may work better for different analyses stages - These files can be regenerated from the raw data at any time using scripts in `00_scripts` :::warning **Discussion** - Which intermediate files are safe to delete and when? ::: #### `01_data/reference` - Reference files used by scripts in `00_scripts` during data processing, e.g. reference genomes, downloaded public data, sample metadata - If these files are shared across projects, consider storing elsewhere or using symbolic links #### `02_results/figures` - Final figures for the project, split by analysis type - For exploratory/preliminary work, you can use date-stamped directories, e.g. `exploratory/20251015_initial_clustering` #### `02_results/reports` - Reports for the project, e.g. QC HTML reports #### `02_results/tables` - Final tables for the project, split by analysis type #### `03_docs` - Project-specific lab notebooks and other documents can be stored here, though elsewhere is fine too (e.g. HackMD, Google Drive, Notion etc) ## (2) Useful file naming conventions File and directory names should meet these 3 criteria: 1. **Human-readable**: understandable at a glance 2. **Machine-readable**: can be parsed automatically via code 3. **Self-sorting**: appear in a logical order within an alphanumeric system ### 1. Human-readable * **Name files descriptively**: the name should indicate what's inside a file without opening it. * You can include key information that makes sense for a given set of files e.g., project/experiment name, sample ID, filtering condition etc. * At the same time, avoid overly long filenames. You can use abbreviations, provided they are explained in a README file. * **Name files consistently**: once decided, maintain the same naming conventions for a given set of files, so it can be explained simply in the README file ### 2. Machine-readable * **Avoid spaces or special characters**: these usually break scripts and code! * Safe characters: letters `a-z, A-Z`, numbers `0-9`, underscores `_`, hyphens `-`, periods `.` * **Use delimiters consistently**: this makes it easier to split file names via code * Underscores `_` are often recommended to avoid confusion with mathematical operators and file extensions * **Use file extensions consistently**: this is essential for parsing files via code, given bioinformatics revolves around converting between different file formats to run different analyses. File extensions should: * Follow typical naming conventions (or follow documentation for a specific tool) * Be located at the end of the file name, separated by a period `.`, e.g. `file.csv, file.vcf, file.fasta` * Add serial file extensions when a series of operations has been performed on the same file, e.g. compressing `file.fasta.gz`, indexing `file.fasta.fai`, sorting `file.sort.bam` etc. :::spoiler **Useful code for parsing filenames** Here are some examples of code demonstrating how you can easily parse file names once they are set up properly. **Extract a file prefix based on delimiters** ```bash ### In bash ### original_name="sampA1_metadata.txt" # Extract the desired prefix prefix="${original_name%%_*}" # Extract all characters before the 1st underscore # Build the new filename new_name="${prefix}_analyses.txt" # Print echo $new_name # sampA1_analyses.txt ### In R ### original_name <- "sampA1_metadata.txt" # Extract the desired prefix prefix <- sub("_.*", "", original_name) # Delete all characters after the 1st underscore # Build the new filename new_name <- paste0(prefix, "_analyses.txt") # Print print(new_name) # sampA1_analyses.txt ``` **Replace/add file extensions** ```bash ### In bash ### # You can use the Bash `basename` command to extract file names without the extension file="sampA1.bam" echo $(basename "${file%.bam}").sort.bam # sampA1.sort.bam echo $(basename "${file%.sort.bam}").cov # sampA1.cov # syntax `${file%.bam}` means: remove the shortest match of '.bam' from the end of the variable 'file'. ``` ::: ### 3. Self-sorting * **Order file prefixes consistently**: to group files meaningfully within the same directory ``` # e.g. order by sample ID, then by quality filter sample_01_minq10.bam sample_01_minq20.bam sample_02_minq10.bam sample_02_minq20.bam ``` * **Left-pad numbers for sorting**: for files to appear in a numerical order ``` # Left-padded: in numerical order sample_01.bam sample_02.bam sample_10.bam sample_20.bam # Un-padded: not in numerical order sample_1.bam sample_10.bam sample_2.bam sample_20.bam ``` * **Use ISO date formats (YYYYMMDD)**: for files to appear in date order ``` 20251109 # 9th Nov 2025 20250911 # 11th Sep 2025 ``` * **Sort scripts and directories numerically**: you can add left-padded numbers for sorting files and directories in a custom order ``` # e.g. order directories by genome assembly workflow steps 01_qc 02_de_novo_assembly 03_polishing 04_purge_dups 05_scaffolding ``` * NB. Scripts often aren't allowed to start with numbers on clusters. You can add a letter in front to get around this, e.g. `S01_qc.sh` rather than `01_qc.sh` ### Examples | ❌ Bad file name | ✅ Better file name | Changes | |-------------------|---------|----------| | `mymetadata 21125.txt` | `honu_metadata_20250211.txt` | Added project name, removed space, ISO format date | | `1_script.check.sh` | `S01_fastqc.sh` | Added left-padding + letter, added analysis info, consistent delimiters| | `chemyd.fasta`, `chemyd.fa.gz`,`greenturt.textfile` | `chemyd.fasta`, `chemyd.fasta.gz`, `chemyd.txt` | Consistent file prefix and extensions | ## (3) Effective documentation for reproducibility The ultimate reproducibility goal is for anyone to look at your files and understand exactly what you did and why, including being able to redo your entire workflow and achieve the same results. To achieve this, effective documentation is essential. Different types of documentation serve specific purposes: | Documentation type | Purpose | Main audience | Update frequency | |-------------------|---------|----------|------------------| | **README** | Project overview | Everyone | Start of project, major changes, publication | | **Sample metadata** | Sample information | You, analysis scripts, collaborators | When data arrives | | **Lab notebook** | Record key decisions and observations throughout your project | You | Daily: as you analyse | | **Workflow** | Streamlined record of steps, commands and programs used to achieve an analysis | You, code readers, collaborators | After finishing an analysis | | **Script comments** | Explain code logic | You, code readers | Daily: as you write code | | **Methods writeup** | Formal methods description | Everyone, for paper/thesis | Final write up | ### Breakdown of key document types #### Project README file - Every project should have a README file at the root level. This is the first file anyone will read - it's your project's front door 🚪 - It should be human-readable, even on a terminal (markdown format is popular) A good README answers these questions: 1. **What** is the goal of the project? 2. **Where** did the data come from? 3. **How** is the project organized? 4. **How** do I run the analysis? 6. **Who** do I contact for questions? :::spoiler **Example** NB. Made up example ```markdown # Project: CheMyd population structure ## Contact Name: Charley Yen Contact: eugenieyen@umass.edu Last updated: 2025-10-15 ## Overview - Study goal: Describe the population structure of green sea turtles (Chelonia mydas) across Australia, Lalo and Florida - Experimental design: Whole genome sequencing of wild populations - Sample info: n=75 blood samples collected from wild nesting females (n=25 Australia, n=25 Lalo, n=25 Florida). For full metadata, see `chemyd_popgen_project_2025/01_data/reference/chemyd_popgen_sample_metadata.csv` ## Data Description - Location: chemyd_popgen_project_2025/01_data/raw/fastq - Date received: 2025-01-15 - Sequencing: Illumina NovaSeq, 150 bp paired ends, 25X coverage - Reference: rCheMyd1.pri.cur.20210525 ## Analysis Workflow 1. QC (chemyd_popgen_project_2025/00_scripts/S01_fastqc.sh) 2. Trim (chemyd_popgen_project_2025/00_scripts/S02_trim_reads.sh) 3. Align (chemyd_popgen_project_2025/00_scripts/S03_align_reads.sh) 4. Variant calling (chemyd_popgen_project_2025/00_scripts/S04_call_variants.sh) 5. SNP filtering (chemyd_popgen_project_2025/00_scripts/S05_filter_snps.sh) 6. Population structure (chemyd_popgen_project_2025/00_scripts/S06_admixture.sh) ``` ::: NB. In addition to the project README, it can be useful to include a README file for your raw data too, focusing on sample and experiment metadata. #### Sample metadata - This provides an overview of all sample metadata, often in `.csv` format - Not only does it track all relevant sample information, but it can also be read in directly by analysis scripts :::spoiler **Example** ```bash Sample,Sex,Nest_ID,Treatment Aus_3D,Male,D,HeatShock Aus_3F,Female,F,HeatShock Aus_3G,Male,G,HeatShock Aus_4D,Female,D,HeatShock Aus_4F,Female,F,HeatShock Aus_4G,Male,G,HeatShock Aus_5G,Male,G,Control Aus_6J,Female,J,Control ``` ::: #### Lab notebook - This serves as your personal analysis diary. Suggested information: - Day-to-day decisions - Observations - Problems encountered and solutions - Job resource consumption - Next steps - Be honest and complete: record failures and mistakes (you'll learn from them), note when you're unsure, include negative results - Structure by analysis or chronologically - While some structure is needed, this should be a quick and informal place to scribble down notes quickly - Make it a habit: update as you go along :::spoiler **Example entry** ```markdown ## 2025-01-15 - Received raw fastq files from sequencing core - Initial QC shows adapters present in 30% of reads - Decision: Will use Trimmomatic with ILLUMINACLIP - Complete: (4 CPUs: wallclock time=50min, max mem=8Gb) - Next step: QC looks good, proceed to alignment ::: #### Workflow - This can serve as a streamlined record of all steps, commands and programs you used to achieve an analysis outcome. Especially important if you mix submission scripts vs interactive command line manipulations. - The goal is to provide a document that you or others can follow to repeat your analysis. Ideally use markdown format to mix coding blocks and detailed descriptions (e.g. HackMD, RMarkdown) #### Script comments - Comments help you and others understand what your script and each code block does - Include a brief script header to detail what's in the script and who created it - Useful for explaining why you made a choice, complex logic, warnings on unusual behaviour or parameters, workarounds implemented for bugs etc. :::spoiler **Example comment** Good comments that explain decisions: ```R # Use default DESeq2 normalization # NB. Tried TMM normalization but results were similar, # and DESeq2 normalization is more standard in the field dds <- DESeqDataSetFromMatrix(countData = counts, colData = metadata, design = ~ condition) # Use default Wald test over LRT as we have a two-group comparison dds <- DESeq(dds) ``` ::: :::warning **Discussion** - How do you decide what goes into your script comments vs workflow vs lab notebook? ::: ## Cheatsheet ### Project directory template ```bash project_template/ # Change to your project name │ ├── 00_scripts/ # Scripts specific to the project │ ├── 01_data/ # Data files │ ├── raw/ # Raw data (NEVER EDIT): e.g. sequencing reads │ ├── reference/ # Reference files: e.g. sample metadata, reference genomes │ └── working/ # Intermediate data: e.g. cleaned, transformed, processed data │ ├── 02_results/ # Analysis outputs │ ├── figures/ # For final publication and preliminary exploration │ ├── reports/ # e.g. QC reports │ └── tables/ # For final publication and preliminary exploration │ ├── 03_docs/ # Project documentation hub (though keep elsewhere if preferred) │ └── lab_notebook.md │ └── README.md # Project overview in root directory # To copy this template mkdir -p project_template/{00_scripts,01_data/{raw,reference,working},02_results/{figures,reports,tables},03_docs} ``` ### Checklists #### File naming checklist - [ ] Descriptive name that explains content - [ ] Consistent with your project naming convention - [ ] No spaces or special characters (use `_` or `-`) - [ ] Consistent delimiters - [ ] Consistent file extension at end of file name, separated by `.` - [ ] Date in YYYYMMDD format if applicable - [ ] Numbers are left-padded if applicable (`01` not `1`) #### Reproducibility checklist before sharing with collaborators - [ ] README in the root directory that explains the project - [ ] Files are organised clearly and logically - [ ] Raw data is preserved, documented, and backed up - [ ] Results can be reproduced identically from the raw data using scripts - [ ] Scripts are well-commented and easy to follow - [ ] Sample metadata is complete and correct - [ ] Software versions are recorded - [ ] Key decisions are documented ### Day-to-day habits #### Daily 1. Name files clearly as you create them (follow the file naming checklist) 2. Add clear comments as you write code 3. Record what you did and why in your lab notebook #### Weekly/Monthly 1. Update your README if the project structure changed 2. Review if any intermediate files can be cleaned up ### Common pitfalls and how to avoid them | ❌ Don't | ✅ Do | |---------|------| | Make up project directory structure as you go along | Start with the project template and go from there | | Have an outdated README file | Update whenever major project changes occur | | Use ambiguous file names `final_v2.txt`, `test.bam` | Follow the file naming checklist | | Edit raw data files | Only call in raw data and and save any processing steps to new files in `01_data/working`. Ensure there is a backup of the raw data. | | Keep every single intermediate file ever | Schedule regular reviews to clean up your project directory | | Skip documenting what you did | Fill out your lab notebook as you go | | Think "I'll remember" | No you won't. Write it down now | | Only organise at the end of the project | Organise as you go | ## Further reading - Article: [A Quick Guide to Organizing Computational Biology Projects](https://pmc.ncbi.nlm.nih.gov/articles/PMC2709440/) (Noble, 2009) - Short article: [How to structure a README file](https://www.makeareadme.com/) - Online book: [The Turing Way](https://book.the-turing-way.org/)