CMSC 701 — Semester Project Options

# CMSC 701 Spring 2026 — Semester Project Options The semester project is intended to be a **mini research project** completed in groups of two. Projects should combine: - A clear algorithmic problem statement - A principled data structure or algorithmic contribution - Theoretical analysis (when appropriate) - A careful empirical evaluation - A short conference-style write-up Below is a list of suggested project directions. You may also propose your own project (subject to approval). --- # 1. Lightweight Long-Read Mapper with Sparse Seeding ## Overall Goal Design and implement a lightweight mapper for long-read sequencing technologies (e.g., ONT or PacBio) using sparse seeding strategies such as syncmers or strobemers, followed by chaining and lightweight alignment. The actual implementation and heuristics you apply are the key to this project, and so the above are only suggestions — you may choose any type of underlying index (hashing-based, suffix array based, FM-index based, de Bruijn graph based, etc.) Your system should: - Build a compact reference index - Map noisy long reads - Evaluate sensitivity, speed, and memory usage ## Why This Is Interesting Long-read mapping remains one of the most impactful algorithmic problems in genomics. Modern tools rely heavily on sparse seeds and chaining heuristics. This project lets you explore seed density theory, error robustness, and cache-aware hash-based indexing. ## Relevant Concepts - Minimizers, syncmers, strobemers - Seed density and coverage guarantees - Hash tables and compact dictionaries - Chaining via dynamic programming - Error models for long reads ## Jumping Off Point ### Papers - Li, H. (2018). *Minimap2: pairwise alignment for nucleotide sequences.* https://doi.org/10.1093/bioinformatics/bty191 - Kristoffer Sahlin et al., Multi-context seeds enable fast and high-accuracy read mapping with advanced strobemers (2024 preprint) — available at https://www.biorxiv.org/content/10.1101/2024.10.29.620855v3.full.pdf (preprint) - Edgar, R. (2021). *Syncmers are more sensitive than minimizers.* https://doi.org/10.7717/peerj.10805 ### Starter Data - Human chromosome 20 reference (GRCh38): https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/ - Oxford Nanopore public WGS datasets: https://s3.amazonaws.com/nanopore-human-wgs - Simulated long reads: - PBSIM: https://github.com/pfnet/pbsim - NanoSim: https://github.com/bcgsc/NanoSim --- # 2. Spectrum Preserving Tilings and Compression of Tiling Metadata ## Overall Goal Implement Spectrum Preserving Tilings (SPT) and explore novel methods for compressing and querying tiling metadata. Support efficient k-mer membership queries while minimizing memory footprint. Possible directions: - Succinct representations of tiling assignments - Rank/select–based membership support - Trade-off analysis between sparsity and lookup time ## Why This Is Interesting SPT enables sparse and modular reference indexing without losing k-mer information. The core open question is how to represent tiling information compactly while preserving fast queries — a clean data-structural challenge. ## Relevant Concepts - Spectrum preserving tilings - Universal hitting sets - Sparse indexing - Elias–Fano encoding - Succinct bitvectors (rank/select) ## Jumping Off Point ### Papers - *Spectrum Preserving Tilings Enable Sparse and Modular Reference Indexing* (RECOMB 2023) https://doi.org/10.1007/978-3-031-29119-7_2 ### Starter Data - Human chromosome 20 reference: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/ - Small bacterial genomes (RefSeq): https://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/ - Synthetic stress-test sequences (self-generated) --- # 3. Dynamic Colored de Bruijn Graph with Succinct Color Encoding ## Overall Goal Design a colored de Bruijn graph representation that: - Compresses color sets effectively - Supports incremental addition of new genomes - Enables efficient k-mer + color queries ## Why This Is Interesting Color information dominates memory usage in pangenome indexing. Balancing compression with dynamic updates is still an open algorithmic problem with clear theoretical and practical dimensions. ## Relevant Concepts - BOSS representation - Rank/select structures - Elias–Fano encoding - Dynamic compressed sequences - Amortized update analysis ## Jumping Off Point ### Papers - Bowe et al. (2012). *Succinct de Bruijn graphs.* https://doi.org/10.1007/978-3-642-33122-0_14 - Pandey et al. (2018). *Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index.* https://doi.org/10.1016/j.cels.2018.05.021 ### Starter Data - 10–100 bacterial genomes (NCBI RefSeq): https://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/ - Simulated strain collections: https://github.com/HadrienG/InSilicoSeq --- # 4. Cache-Efficient Genome Indexing: Designing for the Memory Hierarchy ## Overall Goal Develop a cache-aware or cache-oblivious genome indexing structure optimized for modern memory hierarchies. Possible directions: - Cache-aware FM-index layout - Blocked / tiled suffix array layouts - Cache-efficient k-mer dictionary - Branch-predictor-friendly rank/select You must measure: - Cache misses (e.g., via `perf`) - Runtime vs naive layout - Memory usage ## Why This Is Interesting In large-scale genomics, memory bandwidth and cache behavior dominate runtime. Succinct data structures are often theoretically optimal but practically memory-bound. This project bridges algorithm theory with systems-level performance engineering. ## Relevant Concepts - Cache-aware vs cache-oblivious algorithms - Memory layout and spatial locality - Blocked data structures - I/O complexity models - Succinct indexing ## Jumping Off Point ### Papers - Ferragina & Manzini (2000). *The FM-index.* https://doi.org/10.1109/SFCS.2000.892127 - Gagie et al. (2020). *Optimal text searching in BWT-runs bounded space.* https://doi.org/10.1016/j.tcs.2019.10.022 - Prokop (1999). *Cache-oblivious algorithms* (background). https://doi.org/10.15760/etd.12276 ### Starter Data - Human chromosome 20 reference - Repetitive genome collections (NCBI): https://ftp.ncbi.nlm.nih.gov/genomes/all/ --- # 5. Approximate Membership Query Structures for Genomic k-mers ## Overall Goal Implement and benchmark multiple AMQ data structures for large k-mer sets, including: - Bloom filters - XOR filters - Cuckoo filters - Possibly learned filters Evaluate: - Memory per k-mer - False positive rate - Query speed - Cache behavior ## Why This Is Interesting AMQs are fundamental to large-scale genomic search systems. Genome repetitiveness and skewed distributions may make learned or hybrid approaches particularly effective. ## Relevant Concepts - Probabilistic data structures - False positive analysis - Cache-aware hashing - Learned index structures ## Jumping Off Point ### Papers - Graf & Lemire (2020). *Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters.* https://arxiv.org/abs/2002.07311 - Pandey et al. (2018). *Mantis.* https://doi.org/10.1016/j.cels.2018.05.021 - Limasset. (2026). *ZOR filters: fast and smaller than fuse filters.* https://arxiv.org/abs/2602.03525 ### Starter Data - k-mer sets from bacterial genomes - Jellyfish k-mer counter: https://github.com/gmarcais/Jellyfish --- # 6. Grammar-Compressed Genome Representation with Pattern Queries ## Overall Goal Implement grammar-based compression (e.g., RePair-style) for genomes and support substring queries directly on the compressed representation. ## Why This Is Interesting Grammar compression can achieve very strong compression on repetitive genomes. Supporting queries introduces nontrivial algorithmic challenges at the intersection of parsing and indexing. ## Relevant Concepts - Grammar compression - Parse trees - Pattern matching on compressed text - Space–time trade-offs ## Jumping Off Point ### Papers - [Larsson & Moffat (2000). *RePair.*](https://hackmd.io/CiszTBeWQtKiyKqIUwEmnw?both) - [Bille et al. Practical and Effective Re-Pair Compression](https://arxiv.org/pdf/1704.08558) - Navarro. *Indexing highly repetitive string collections* (survey). [part 1](https://users.dcc.uchile.cl/~gnavarro/ps/acmcs20.3.pdf) [part 2](https://users.dcc.uchile.cl/~gnavarro/ps/acmcs20.2.pdf) ### Starter Data - Human chromosomes (GRCh38): https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/ --- # 7. Sketching for Large-Scale Single-Cell RNA-seq ## Overall Goal Develop a sketch-based representation (e.g., MinHash or random projections) of high-dimensional gene expression vectors to support: - Approximate similarity search - Fast clustering - Reduced memory storage ## Why This Is Interesting Single-cell datasets can contain millions of cells. Sublinear sketching algorithms could drastically reduce memory and runtime while preserving biological signal. ## Relevant Concepts - MinHash - Johnson–Lindenstrauss lemma - Approximate nearest neighbors - High-dimensional sparse vectors ## Jumping Off Point ### Papers - Ondov et al. (2016). *Mash: Fast genome and metagenome distance estimation using MinHash.* https://doi.org/10.1186/s13059-016-0997-X - Zheng et al. (2017). *Massively parallel digital transcriptional profiling of single cells.* https://doi.org/10.1038/ncomms14049 - Hie et al. (2019). Geometric Sketching Compactly Summarizes the Single-Cell Transcriptomic Landscape. Cell Systems. DOI: 10.1016/j.cels.2019.05.003 - “Benchmarking sketching methods on spatial transcriptomics data” (2025 bioRxiv), https://www.biorxiv.org/content/10.1101/2025.08.26.672376.full ### Starter Data - 10x PBMC 3k dataset: https://www.10xgenomics.com/datasets/3-k-pbmc-from-a-healthy-donor-3-standard-1-2-0 - Allen Mouse Brain Cell Types: https://celltypes.brain-map.org/ --- # 8. Hierarchical Compressed Representations of Single-Cell Data ## Overall Goal Construct a hierarchical (multi-resolution) representation of single-cell gene expression data that supports: - Fast aggregated queries - Coarse-to-fine clustering - Compressed storage You may encode the hierarchy using succinct tree representations. ## Why This Is Interesting Hierarchical summaries enable scalable exploration and interactive analysis. Efficient encoding of large trees with aggregation support is a clean data-structural challenge. ## Relevant Concepts - Hierarchical clustering - Succinct tree encodings (LOUDS, balanced parentheses) - Tree queries with aggregation - Multiscale data representations ## Jumping Off Point ### Papers - Price et al. (2009). *FastTree.* https://doi.org/10.1093/molbev/msp077 - Navarro & Sadakane (2014). *Fully functional succinct trees.* https://doi.org/10.1137/130912062 ### Starter Data - Tabula Sapiens single-cell atlas: https://tabula-sapiens-portal.ds.czbiohub.org - 10x Genomics datasets: https://www.10xgenomics.com/resources/datasets --- # Expectations for All Projects Each project should include: - A precise problem formulation - Description of the proposed algorithm/data structure - Theoretical analysis (when appropriate) - Careful empirical evaluation - Comparison to at least one baseline - A final 7-9 page conference-style paper - Reproducible code and experiments ---