# 3-19-24 Meeting with titus
:::info
**Student:** Meghan Sleeper
**Project:** Identifying commonly differentially methylated regions (DMRs) in colorectal cancer tissue samples
:::
## :triangular_flag_on_post: Project Background
:::success
My project seeks to aggregate genome wide methylation data for colorectal cancer patients and healthy patients and identify common differentially methylated regions across all samples.
:::
### Methylation patterns observed in cancers:
Epigenome-wide association studies (EWAS) investigate relationships between epigenetic modifications across the entire genome and a particular condition.
Differentially methylated regions (DMRs) associated with a condition are identified by comparing methylation in two groups.

DNAm patterns observed in cancer cells:
| Genomic feature | Change | Impact of change |
| ------------------ | ---------------- | -------------------------- |
| Intergenic repeats | Hypomethylation | Genomic instability |
| Gene promoters | Hypomethylation | Gene reactivation |
| Gene promoters | Hypermethylation | Gene silencing |
| Enhancer | Hypermethylation | Reduces gene transcription |
| CTCF binding sites | Both | Genomic instability |
Methylation arrays, despite covering ~2% of potential methylated regions, are commonly used due to cost considerations.
Whole genome methylation sequencing, is limited in usage due to high costs, often restricting analysis to one cancer tissue sample per study.
There is value in aggregating WGBS data from various studies and analysing as a collective.
## Samples overview:
*ref IDs are last 3-4 of bioproject accession (last 3 of geo series accession in parentheses when available)
| ref | bioproject | # of CRC samples | # of normal samples | # of adenoma (pre-cancer) samples |
| ------------- | ------------------------------------------------------------------ | ---------------- | ------------------- | --------------------------------- |
| 480(644) | [PRJNA201480](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA201480) | 1 | 1 | 0 |
| 431(318) | [PRJNA635431](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA635431) | 5 | 5 | 0 |
| 833 | [PRJNA230833](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA230833) | 6 | 0 | 3 |
| 9055(271) | [PRJNA229055](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA229055) | 2 | 0 | 0 |
| 217(171) | [PRJNA328217](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA328217) | 3 | 0 | 0 |
| 194(215) | [PRJNA315194](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA315194) | 0 | 1 | 0 |
| 535(161,348) | [PRJNA34535](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA34535) | 0 | 2 | 0 |
| **sample count** | | 17 | 9 | 3 |
* [Sample info spreadsheet](https://csuchico-my.sharepoint.com/:x:/r/personal/msleeper1_csuchico_edu/Documents/MS_sample_data_info.xlsx?d=w69524862b4d24ca088cdd0d993e46d8c&csf=1&web=1&e=g1z3bI)
:::warning
#### Challenges/Questions:
* Publicly available sample data that meets the criteria (true wgbs of colon tissue) has been more challenging to find than originally expected.
* Sample data found often does not contain metadata on age and sex, which are important covariates in epigentic studies.
* Can samples be compared cross study when they have used different platforms/sequencing methods? How can I account for batch effects?
* **Titus mentioned:** look for similarities in samples of same condition from different studies. ex: clustering to make sure the controls cluster together
* [ranking similarity of bed files](https://rdrr.io/github/Bao-Lab/GPSmatch/man/rankSimilarity.html)
* It is probably best for me to reach out to **Logan or Viki** from Janine LaSalle's lab again to ask what they would recommend since they work with wgbs data...
* ask them how they run cluster analyses...
:::
### Data analysis
[dmr_workflow](https://github.com/MSleeper1/dmr_workflow/tree/main)
* [Original analysis scripts](https://github.com/MSleeper1/dmr_workflow/tree/main/script_dev) are primarily bash scripts.
* I have been working to produce an [automated workflow using snakemake](https://github.com/MSleeper1/dmr_workflow/tree/main/pipeline) to improve reproducibility.
* **pre-processing subwrorkflow** includes all steps that need to be carried out on cluster due to disk and memory usage. Starts by downloading files from SRA. Includes QC checks, trimming, alignment, merging, and methylation reporting. Outputs small beta files that can be worked with locally.
* **primary workflow** can be carried out locally. Includes segmentation, and DMR calling.
:::warning
#### Challenges/Questions:
* How can I run these workflows if there is not enough disk space for the intermediate files?
* I have implemented `temporary()` to burn intermediate files once they have been used, but what if there is not enough space for those temporary files while running the pipeline?
* **Titus advised:** Try using [shadow rules](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#shadow-rules) and [local temp directory](https://github.com/dib-lab/farm-notes/blob/latest/example-scripts.md#example-script-for-using-local-temp-directories) (does not count against space quota).
* [example workflow](https://github.com/dib-lab/genome-grist/blob/latest/genome_grist/conf/Snakefile) that uses these tricks
* **Tessa and Mo** are good resources for cluster configuration and other snakemake questions.
:::
### Presentations/Posters with background and preliminary data
:black_small_square: [Summer seminar presentation](https://csuchico.box.com/s/ibdr0rw3vy57b6wmrueem9h1g52ectax): background on topic and research overview
:black_small_square: [Fall research poster for MATH615](https://csuchico.box.com/s/t3lfqfoqxzsdx2nbk6kokfesc9utz120): preliminary look at locations of differentially methylated regions.