# PDAC single-cell application: Bioinformatics methods To analyse the G&T-Seq data, we will first perform a standard analysis on the transcriptome data: After standard quality control, read filtering, we will quantify expression strength using, e.g., Alevin [https://doi.org/10.1186/s13059-019-1670-y] or a standard aligner such as STAR [https://doi.org/10.1093/bioinformatics/bts635 ] combined with UMItools [https://doi.org/10.1101/gr.209601.116]. For the main analysis, we will use R/Bioconductor [https://doi.org/10.1038/nmeth.3252], leveraging Bioconductor packages for single-cell analysis [https://doi.org/10.1038/s41592-019-0654-x], including Seurat [https://doi.org/10.1016/j.cell.2021.04.048]. In order to improve comparability between the patient samples, we can use integration based on canonical correlation analysis (CCA), as offered by Seurat [https://doi.org/10.1016/j.cell.2019.05.031]. We will then seek to define sub-clusters among the tumour cells, and CTCs and CTCs clusters, as well as compare between these categories. Possible question to ask include: Are there differences in expression between circulating cells and cells taken from the solid tumours that are consistently seen across patients? In the same manner, are there such differences seen between single CTCs and CTC clusters, or between cells collected in the portal versus the peripheral vein? Care has to be taken in such comparisons to perform sound statistical inference: many earlier workflows incorrectly treated individual cells rather than patients as statistical units, leading to huge inflation of false positive results [https://doi.org/10.1101/2021.03.12.435024]. To overcome this issue, one may either use pseudo-bulks, i.e, sum up counts across all cells of a subpopulation in a sample and then use established bulk methods like limma-voom [https://doi.org/10.1186/gb-2014-15-2-r29], or use so-called mixed-effect models that explicitly model the hierarchy of statistical dependence between cells, with some benchmarks recommending the former [https://doi.org/10.1038/nmeth.4612] and others suggesting the latter [https://doi.org/10.1038/s41467-021-21038-1]. We will explore both approaches. In either case, the proper use of linear models to subtract patient-specific base lines will be crucial to obtain good statistical power. Prior removal of cells that are atypical for the cluster or sub-population under consideration will also improve interpretability. Our Sleepwalk tool [https://doi.org/10.1101/gr.251447.119] will assist in visually identifying such outliers. For downstream analysis of the differential-expression results, we will use gene-set enrichment approaches, using e.g. the Hallmark collection [https://doi.org/10.1016/j.cels.2015.12.004] to identify signaling-induced states or tools to identify activated gene-regulatory networks, such as Scenic [https://doi.org/10.1038/s41596-020-0336-2] or others [https://doi.org/10.1038/s41592-019-0690-6]. For the WES data obtained from the genetics part of the G&T-Seq data, we will compare these data to the germline reference obtained. For the bulk data, SNP and CNV calls will be obtained using our established standard pipeline (PUT SOMETHING IN HERE), for the single-cell data, we might employ DLP+ [https://doi.org/10.1016/j.cell.2019.10.026], SCmut [https://doi.org/10.1093/bioinformatics/btz288] and/or SCAN-SNV [https://doi.org/10.1038/s41467-019-11857-8]. Given the fast pace of method development in this new area, we anticipate that new methods might be published in the next few months which we will evaluate. The mutation calls for the single cells will allow us foremost to ask whether CRCs carry additional mutations that may include those causative for their leaving the tumour (and whether this is also the case for the cells collected immediately after surgery where the cause for breaking the tumour may be the surgery rather than a mutation). Here, simple counting statistics on the mutational load, as well as enrichment analyses for EMT-related genes might be of use. Cross-referencing between mutation calls and transcriptome will also offer unique opportunities to study the effect of mutations on the transcriptome. Finally, the single-cell bisulfate data will allow us to ask similar questions on the epigenetic level. For analysis, we will use the Bismarck suite [https://doi.org/10.1093/bioinformatics/btr167] to obtain methylation calls on the single-cell level. A common way to proceed from there is to then aggregate data by splitting the genome into overlapping sections of a few kb and searching for differentially methylated regions, a workflow offered, e.g., by MethylKit [https://doi.org/10.1186/gb-2012-13-10-r87]. As for the transcriptomics data, we have to be careful here to not fall for the “pseudoreplication trap” of treating each cell as an independent sample; this means that the standard chi-squqre test usually employed is incorrect for single-cell data. We are currently working on a new method that addresses this challenge as follows: First, we determine for each locus an exepcted methylation by averaging over all cells and then performing LOESS-style regression [[ISBN 978-0-387-22732-0](https://www.springer.com/gp/book/9780387987750)] with quasi-binomial likelihood, so that we then continue with each cell’s residual to this expectation. Scanning along the chromosomes, we then locate sections where residual variance across cells is high. Using average residuals in each of these sections provides a feature vector for each cell that can be used as input into workflows designed for single-cell transcriptomics; e.g., we may use the feature vectors to span a PCA space for the cells. Furthermore, we can employ mixed models (to account for the hierarchical setting, analogous to the approach of [https://doi.org/10.1038/s41467-021-21038-1] for scRNA-Seq) to find individual loci or methylation principal components associated with, e.g., differences between primary tumour and CRCs, or the like. Finally, as an alternative approach to integrate the modalities of WES, transcriptome and epigenetics, we will also explore factor analysis approaches like MOFA+ [https://doi.org/10.1186/s13059-020-02015-1].