# GIW/BIOINFO 2017 Conference ###### tags: `shared` `conference` [TOC] ## GIW/BIOINFO 2017 in Seoul Paper presentations notes ## GIW/BIOINFO Day 1 ### Tutorial 1: Machine Learning + Introduction to Sparse Coding in Genomics Data Analysis - Sparse coding to find biomarkers - First part of part I: introduction to ML, from definition (weighting, activation function) to mentioning high dimensional data (and thus sparse coding) - Idea of sparse coding: signs do not matter but the magnitudes matter => make some weights to be zero - Machine learning - Training data: parameter learning; validation data: hyper-parameter learning (for learning) - Testing data: for performance evaluation - Assumption: data points are iid - Main objective is the regularizer - Prediction models and data models - Prediction model: y <- f_w(x) - Data model: y=f_w(x) + epsilon, zero-mean Gaussian noise N(0, sigma^2) - Model space -> (x,y): w (sampling); (x,y) -> model space: w^hat (estimation) - Model space is also known as hypothesis space - Sparse coding - s-sparse if nnz(w) <= s (nnz == number of non-zero) - Estimation (sparse coding) - Least Absolute Shrinkage and Selection Operator (LASSO) - Fused LASSO: D matrix as a penalty matrix - 1-D fused LASSO: find step-wise functions to find differentially-expressed data - 2-D fused LASSO: fMRI images - Dantzig selector - Three forms of sparse coding - What are we looking for by LASSO? - Types of error - Estimation error: w* = w ? - Variable selection error: supp(w*) != supp(w), where supp(w):={i in {1, …, p}: w_i != 0} - Second type error often matters (what biomarkers you use) - How to choose the hyper-parameters? (figure) - CV has no guarantee on the variable selection error - Notable results for LASSO - LASSO: finite sample control - LASSO selects correct variables even when s is very small - LASSO beings to include false variables even when s < n/(2log p) - R packages: glmnet, SLOPE - HIV dataset available: https://goo.gl/c6fCMx - Genotypic predictors of human immunodeficiency virus type 1 drug resistance, Soo-Yon Rhee et al., PNAS 2005 - hiv.train$x # 704 observations of 208 binary mutation variables - fit=glmnet(hiv.train$x, hiv.train$y) # fit the model - plot(fit, xvar=“lambda”, main=“HIV model coefficient paths”) # plot the paths for the fit - Why graphical models? (recent researches) - We are often interested in their relations - Gaussian Markov random field (Gaussian MRF, GMRF) - Correlation (local perspective) v.s. conditional correlation (global perspective) - Conditional correlation can be obtained by observing the interacting nodes - Sparser graph -> more interpretable - GMRF optimization problem (can be solved in a reasonable time even if the number of genes is large) - Problem: the matrix is NOT sparse -> not a sparse graph - Grade 1 breast cancer (unsupervised: on used microarray data) - Choosing lambda - Biomarker discovery - Genetic data - Classical model: multiple hypothesis testing, calculating p-values for them - Banjamini-Hochberg v.s. Bonferroni - Multivariate hypothesis testing: combinatorial explosion but computationally expensive - MHT as model selection: X * beta -> y, where beta is a sparse vector - Model selection by sparse coding - Using SL1: SLOPE - fit.slpoe=SLPOE (hiv.train$x, hiv.train$y, fdr=0.2) - fit.slope - Biomarkers -> relations - On-going research ### GIW 1: Biological networks + BTNET: Boosted tree based gene regulatory network inference algorithm using time-course measurement data - Transcription factors - target genes - How to identify GRN - Correlation, information theoretic, dynamic bayesian network, tree-based ensemble - Problem: steady-state expression data but dynamic biological process - Adaboost: Focus on samples with poor predictions - Gradient boosting: Trained based on the error - Variable Important Score (VIS): contribute to the reduction of the variance of target value (i.e., target gene expression) - AUROC/AUPR as evaluation metrics - Future work: Web interface & dynamic inference - BTNET available on the network based on Python + SPSNet: Subpopulation-sensitive network-based analysis of heterogeneous gene expression data - Disease heterogeneity (lies at the molecular level) - Systematic biases enter in the form of batch effects - DE analysis at gene level -> pathway level -> sub-network level - Disease changes only certain part of the whole pathway (not the whole pathway) - Generating candidate subnetworks -> compute subnetwork scores -> determine statistical significance - Rank genes in the order of their expression -> fuzzy: top-10 (1) ~ bottom-10 (0) - Node & nodes within 5 steps -> pathway/subnetwork - Simulation methodology (SPSNet): N1 & N2 (artificially up/down-regulated) & subnetwork information - SPSNet can be used to recognize and eliminate extrinsic heterogeneity such as batch effect + Network-based identification of pathway-specific protein domains and their implication for human disease - Network propagation - co-pathway protein network - Pathway specificity - PPA score between protein-pathway - DPA score between domain-pathway + A homologous mapping method for three-dimensional reconstruction of protein networks reveals disease-associated mutations - Network-based approach to cell behaviors and disease mechanism - PPI data is incomplete - High false positive rate - hDiSNet: human 3-dimensional structural PPI network - Validation: gene ontology (biological function) (biological process and cellular component) - Validation: gene co-expression profiles (gene expression profiles) - Mutations prefer occurring in the interacting domain - Method for predicting/construction of PPI network + Comorbidity Scoring with causal Disease Networks - Disease association network -> disease causality network - Main steps: Disease association -> disease causality (pathway & medical literatures) -> SSS^c algorithms - Proposed method - Settling directed edges (metabolic pathway & literatures) - Limitation to SSL for causality matrix: non-symmetry -> not good for Laplacian - Toy example: graph with(out) direction ### Keynote Speech 1 + Studying Stem Cell Systems through multiple ‘omics data - Wnt: “wint signaling”; mesodermal; lineages - Biological sample: XXX-seq data - Integrated analysis -> SNP (“snip”) - Modeling Wnt signaling pathway - “Organizing the first 24 hours of life” - Rules of developmental signaling pathways: initiation -> progression -> termination - Terminator of Wint signaling - Highly included in response to Wnt - Transcription factor - Repressive domain - Acts globally - SP5 binds in promoter regions of many Wnt target genes - what if disabled ? - Wildtype (WT) v.s. disabled - SP5 reins in their response to Wnt - What makes it so selective? - Generation and characterization of a mesodermal progenitor population - RNA-seq reveals 2 key sets of genes - What organizes neural progenitor cells? - Wnt instructs A/P positional identity of NPCs - Challenge: biological -> computational - Question: Data driven analysis? - Haven’t done yet (presently, query-motivated) - Waited to be mined ### GIW 2: Sequence Analysis, Computational Methods + Introducing difference recurrence relations for faster semi-global alignment of long sequences - Genealogy of Alignment Algorithms -> Contribution: Difference DP/libgaba/new implementation technique - Narrower of variables allow for parallelization - libgaba available on GitHub + MUGAN: Multi-GPU accelerated AmpliconNoise server for rapid microbial diversity assessment - EM maximization ### BIOINFO: RNA Informatics + Integrative analysis of genomic and epigenomic regulation of the transcriptome in liver cancer - Mutations acquired by tumor recurrence - Mutations in recurrent HCC - Reprogrammed Expression by Recurrence (RER) (by heirarchical clustering/KM analysis) - GOLGB1 and SF3B3 are novel targets - SOX4 and IL-8 are downstream targets (interaction networks) - DNA copy numbers and methylation - Look at: distributions and correlation - Multiomics-based classification - Validation using TCGA - Subtype markers + Differential, pathway, and network analysis of RNA-seq data - Read count data - RNA composition - Poisson model - calculate the probability to observe k read counts - But really Poisson? No. Variance >> mean - Negative binomial distribution: composition of Poisson and gamma distribution - lambda ~ Gamma(alpha,beta), alpha: dispersion variable - R: DESeq, EdgeR, … etc. - Focus on AUC for performance valuation - Read count bias (Oshlack and Wakefield, 2009) - Dispersion is the upper and lower bounds of SNR - Pathway (GSEA) analysis of RNA-seq data - Step 1: Normalize the read count data using TMM, DESeq ,voom (not RPKM) - Step 2 Apply GSEA - Small replicate: gene-permuting GSEA (often used) - But, it will cause many false positives - Instead, use absolute gene statistic in one-tailed GSEA - Gene-set clustering - Difficult to identify biological themes - GO, DAVID, Enrichment Map - Expathnet: new web server - PPI weighted distance ## GIW/BIOINFO Day 2 ### BIOINFO: Cancer Genomics + Utilization of PDX library and cancer genome database for precision oncology - Why sequence cancer samples - Understand more about cancers - Ideal study plan for cancer genome research - Large number of samples - High depth of sequencing - small cancer cell proportion - cancer cell heterogeneity - Large area of the whole genome - structural variation - variation in noncoding region - Many samples from the same patient - Multiple site from the same tumor - Normal sample (blood, adjacent normal tissue) - Metastatic loci - Samples from different phase - Integration of multi-omics data - Transcriptome - Epigenome (DNA methylation, Histone modification, chromatin structure) - Proteome - (Important) With clinical outcome - Clinical information we need if possible - PDX: patient-derived xenograft (drug, clinical data, genomic data) - Targeted mainly on breast cancer (most in eastern Asia) - Make PDX -> experiment (now in process) - Metastasis study using xenograft tissues - Drug response test with PDX models - Clinical cancer genome database: http://ccgd.snu.ac.kr - Question: Since less patients recurred, it’s not good for the research (success rate is around 30%) + Systematic analysis of cancer omics and phenotypic data for finding synthetic lethal target and biomarker combinations - Challenges in cancer therapies - Wide variety of cellular response against target-specific cancer drugs - Difficult to predict clinical outcome - How to effectively combine different information of biomarkers? - Analysis of cancer omics and siRNA screening data: cancer phenotype - Cell line enrichment analysis (CLEA) - STK11 loss of function is a potential biomarker for lung cancer therapy - Major mutation pattern in patient samples - STK11mt is unique in lung cancers - ATP1A1 is a selective target for STK11 mutant cancers - Why siRNA library screening? - Algorithm for building network: weighted tanimoto score - Pathway analysis and validate the biological functions - On-going: screening of a CSC-inhibitor on colon cell line panel transcriptome analysis of the sensitive cell lines - Feature extraction from images - Omics data mining + Pilot targeted next generation sequencing data - HNSCC: head and neck cancer - More precise target is needed - General design of umbrella trial - Part 1: pilot targeted next generation sequence (NGS pipeline … etc.) - Part 2: umbrella trials - Establish web-based report system - Molecular tumor board meeting (MTB) for interpreting NGS data - TP53 was the most common mutational event occurring in 47 patients (51%) followed by CDKN2A (n=23), CCND1 (n=22), PIK3CA (n=19) - Comparison of HPV-positive and -negative + Application of RNA sequencing data in cancer research - N=59: 38 for experiment; 21 for control - Use of WTS data to discern functionality ### Keynote Speech 2 + Genome Editing in Plants, Animals, and Human Cells - Genome editing timeline (Kim JS. Nature Protocols) - Kim H and Kim Js: comparison of programmable nucleases - CRISPR: adaptive immune system in bacteria - Genome surgery: in-/ex-vivo therapy - DNA/genome-wide cleavage scores - How to avoid off-target effects? - CRISPR RGEN Tools (Cas-Designer) ### BIOINFO: Machine Learning for Bioinformatics + Learning Graphical Structure from High-Dimensional Data using Sorted L1 Regulation - Bonferroni versus Banjamini-Hochberg (B-H: slightly large chance of false discovery but much more powerful) - MHT as model selection - linear combination of each interested biomarkers - Coefficient tell the significants of the biomarkers - Polynomial complexity (compared to combinatorial explosion: test every possible combinations) - Transformed problem: X * beta -> y => convex optimization problem - New methods/algorithms modify the regularization term (R packages available) - Graphical LASSO/SLOPE: not requiring clinical outcome (unsupervised) - Graphical SLOPE - Precision matrix (n x n, Theta): representing a graph - Gaussian MRF - Markov property: conditionally independent if the nodes are disconnected (theta = 0), but difficult - Case study: Neuroblastoma (NRC), breast cancer - p-values versus graph structures (grade 1-3) - Difference of networks - Node degree distribution plot - FDR control: replacing this using graphical LASSO - Question: lambda sorted in decreasing order, also sorting the markers - Question: no matter much for the Gaussian noise assumption - Question: only need to provide FDR -> number of lambda’s -> graphical LASSO will calculate them for us + Network based ML Algorithms of Intra-relation, Integration, and Inter-relation for Diseases-Drugs - Integration of heterogeneous data - Layers == networks for different type of data (with causality) - Integration of networks: addition/subtraction of networks - Disease networks - Gene regulatory networks - Metabolic pathways - Protein interaction networks - Etc. - Connecting diseases - Cosine similarity over a disease vector - Scoring for the co-occurrence - Semi-supervised scoring - Will be only given a few pieces of information on one or two diseases contracted - Graph-based SSL algorithm (solution (probabilities) can be obtained in a closed-form) - Passing through activation function, sort the associated diseases - Case study: diabetes (Type I and Type II) - Hierarchical disease networks - Intra- and inter-relation - Tridiagonal block matrix (weight matrix) - Limitations - Existence of rectangular blocks - High computational complexity and sparseness; O(n^3) - Proposed method - Matrix separation - Nystrom method - Changed solution (only need cheap computational efforts) - 121.63 -> 12.85 (secs; 500 layers) in computation time - Casual network - Metabolic pathway + diseases association networks - Flow function: comparing inflow and outflow to decide the direction - Comorbidity scoring - Drug network - Future works: theoretic works & find the applications - Question: edges: connected or not (1/0) - Question: demographic data? prospect for precision medicine - Question: image such as MRI is another new area + Functional prediction of causal regulatory variants identifies a novel autism gene (topic changed) - Cancer dependency maps (by genome-wide screen) - Bayesian gene regulatory network - Cutoff values on dependency and estimate them respectively - Prediction and validation in PDX mice (still under validation) - Prediction in TCGA samples - Comparison with other algorithms (peaks separation) + Computational analysis of intra-tumor heterogeneity in cancer - Tumor heterogeneity -> challenge in cancer therapy - Major subjects - Apply NGS data - Analysis of tumor clonality - Tumor heterogeneity in colorectal cancer (CRC) - 46 positive/42 negative samples - PyClone model - Hierarchical Bayesian clustering model - Oligo-/Multi-clone (low/high heterogeneity) - Vascular invasion and stages are dramatically different in these two groups - Clinical relevance according to clonality - Characteristics of mutations according to clonality ### Keynote Speech 3 + Fun stories in translational biomedical informatics - Translational bridge: basic and clinical - Patient data and phenotype is the key - e.g., social networks: big data (anonymous social network) - Database for comments is available online - The power of machine learning - Many datasets available - Predictive analytics is finally making inroads into healthcare - UW medicine clinical data repository (UW == University Washington) - Statistical analysis on clinical data - Enrichment analysis on clinical data (term enrichment) - That was only only for coded data (20%) - 80% is unstructured data; e.g., clinical note/text, image - Pilot NLP engine - Identifying concepts in discharge notes - Lots of questions and opportunities - Interpreting genomes and precision medicine - 80% of rare diseases have genetic origins - Rare diseases are significant to human health - Many mechanisms of pathogenicity (what function/ why this function) - MutPred: using functional properties (MutPred2 is now available) - Community challenges enable real citizen science - Started in protein structure prediction -> CASP (Critical Assessment of Structure Prediction) - CASP targets can also be used for evaluating function prediction - CAGI challenge ## GIW/BIOINFO Day 3 ### GIW 8: Cancer Bioinformatics + WISARD: a comprehensive workbench for the analysis of related database - WISARD: Workbench for Integrated Superfast Association Test with Related Dataset - Handling 9 genotype file formats - Support five operating systems - WISARD website - Why fast - Matrix optimization - Analysis-wise optimization + Isoform specific gene expression analysis of KRAS in the prognosis of lung adenocarcinoma patients - Lung cancer (3 types) -> KRAS mutant patients show worst prognostic records - KRAS4A & KRAS4B - Large-scale genome data from TCGA project - Isoform-level expression analysis (proportions of K-RAS isoforms) - Linear regression analysis (# mutation v.s. expression): show clear association, especially in K-RAS4A - Survival analysis (better in KRAS mutant/amplified groups) - Multivariate analyses - Low K-RAS proportion and expression is the key factor for the disease - Limitations: the use of overall survival & the lack of proper independent cohort for validation - TCGA: largest cohort dataset with well-defined clinical records - Question: Better to use the results for predictions + BRCA-Pathway: a structural integration and visualization system of TCGA Breast Cancer data on KEGG Pathways - Motivation: biological mechanism - Webservice - TCGA data + pathway + driver genes, TFs, hallmark gene sets - Use correlation coefficients between genes - Survival analysis (mutation/no mutation; p-value=0.05) and oncoprint - User data visualization - REST-API: easy access by users - Mapping multi-omics data onto KEGG pathway - Question: Why breast cancer? The largest data size; other kinds of cancer is future work - Question: Possible for classification? Can choose subgroup for various combination of clinical records + Stage-Dependent Gene Expression Profiling in Colorectal Cancer - Previous works have focused on the correlation between cancer stage and individual gene expression - Examine 9 microRNAs - Gene expression patterns have been analyzed for limited stages with small samples, without proper data pre-processing - Workflow - Data aquisition - Data modification - Analysis of functional characteristics - Analysis of structural characteristics - Computing correlation using LMER (linear mixed-effect regression model) - A representative gene RPSY1 shows different correlation coefficients between LM and LMER - Functional characteristics of the stage-dependent genes (subgroups: INC/DEC) - Low expression of TAP1 is associated with the development or progression of colorectal cancer - Both types of genes are closely associated with each other - DFS using KM analysis - High expression and low expression for two subgroups - MPP2 (p-value=0.012) - TXNL1 (p-value=0.072) - Continuously increasing genes are related to nerves and development system while other genes are related to cell cycle and metabolic process - Question: Is really LEMR is really better? Or just few cases are different while other results are the same? Basically the same. + MGRFE: multilayer recursive feature elimination based on embedded genetic algorithm for cancer classification - (presenter not shown) ### BIOINFO: Big Data Informatics Supported by Center for Plant Aging Research + Cancer Precision Medicine through Transforming Biomedical Big Data to Actionable Knowledge - Big data challenges - Efficient storage and access - Data analytics to mine valuable information (most focus) - Biomedicine big data challenge: multi-scale, complex, heterogeneous, and distributed - Constructing BD2K intelligence engine - Biomedical entity Search Tool (BEST): “<disease>”+target:[mutation | drug] - Predict best drug given patient information - Synergistic effect S = Observed affect E+ Additive affect A - Sub-challenge 1-A: predict drug synergy from all available data - Prediction model - RBF SVR (regression) - Choose rather simple model due to label imbalance - Use AWS to speed up the training time (~1hr. for 1800 cores) - Translatability of deep learning model (difficult; tried various types of models) - Regression tasks not good, but tree-based models provide reliable results - Synergistic rules generation (qualitative analysis) - Biomedical QA with deep neural reader (attention-sum reader & biological feature) ### Keynote Speech 4 + Peer through your Brain with Multilevel Data - World largest brain dataset: Chinese southern biobank - How to integrate these data to make predictions? - Association study; causality; modeling - I. Nonlinear association (need/try nonlinear transformation using different kernels) - II. Causality inference (of time-course data) - Current trend: kernel methods - Intelligent image reader - Depression - Searching the roots of depression - Data driven approaches (time-course data); functional connectivity -> abnormal links - Reward and punishment processing in depression - Depression uncouples brain hate circuit - Painting the full depression circuit - Quantity of your brain - Dynamic brain - ffMRI + EEG - Entropy v.s. entropy rate - Entropy gene: expression strongly related to entropy - Integrated Multi-scale data for Prediction (IMuDP) - Question: Animal models will be built later? (human first due to machines for fast scanning) - Question: How to identify specific areas for specific emotion? Literatures - Question: How to quantify the emotions such as creativity? All sorts: math, picture, images, etc. - Question: How to combine these data? Way too large data: dimension reduction -> prediction (but labour intensive; want to make this process automatic) ### GIW 10: Biomarkers Discovery + Identifying statistically significant combinatorial markers for survival analysis - Survival analysis - Methods for statistical analysis of longitudinal data on the occurrence of events - Failure time (an event happened) - Censoring (survival time not exactly available; e.g., drop out) - For finding significant makers - Challenge: multiple-testing and finding ALL significant combination of biomarkers - Significant markers for precision medicine - Standard therapy -> alternative therapy 1 -> alternative therapy 2 - Limitless-Arity Multiple-testing Procedure (LAMP) - Minimum p-value - x: frequency of mutation in GWAS data - If p_min > min. threshold -> not testable -> not include in hypothesis - Monototically-decreasing in x (for p-values) - Cannot be applied to survival data (no time information) - Log-rank test - Chi squared D.O.F.=1 - Proposed method: contingency tables - p-value lower bound - Algorithm similar to LAMP - Experiment - TCGA breast and ovarian cancer data (gene expression profiles) (BRCA & OV) - z-score >=2 -> 1; else -> 0 - SW available: rtrelator.github.io/SurvivalLAMP - Question: 1/0 omitting down-regulated genes -> try to deal with continuous data - Question: simulated data? NO - Question: SNP? Not yet - Question: FDR control? Not yet + Identifying feature genes from gene expression data by twin kernel method - Motivation: biological aspect - Curse of dimensionality - Only few groups of genes are related to them - Select “biological plausible” genes - Linear selection method cannot handle the nonlinear relationship in the dataset well - As a result, kernel methods were introduced to deal with the nonlinear relationship - Computationally complex due to SVD - Local v.s. global kernels - Want to a general unified mixture model with low computational cost and be able to select significant genes - CND (conditionally negative definite) (p,q)-kernels were introduced - (p,q)-based twin kernel - Less computationally intensive - Symmetric and CND - Q-analogue of a mixed kernel of two multi-quadric kernels - Rather a distance measure d=|| phi(x) - phi(x’) || ^ 2 = H_{pq} (x,x’) - Algorithm - Select to 10 ~ 100 genes as feature genes - 10 % as testing - SVM/KNN as classifier (two-class datasets) - Results - Easy separation using p,q-domain - PCA analysis was also applied - Select first 50 genes for survival analysis - Question: Parameter tuning? 10-fold CV - Question: Contain both global and local properties of polynomial and Gaussian kernels + Exact Association Test for Small Size Sequencing Data - NGS technology allows to identify all rare SNPs or other types of variants - Asymptotic tests - Exact Association Test (EXAT) not relying on the assumption of large sample size - Generalized CMH test (GCMH) - Analyze IPMN patients - Small sample size - Contingency table - Detection rate, QQ plot, Venn-diagram for testing methods, pair-wise scatter plot of p-value for testing methods - Question: Why normal noise? ### Keynote Speech 5 + Single Cell Tumor Transcriptomics: Algorithms and Applications - CellHub: tackling disease, one cell at a time (big program in Singapore; not completed) - Fluidigm C1; 10X Chromium - Promise of single cell tumor transcriptomics - Unbiased analysis of all cell types in tumor, cell type markers, biomarkers, patient stratification - EMT/cancer stem cells/new cell types? - Challenges - Cells stick together; e.g., epithelial cells => custom dissociaton - Noise => new algorithms - Distortion & bias => new algorithms - Cost => dropping - Algorithms (guiding principles) - Make as few assumptions as possible - Account for unique properties of the data - Exploit unique properties of the data; e.g. non-parametric methods - Scalability - Three algorithms for single cell transcriptomics - Normalization - Clustering - Differentially-expressed genes - Why new normalization method? - Scaling is NOT adequate - Too many zeros in scRNA-seq - imputed Quantile normalization (iQ) - Use normalization as the benchmark - iQ maximizes the tightness of clusters - semi-supervised clustering -> RCA clustering algorithm (R package) - NODES for DEs - Problem of tumor heterogeneity - genetic/stemness/microenvironment - EMT markers are up-regulated only in fibroblast ### GIW 11 Systems Biology and Biological Interactions + GxGrare: Gene-gene interaction analysis for rare variants from NGS data - Heritability - Why rare variants? - Issues for rare variants - Low power - Bi-directional effect - Gene-level test for rare variants: burden/non-burden test - Issues in genetic interaction analysis - Data sparsity - MDR overview - Validation methods - Information gain ## GIW/BIOINFO Day 4 ### BIOINFO: Systems Biology + Systematic approach for predicting pluripotency of stem cells and human chronological age using DNA methylation data - Now is possible to extract information form identical cells (information in the cell level) - Research aims - Predict cell pluripotency from DNA methylation data - Find epigenetic markers for mouse ESCs - Windowing approach to extract features from sequence data - Cell-cell ordering - Using single cell gene expression profile - SLICER/TSCAN/Wanderlust - Pseudo-time construction - Regression modeling - F-test; FDR < 0.05 - Alpha selection of Elastic Net - Blood age predictors showed good results with small features - Objective - tissue-specific age prediction model - Methylation data from TCGA and GEO - Feature selection - Association test - Bootstrap analysis using Elastic Net - Feature analysis - Tissue-specific aging-related CpGs contains more negative ageCpGs - Ratio of CpG islands differentiate tissue-specific/common markers - Question: Cancer will affect the results? Depends on the type of diseases - Question: Aging information are the same from one person? Possibly different + A Proteogenomic Analysis of Early Onset Gastric Cancer - Early Onset Gastric Cancer (EOGC) - Most genomic studies focused on elderly GC patients - Poor prognostic - Young diffuse-type of GC (from data exploration) - 46 significantly mutated genes in GCs - Categorization of nonsynonymous somatic SNVs … - Utility of proteomic data + Network-based augmenting and interpreting disease genomics data - None of them (gene expression profiles, GWAS, WES of tumors) are sufficient alone - NGSEA - Overlap-based and rank-based - Drawbacks of GSEA - Overview of NGESA: DEGs -> scoring genes using their network neighbors -> network - Mapping drug-disease association - Showed very high prediction power for CRC (6 of them are previously known drugs for colorectal cancer) - Identify new drug candidates - GWAB (genome-wide association boosting) - More effectively retrieves disease genes than GWAS alone - Driver v.s. passenger mutations - Subject: how to distinguish them - Based on recurrent mutations: use deleteriousness of the mutation - Mutsig CV (including more information to reduce FPs) - MUFFIN - Predict driver through pathway information - No comprehensive gold standard cancer gene set - Each cancer gene set has a different trade-off criteria + Drug repositioning for cancer therapy based on large-scale drug-induced transcriptional signatures - Challenges of drug development; e.g., high risk of failure - Drug repositioning (DR) - Transcriptome-based drug discovery - CMAP approach ### Keynote Speech 6 + (Sum Kim) - 3D time series analysis (genes, time, conditions) - Detecting cluster patterns - Domain knowledge can be used more aggressively -> biological networks (domain knowledge into a computational framework) - Perturbed pathway - Cross-correlation between differential expression vector - Influence maximization in the time-bounded network - Differential expression vectors - Delay -> maximum cross correlation - Wish to use literature and your input as domain expert knowledge in a single computational framework - No guarantee that all pathways selected by analysis are relevant to the context of data - AI modeling? - Deep learning with domain (biology) knowledge - We know little knowledge about biology though - DeepFam - Based on CNN - Alignment-free method - Modeling family without alignment (usually faster) - But still black box? Capturing motifs - Hybrid approach of relation network & CNN for breast subtype classification - DL need grid-like data; e.g., matrix - DeepMind for relation network paper? - Input (MA) -> PPIN -> convolution on graph -> pooling -> relation network for each edge -> MLP -> ensemble -> softmax(subtypes) - y=VgV^Tx, g: learned (from relation network) - LumA, LumB, Her2, Basal; TCGA RNA-seq expression - Monte-Carlo cross-validation is used