# GIW/BIOINFO 2017 Conference
###### tags: `shared` `conference`
[TOC]
## GIW/BIOINFO 2017 in Seoul Paper presentations notes
## GIW/BIOINFO Day 1
### Tutorial 1: Machine Learning
+ Introduction to Sparse Coding in Genomics Data Analysis
- Sparse coding to find biomarkers
- First part of part I: introduction to ML, from definition (weighting, activation function) to mentioning high dimensional data (and thus sparse coding)
- Idea of sparse coding: signs do not matter but the magnitudes matter => make some weights to be zero
- Machine learning
- Training data: parameter learning; validation data: hyper-parameter learning (for learning)
- Testing data: for performance evaluation
- Assumption: data points are iid
- Main objective is the regularizer
- Prediction models and data models
- Prediction model: y <- f_w(x)
- Data model: y=f_w(x) + epsilon, zero-mean Gaussian noise N(0, sigma^2)
- Model space -> (x,y): w (sampling); (x,y) -> model space: w^hat (estimation)
- Model space is also known as hypothesis space
- Sparse coding
- s-sparse if nnz(w) <= s (nnz == number of non-zero)
- Estimation (sparse coding)
- Least Absolute Shrinkage and Selection Operator (LASSO)
- Fused LASSO: D matrix as a penalty matrix
- 1-D fused LASSO: find step-wise functions to find differentially-expressed data
- 2-D fused LASSO: fMRI images
- Dantzig selector
- Three forms of sparse coding
- What are we looking for by LASSO?
- Types of error
- Estimation error: w* = w ?
- Variable selection error: supp(w*) != supp(w), where supp(w):={i in {1, …, p}: w_i != 0}
- Second type error often matters (what biomarkers you use)
- How to choose the hyper-parameters? (figure)
- CV has no guarantee on the variable selection error
- Notable results for LASSO
- LASSO: finite sample control
- LASSO selects correct variables even when s is very small
- LASSO beings to include false variables even when s < n/(2log p)
- R packages: glmnet, SLOPE
- HIV dataset available: https://goo.gl/c6fCMx
- Genotypic predictors of human immunodeficiency virus type 1 drug resistance, Soo-Yon Rhee et al., PNAS 2005
- hiv.train$x # 704 observations of 208 binary mutation variables
- fit=glmnet(hiv.train$x, hiv.train$y) # fit the model
- plot(fit, xvar=“lambda”, main=“HIV model coefficient paths”) # plot the paths for the fit
- Why graphical models? (recent researches)
- We are often interested in their relations
- Gaussian Markov random field (Gaussian MRF, GMRF)
- Correlation (local perspective) v.s. conditional correlation (global perspective)
- Conditional correlation can be obtained by observing the interacting nodes
- Sparser graph -> more interpretable
- GMRF optimization problem (can be solved in a reasonable time even if the number of genes is large)
- Problem: the matrix is NOT sparse -> not a sparse graph
- Grade 1 breast cancer (unsupervised: on used microarray data)
- Choosing lambda
- Biomarker discovery
- Genetic data
- Classical model: multiple hypothesis testing, calculating p-values for them
- Banjamini-Hochberg v.s. Bonferroni
- Multivariate hypothesis testing: combinatorial explosion but computationally expensive
- MHT as model selection: X * beta -> y, where beta is a sparse vector
- Model selection by sparse coding
- Using SL1: SLOPE
- fit.slpoe=SLPOE (hiv.train$x, hiv.train$y, fdr=0.2)
- fit.slope
- Biomarkers -> relations
- On-going research
### GIW 1: Biological networks
+ BTNET: Boosted tree based gene regulatory network inference algorithm using time-course measurement data
- Transcription factors - target genes
- How to identify GRN
- Correlation, information theoretic, dynamic bayesian network, tree-based ensemble
- Problem: steady-state expression data but dynamic biological process
- Adaboost: Focus on samples with poor predictions
- Gradient boosting: Trained based on the error
- Variable Important Score (VIS): contribute to the reduction of the variance of target value (i.e., target gene expression)
- AUROC/AUPR as evaluation metrics
- Future work: Web interface & dynamic inference
- BTNET available on the network based on Python
+ SPSNet: Subpopulation-sensitive network-based analysis of heterogeneous gene expression data
- Disease heterogeneity (lies at the molecular level)
- Systematic biases enter in the form of batch effects
- DE analysis at gene level -> pathway level -> sub-network level
- Disease changes only certain part of the whole pathway (not the whole pathway)
- Generating candidate subnetworks -> compute subnetwork scores -> determine statistical significance
- Rank genes in the order of their expression -> fuzzy: top-10 (1) ~ bottom-10 (0)
- Node & nodes within 5 steps -> pathway/subnetwork
- Simulation methodology (SPSNet): N1 & N2 (artificially up/down-regulated) & subnetwork information
- SPSNet can be used to recognize and eliminate extrinsic heterogeneity such as batch effect
+ Network-based identification of pathway-specific protein domains and their implication for human disease
- Network propagation
- co-pathway protein network
- Pathway specificity
- PPA score between protein-pathway
- DPA score between domain-pathway
+ A homologous mapping method for three-dimensional reconstruction of protein networks reveals disease-associated mutations
- Network-based approach to cell behaviors and disease mechanism
- PPI data is incomplete
- High false positive rate
- hDiSNet: human 3-dimensional structural PPI network
- Validation: gene ontology (biological function) (biological process and cellular component)
- Validation: gene co-expression profiles (gene expression profiles)
- Mutations prefer occurring in the interacting domain
- Method for predicting/construction of PPI network
+ Comorbidity Scoring with causal Disease Networks
- Disease association network -> disease causality network
- Main steps: Disease association -> disease causality (pathway & medical literatures) -> SSS^c algorithms
- Proposed method
- Settling directed edges (metabolic pathway & literatures)
- Limitation to SSL for causality matrix: non-symmetry -> not good for Laplacian
- Toy example: graph with(out) direction
### Keynote Speech 1
+ Studying Stem Cell Systems through multiple ‘omics data
- Wnt: “wint signaling”; mesodermal; lineages
- Biological sample: XXX-seq data
- Integrated analysis -> SNP (“snip”)
- Modeling Wnt signaling pathway
- “Organizing the first 24 hours of life”
- Rules of developmental signaling pathways: initiation -> progression -> termination
- Terminator of Wint signaling
- Highly included in response to Wnt
- Transcription factor
- Repressive domain
- Acts globally
- SP5 binds in promoter regions of many Wnt target genes - what if disabled ?
- Wildtype (WT) v.s. disabled
- SP5 reins in their response to Wnt
- What makes it so selective?
- Generation and characterization of a mesodermal progenitor population
- RNA-seq reveals 2 key sets of genes
- What organizes neural progenitor cells?
- Wnt instructs A/P positional identity of NPCs
- Challenge: biological -> computational
- Question: Data driven analysis?
- Haven’t done yet (presently, query-motivated)
- Waited to be mined
### GIW 2: Sequence Analysis, Computational Methods
+ Introducing difference recurrence relations for faster semi-global alignment of long sequences
- Genealogy of Alignment Algorithms -> Contribution: Difference DP/libgaba/new implementation technique
- Narrower of variables allow for parallelization
- libgaba available on GitHub
+ MUGAN: Multi-GPU accelerated AmpliconNoise server for rapid microbial diversity assessment
- EM maximization
### BIOINFO: RNA Informatics
+ Integrative analysis of genomic and epigenomic regulation of the transcriptome in liver cancer
- Mutations acquired by tumor recurrence
- Mutations in recurrent HCC
- Reprogrammed Expression by Recurrence (RER) (by heirarchical clustering/KM analysis)
- GOLGB1 and SF3B3 are novel targets
- SOX4 and IL-8 are downstream targets (interaction networks)
- DNA copy numbers and methylation
- Look at: distributions and correlation
- Multiomics-based classification
- Validation using TCGA
- Subtype markers
+ Differential, pathway, and network analysis of RNA-seq data
- Read count data
- RNA composition
- Poisson model
- calculate the probability to observe k read counts
- But really Poisson? No. Variance >> mean
- Negative binomial distribution: composition of Poisson and gamma distribution
- lambda ~ Gamma(alpha,beta), alpha: dispersion variable
- R: DESeq, EdgeR, … etc.
- Focus on AUC for performance valuation
- Read count bias (Oshlack and Wakefield, 2009)
- Dispersion is the upper and lower bounds of SNR
- Pathway (GSEA) analysis of RNA-seq data
- Step 1: Normalize the read count data using TMM, DESeq ,voom (not RPKM)
- Step 2 Apply GSEA
- Small replicate: gene-permuting GSEA (often used)
- But, it will cause many false positives
- Instead, use absolute gene statistic in one-tailed GSEA
- Gene-set clustering
- Difficult to identify biological themes
- GO, DAVID, Enrichment Map
- Expathnet: new web server
- PPI weighted distance
## GIW/BIOINFO Day 2
### BIOINFO: Cancer Genomics
+ Utilization of PDX library and cancer genome database for precision oncology
- Why sequence cancer samples
- Understand more about cancers
- Ideal study plan for cancer genome research
- Large number of samples
- High depth of sequencing
- small cancer cell proportion
- cancer cell heterogeneity
- Large area of the whole genome
- structural variation
- variation in noncoding region
- Many samples from the same patient
- Multiple site from the same tumor
- Normal sample (blood, adjacent normal tissue)
- Metastatic loci
- Samples from different phase
- Integration of multi-omics data
- Transcriptome
- Epigenome (DNA methylation, Histone modification, chromatin structure)
- Proteome
- (Important) With clinical outcome
- Clinical information we need if possible
- PDX: patient-derived xenograft (drug, clinical data, genomic data)
- Targeted mainly on breast cancer (most in eastern Asia)
- Make PDX -> experiment (now in process)
- Metastasis study using xenograft tissues
- Drug response test with PDX models
- Clinical cancer genome database: http://ccgd.snu.ac.kr
- Question: Since less patients recurred, it’s not good for the research (success rate is around 30%)
+ Systematic analysis of cancer omics and phenotypic data for finding synthetic lethal target and biomarker combinations
- Challenges in cancer therapies
- Wide variety of cellular response against target-specific cancer drugs
- Difficult to predict clinical outcome
- How to effectively combine different information of biomarkers?
- Analysis of cancer omics and siRNA screening data: cancer phenotype
- Cell line enrichment analysis (CLEA)
- STK11 loss of function is a potential biomarker for lung cancer therapy
- Major mutation pattern in patient samples
- STK11mt is unique in lung cancers
- ATP1A1 is a selective target for STK11 mutant cancers
- Why siRNA library screening?
- Algorithm for building network: weighted tanimoto score
- Pathway analysis and validate the biological functions
- On-going: screening of a CSC-inhibitor on colon cell line panel transcriptome analysis of the sensitive cell lines
- Feature extraction from images
- Omics data mining
+ Pilot targeted next generation sequencing data
- HNSCC: head and neck cancer
- More precise target is needed
- General design of umbrella trial
- Part 1: pilot targeted next generation sequence (NGS pipeline … etc.)
- Part 2: umbrella trials
- Establish web-based report system
- Molecular tumor board meeting (MTB) for interpreting NGS data
- TP53 was the most common mutational event occurring in 47 patients (51%) followed by CDKN2A (n=23), CCND1 (n=22), PIK3CA (n=19)
- Comparison of HPV-positive and -negative
+ Application of RNA sequencing data in cancer research
- N=59: 38 for experiment; 21 for control
- Use of WTS data to discern functionality
### Keynote Speech 2
+ Genome Editing in Plants, Animals, and Human Cells
- Genome editing timeline (Kim JS. Nature Protocols)
- Kim H and Kim Js: comparison of programmable nucleases
- CRISPR: adaptive immune system in bacteria
- Genome surgery: in-/ex-vivo therapy
- DNA/genome-wide cleavage scores
- How to avoid off-target effects?
- CRISPR RGEN Tools (Cas-Designer)
### BIOINFO: Machine Learning for Bioinformatics
+ Learning Graphical Structure from High-Dimensional Data using Sorted L1 Regulation
- Bonferroni versus Banjamini-Hochberg (B-H: slightly large chance of false discovery but much more powerful)
- MHT as model selection
- linear combination of each interested biomarkers
- Coefficient tell the significants of the biomarkers
- Polynomial complexity (compared to combinatorial explosion: test every possible combinations)
- Transformed problem: X * beta -> y => convex optimization problem
- New methods/algorithms modify the regularization term (R packages available)
- Graphical LASSO/SLOPE: not requiring clinical outcome (unsupervised)
- Graphical SLOPE
- Precision matrix (n x n, Theta): representing a graph
- Gaussian MRF
- Markov property: conditionally independent if the nodes are disconnected (theta = 0), but difficult
- Case study: Neuroblastoma (NRC), breast cancer
- p-values versus graph structures (grade 1-3)
- Difference of networks
- Node degree distribution plot
- FDR control: replacing this using graphical LASSO
- Question: lambda sorted in decreasing order, also sorting the markers
- Question: no matter much for the Gaussian noise assumption
- Question: only need to provide FDR -> number of lambda’s -> graphical LASSO will calculate them for us
+ Network based ML Algorithms of Intra-relation, Integration, and Inter-relation for Diseases-Drugs
- Integration of heterogeneous data
- Layers == networks for different type of data (with causality)
- Integration of networks: addition/subtraction of networks
- Disease networks
- Gene regulatory networks
- Metabolic pathways
- Protein interaction networks
- Etc.
- Connecting diseases
- Cosine similarity over a disease vector
- Scoring for the co-occurrence
- Semi-supervised scoring
- Will be only given a few pieces of information on one or two diseases contracted
- Graph-based SSL algorithm (solution (probabilities) can be obtained in a closed-form)
- Passing through activation function, sort the associated diseases
- Case study: diabetes (Type I and Type II)
- Hierarchical disease networks
- Intra- and inter-relation
- Tridiagonal block matrix (weight matrix)
- Limitations
- Existence of rectangular blocks
- High computational complexity and sparseness; O(n^3)
- Proposed method
- Matrix separation
- Nystrom method
- Changed solution (only need cheap computational efforts)
- 121.63 -> 12.85 (secs; 500 layers) in computation time
- Casual network
- Metabolic pathway + diseases association networks
- Flow function: comparing inflow and outflow to decide the direction
- Comorbidity scoring
- Drug network
- Future works: theoretic works & find the applications
- Question: edges: connected or not (1/0)
- Question: demographic data? prospect for precision medicine
- Question: image such as MRI is another new area
+ Functional prediction of causal regulatory variants identifies a novel autism gene (topic changed)
- Cancer dependency maps (by genome-wide screen)
- Bayesian gene regulatory network
- Cutoff values on dependency and estimate them respectively
- Prediction and validation in PDX mice (still under validation)
- Prediction in TCGA samples
- Comparison with other algorithms (peaks separation)
+ Computational analysis of intra-tumor heterogeneity in cancer
- Tumor heterogeneity -> challenge in cancer therapy
- Major subjects
- Apply NGS data
- Analysis of tumor clonality
- Tumor heterogeneity in colorectal cancer (CRC)
- 46 positive/42 negative samples
- PyClone model
- Hierarchical Bayesian clustering model
- Oligo-/Multi-clone (low/high heterogeneity)
- Vascular invasion and stages are dramatically different in these two groups
- Clinical relevance according to clonality
- Characteristics of mutations according to clonality
### Keynote Speech 3
+ Fun stories in translational biomedical informatics
- Translational bridge: basic and clinical
- Patient data and phenotype is the key
- e.g., social networks: big data (anonymous social network)
- Database for comments is available online
- The power of machine learning
- Many datasets available
- Predictive analytics is finally making inroads into healthcare
- UW medicine clinical data repository (UW == University Washington)
- Statistical analysis on clinical data
- Enrichment analysis on clinical data (term enrichment)
- That was only only for coded data (20%)
- 80% is unstructured data; e.g., clinical note/text, image
- Pilot NLP engine
- Identifying concepts in discharge notes
- Lots of questions and opportunities
- Interpreting genomes and precision medicine
- 80% of rare diseases have genetic origins
- Rare diseases are significant to human health
- Many mechanisms of pathogenicity (what function/ why this function)
- MutPred: using functional properties (MutPred2 is now available)
- Community challenges enable real citizen science
- Started in protein structure prediction -> CASP (Critical Assessment of Structure Prediction)
- CASP targets can also be used for evaluating function prediction
- CAGI challenge
## GIW/BIOINFO Day 3
### GIW 8: Cancer Bioinformatics
+ WISARD: a comprehensive workbench for the analysis of related database
- WISARD: Workbench for Integrated Superfast Association Test with Related Dataset
- Handling 9 genotype file formats
- Support five operating systems
- WISARD website
- Why fast
- Matrix optimization
- Analysis-wise optimization
+ Isoform specific gene expression analysis of KRAS in the prognosis of lung adenocarcinoma patients
- Lung cancer (3 types) -> KRAS mutant patients show worst prognostic records
- KRAS4A & KRAS4B
- Large-scale genome data from TCGA project
- Isoform-level expression analysis (proportions of K-RAS isoforms)
- Linear regression analysis (# mutation v.s. expression): show clear association, especially in K-RAS4A
- Survival analysis (better in KRAS mutant/amplified groups)
- Multivariate analyses
- Low K-RAS proportion and expression is the key factor for the disease
- Limitations: the use of overall survival & the lack of proper independent cohort for validation
- TCGA: largest cohort dataset with well-defined clinical records
- Question: Better to use the results for predictions
+ BRCA-Pathway: a structural integration and visualization system of TCGA Breast Cancer data on KEGG Pathways
- Motivation: biological mechanism
- Webservice
- TCGA data + pathway + driver genes, TFs, hallmark gene sets
- Use correlation coefficients between genes
- Survival analysis (mutation/no mutation; p-value=0.05) and oncoprint
- User data visualization
- REST-API: easy access by users
- Mapping multi-omics data onto KEGG pathway
- Question: Why breast cancer? The largest data size; other kinds of cancer is future work
- Question: Possible for classification? Can choose subgroup for various combination of clinical records
+ Stage-Dependent Gene Expression Profiling in Colorectal Cancer
- Previous works have focused on the correlation between cancer stage and individual gene expression
- Examine 9 microRNAs
- Gene expression patterns have been analyzed for limited stages with small samples, without proper data pre-processing
- Workflow
- Data aquisition
- Data modification
- Analysis of functional characteristics
- Analysis of structural characteristics
- Computing correlation using LMER (linear mixed-effect regression model)
- A representative gene RPSY1 shows different correlation coefficients between LM and LMER
- Functional characteristics of the stage-dependent genes (subgroups: INC/DEC)
- Low expression of TAP1 is associated with the development or progression of colorectal cancer
- Both types of genes are closely associated with each other
- DFS using KM analysis
- High expression and low expression for two subgroups
- MPP2 (p-value=0.012)
- TXNL1 (p-value=0.072)
- Continuously increasing genes are related to nerves and development system while other genes are related to cell cycle and metabolic process
- Question: Is really LEMR is really better? Or just few cases are different while other results are the same? Basically the same.
+ MGRFE: multilayer recursive feature elimination based on embedded genetic algorithm for cancer classification
- (presenter not shown)
### BIOINFO: Big Data Informatics Supported by Center for Plant Aging Research
+ Cancer Precision Medicine through Transforming Biomedical Big Data to Actionable Knowledge
- Big data challenges
- Efficient storage and access
- Data analytics to mine valuable information (most focus)
- Biomedicine big data challenge: multi-scale, complex, heterogeneous, and distributed
- Constructing BD2K intelligence engine
- Biomedical entity Search Tool (BEST): “<disease>”+target:[mutation | drug]
- Predict best drug given patient information
- Synergistic effect S = Observed affect E+ Additive affect A
- Sub-challenge 1-A: predict drug synergy from all available data
- Prediction model
- RBF SVR (regression)
- Choose rather simple model due to label imbalance
- Use AWS to speed up the training time (~1hr. for 1800 cores)
- Translatability of deep learning model (difficult; tried various types of models)
- Regression tasks not good, but tree-based models provide reliable results
- Synergistic rules generation (qualitative analysis)
- Biomedical QA with deep neural reader (attention-sum reader & biological feature)
### Keynote Speech 4
+ Peer through your Brain with Multilevel Data
- World largest brain dataset: Chinese southern biobank
- How to integrate these data to make predictions?
- Association study; causality; modeling
- I. Nonlinear association (need/try nonlinear transformation using different kernels)
- II. Causality inference (of time-course data)
- Current trend: kernel methods
- Intelligent image reader
- Depression
- Searching the roots of depression
- Data driven approaches (time-course data); functional connectivity -> abnormal links
- Reward and punishment processing in depression
- Depression uncouples brain hate circuit
- Painting the full depression circuit
- Quantity of your brain
- Dynamic brain
- ffMRI + EEG
- Entropy v.s. entropy rate
- Entropy gene: expression strongly related to entropy
- Integrated Multi-scale data for Prediction (IMuDP)
- Question: Animal models will be built later? (human first due to machines for fast scanning)
- Question: How to identify specific areas for specific emotion? Literatures
- Question: How to quantify the emotions such as creativity? All sorts: math, picture, images, etc.
- Question: How to combine these data? Way too large data: dimension reduction -> prediction (but labour intensive; want to make this process automatic)
### GIW 10: Biomarkers Discovery
+ Identifying statistically significant combinatorial markers for survival analysis
- Survival analysis
- Methods for statistical analysis of longitudinal data on the occurrence of events
- Failure time (an event happened)
- Censoring (survival time not exactly available; e.g., drop out)
- For finding significant makers
- Challenge: multiple-testing and finding ALL significant combination of biomarkers
- Significant markers for precision medicine
- Standard therapy -> alternative therapy 1 -> alternative therapy 2
- Limitless-Arity Multiple-testing Procedure (LAMP)
- Minimum p-value
- x: frequency of mutation in GWAS data
- If p_min > min. threshold -> not testable -> not include in hypothesis
- Monototically-decreasing in x (for p-values)
- Cannot be applied to survival data (no time information)
- Log-rank test
- Chi squared D.O.F.=1
- Proposed method: contingency tables
- p-value lower bound
- Algorithm similar to LAMP
- Experiment
- TCGA breast and ovarian cancer data (gene expression profiles) (BRCA & OV)
- z-score >=2 -> 1; else -> 0
- SW available: rtrelator.github.io/SurvivalLAMP
- Question: 1/0 omitting down-regulated genes -> try to deal with continuous data
- Question: simulated data? NO
- Question: SNP? Not yet
- Question: FDR control? Not yet
+ Identifying feature genes from gene expression data by twin kernel method
- Motivation: biological aspect
- Curse of dimensionality
- Only few groups of genes are related to them
- Select “biological plausible” genes
- Linear selection method cannot handle the nonlinear relationship in the dataset well
- As a result, kernel methods were introduced to deal with the nonlinear relationship
- Computationally complex due to SVD
- Local v.s. global kernels
- Want to a general unified mixture model with low computational cost and be able to select significant genes
- CND (conditionally negative definite) (p,q)-kernels were introduced
- (p,q)-based twin kernel
- Less computationally intensive
- Symmetric and CND
- Q-analogue of a mixed kernel of two multi-quadric kernels
- Rather a distance measure d=|| phi(x) - phi(x’) || ^ 2 = H_{pq} (x,x’)
- Algorithm
- Select to 10 ~ 100 genes as feature genes
- 10 % as testing
- SVM/KNN as classifier (two-class datasets)
- Results
- Easy separation using p,q-domain
- PCA analysis was also applied
- Select first 50 genes for survival analysis
- Question: Parameter tuning? 10-fold CV
- Question: Contain both global and local properties of polynomial and Gaussian kernels
+ Exact Association Test for Small Size Sequencing Data
- NGS technology allows to identify all rare SNPs or other types of variants
- Asymptotic tests
- Exact Association Test (EXAT) not relying on the assumption of large sample size
- Generalized CMH test (GCMH)
- Analyze IPMN patients
- Small sample size
- Contingency table
- Detection rate, QQ plot, Venn-diagram for testing methods, pair-wise scatter plot of p-value for testing methods
- Question: Why normal noise?
### Keynote Speech 5
+ Single Cell Tumor Transcriptomics: Algorithms and Applications
- CellHub: tackling disease, one cell at a time (big program in Singapore; not completed)
- Fluidigm C1; 10X Chromium
- Promise of single cell tumor transcriptomics
- Unbiased analysis of all cell types in tumor, cell type markers, biomarkers, patient stratification
- EMT/cancer stem cells/new cell types?
- Challenges
- Cells stick together; e.g., epithelial cells => custom dissociaton
- Noise => new algorithms
- Distortion & bias => new algorithms
- Cost => dropping
- Algorithms (guiding principles)
- Make as few assumptions as possible
- Account for unique properties of the data
- Exploit unique properties of the data; e.g. non-parametric methods
- Scalability
- Three algorithms for single cell transcriptomics
- Normalization
- Clustering
- Differentially-expressed genes
- Why new normalization method?
- Scaling is NOT adequate
- Too many zeros in scRNA-seq
- imputed Quantile normalization (iQ)
- Use normalization as the benchmark
- iQ maximizes the tightness of clusters
- semi-supervised clustering -> RCA clustering algorithm (R package)
- NODES for DEs
- Problem of tumor heterogeneity
- genetic/stemness/microenvironment
- EMT markers are up-regulated only in fibroblast
### GIW 11 Systems Biology and Biological Interactions
+ GxGrare: Gene-gene interaction analysis for rare variants from NGS data
- Heritability
- Why rare variants?
- Issues for rare variants
- Low power
- Bi-directional effect
- Gene-level test for rare variants: burden/non-burden test
- Issues in genetic interaction analysis
- Data sparsity
- MDR overview
- Validation methods
- Information gain
## GIW/BIOINFO Day 4
### BIOINFO: Systems Biology
+ Systematic approach for predicting pluripotency of stem cells and human chronological age using DNA methylation data
- Now is possible to extract information form identical cells (information in the cell level)
- Research aims
- Predict cell pluripotency from DNA methylation data
- Find epigenetic markers for mouse ESCs
- Windowing approach to extract features from sequence data
- Cell-cell ordering
- Using single cell gene expression profile
- SLICER/TSCAN/Wanderlust
- Pseudo-time construction
- Regression modeling
- F-test; FDR < 0.05
- Alpha selection of Elastic Net
- Blood age predictors showed good results with small features
- Objective
- tissue-specific age prediction model
- Methylation data from TCGA and GEO
- Feature selection
- Association test
- Bootstrap analysis using Elastic Net
- Feature analysis
- Tissue-specific aging-related CpGs contains more negative ageCpGs
- Ratio of CpG islands differentiate tissue-specific/common markers
- Question: Cancer will affect the results? Depends on the type of diseases
- Question: Aging information are the same from one person? Possibly different
+ A Proteogenomic Analysis of Early Onset Gastric Cancer
- Early Onset Gastric Cancer (EOGC)
- Most genomic studies focused on elderly GC patients
- Poor prognostic
- Young diffuse-type of GC (from data exploration)
- 46 significantly mutated genes in GCs
- Categorization of nonsynonymous somatic SNVs …
- Utility of proteomic data
+ Network-based augmenting and interpreting disease genomics data
- None of them (gene expression profiles, GWAS, WES of tumors) are sufficient alone
- NGSEA
- Overlap-based and rank-based
- Drawbacks of GSEA
- Overview of NGESA: DEGs -> scoring genes using their network neighbors -> network
- Mapping drug-disease association
- Showed very high prediction power for CRC (6 of them are previously known drugs for colorectal cancer)
- Identify new drug candidates
- GWAB (genome-wide association boosting)
- More effectively retrieves disease genes than GWAS alone
- Driver v.s. passenger mutations
- Subject: how to distinguish them
- Based on recurrent mutations: use deleteriousness of the mutation
- Mutsig CV (including more information to reduce FPs)
- MUFFIN
- Predict driver through pathway information
- No comprehensive gold standard cancer gene set
- Each cancer gene set has a different trade-off criteria
+ Drug repositioning for cancer therapy based on large-scale drug-induced transcriptional signatures
- Challenges of drug development; e.g., high risk of failure
- Drug repositioning (DR)
- Transcriptome-based drug discovery
- CMAP approach
### Keynote Speech 6
+ (Sum Kim)
- 3D time series analysis (genes, time, conditions)
- Detecting cluster patterns
- Domain knowledge can be used more aggressively -> biological networks (domain knowledge into a computational framework)
- Perturbed pathway
- Cross-correlation between differential expression vector
- Influence maximization in the time-bounded network
- Differential expression vectors
- Delay -> maximum cross correlation
- Wish to use literature and your input as domain expert knowledge in a single computational framework
- No guarantee that all pathways selected by analysis are relevant to the context of data
- AI modeling?
- Deep learning with domain (biology) knowledge
- We know little knowledge about biology though
- DeepFam
- Based on CNN
- Alignment-free method
- Modeling family without alignment (usually faster)
- But still black box? Capturing motifs
- Hybrid approach of relation network & CNN for breast subtype classification
- DL need grid-like data; e.g., matrix
- DeepMind for relation network paper?
- Input (MA) -> PPIN -> convolution on graph -> pooling -> relation network for each edge -> MLP -> ensemble -> softmax(subtypes)
- y=VgV^Tx, g: learned (from relation network)
- LumA, LumB, Her2, Basal; TCGA RNA-seq expression
- Monte-Carlo cross-validation is used