Dealing with missing values

# Dealing with missing values After performing RAMClust (SOP 5), the features can directly put to MUVR (as the X). MUVR is scale-invariant, so there is no data normalization (log-transformation, scaling, or centering) needed at this point. Then, for selected features-of-interest, some normalization may be needed, depending on the statistical models. In most cases, log transformation is necessary. Before performing log transformation, it is necessary to check if the dataset contains missing (N/A) or zero values. - If you have missing values, then a dynamic imputation, e.g mvImpWrap() from StatTools package (https://gitlab.com/CarlBrunius/StatTools) is probably a good idea ```` library(StatTools) library(doParallel) #for parallel computing nCore = detectCores()-1 imputed <- mvImpWrap( X, guess = NULL, forceZero = TRUE, #true for PLS, doesnt matter for RF method = c("PLS", "RF"), #RF is slower but perform slightly better, PLS is faster, rfMeth = c("rf", "Rborist", "ranger"), #no need to be specified if using PLS nComp = 2, nCore = nCore, tol1 = 0.05, n1 = 15, tol2 = 0.025, n2 = 100 #number of iteration, default is 60 ) ```` - If you have zero values in your original data, then the question is whether you should believe that they are: o Missing values --> dynamic imputation o Actual zero values (or at least really low values) --> adding a pseudocount before a log transformation For the latter, the idea is to add a very low values (e.g min(X[X!=0])/1000) to the whole dataset, hence replacing zero values. After making sure that there is no missing nor zero values anymore in your dataset, you are free to do log transformation or any other data normalization strategy. - version 1.0 by Stef - Dec 29, 2021