Data manipulation in R
Replace missing values with non-missing values in groups
Filling missing value in group
Get data structure as data.frame
Find minimum, maximum in a column
Find elements of a vector that are unmatched to elements in another vector
Handle errors with tryCatch{}
This allows iteration that results in erros to be skipped. Without this error handling, the iteration discontinues.
Read specific lines/rows of a file with separators \n
Read a file containing quotation marks/quotes. Quotes may be unclosed single or double quotes (e.g. I've signed up)
Duplicate/repeat each row for N times
Replace non missing values with a number
Assign a name to an object and export it to the global environment
keep data in original order after subsetting with filter() using dply::arrnge(match(variable,vector-order))
Multiply multiple columns with a specific columnm, creating new columns for the multiplication
Deselect multiple columns by their names
Rename multiple columns
Rename columns with dplyr::rename_at(vars=, list(~function1, ~function2,...))
or dplyr::rename(new.var.name=old.var.name)
Rename columns with data.table::setnames()
Column-bind two data.frames that have different row numbers
String manipulation
Create all combinations of two vectors, ordering the result as vector 1 and then vector 2. This cross join is more efficient than apply(expand.grip(x,y))
and custom sorting
Replace multiple strings with multiple replacements
Join multiple strings into a single string. Collapse multiple strings to 1 string.
Count occurrences of categories in ordinal/binary variables
Suppose a cross table is generated for multiple binary variables (categories=0,1) where some variables don't have a category, say 1. Convert your variable to a factor, and set the categories you wish to include in the result using levels. Values with a count of zero will then also appear in the result:
Produces 2-way cross table
Date and time
Convert datetime from character to datetime type
search pattern "yyyy-mm-dd"
Matrix
Summarise a matrix
Export a matix and import it as a data.frame. It's important that the imported data.frame has a rowname that is from the matrix
Pad leading zeros to numeric variables
Set plotting parameters by par(mar(),mpg(),oma())
mar()
controls the margins, where plot axis titles are printed
mpg()
sets the axis label locations relative to the edge of the inner plot window. The first value represents the location the labels (i.e. xlab and ylab in plot), the second the tick-mark labels, and third the tick marks. The default is c(3, 1, 0). Increase the label location when your axis label overlaps with axis ticks

Mutating joins and filtering joins
Read selective lines of a file into R
*How to extract the first line from a text file?
Grep the line that starts with keywords from a no column text file
Count number of non-missing values per row for multiple columns
Count number of non-missing values in a column
Combine multiple columns into one column
Search or match one or multiple patterns
Use grep()
,glob2rx()
and grep()
, or match(patterns,object)
glob2rx()
changes wildcard or globbing pattern into regular expression
Stack multiple columns to a single column
*stacking columns into 1 column in R
Split a column of a data.frame into columns by delimiter.
Substring a column by position
Subset part of a column by delimiter by splitting the column into multiple new columns using tidyr::separate
and deselecting unwated columns
Find position of first underscore in a string. This allows you to subset part of a string if the string has a delimiter
custom sort a character vector, data.frame…
Custom sort a character vector by part of the values
Sort a character vector in a user-defined order with order(match(have,want))
*How to sort a character vector according to a specific order?
Sort a data.frames based on ascending and descending order of multiple columns
Run part of a R file by a range of line number
Sort a data.frame by an user-defined order.
Find the positions of NAs in the matrix using which(is.na(matrix_name),arr.ind=T)
Get positions for NAs only in the “middle” of a matrix column
Convert a matrix to a data.frame without losing its dimnames. This method writes the matrix's rownames into 1st column and uses its colnames as column names for the data.frame
Convert a data.frame to a matrix. The key is to use matrix(as.matrix(DF))
. dimnames=
specifies the rownames and colnames of the matrix.
Reshape data from long to wide
Method 1: use reshape()
. Limitation is only 1 timevar can be used. Use method 2 if multiple timevar
Method 2: use reshape2::dcast()
. Limitation: when there are duplicate variables, dcast() applies its default function length to your data (e.g. when your value.var takes a character column, but the reshaped values will become 1 or 0). To overcome this limitation, you will create another new column that uniquely identifies the combinations of the ~ variables. Function arguments:
- value.var = can take only 1 measure variable if packageVersion("reshape2") 1.4.3
- On the left of ~ : the transposed data has one row per unique value of "variable"
- On the right of ~ : the transposed columns are unique combination of wave and value
- drop= : should missing combinations dropped or kept? If variable contains NA, set drop= to TRUE; Otherwise set it to FALSE
Method 3: use reshape2::melt() and tidyr::spread()
Method 3: reshape data when identifier columns containing duplicates using tidyr::spread()
The trick is to create a dummy column containing unique number representing rows of the reshaped data using gather()
unite()
group_by
and mutate()
and then reshape data using the dummy column as the key column, and eventually drop this dummy column
Method 4: reshape data from long to wide using
Conditionally create a new column with mutate(new.var=case_when)
Values of new variable are specified on the right of ~
Conditionally filter rows based on values of a column.
*onditional filter of grouped factors - dplyr
Negatively filter rows based on a pattern in a column
Negatively filter columns with column names ending with a pattern
*How to drop columns by name pattern in R?
Reorder columns in a flexible way using setdiff()
Select columns and order them using dplyr::select_()
.
Subset rows using dplyr::filter()
Use multiple patterns with filter(str_detect(variable,"A|B|C"))
. str_detect() accepts length-1 pattern
Searching a particular string, an equivalent to using wildcard
Positive filtration with a pattern containing special characters that need to escape: dot
Positive filtration with a search pattern of multiple strings
Positive filtration with logical operators
Subset part of a string by deleting the part that we don't want
Sum up a column with multiple group_by columns
Convert factor columns to character or other type lapply()
Convert multiple POSIXct columns to Date
An example of using nested foreach loops. foreach()
loops are used for their return value, like lapply. In this way they are very different from for loops which are used for their side effects. By using the appropriate .combine functions, the inner foreach loop can return vectors which are combined row-wise into a matrix by the outer foreach loop. Notice that packages used in the inner foreach loop need to be called in the outer foreach loop
Subsetting using subset()
Calculation in a data.frame
*Creating a new column to a data frame using a formula from another variable
Remove duplicated rows based on specific columns
*removing duplicate units from data frame
Remove rows with column ID that occurs more than once and then remove these rows using match()
Subset/Remove rows with column ID that occurs more than once and then remove these rows using filter()
Left outer join in R, with merging key columns are same-named or different-named. Prefer dplyr::left_join()
to merge()
because it keeps the order of merging key of the joined file the same as the left table
Subset values that begin with rs
in a data.frame
*Using grep to help subset a data frame in R
sort a data.frame by multiple columns
*How to Sort a Data Frame by Multiple Columns in R
Subset rows where one or multiple columns are not missing
Merge, join multiple data.frame simultaneously using list(df1,df2,df3...) %>% reduce(left_join, by = c("key1","key2"..))
Merging a list of > 2 data.frames using join_all. To do this, firstly adds all the data.frames to join to a list. These df should have same-named merging key. Then join_all merges all the df as a single df. Note that merge() is limited to combining 2 files. The following code merge 5 data.frames for 16 groups, which are constructed by 4 for loops
*join_all: Recursively join a list of data frames
- see complete code in
PRS_UKB_201711_step11-01_inner-join_summedPRS-across-all-traits_PCs_impCov.R
- alternatively use Reduce(merge()) to do same thing as join_all
library(plyr)
count=0
for (a in 1:length(unique(base$group_keyword1))){
for (b in 1:length(unique(base$group_keyword2))){
for (c in 1:length(unique(base$group_keyword3))){
for (d in 1:length(unique(base$group_keyword4))){
aa=unique(base$group_keyword1)[a]
bb=unique(base$group_keyword2)[b]
cc=unique(base$group_keyword3)[c]
dd=unique(base$group_keyword4)[d]
count=count+1;
print(paste("aa=",aa,"bb=",bb,"cc=",cc,"dd=",dd,sep = " "))
per_group_info= base[base$group_keyword1==aa & base$group_keyword2== bb & base$group_keyword3== cc & base$group_keyword4== dd, c("inputFilePath","outputFolderPath")]
per_group_input <- as.character(per_group_info$inputFilePath)
file1= read.table(per_group_input[1],header = T, stringsAsFactors = F)
file2= read.table(per_group_input[2],header = T, stringsAsFactors = F) %>% select(-c(FID,PHENO))
file3= read.table(per_group_input[3],header = T, stringsAsFactors = F) %>% select(-c(FID,PHENO))
file4= read.table(per_group_input[4],header = T, stringsAsFactors = F) %>% select(-c(FID,PHENO))
file5= read.table(per_group_input[5],header = T, stringsAsFactors = F) %>% select(-c(FID,PHENO))
dfs <- list(file1,file2,file3,file4,file5)
joined <- join_all(dfs,by=c("IID"),type = "inner")
joined_pc= merge.data.frame(joined,pc_file,by.x = "IID",by.y = "INDID")
joined_pc_impCov= merge.data.frame(joined_pc,impCov_file,by.x = "IID", by.y = "V2")
colnames(joined_pc_impCov)[54] <- "impCov"
colnames(joined_pc_impCov)[1] <- "ID"
outputFolderPath <- as.character(unique(per_group_info$outputFolderPath))
outputFilePath <- paste0(outputFolderPath,"/","summed-PRS_S1-S8_all-pheno_10PCs_impCov")
write.table(joined_pc_impCov
,col.names=T
,row.names = F
,file=outputFilePath
,dec="."
,sep=" "
,quote=FALSE
,na = "NA" )
print(paste0("================================= iteration", count, "========================="))
}
}
}
}
Merging 2 data.frames with different-named merging key columns
Merging 2 data sets using merge(). By default the data frames are merged on the columns with names they both have, but separate specifications of the columns can be given by by.x
and by.y
. To prevent this default, skip by= if the merging key is same-named in both data sets.
Read a file containing dashes in its column names. For example, column "UKB-SS-S1" becomes "UKB.SS.S1" However, you can change the column names back to orginal ones using qsub
*How to replace the “.” in column names generated by read.csv() with a single space when exporting?
Convert a data.frame to a vector.
*R-friendly way to convert R data.frame column to a vector?
Row-subsetting and column-subsetting in R. Note this subsetting is based on character columns. factor columns won't work
Conditionally copy columns to a new column based on another column
Conditionally execute code in R. Don't nest if within if
When there are 2 conditions to evaluate
When there are 3 conditions to evaluate
Create a new variable based on the conditions of a variable using a single or nested ifelse()
-
|
for "or", which is different from ||
used by if()
-
'&' for "and", which is different from &&
used by if()
-
==
for comparing variables, which is similar to ==
used by if()
-
Nested ifelse statement
When nesting ifelse()
, make sure each ifelse()
contains 3 arugmenets: (1) condition to evaluate,(2) value if the condition is evaluated as TRUE, (3) value if the condition is evaluated as FALSE. This example has 3 condictions and 2 ifelse()
. The 3rd argument in the first ifelse() is a nested ifelse().
Recode variables using dplyr::recode
- syntax form:
dplyr::recode(variable,)
Append output from every iteration to a base data.frame
*Add row to dataframe
Remove columns that contain only NAs
Read files with delimiters as varying number of white spaces
Combine multiple data frames by row using do.call(rbind,df_list)
Combine multiple data frames by column using (1) do.call(cbind,df_list)
,(2) join_all(data_list,by=mergingKey,type=)