Data Transformation

# Data Transformation ###### tags: `R` `statistics` `p-value` `t-test` `Mann Whitney U test` `dplyr` ## Import Dataset A **data frame (df)** is used for storing data tables. It is a list of vectors of equal length ## Let's create our own dasta set **Using MS Excel** Create "myData.txt" **Attribute Information:** * age: age in years * sex: M:1, F: 0 * height: cm * weight: Kg * group: student, resident, attending * dm: 0, 1 * htn: 0, 1 * dyslipidemia: 0, 1 We will import myData.txt files Frist put your files into the **"data folder"** ``` r= # load library library(dplyr) # for data transformation library(ggplot2) # for graph # import myData.txt file and name it data1 data1 <- read.table("data/myData.txt", header = T, sep = "\t") # check the structures of datasets str(data1) ``` ## dplyr * Use a data frame and create a data frame * Comparisons: >, >=, <, <=, !=, and == * Logical operator: & (and), | (or), and ! (not) **filter():** Pick observations by their values ![](https://i.imgur.com/DQQQ11i.png) ```r= # find male with DM m_dm <- filter(data1, sex == 1, dm == 1) str(m_dm) # find individual with HTN and height >= 165 h_htn <- filter(data1, height >= 165 & htn == 1) str(h_htn) # find student with weight < 60 and > 55 s_w <- filter(data1, between(weight, 55,60) & group == "student") str(s_w) ``` **arrange():** Reorder the rows ![](https://i.imgur.com/P9FLbyg.png) ```r= # arrange in ascending order data_arr1 <- arrange(data1, height) View(data_arr1) # use more than one column data_arr2 <- arrange(data1, height, weight) View(data_arr2) # in descending order data_arr3 <- arrange(data1, desc(height)) View(data_arr3) ``` **select():** Pick variables by their names ![](https://i.imgur.com/0QVUQxX.png) ```r= # pick age, sex and dm columns age_sex_dm <- select(data1, age, sex, dm) str(age_sex_dm) # pick the columns from age to htn age_to_htn <- select(data1, age:htn) str(age_to_htn) # remove the columns from age to htn no_age_to_htn <- select(data1, -(age:htn)) str(no_age_to_htn) # remane the restecg column new_data <- rename(data1, gender = sex) str(data1) str(new_data) ``` **mutate():** Create new variable **transmute():** keep the new variables only ![](https://i.imgur.com/XR8MpYf.png) ```r= # add new columns age_sex and dm_htn new_columns <- mutate(data1, age_sex = age - 10 * sex, dm_htn = dm + htn) str(new_columns) # save only the new columns new_data2 <- transmute(data1, age_sex = age - 10 * sex, dm_htn = dm + htn) str(new_data2) ``` **summarize():** summary **group_by():** operate group by group ![](https://i.imgur.com/QZkJ5us.png) ```r= # summarize the mean of age, SD and total pt number summarize(data1, age_mean = mean(age), sd = sd(age), n= n()) # group by sex and group group1 <- group_by(data1, sex, group) summarize(group1, age_mean = mean(age), sd = sd(age), n= n()) # the better way: use pip (%>% ) data1 %>% group_by(sex, group) %>% summarize(age_mean = mean(age), sd = sd(age), n= n()) # use count data1 %>% count(sex, group) data1 %>% count(sex, htn) ```