# Data Transformation
###### tags: `R` `statistics` `p-value` `t-test` `Mann Whitney U test` `dplyr`
## Import Dataset
A **data frame (df)** is used for storing data tables.
It is a list of vectors of equal length
## Let's create our own dasta set
**Using MS Excel**
Create "myData.txt"
**Attribute Information:**
* age: age in years
* sex: M:1, F: 0
* height: cm
* weight: Kg
* group: student, resident, attending
* dm: 0, 1
* htn: 0, 1
* dyslipidemia: 0, 1
We will import myData.txt files
Frist put your files into the **"data folder"**
``` r=
# load library
library(dplyr) # for data transformation
library(ggplot2) # for graph
# import myData.txt file and name it data1
data1 <- read.table("data/myData.txt", header = T, sep = "\t")
# check the structures of datasets
str(data1)
```
## dplyr
* Use a data frame and create a data frame
* Comparisons: >, >=, <, <=, !=, and ==
* Logical operator: & (and), | (or), and ! (not)
<br/>
**filter():** Pick observations by their values

```r=
# find male with DM
m_dm <- filter(data1, sex == 1, dm == 1)
str(m_dm)
# find individual with HTN and height >= 165
h_htn <- filter(data1, height >= 165 & htn == 1)
str(h_htn)
# find student with weight < 60 and > 55
s_w <- filter(data1, between(weight, 55,60) & group == "student")
str(s_w)
```
<br/>
**arrange():** Reorder the rows

```r=
# arrange in ascending order
data_arr1 <- arrange(data1, height)
View(data_arr1)
# use more than one column
data_arr2 <- arrange(data1, height, weight)
View(data_arr2)
# in descending order
data_arr3 <- arrange(data1, desc(height))
View(data_arr3)
```
<br/>
**select():** Pick variables by their names

```r=
# pick age, sex and dm columns
age_sex_dm <- select(data1, age, sex, dm)
str(age_sex_dm)
# pick the columns from age to htn
age_to_htn <- select(data1, age:htn)
str(age_to_htn)
# remove the columns from age to htn
no_age_to_htn <- select(data1, -(age:htn))
str(no_age_to_htn)
# remane the restecg column
new_data <- rename(data1, gender = sex)
str(data1)
str(new_data)
```
<br/>
**mutate():** Create new variable
**transmute():** keep the new variables only

```r=
# add new columns age_sex and dm_htn
new_columns <- mutate(data1, age_sex = age - 10 * sex, dm_htn = dm + htn)
str(new_columns)
# save only the new columns
new_data2 <- transmute(data1, age_sex = age - 10 * sex, dm_htn = dm + htn)
str(new_data2)
```
<br/>
**summarize():** summary
**group_by():** operate group by group

```r=
# summarize the mean of age, SD and total pt number
summarize(data1, age_mean = mean(age), sd = sd(age), n= n())
# group by sex and group
group1 <- group_by(data1, sex, group)
summarize(group1, age_mean = mean(age), sd = sd(age), n= n())
# the better way: use pip (%>% )
data1 %>% group_by(sex, group) %>% summarize(age_mean = mean(age), sd = sd(age), n= n())
# use count
data1 %>% count(sex, group)
data1 %>% count(sex, htn)
```
<br/>