###### tags: `Statistical Analysis in Simple Steps Using R`
[TOC]
# Ch2 Data management in R
**Data Types:**
* Vectors
* Matrics
* Lists
* Data Frames
* Factors
* Arrays
| Data Type | Example |
| -------- | -------- |
| Numeric | 15.517 |
| Integer | 18 |
| Complex | 2+5i |
| Character | "abc", "PQR" |
| Logical | TRURE, FALSE |
**Data Structures:**
R is an object-oriented language, which is a type of sofeware design that allows the user to nor only specify the data types but also the operations that can be performed on a defined data structure.
---
## 2.1 Vectors
~~~
V1=c(5,2,3,7,8,9,1,4,10,15)
v1
length(v1)
class(v1)
str(v1) ## structure of the object v1
is.vector(v1)
~~~
### 2.1.1 Vector Operations
1. Scalar Arithmetic
~~~
v1*5
newv1=v1*5
~~~
2. Vector Arithmetic
~~~
v1=c(5,2,3,7,8,9,1,4,10,15)
v2=c(2,3,6,5,8,9,10,11,14,16)
v1+v2
v1-v2
v1*v2
v1/v2
~~~
3. Functional Transformations
~~~
log(v1)
sqrt(v2)
v1^2
~~~
### 2.1.2 Indexing Vectors
~~~
V1[5]
v1[-8]
~~~
### 2.1.3 Subsetting a Vector
1. Subsetting a Vector
~~~
subset(v1,v1>=5)
~~~
2. Random Selection of Elements of a Vector
~~~
sample(v1,5)
~~~
3. Bootstrapping a Vector
~~~
sample(v1, 20, replace=TRUE)
~~~
### 2.1.4 Combining Vectors
~~~
v3=c(v1,v2)
~~~
~~~
vecnum=c(1,2,3)
vecchr<-c("a","b", "c")
veccom=c(vecnum,vecchr)
class(veccom)
typeof(veccom)
length(veccom)
attributes(veccom)
~~~
* $R$ has $6$ basic ***data types***.
1. character
2. numeric (real or decimal)
3. integer
4. logical
5. complex
* $R$ has many ***data structures***.
1. atomic vector
2. list
3. matrix
4. data frame
5. factors
---
## 2.2 Matrices
~~~
mat1=matrix(c(1,2,3,7,8,9), nrow=2, ncol=3, byrow=TRUE)
mat1
mat2=matrix(c(2,3,5,9,6,4), nrow=2, ncol=3, byrow=TRUE)
mat2
~~~
### 2.2.1 Matrix Operations
~~~
mat1+mat2
mat1-mat2
mat1*mat2
mat1/mat2
~~~
* Matrix Multiplication
~~~
Mat1%*%mat3
~~~
* Matrix transposed
~~~
t(mat1)
~~~
### 2.2.2 Indexing Matrices
~~~
matsub1=mat1[2,]
matsub2=mat1[,1]
mat1[2,3]
~~~
### 2.2.3 Subsetting and Sampling Matrices
~~~
a1=seq(1:10)
a2=rep(1:2,5)
a3=rep(5,10)
a4=seq(6:15)
mata=matrix(cbind(a1,a2,a3,a4),nrow=10,ncol=4)
~~~
* Subsetting the Matrices
~~~
subset(mata, mata[,4]>3)
~~~
* Random Sampling of Matrices
~~~
apply(mata,1,sample,2)
t(apply(mata,1,sample,2))
~~~
* Bootstrapping Matrices
~~~
matarow=sample(1:nrow(mata),100, replace=TRUE)
matacol=sample(1:ncol(mata),100, replace=TRUE)
mapply(function(row,column) return(mata[row,column]), row=matarow,col=matacol)
~~~
* Combining Matrices
~~~
mat4=matrix(c(3,6),nrow=2,ncol=1)
cbind(mat1,mat4)
mat5=matrix(c(3,6,9),nrow=1,ncol=3)
rbind(mat1,mat5)
~~~
---
## 2.3 Lists
R 的列表(`list`)變數類似向量,內含多個元素,不過跟向量不同的是列表是一種復合型的變數,其中的每個元素可以是不同的類型,我們可以將各式各樣不同類型的變數儲存在一個列表變數中。
~~~
x.list <- list(1:3, "H.Y.Pan", matrix(3:6, nrow = 2), sin)
x.list
~~~
~~~
a1=c(1,2,3,4)
a2=c("a","b","c","d")
a3=c(TRUE,FALSE,FALSE,TRUE)
lst1=list(a1,a2,a3)
lst1
names(lst1)=c("n1","n2","n3")
~~~
### 2.3.1 Naming the elements of a List
為列表的每個元素命名:
~~~
x.list <- list(
seq = 1:3,
name = "G.T.Wang",
mat = matrix(3:6, nrow = 2),
fun = sin)
x.list
~~~
在建立列表之後,再使用`names`函數指定每個元素的名稱:
~~~
names(x.list) <- c("seq", "name", "mat", "fun")
~~~
列表變數的元素中也允許納入其他的列表變數,形成巢狀的資料結構:
~~~
y.list <- list(
var1 = list( name = "pi", val = pi),
var2 = list( name = "e", val = exp(1))
)
y.list
~~~
### 2.3.2 List Operations
~~~
a1=c(2,7,3,8)
a2=c(8,9,2,5)
a3=c(3,7,9,2,6,5)
lst2=list(a1,a2,a3)
lapply(lst2, function(x) sum(x))
~~~
~~~
unlist(lst1)
~~~
### 2.3.3 Indexing a List
~~~
lst1[2]
n2[2]
x.list[1:3]
x.list[-4]
x.list[c(TRUE, TRUE, TRUE, FALSE)]
x.list[c("seq", "name", "mat")]
~~~
### 2.3.4 Subsetting and Sampling Lists
* Subsetting a List
~~~
Filter(function(lst2) length(lst2)>4, lst2)
~~~
* Random Sampling in a List
~~~
sample(lst1,2)
~~~
* Bootstrapping a List
~~~
sample(list1, 10, replace=TRUE)
~~~
### 2.3.5 Combining Lists
~~~
append(lst1, list(n4=c(2,3,4,5)), after=3)
append(lst1,lst2)
~~~
---
## 2.4 Data Frames
R 的 `data frame` 是一個用來儲存類似 `Excel` 表格的變數類型,它跟矩陣類似,不過 `data frame` 的每個行(`column`)可以儲存不同變數類型的資料,甚至非狀巢結構的列表亦可。
* Each row of a data frame contains related elements.
* Every column of a data frame has the same length.
* like a matrix
~~~
rollno=c(1,2,3,4)
mathmarks=c(67,45,88,90)
scimarks=c(66,56,90,92)
students=data.frame(rollno, mathmarks, scimarks)
~~~
~~~
x.data.frame <- data.frame(
x = letters[1:6],
y = rnorm(6),
z = runif(6) > 0.5
)
x.data.frame
class(x.data.frame)
~~~
在建立`data frame` 時,如果任何一個向量有被指定每個元素的名稱時,R 會依據第一個這樣具有名稱的向量,為 `data frame` 的列命名。
~~~
y <- rnorm(5)
names(y) <- month.name[1:5]
data.frame(
x = letters[1:5],
y = y,
z = runif(5) > 0.5
)
~~~
如果不想要讓 R 自動依照向量的元素名稱來指定 `data frame` 列名稱,可以將 `row.names` 指定為 `NULL`:
~~~
data.frame(
x = letters[1:5],
y = y,
z = runif(5) > 0.5,
row.names = NULL
)
~~~
也可以利用 row.names 參數明確指定 data frame 的列名稱:
~~~
data.frame(
x = letters[1:5],
y = y,
z = runif(5) > 0.5,
row.names = c("Taipei", "Hsinchu", "Taichung", "Tainan", "Kaohsiung")
)
~~~
### 2.4.1 Data Frame Operations
~~~
apply(students, 2, mean)
lapply(students, function(students) mean(students))
sapply(students, function(students) mean(students))
~~~
**Note**: `apply()` and `sapply()` will return a vector data structure and `lapply` will return a list.
### 2.4.2 Indexing Data Frames
~~~
class(mtcars)
mtcars[3,]
mtcars[,3]
mtcars
~~~
### 2.4.3 Subsetting and Sampling
~~~
subset(mtcars,mtcars$cyl>=4)
mtcars[sample(nrow(mtcars),15),]
mtcars[sample(nrow(mtcars),100, replace=TURE),]
~~~
### 2.4.4 Combining Data Frames
~~~
engmarks=c(62,61,58,59)
students=cbind(students, engmarks)
~~~
A row is to be added to an existing data frame
~~~
row5=c(5,89,86,61)
students=rbind(students, row5)
~~~
~~~
rollno=c(1,3,5,2,4)
gkmarks=c(67,78,90,95,79)
studentsgk=data.frame(rollno, gkmarks)
merge(students, studentsgk, by="rollno")
~~~
---
## 2.5 Factors
~~~
vec1=c(1,2,2,1,1,2,1,2,1,2)
fac1=factor(vec1)
fac1
levels(fac1)=c("Male", "Female")
~~~
### 2.5.1 Uses of Factors
**t-test**, **ANOVA** and **logistic regression**
### 2.5.2 Factor Operations
~~~
table(fac1)
~~~
### 2.5.3 Indexing
~~~
fac1[3]
~~~
---
## 2.6 Arrays

~~~
vec1=c(2,3,4,1,2,3,7,8,9,5,6,7,1,4,5)
arr1=array(vec1, dim=c(2,3,2))
~~~
---
## 2.7 Missing Values
~~~
vec1=c(11,15,NA,27,14,28,NA,10,18,NA,NA,12,17,19,20)
mean(vec1)
~~~
### 2.7.1 Counting Missing Values
~~~
is.na(vec1)
table(is.na(vec1))
~~~
### 2.7.2 Omitting Missing Values
~~~
mean(vec1, na.rm=TRUE)
vec2=na.omit(vec1)
vec2
~~~
### 2.7.3 Replacing Missing Values
~~~
vec1[is.na(vec1)]=round(mean(vec1, na.rm=TRUE))
~~~