# R Toolbox for Introductory Statistics
Each workflow below includes:
1. **Use when:** a general description of the task **and** concrete p-set examples
2. **Key formulas** (if applicable)
3. **Clean R code template**
---
## 1. Basic Data & Vector Manipulation
**Use when:** importing data, creating or transforming simple vectors, and getting quick summaries.
**Examples:**
* Loading **penguins.csv** to analyze body masses
* Converting **body\_mass\_g → body\_mass\_lb** for weight comparisons
* Subsetting the first six bill lengths in **pset1**
* Summarizing quiz scores in **pset2**
```r
# 1. Read a CSV file
df <- read.csv("path/to/file.csv")
# 2. Create simple sequences and transform vectors
v <- 1:N
v2 <- a * v + b
# 3. Convert units (example: grams → pounds)
body_mass_lb <- (body_mass_g / 1000) * 2.205
# 4. Summaries and tables
summary(df$var) # five-number summary + mean
mean(df$var, na.rm = TRUE) # mean excluding NAs
table(df$group) # counts by category
head(df$var, 6) # first six values
```
---
## 2. Binomial Distribution
**Use when:** computing probabilities and moments for independent “yes/no” trials.
**Examples:**
* “Exact chance exactly 25 of 80 customers return” in **pset1**
* “Probability at least 3 defectives in 20” in **pset3**
* “Expected heads in 100 coin flips”
### Formulas
$$
P(X = k) = \binom{n}{k}\,p^k\,(1-p)^{n-k},
\quad
E[X] = np,
\quad
\mathrm{SD}(X) = \sqrt{np(1-p)}.
$$
### R code
```r
# PMF: P(X = k)
dbinom(k, size = n, prob = p)
# CDF: P(X ≤ k)
pbinom(k, size = n, prob = p)
# Upper tail: P(X ≥ k)
1 - pbinom(k-1, size = n, prob = p)
# Quantiles: smallest k with P(X ≤ k) ≥ α
qbinom(alpha, size = n, prob = p)
# Mean & SD of X
mu <- n * p
sigma <- sqrt(n * p * (1 - p))
# For sample proportion p̂ = X/n
mu_p <- p
sigma_p <- sqrt(p * (1 - p) / n)
```
---
## 3. Normal Distribution & CLT
**Use when:** finding probabilities or quantiles for a normal variable or a sample mean via the Central Limit Theorem.
**Examples:**
* “Probability a 2-minute call exceeds 5 min” (**pset4**)
* “Cutoff for fastest 5% of race times” in **final review**
### Formulas
$$
P(a < X < b) = \Phi\!\Bigl(\tfrac{b-\mu}{\sigma}\Bigr)\;-\;\Phi\!\Bigl(\tfrac{a-\mu}{\sigma}\Bigr),
\quad
q_\alpha = \mu + \sigma\,\Phi^{-1}(\alpha).
$$
### R code
```r
# P(a < X < b)
pnorm(b, mean = mu, sd = sigma) -
pnorm(a, mean = mu, sd = sigma)
# P(X > c)
1 - pnorm(c, mean = mu, sd = sigma)
# Quantile q such that P(X ≤ q) = α
qnorm(alpha, mean = mu, sd = sigma)
# For sample mean X̄ ∼ N(µ, σ²/n)
pnorm(b, mean = mu, sd = sigma/sqrt(n)) -
pnorm(a, mean = mu, sd = sigma/sqrt(n))
```
---
## 4. One-Sample Z-Test for Proportion
**Use when:** testing a single population proportion in large samples.
**Examples:**
* “Is the true defect rate ≠ 5%?” in **pset5**
* “Majority test vs. 50%” in **final practice**
### Formula
$$
z = \frac{\hat p - p_0}{\sqrt{p_0(1-p_0)/n}},
\quad
p\text{-value} = 1 - \Phi(z)\ (\text{or }2[1-\Phi(|z|)]).
$$
### R code
```r
p_hat <- x / n
se <- sqrt(p0 * (1 - p0) / n)
z <- (p_hat - p0) / se
# One-sided p-value
p_value_one <- 1 - pnorm(z)
# Two-sided p-value
p_value_two <- 2 * (1 - pnorm(abs(z)))
```
---
## 5. Difference in Proportions & Confidence Interval
**Use when:** comparing two independent proportions.
**Examples:**
* “Difference in click-through rates between A/B groups” in **pset6**
* “Repair rates FR vs. UK” in **review session**
### Formulas
$$
\hat d = \hat p_1 - \hat p_2,\quad
\mathrm{SE} = \sqrt{\frac{\hat p_1(1-\hat p_1)}{n_1} + \frac{\hat p_2(1-\hat p_2)}{n_2}},\quad
\text{CI: }\hat d \pm z^*\,\mathrm{SE}.
$$
### R code
```r
diff_hat <- p1_hat - p2_hat
se_diff <- sqrt(p1_hat*(1-p1_hat)/n1 + p2_hat*(1-p2_hat)/n2)
z_crit <- qnorm(1 - alpha/2)
lower <- diff_hat - z_crit * se_diff
upper <- diff_hat + z_crit * se_diff
```
---
## 6. One- and Two-Sample t-Tests & Confidence Intervals
**Use when:**
* One-sample: comparing mean to a known value ($\bar x$ vs. $\mu_0$).
* Two-sample: comparing means of two independent groups with unknown variances.
**Examples:**
* “Test whether average height > 60 cm” in **pset4**
* “Compare treatment vs. control scores” in **pracfin-post**
### Formula (one-sample)
$$
t = \frac{\bar x - \mu_0}{s/\sqrt{n}},\quad df = n-1.
$$
### R code
```r
# One-sample t-test
res1 <- t.test(x, mu = mu0)
# Two-sample Welch’s t-test
res2 <- t.test(x1, x2, var.equal = FALSE)
# Extract results
res2$statistic
res2$p.value
res2$conf.int
```
---
## 7. Paired (Dependent) t-Test & Confidence Interval
**Use when:** analyzing before/after or matched-pair measurements.
**Examples:**
* “Weight change pre- vs. post-program” in **final practice**
* “Reaction time difference within subjects” in **review**
### Formulas
$$
d_i = X_{i,\text{after}} - X_{i,\text{before}},\quad
\bar d = \frac{1}{n}\sum d_i,\quad
t = \frac{\bar d}{s_d/\sqrt{n}},\quad
\text{CI: }\bar d \pm t^*_{n-1}\,\frac{s_d}{\sqrt{n}}.
$$
### R code
```r
# Paired t-test
res_paired <- t.test(x_after, x_before, paired = TRUE)
# Extract
res_paired$statistic
res_paired$conf.int
res_paired$p.value
```
---
## 8. Correlation & Scatterplot
**Use when:** measuring and visualizing linear association between two continuous variables.
**Examples:**
* “Bill length vs. bill depth by species” in **pset3**
* “Height vs. weight correlation” in **review session**
### Formula
$$
r = \frac{\sum_i (x_i-\bar x)(y_i-\bar y)}{\sqrt{\sum (x_i-\bar x)^2 \sum (y_i-\bar y)^2}}.
$$
### R code
```r
# Scatterplot
plot(Y ~ X, data = df,
main = "Y vs X", xlab = "X", ylab = "Y")
# Pearson’s r
cor(df$X, df$Y, use = "complete.obs")
```
---
## 9. Quantiles & Interquartile Range
**Use when:** summarizing distributions with medians, quartiles, and outlier detection.
**Examples:**
* “Find Q1, median, Q3 of exam scores” in **pset2**
* “Compute IQR for glucose levels” in **final review**
### R code
```r
# 25th, 50th, 75th percentiles
quantile(df$var, probs = c(0.25, 0.5, 0.75))
# Interquartile range IQR = Q3 – Q1
IQR(df$var)
```
---
## 10. χ² Goodness-of-Fit
**Use when:** testing whether observed category counts match expected proportions.
**Examples:**
* “Do color frequencies fit a uniform model?” in **pset5**
* “Expected vs. observed genotype counts” in **final practice**
### Formula
$$
\chi^2 = \sum_i \frac{(O_i - E_i)^2}{E_i},\quad
df = k - 1,\quad
p\text{-value} = 1 - F_{\chi^2}(\chi^2; df).
$$
### R code
```r
obs <- c(O1, O2, …)
p_exp <- c(p1, p2, …)
chi_res <- chisq.test(x = obs, p = p_exp, rescale.p = FALSE)
# Critical χ² value
qchisq(1 - alpha, df = length(obs) - 1)
```
---
## 11. Linear Regression & Diagnostics
**Use when:** modeling one continuous response against one or more predictors, testing slopes, and making predictions.
**Examples:**
* “Predict credit card default from income” in **pset6**
* “Fit OBP \~ position in baseball data” in **pset1**
### Formulas
$$
\hat Y = \beta_0 + \beta_1 X_1 + \cdots,\quad
t_j = \frac{\hat\beta_j}{\mathrm{SE}(\hat\beta_j)}.
$$
### R code
```r
# Fit linear model
mod <- lm(Y ~ X1 + X2 + ..., data = df)
# Coefficients & summary
coef(summary(mod))
# Confidence interval for slope X1
confint(mod, "X1", level = 0.95)
# Predictions at new data
new_data <- data.frame(X1 = x0, X2 = x2, ...)
predict(mod, new_data, interval = "confidence")
predict(mod, new_data, interval = "prediction")
# Diagnostic plots
par(mfrow = c(1, 2))
plot(mod, which = 1) # Residuals vs Fitted
plot(mod, which = 2) # Normal Q–Q
```
---
## 12. One-Way ANOVA & Post-Hoc
**Use when:** comparing means across three or more independent groups.
**Examples:**
* “Compare mean OBP across positions” in **pset1**
* “Test differences in repairability scores by country” in **review**
### Formula
$$
F = \frac{\text{MS}_{\text{Between}}}{\text{MS}_{\text{Within}}},\quad
df_1 = k-1,\quad
df_2 = N-k.
$$
### R code
```r
# Fit ANOVA
aov_mod <- aov(Y ~ Group, data = df)
# ANOVA table
summary(aov_mod)
# Post-hoc Tukey
TukeyHSD(aov_mod, "Group", conf.level = 0.95)
```
---
## 13. Random Sampling & Group Proportions
**Use when:** assigning treatment/control or computing proportions by group.
**Examples:**
* “Simulate random assignment in an experiment” in **pset4**
* “Proportion of smokers vs. non-smokers in treatment” in **final practice**
### R code
```r
# Random assignment
set.seed(2025)
treat_ids <- sample(df$id, size = floor(nrow(df)/2), replace = FALSE)
df$group <- ifelse(df$id %in% treat_ids, "Treatment", "Control")
# Proportions by group
prop.table(table(df$group, df$factor), margin = 1)
```
---
## 14. Exploratory Plots
**Use when:** quickly visualizing distributions or group comparisons.
**Examples:**
* “Histogram of exam scores” in **pset2**
* “Boxplot of glucose by treatment” in **final review**
* “Barplot of favorite ice-cream flavors” in **pset3**
```r
# Histogram
hist(df$var, main = "Histogram of var", xlab = "var")
# Boxplot
boxplot(var ~ group, data = df,
main = "var by Group", xlab = "Group", ylab = "var")
# Barplot
barplot(table(df$factor), main = "Counts of factor", ylab = "Frequency")
```
---
## 15. Model Comparison via AIC
**Use when:** comparing the relative quality of two or more fitted models.
**Examples:**
* “Compare simple vs. full climate models” in **pracfinsol**
* “Decide whether to include interaction term” in **final review**
```r
AIC(mod_base, mod_full, mod_other)
```
---
## 16. Proportion Tests with `prop.test()`
**Use when:** conducting one- or two-sample proportion hypothesis tests via a built-in function.
**Examples:**
* “Test difference in pass rates between classes” in **pset5**
* “Macy’s vs. Filene’s theft rates” in **pracfin-post**
```r
# One-sample proportion test
prop.test(x = successes, n = n, p = p0,
alternative = "greater", correct = FALSE)
# Two-sample proportion test
prop.test(x = c(x1, x2), n = c(n1, n2),
alternative = "two.sided", correct = FALSE)
```
---
## 17. Nonparametric Tests
**Use when:** data strongly violate normality or equal-variance assumptions.
**Examples:**
* “Compare medians of skewed distributions” (no direct p-set example but useful fallback)
```r
# Kruskal–Wallis for ≥3 groups
kruskal.test(Y ~ Group, data = df)
# Wilcoxon signed-rank for paired data
wilcox.test(x_after, x_before, paired = TRUE)
# Wilcoxon rank-sum for two independent samples
wilcox.test(x1, x2)
```
---
## 18. Critical Values Reference
* **Normal (Z):**
```r
qnorm(1 - alpha/2)
```
* **t-Distribution:**
```r
qt((1 + conf.level)/2, df = n - 1)
```
* **Chi-Square:**
```r
qchisq(1 - alpha, df = k - 1)
```
---
### How to Use This Toolbox
1. **Identify** the question type on your exam.
2. **Locate** the matching workflow above.
3. **Recall** its formula and R template.
4. **Plug in** your data values ($n$, $\bar x$, $s$, $p$, etc.) and **run** the code.