R Toolbox - HackMD

# R Toolbox for Introductory Statistics Each workflow below includes: 1. **Use when:** a general description of the task **and** concrete p-set examples 2. **Key formulas** (if applicable) 3. **Clean R code template** --- ## 1. Basic Data & Vector Manipulation **Use when:** importing data, creating or transforming simple vectors, and getting quick summaries. **Examples:** * Loading **penguins.csv** to analyze body masses * Converting **body\_mass\_g → body\_mass\_lb** for weight comparisons * Subsetting the first six bill lengths in **pset1** * Summarizing quiz scores in **pset2** ```r # 1. Read a CSV file df <- read.csv("path/to/file.csv") # 2. Create simple sequences and transform vectors v <- 1:N v2 <- a * v + b # 3. Convert units (example: grams → pounds) body_mass_lb <- (body_mass_g / 1000) * 2.205 # 4. Summaries and tables summary(df$var) # five-number summary + mean mean(df$var, na.rm = TRUE) # mean excluding NAs table(df$group) # counts by category head(df$var, 6) # first six values ``` --- ## 2. Binomial Distribution **Use when:** computing probabilities and moments for independent “yes/no” trials. **Examples:** * “Exact chance exactly 25 of 80 customers return” in **pset1** * “Probability at least 3 defectives in 20” in **pset3** * “Expected heads in 100 coin flips” ### Formulas $$ P(X = k) = \binom{n}{k}\,p^k\,(1-p)^{n-k}, \quad E[X] = np, \quad \mathrm{SD}(X) = \sqrt{np(1-p)}. $$ ### R code ```r # PMF: P(X = k) dbinom(k, size = n, prob = p) # CDF: P(X ≤ k) pbinom(k, size = n, prob = p) # Upper tail: P(X ≥ k) 1 - pbinom(k-1, size = n, prob = p) # Quantiles: smallest k with P(X ≤ k) ≥ α qbinom(alpha, size = n, prob = p) # Mean & SD of X mu <- n * p sigma <- sqrt(n * p * (1 - p)) # For sample proportion p̂ = X/n mu_p <- p sigma_p <- sqrt(p * (1 - p) / n) ``` --- ## 3. Normal Distribution & CLT **Use when:** finding probabilities or quantiles for a normal variable or a sample mean via the Central Limit Theorem. **Examples:** * “Probability a 2-minute call exceeds 5 min” (**pset4**) * “Cutoff for fastest 5% of race times” in **final review** ### Formulas $$ P(a < X < b) = \Phi\!\Bigl(\tfrac{b-\mu}{\sigma}\Bigr)\;-\;\Phi\!\Bigl(\tfrac{a-\mu}{\sigma}\Bigr), \quad q_\alpha = \mu + \sigma\,\Phi^{-1}(\alpha). $$ ### R code ```r # P(a < X < b) pnorm(b, mean = mu, sd = sigma) - pnorm(a, mean = mu, sd = sigma) # P(X > c) 1 - pnorm(c, mean = mu, sd = sigma) # Quantile q such that P(X ≤ q) = α qnorm(alpha, mean = mu, sd = sigma) # For sample mean X̄ ∼ N(µ, σ²/n) pnorm(b, mean = mu, sd = sigma/sqrt(n)) - pnorm(a, mean = mu, sd = sigma/sqrt(n)) ``` --- ## 4. One-Sample Z-Test for Proportion **Use when:** testing a single population proportion in large samples. **Examples:** * “Is the true defect rate ≠ 5%?” in **pset5** * “Majority test vs. 50%” in **final practice** ### Formula $$ z = \frac{\hat p - p_0}{\sqrt{p_0(1-p_0)/n}}, \quad p\text{-value} = 1 - \Phi(z)\ (\text{or }2[1-\Phi(|z|)]). $$ ### R code ```r p_hat <- x / n se <- sqrt(p0 * (1 - p0) / n) z <- (p_hat - p0) / se # One-sided p-value p_value_one <- 1 - pnorm(z) # Two-sided p-value p_value_two <- 2 * (1 - pnorm(abs(z))) ``` --- ## 5. Difference in Proportions & Confidence Interval **Use when:** comparing two independent proportions. **Examples:** * “Difference in click-through rates between A/B groups” in **pset6** * “Repair rates FR vs. UK” in **review session** ### Formulas $$ \hat d = \hat p_1 - \hat p_2,\quad \mathrm{SE} = \sqrt{\frac{\hat p_1(1-\hat p_1)}{n_1} + \frac{\hat p_2(1-\hat p_2)}{n_2}},\quad \text{CI: }\hat d \pm z^*\,\mathrm{SE}. $$ ### R code ```r diff_hat <- p1_hat - p2_hat se_diff <- sqrt(p1_hat*(1-p1_hat)/n1 + p2_hat*(1-p2_hat)/n2) z_crit <- qnorm(1 - alpha/2) lower <- diff_hat - z_crit * se_diff upper <- diff_hat + z_crit * se_diff ``` --- ## 6. One- and Two-Sample t-Tests & Confidence Intervals **Use when:** * One-sample: comparing mean to a known value ($\bar x$ vs. $\mu_0$). * Two-sample: comparing means of two independent groups with unknown variances. **Examples:** * “Test whether average height > 60 cm” in **pset4** * “Compare treatment vs. control scores” in **pracfin-post** ### Formula (one-sample) $$ t = \frac{\bar x - \mu_0}{s/\sqrt{n}},\quad df = n-1. $$ ### R code ```r # One-sample t-test res1 <- t.test(x, mu = mu0) # Two-sample Welch’s t-test res2 <- t.test(x1, x2, var.equal = FALSE) # Extract results res2$statistic res2$p.value res2$conf.int ``` --- ## 7. Paired (Dependent) t-Test & Confidence Interval **Use when:** analyzing before/after or matched-pair measurements. **Examples:** * “Weight change pre- vs. post-program” in **final practice** * “Reaction time difference within subjects” in **review** ### Formulas $$ d_i = X_{i,\text{after}} - X_{i,\text{before}},\quad \bar d = \frac{1}{n}\sum d_i,\quad t = \frac{\bar d}{s_d/\sqrt{n}},\quad \text{CI: }\bar d \pm t^*_{n-1}\,\frac{s_d}{\sqrt{n}}. $$ ### R code ```r # Paired t-test res_paired <- t.test(x_after, x_before, paired = TRUE) # Extract res_paired$statistic res_paired$conf.int res_paired$p.value ``` --- ## 8. Correlation & Scatterplot **Use when:** measuring and visualizing linear association between two continuous variables. **Examples:** * “Bill length vs. bill depth by species” in **pset3** * “Height vs. weight correlation” in **review session** ### Formula $$ r = \frac{\sum_i (x_i-\bar x)(y_i-\bar y)}{\sqrt{\sum (x_i-\bar x)^2 \sum (y_i-\bar y)^2}}. $$ ### R code ```r # Scatterplot plot(Y ~ X, data = df, main = "Y vs X", xlab = "X", ylab = "Y") # Pearson’s r cor(df$X, df$Y, use = "complete.obs") ``` --- ## 9. Quantiles & Interquartile Range **Use when:** summarizing distributions with medians, quartiles, and outlier detection. **Examples:** * “Find Q1, median, Q3 of exam scores” in **pset2** * “Compute IQR for glucose levels” in **final review** ### R code ```r # 25th, 50th, 75th percentiles quantile(df$var, probs = c(0.25, 0.5, 0.75)) # Interquartile range IQR = Q3 – Q1 IQR(df$var) ``` --- ## 10. χ² Goodness-of-Fit **Use when:** testing whether observed category counts match expected proportions. **Examples:** * “Do color frequencies fit a uniform model?” in **pset5** * “Expected vs. observed genotype counts” in **final practice** ### Formula $$ \chi^2 = \sum_i \frac{(O_i - E_i)^2}{E_i},\quad df = k - 1,\quad p\text{-value} = 1 - F_{\chi^2}(\chi^2; df). $$ ### R code ```r obs <- c(O1, O2, …) p_exp <- c(p1, p2, …) chi_res <- chisq.test(x = obs, p = p_exp, rescale.p = FALSE) # Critical χ² value qchisq(1 - alpha, df = length(obs) - 1) ``` --- ## 11. Linear Regression & Diagnostics **Use when:** modeling one continuous response against one or more predictors, testing slopes, and making predictions. **Examples:** * “Predict credit card default from income” in **pset6** * “Fit OBP \~ position in baseball data” in **pset1** ### Formulas $$ \hat Y = \beta_0 + \beta_1 X_1 + \cdots,\quad t_j = \frac{\hat\beta_j}{\mathrm{SE}(\hat\beta_j)}. $$ ### R code ```r # Fit linear model mod <- lm(Y ~ X1 + X2 + ..., data = df) # Coefficients & summary coef(summary(mod)) # Confidence interval for slope X1 confint(mod, "X1", level = 0.95) # Predictions at new data new_data <- data.frame(X1 = x0, X2 = x2, ...) predict(mod, new_data, interval = "confidence") predict(mod, new_data, interval = "prediction") # Diagnostic plots par(mfrow = c(1, 2)) plot(mod, which = 1) # Residuals vs Fitted plot(mod, which = 2) # Normal Q–Q ``` --- ## 12. One-Way ANOVA & Post-Hoc **Use when:** comparing means across three or more independent groups. **Examples:** * “Compare mean OBP across positions” in **pset1** * “Test differences in repairability scores by country” in **review** ### Formula $$ F = \frac{\text{MS}_{\text{Between}}}{\text{MS}_{\text{Within}}},\quad df_1 = k-1,\quad df_2 = N-k. $$ ### R code ```r # Fit ANOVA aov_mod <- aov(Y ~ Group, data = df) # ANOVA table summary(aov_mod) # Post-hoc Tukey TukeyHSD(aov_mod, "Group", conf.level = 0.95) ``` --- ## 13. Random Sampling & Group Proportions **Use when:** assigning treatment/control or computing proportions by group. **Examples:** * “Simulate random assignment in an experiment” in **pset4** * “Proportion of smokers vs. non-smokers in treatment” in **final practice** ### R code ```r # Random assignment set.seed(2025) treat_ids <- sample(df$id, size = floor(nrow(df)/2), replace = FALSE) df$group <- ifelse(df$id %in% treat_ids, "Treatment", "Control") # Proportions by group prop.table(table(df$group, df$factor), margin = 1) ``` --- ## 14. Exploratory Plots **Use when:** quickly visualizing distributions or group comparisons. **Examples:** * “Histogram of exam scores” in **pset2** * “Boxplot of glucose by treatment” in **final review** * “Barplot of favorite ice-cream flavors” in **pset3** ```r # Histogram hist(df$var, main = "Histogram of var", xlab = "var") # Boxplot boxplot(var ~ group, data = df, main = "var by Group", xlab = "Group", ylab = "var") # Barplot barplot(table(df$factor), main = "Counts of factor", ylab = "Frequency") ``` --- ## 15. Model Comparison via AIC **Use when:** comparing the relative quality of two or more fitted models. **Examples:** * “Compare simple vs. full climate models” in **pracfinsol** * “Decide whether to include interaction term” in **final review** ```r AIC(mod_base, mod_full, mod_other) ``` --- ## 16. Proportion Tests with `prop.test()` **Use when:** conducting one- or two-sample proportion hypothesis tests via a built-in function. **Examples:** * “Test difference in pass rates between classes” in **pset5** * “Macy’s vs. Filene’s theft rates” in **pracfin-post** ```r # One-sample proportion test prop.test(x = successes, n = n, p = p0, alternative = "greater", correct = FALSE) # Two-sample proportion test prop.test(x = c(x1, x2), n = c(n1, n2), alternative = "two.sided", correct = FALSE) ``` --- ## 17. Nonparametric Tests **Use when:** data strongly violate normality or equal-variance assumptions. **Examples:** * “Compare medians of skewed distributions” (no direct p-set example but useful fallback) ```r # Kruskal–Wallis for ≥3 groups kruskal.test(Y ~ Group, data = df) # Wilcoxon signed-rank for paired data wilcox.test(x_after, x_before, paired = TRUE) # Wilcoxon rank-sum for two independent samples wilcox.test(x1, x2) ``` --- ## 18. Critical Values Reference * **Normal (Z):** ```r qnorm(1 - alpha/2) ``` * **t-Distribution:** ```r qt((1 + conf.level)/2, df = n - 1) ``` * **Chi-Square:** ```r qchisq(1 - alpha, df = k - 1) ``` --- ### How to Use This Toolbox 1. **Identify** the question type on your exam. 2. **Locate** the matching workflow above. 3. **Recall** its formula and R template. 4. **Plug in** your data values ($n$, $\bar x$, $s$, $p$, etc.) and **run** the code.