Data Analysis 101

# Data Analysis 101 ## Sanity check: 1. Is back-testing reliable/Use cross validation 1. Can I interpret the model? 1. Is the result significant? 1. Overfit/underfit 1. Consider variable transforms and always scale 1. Bias-variance tradeoff ## Loss functions Consider $N$ data points $\{x_i,y_i\}_{i=1}^N$ and predictions $\{\hat y_i\}_{i=1}^N$. For regression for $y_i \in \mathbb{R}$: - $L_1$ (Lasso) loss $\propto \sum_i \lvert \hat y_i - y_i \rvert$ for $y_i \in \mathbb{R}$ encourages all errors to go to absolute zero. - $L_2$ loss $\propto \sum_i (\hat y_i - y_i)^2$ for penalizes large errors heavily. - Huber loss $\propto \begin{cases} \begin{split} 0.5 (\hat y_i - y_i)^2 & \quad \hat y_i - y_i \rvert < \delta \\ \delta ( \lvert \hat y_i - y_i \rvert - 0.5 \, \delta) & \quad \text{otherwise} \end{split} \end{cases}$ incurs $L_2$ loss for small errors and $L_1$ for large errors, determined by parameter $\delta$. For classification: - Binary cross entropy $\propto -\sum_i y_i \log(P(\hat y_i)) + (1 - y_i) \log(1 - P(\hat y_i))$ for $y_i \in \{0,1\}$ penalizes missclassifcation by the log probability of the correct class. - Multiclass cross entropy $\propto -\sum_i t_i \cdot \log(P(\hat y_i))$ for $y_i \in \{1,K\}$ and $t_i$ one-hot encoding of $y_i$ penalizes missclassfication by the log probability of the correct class, regardless of probabilities of the other classes. - Hinge loss $\propto \sum_i \max(0, 1 - \hat y_i \cdot y_i)$ used as the additional term for a soft margin for SVMs. ## Analysis of variance (ANOVA) Compare the means of a continuous variable between $\geq 2$ groups using $F$ test.(Use Student's $t$ test if just $2$ groups) . Assumptions: - Equal variances, between groups (homoscedasticity). Check with Levene's, Bartlett's, or Brown-Forsythe test. - Residuals are normal. Check with Shapiro-Wilks test or histogram. - Data points are indepedent. Hypotheses: - Null: All group means are equal. - Alternative: At least one group mean is different. Steps: Let $X^1, \ldots, X^N$ be $N$ groups of $M$ observations of a variable. We denote - Overall mean $\mu = \frac{1}{MN}\sum_{n=1}^N \sum_{m=1}^M X^n_m$. - Group mean $\mu^n = \frac{1}{M} \sum_{m=1}^M X^n_m$ for $n=1,\ldots,N$. - Between-group SS $SSB = M \times \sum_{n=1}^N (\mu^n-\mu)^2$. - Degree-of-freedom (DOF) for $SSB$ $DOF_{SSB} = N - 1$. - Within-group sum-of-squares (SS) $SSW^n = \sum_{m=1}^M (X^n_m-\mu^n)^2$ for $n=1,\ldots,N$ then $SSW = \sum_{n=1}^N SSW^n$. - DOF for $SSW$ $DOF_{SSW} = N \times (M - 1)$. - $F$-statistic $\frac{SSB/DOF_{SSB}}{SSW/DOF_{SSW}}$. - Check $p$-value according to $F$-distribution. We can state each data point as \begin{align}X^n_m &= \mu + (\mu - \mu^n) + (\mu_n + X^n_m)\\ &=\mu + \alpha^n + \epsilon^n_m. \end{align} We assume that $\epsilon^n_m \overset{\text{iid}}{\sim} N(0,1)$ for all $n$ and $m$. The null states that $\alpha^1 = \alpha^2 = \cdots = \alpha^N$. ## Classification result evaluation ### Confusion Matrix For binary classification, it looks like this. | | Pred no | Pred yes | | -------- | -------- | -------- | | Actual no | TN | FP | | Actual yes | FN | TP | A perfect classifier will have 0 off-diagonal entries. We can summarise the matrix using the *precision*. $\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}$. It is the accuracy considering all positive predicitions; it is okay to not capture all positive truths (type II errors). We can trivially get $\text{Precision}=1$ by making only one correct positive prediction. We have no false positive so there. Indeed, we are not looking at the negative predictions! Therefore, we also consider the *recall* or *sensitivity*. $\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}$. It is the accuracy considering all positive truths; it is okay to predict true when it is not (type I errors). We can combine them as a harmonic mean called the $F_1$ score. $F_1 = \frac{2}{1/\text{Precision} + 1/\text{Recall}} = \frac{\text{TP}}{\text{TP} + \frac{\text{FN} +\text{FP}}{2}}$. $F_1$ is high when precision is close to recall. Indeed, the two quantities trade off. We may prioritise one over the other depending on application. We can plot precision vs recall to observe the trade-off. ### The receiver operating characteristic (ROC) curve ROC plots the true positive rate (recall) against the false positive rate ($1 -$ true negative rate, or *specificity*). The higher the recall, it also means we likely have more false positives, so specificity drops. To compare two curves, we can measure the area under the curve (AUC). A perfect classifier has area of $1$ while a coin-flip produces an AUC of $0.5$. ## Support Vector Machine (SVM) for classification Ideal for small-ish datasets, (linear) SVM is essentially a loss function designed to achieve the largest margin between classes, therefore it is determined only by the *support vectors* or data points closest to the boundaries. We can train it using e.g. SGD. A more flexible approach is to use a *soft margin* that balances accuracy and margin size, i.e. some missclassifications may result. The softness can be controlled by some parameter. We can use the *kernel trick* to achieve non-linear boundaries, e.g. polynomial or Gaussian radial basis functions. By transforming the features using the kernel, possibly into spaces of higher dimensions than the original data, hopefully it is linearly separable in the more complex space. Be careful for over/underfitting as a result of the choice of kernel (and resulting boundary), e.g. higher degree polynomial implies less regularization. ## Support Vector Machine (SVM) for regression Instead of separating data points by a large margin, we now look to include the data points within a margin whose size/width is a parameter. Within the margin, adding more data points has no effect on the prediction. ## Decision Tree It works with both regression and classification and does not require scaling or centering of the dataset. The main idea is that at each branching of the tree, to make a prediction we throw in a test point and it goes to left or right branch depending on the threshold of a specific feature. It traverses down the tree and takes on the class of the leaf node. The tree is trained using the CART algorithm. Each split is chosen as the minimizer of a (weighted sum of) score. For classification, we use the Gini impurity score or entropy. The split is defined by the feature and a corresponding threshold. Indeed, we searched all possible feature and thresholds exhausitively. For example, if you have 1000 data points. The real-valued feature in fact just takes on 1000 different values, and we can split them at 1000 different points. We do this exhaustive search efficiently by first sorting the values of the features, then use a sliding window implementation to keep track of the score. We repeat this until we terminate the training as a function of maximum depth or other metrics. This training algorithm is greedy. It optimises the score at each node (split) and does not see into the future. To handle regression, we do a sort-of classification by where the feature space is divided into regions whoses label is the mean of the truth values on that region. The score for regression is simply the mean squared error. ## Ensemble We can combine the strength of several different predictors/learners (logistic regression, SVM, trees) by treating their predictions as votes and taking the majority or some weighted average as the final prediction. ### Bagging (or Bootstrap Aggregating) We could also use the same learner method but train each weak learner on a subset of the dataset where each sub-datapoints are selected (with replacement!) from the full dataset. We then maybe use the mode or average of the collection of predictions from each weak learner. The final result will have less variance than the individual weak learners but usually a similar degree of bias. Pasting is bagging but random subsets are chosen without replacement. Note that the weak leaners can be trained in parallel, making the method very scalable. Since we are training on a subset, the unseen (or out-of-bag) data can be conveniently used for evaluation of the weak learner. A random forest is an ensemble of decision trees. Extra Trees are a special kind of random forest where each tree is trained more randomly such that we maybe use a random threshold and only split from a subset of features instead of exhaustive optimization. Bias gets worse but variance gets better. A nice feature of random forests is that we can check how the features are used in each tree and distill information of feature importance! ### Boosting and AdaBoost Boosting also uses multiple learners, but they are trained sequentially and each learns from the past learners. A popular method is AdaBoost. The main idea is that each data is given a weight, where missclassification will give it a higher weight (of being chosen). Each learner will also be given a weight (based on their performance). For each learner $j=1,\ldots, J$: 1. For a dataset of size $N$, each data point is given a weight $w_i \gets 1/N$ for $i=1,\ldots,N$. A subset of size smaller than $N$ where each data point is uniformly sampled according to the weights. The first iteration implies equal probability for all data points. 2. Train the $j$-th learner and evaluate its error $r_j$ (on the full dataset). 3. Evaluate the weight of the learner by $\alpha_j \propto \log \frac{1 - r_j}{r_j}$. The better the performance, the more weightage the learner has in the final ensemble prediction. 4. Evaluate the new weights of each missclassified data point $w_i \gets w_i \exp(\alpha_j)$. The weights of correctly predicted data points are unchanged. The idea is that wrongly predicted data points are given higher weights so they are more likely to be included in the training of subsequent learners. Where the current learner has failed we hope the future learners will be correct. 5. Normalize such that $\sum_{i=1}^N w_i = 1$. We end up with $J$ learners. We make predictions by running all of them and the the final production is the weighted result according to the weights $\{\alpha_j\}_{j=1}^J$. It is common to apply AdaBoost to decision stumps (trees with only one split) as weak learners. We expect better bias in the ensemble than the weak learners. ### Gradient Boosting for regression Consider $N$ data points $\{x_i,y_i\}_{i=1}^N$, a weak learner $f_j$ and a differentiable loss function $L(y_i, f(x_i))$. Let $f_0 \equiv \frac{1}{N}\sum_i y_i$ so that the first predictor just outputs the mean for every input. For (weak learner) $j = 1, \ldots, J$: 1. Compute gradient (pseudo-residuals) for each data point $i$ $$r_i^j = -\frac{\partial L(y_i, f(x_i))}{\partial f(x_i)}$$ evaluated at $f(x_i) = f_{j-1}(x_i)$. For regression, the $L_2$ loss means that the gradient is simply the residual $(y_n - f_{j-1}(x_n))$. 2. Train $f_j$ to predict the pseudo-residuals so that $f_j(x_i)$ predicts $r_i^j$. Note that we usually do not have $N$ unique predictions. For example, we train a decision tree of 8 to 32 leaves (could be different for each learner), so we only have those number of predictions. Let's say we have $K$ leaves and $x_i$ gets mapped to some leaf $k$. Indeed, we need to evaluate what the output value should be for the leaf $k$. The output for leaf $k$ (for learner $j$) should be $\gamma_k^j$ that minimizes $\sum_{i\in I_k} L(y_i, f_{j-1}(x_i) + \gamma_k^j))$ where $I_k$ are the set of indices for the data points in leaf $K$. We can use gradient descent to solve for $\gamma_k^j$. It turns out that for $L_2$ loss this value is simply the average of the pseudo-residuals in that leaf $$ \gamma_k^j = \frac{1}{\lvert I_k \rvert}\sum_{i'\in I_k} r_{i'}^j (x_i). $$ 3. With learning rate $0 \leq \nu \leq 1$, the new prediction is $f_0(x_i) + \nu \gamma_k^1$ such that $x_i$ maps to leaf $k$. We iteratively have more refined predictions using more and more levels of (learning-rate-scaled) output of trees which are some averaging of pseudo-residuals. ### Gradient Boosting for classification We work with log-odds are our prediction. From the log-odds of a class $c$ $$ l_c= \log\frac{p(c)}{1 - p(c)} = \log\frac{\#c/N}{\#!c/N} = \log\frac{\#c}{\#!c} $$ we can recover the probability $p(c)$ by taking the sigmoid of the log-odds $\sigma(c) = \frac{1}{1 + \exp(-l_c)} \equiv p(c)$. The pseudo-residual for data point $x_i$ is given simply by $1 - p(\hat y_i)$ if correct prediction and $p(\hat y_i)$ for wrong prediction where $\hat y_i$ is the predicted class for data point $x_i$. (to be continued) ## Other techniques 4. PCA or Factor analysis 5. Time series Analysis 6. Cluster analysis ## Other concepts 1. AIC 1. Durbin-Watson statistic # Sources Black-Scholes: https://www.streetofwalls.com/finance-training-courses/quantitative-hedge-fund-training/important-quant-math-topics/

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.