K X Au
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invitee
    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Versions and GitHub Sync Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
Invitee
Publish Note

Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

Your note will be visible on your profile and discoverable by anyone.
Your note is now live.
This note is visible on your profile and discoverable online.
Everyone on the web can find and read all notes of this public team.
See published notes
Unpublish note
Please check the box to agree to the Community Guidelines.
View profile
Engagement control
Commenting
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
  • Everyone
Suggest edit
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
Emoji Reply
Enable
Import from Dropbox Google Drive Gist Clipboard
   owned this note    owned this note      
Published Linked with GitHub
Subscribed
  • Any changes
    Be notified of any changes
  • Mention me
    Be notified of mention me
  • Unsubscribe
Subscribe
# Data Analysis 101 ## Sanity check: 1. Is back-testing reliable/Use cross validation 1. Can I interpret the model? 1. Is the result significant? 1. Overfit/underfit 1. Consider variable transforms and always scale 1. Bias-variance tradeoff ## Loss functions Consider $N$ data points $\{x_i,y_i\}_{i=1}^N$ and predictions $\{\hat y_i\}_{i=1}^N$. For regression for $y_i \in \mathbb{R}$: - $L_1$ (Lasso) loss $\propto \sum_i \lvert \hat y_i - y_i \rvert$ for $y_i \in \mathbb{R}$ encourages all errors to go to absolute zero. - $L_2$ loss $\propto \sum_i (\hat y_i - y_i)^2$ for penalizes large errors heavily. - Huber loss $\propto \begin{cases} \begin{split} 0.5 (\hat y_i - y_i)^2 & \quad \hat y_i - y_i \rvert < \delta \\ \delta ( \lvert \hat y_i - y_i \rvert - 0.5 \, \delta) & \quad \text{otherwise} \end{split} \end{cases}$ incurs $L_2$ loss for small errors and $L_1$ for large errors, determined by parameter $\delta$. For classification: - Binary cross entropy $\propto -\sum_i y_i \log(P(\hat y_i)) + (1 - y_i) \log(1 - P(\hat y_i))$ for $y_i \in \{0,1\}$ penalizes missclassifcation by the log probability of the correct class. - Multiclass cross entropy $\propto -\sum_i t_i \cdot \log(P(\hat y_i))$ for $y_i \in \{1,K\}$ and $t_i$ one-hot encoding of $y_i$ penalizes missclassfication by the log probability of the correct class, regardless of probabilities of the other classes. - Hinge loss $\propto \sum_i \max(0, 1 - \hat y_i \cdot y_i)$ used as the additional term for a soft margin for SVMs. ## Analysis of variance (ANOVA) Compare the means of a continuous variable between $\geq 2$ groups using $F$ test.(Use Student's $t$ test if just $2$ groups) . Assumptions: - Equal variances, between groups (homoscedasticity). Check with Levene's, Bartlett's, or Brown-Forsythe test. - Residuals are normal. Check with Shapiro-Wilks test or histogram. - Data points are indepedent. Hypotheses: - Null: All group means are equal. - Alternative: At least one group mean is different. Steps: Let $X^1, \ldots, X^N$ be $N$ groups of $M$ observations of a variable. We denote - Overall mean $\mu = \frac{1}{MN}\sum_{n=1}^N \sum_{m=1}^M X^n_m$. - Group mean $\mu^n = \frac{1}{M} \sum_{m=1}^M X^n_m$ for $n=1,\ldots,N$. - Between-group SS $SSB = M \times \sum_{n=1}^N (\mu^n-\mu)^2$. - Degree-of-freedom (DOF) for $SSB$ $DOF_{SSB} = N - 1$. - Within-group sum-of-squares (SS) $SSW^n = \sum_{m=1}^M (X^n_m-\mu^n)^2$ for $n=1,\ldots,N$ then $SSW = \sum_{n=1}^N SSW^n$. - DOF for $SSW$ $DOF_{SSW} = N \times (M - 1)$. - $F$-statistic $\frac{SSB/DOF_{SSB}}{SSW/DOF_{SSW}}$. - Check $p$-value according to $F$-distribution. We can state each data point as \begin{align}X^n_m &= \mu + (\mu - \mu^n) + (\mu_n + X^n_m)\\ &=\mu + \alpha^n + \epsilon^n_m. \end{align} We assume that $\epsilon^n_m \overset{\text{iid}}{\sim} N(0,1)$ for all $n$ and $m$. The null states that $\alpha^1 = \alpha^2 = \cdots = \alpha^N$. ## Classification result evaluation ### Confusion Matrix For binary classification, it looks like this. | | Pred no | Pred yes | | -------- | -------- | -------- | | Actual no | TN | FP | | Actual yes | FN | TP | A perfect classifier will have 0 off-diagonal entries. We can summarise the matrix using the *precision*. $\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}$. It is the accuracy considering all positive predicitions; it is okay to not capture all positive truths (type II errors). We can trivially get $\text{Precision}=1$ by making only one correct positive prediction. We have no false positive so there. Indeed, we are not looking at the negative predictions! Therefore, we also consider the *recall* or *sensitivity*. $\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}$. It is the accuracy considering all positive truths; it is okay to predict true when it is not (type I errors). We can combine them as a harmonic mean called the $F_1$ score. $F_1 = \frac{2}{1/\text{Precision} + 1/\text{Recall}} = \frac{\text{TP}}{\text{TP} + \frac{\text{FN} +\text{FP}}{2}}$. $F_1$ is high when precision is close to recall. Indeed, the two quantities trade off. We may prioritise one over the other depending on application. We can plot precision vs recall to observe the trade-off. ### The receiver operating characteristic (ROC) curve ROC plots the true positive rate (recall) against the false positive rate ($1 -$ true negative rate, or *specificity*). The higher the recall, it also means we likely have more false positives, so specificity drops. To compare two curves, we can measure the area under the curve (AUC). A perfect classifier has area of $1$ while a coin-flip produces an AUC of $0.5$. ## Support Vector Machine (SVM) for classification Ideal for small-ish datasets, (linear) SVM is essentially a loss function designed to achieve the largest margin between classes, therefore it is determined only by the *support vectors* or data points closest to the boundaries. We can train it using e.g. SGD. A more flexible approach is to use a *soft margin* that balances accuracy and margin size, i.e. some missclassifications may result. The softness can be controlled by some parameter. We can use the *kernel trick* to achieve non-linear boundaries, e.g. polynomial or Gaussian radial basis functions. By transforming the features using the kernel, possibly into spaces of higher dimensions than the original data, hopefully it is linearly separable in the more complex space. Be careful for over/underfitting as a result of the choice of kernel (and resulting boundary), e.g. higher degree polynomial implies less regularization. ## Support Vector Machine (SVM) for regression Instead of separating data points by a large margin, we now look to include the data points within a margin whose size/width is a parameter. Within the margin, adding more data points has no effect on the prediction. ## Decision Tree It works with both regression and classification and does not require scaling or centering of the dataset. The main idea is that at each branching of the tree, to make a prediction we throw in a test point and it goes to left or right branch depending on the threshold of a specific feature. It traverses down the tree and takes on the class of the leaf node. The tree is trained using the CART algorithm. Each split is chosen as the minimizer of a (weighted sum of) score. For classification, we use the Gini impurity score or entropy. The split is defined by the feature and a corresponding threshold. Indeed, we searched all possible feature and thresholds exhausitively. For example, if you have 1000 data points. The real-valued feature in fact just takes on 1000 different values, and we can split them at 1000 different points. We do this exhaustive search efficiently by first sorting the values of the features, then use a sliding window implementation to keep track of the score. We repeat this until we terminate the training as a function of maximum depth or other metrics. This training algorithm is greedy. It optimises the score at each node (split) and does not see into the future. To handle regression, we do a sort-of classification by where the feature space is divided into regions whoses label is the mean of the truth values on that region. The score for regression is simply the mean squared error. ## Ensemble We can combine the strength of several different predictors/learners (logistic regression, SVM, trees) by treating their predictions as votes and taking the majority or some weighted average as the final prediction. ### Bagging (or Bootstrap Aggregating) We could also use the same learner method but train each weak learner on a subset of the dataset where each sub-datapoints are selected (with replacement!) from the full dataset. We then maybe use the mode or average of the collection of predictions from each weak learner. The final result will have less variance than the individual weak learners but usually a similar degree of bias. Pasting is bagging but random subsets are chosen without replacement. Note that the weak leaners can be trained in parallel, making the method very scalable. Since we are training on a subset, the unseen (or out-of-bag) data can be conveniently used for evaluation of the weak learner. A random forest is an ensemble of decision trees. Extra Trees are a special kind of random forest where each tree is trained more randomly such that we maybe use a random threshold and only split from a subset of features instead of exhaustive optimization. Bias gets worse but variance gets better. A nice feature of random forests is that we can check how the features are used in each tree and distill information of feature importance! ### Boosting and AdaBoost Boosting also uses multiple learners, but they are trained sequentially and each learns from the past learners. A popular method is AdaBoost. The main idea is that each data is given a weight, where missclassification will give it a higher weight (of being chosen). Each learner will also be given a weight (based on their performance). For each learner $j=1,\ldots, J$: 1. For a dataset of size $N$, each data point is given a weight $w_i \gets 1/N$ for $i=1,\ldots,N$. A subset of size smaller than $N$ where each data point is uniformly sampled according to the weights. The first iteration implies equal probability for all data points. 2. Train the $j$-th learner and evaluate its error $r_j$ (on the full dataset). 3. Evaluate the weight of the learner by $\alpha_j \propto \log \frac{1 - r_j}{r_j}$. The better the performance, the more weightage the learner has in the final ensemble prediction. 4. Evaluate the new weights of each missclassified data point $w_i \gets w_i \exp(\alpha_j)$. The weights of correctly predicted data points are unchanged. The idea is that wrongly predicted data points are given higher weights so they are more likely to be included in the training of subsequent learners. Where the current learner has failed we hope the future learners will be correct. 5. Normalize such that $\sum_{i=1}^N w_i = 1$. We end up with $J$ learners. We make predictions by running all of them and the the final production is the weighted result according to the weights $\{\alpha_j\}_{j=1}^J$. It is common to apply AdaBoost to decision stumps (trees with only one split) as weak learners. We expect better bias in the ensemble than the weak learners. ### Gradient Boosting for regression Consider $N$ data points $\{x_i,y_i\}_{i=1}^N$, a weak learner $f_j$ and a differentiable loss function $L(y_i, f(x_i))$. Let $f_0 \equiv \frac{1}{N}\sum_i y_i$ so that the first predictor just outputs the mean for every input. For (weak learner) $j = 1, \ldots, J$: 1. Compute gradient (pseudo-residuals) for each data point $i$ $$r_i^j = -\frac{\partial L(y_i, f(x_i))}{\partial f(x_i)}$$ evaluated at $f(x_i) = f_{j-1}(x_i)$. For regression, the $L_2$ loss means that the gradient is simply the residual $(y_n - f_{j-1}(x_n))$. 2. Train $f_j$ to predict the pseudo-residuals so that $f_j(x_i)$ predicts $r_i^j$. Note that we usually do not have $N$ unique predictions. For example, we train a decision tree of 8 to 32 leaves (could be different for each learner), so we only have those number of predictions. Let's say we have $K$ leaves and $x_i$ gets mapped to some leaf $k$. Indeed, we need to evaluate what the output value should be for the leaf $k$. The output for leaf $k$ (for learner $j$) should be $\gamma_k^j$ that minimizes $\sum_{i\in I_k} L(y_i, f_{j-1}(x_i) + \gamma_k^j))$ where $I_k$ are the set of indices for the data points in leaf $K$. We can use gradient descent to solve for $\gamma_k^j$. It turns out that for $L_2$ loss this value is simply the average of the pseudo-residuals in that leaf $$ \gamma_k^j = \frac{1}{\lvert I_k \rvert}\sum_{i'\in I_k} r_{i'}^j (x_i). $$ 3. With learning rate $0 \leq \nu \leq 1$, the new prediction is $f_0(x_i) + \nu \gamma_k^1$ such that $x_i$ maps to leaf $k$. We iteratively have more refined predictions using more and more levels of (learning-rate-scaled) output of trees which are some averaging of pseudo-residuals. ### Gradient Boosting for classification We work with log-odds are our prediction. From the log-odds of a class $c$ $$ l_c= \log\frac{p(c)}{1 - p(c)} = \log\frac{\#c/N}{\#!c/N} = \log\frac{\#c}{\#!c} $$ we can recover the probability $p(c)$ by taking the sigmoid of the log-odds $\sigma(c) = \frac{1}{1 + \exp(-l_c)} \equiv p(c)$. The pseudo-residual for data point $x_i$ is given simply by $1 - p(\hat y_i)$ if correct prediction and $p(\hat y_i)$ for wrong prediction where $\hat y_i$ is the predicted class for data point $x_i$. (to be continued) ## Other techniques 4. PCA or Factor analysis 5. Time series Analysis 6. Cluster analysis ## Other concepts 1. AIC 1. Durbin-Watson statistic # Sources Black-Scholes: https://www.streetofwalls.com/finance-training-courses/quantitative-hedge-fund-training/important-quant-math-topics/

Import from clipboard

Paste your markdown or webpage here...

Advanced permission required

Your current role can only read. Ask the system administrator to acquire write and comment permission.

This team is disabled

Sorry, this team is disabled. You can't edit this note.

This note is locked

Sorry, only owner can edit this note.

Reach the limit

Sorry, you've reached the max length this note can be.
Please reduce the content or divide it to more notes, thank you!

Import from Gist

Import from Snippet

or

Export to Snippet

Are you sure?

Do you really want to delete this note?
All users will lose their connection.

Create a note from template

Create a note from template

Oops...
This template has been removed or transferred.
Upgrade
All
  • All
  • Team
No template.

Create a template

Upgrade

Delete template

Do you really want to delete this template?
Turn this template into a regular note and keep its content, versions, and comments.

This page need refresh

You have an incompatible client version.
Refresh to update.
New version available!
See releases notes here
Refresh to enjoy new features.
Your user state has changed.
Refresh to load new user state.

Sign in

Forgot password

or

By clicking below, you agree to our terms of service.

Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
Wallet ( )
Connect another wallet

New to HackMD? Sign up

Help

  • English
  • 中文
  • Français
  • Deutsch
  • 日本語
  • Español
  • Català
  • Ελληνικά
  • Português
  • italiano
  • Türkçe
  • Русский
  • Nederlands
  • hrvatski jezik
  • język polski
  • Українська
  • हिन्दी
  • svenska
  • Esperanto
  • dansk

Documents

Help & Tutorial

How to use Book mode

Slide Example

API Docs

Edit in VSCode

Install browser extension

Contacts

Feedback

Discord

Send us email

Resources

Releases

Pricing

Blog

Policy

Terms

Privacy

Cheatsheet

Syntax Example Reference
# Header Header 基本排版
- Unordered List
  • Unordered List
1. Ordered List
  1. Ordered List
- [ ] Todo List
  • Todo List
> Blockquote
Blockquote
**Bold font** Bold font
*Italics font* Italics font
~~Strikethrough~~ Strikethrough
19^th^ 19th
H~2~O H2O
++Inserted text++ Inserted text
==Marked text== Marked text
[link text](https:// "title") Link
![image alt](https:// "title") Image
`Code` Code 在筆記中貼入程式碼
```javascript
var i = 0;
```
var i = 0;
:smile: :smile: Emoji list
{%youtube youtube_id %} Externals
$L^aT_eX$ LaTeX
:::info
This is a alert area.
:::

This is a alert area.

Versions and GitHub Sync
Get Full History Access

  • Edit version name
  • Delete

revision author avatar     named on  

More Less

Note content is identical to the latest version.
Compare
    Choose a version
    No search result
    Version not found
Sign in to link this note to GitHub
Learn more
This note is not linked with GitHub
 

Feedback

Submission failed, please try again

Thanks for your support.

On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

Please give us some advice and help us improve HackMD.

 

Thanks for your feedback

Remove version name

Do you want to remove this version name and description?

Transfer ownership

Transfer to
    Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

      Link with GitHub

      Please authorize HackMD on GitHub
      • Please sign in to GitHub and install the HackMD app on your GitHub repo.
      • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
      Learn more  Sign in to GitHub

      Push the note to GitHub Push to GitHub Pull a file from GitHub

        Authorize again
       

      Choose which file to push to

      Select repo
      Refresh Authorize more repos
      Select branch
      Select file
      Select branch
      Choose version(s) to push
      • Save a new version and push
      • Choose from existing versions
      Include title and tags
      Available push count

      Pull from GitHub

       
      File from GitHub
      File from HackMD

      GitHub Link Settings

      File linked

      Linked by
      File path
      Last synced branch
      Available push count

      Danger Zone

      Unlink
      You will no longer receive notification when GitHub file changes after unlink.

      Syncing

      Push failed

      Push successfully