K X Au
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee
    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee
  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # Data Analysis 101 ## Sanity check: 1. Is back-testing reliable/Use cross validation 1. Can I interpret the model? 1. Is the result significant? 1. Overfit/underfit 1. Consider variable transforms and always scale 1. Bias-variance tradeoff ## Loss functions Consider $N$ data points $\{x_i,y_i\}_{i=1}^N$ and predictions $\{\hat y_i\}_{i=1}^N$. For regression for $y_i \in \mathbb{R}$: - $L_1$ (Lasso) loss $\propto \sum_i \lvert \hat y_i - y_i \rvert$ for $y_i \in \mathbb{R}$ encourages all errors to go to absolute zero. - $L_2$ loss $\propto \sum_i (\hat y_i - y_i)^2$ for penalizes large errors heavily. - Huber loss $\propto \begin{cases} \begin{split} 0.5 (\hat y_i - y_i)^2 & \quad \hat y_i - y_i \rvert < \delta \\ \delta ( \lvert \hat y_i - y_i \rvert - 0.5 \, \delta) & \quad \text{otherwise} \end{split} \end{cases}$ incurs $L_2$ loss for small errors and $L_1$ for large errors, determined by parameter $\delta$. For classification: - Binary cross entropy $\propto -\sum_i y_i \log(P(\hat y_i)) + (1 - y_i) \log(1 - P(\hat y_i))$ for $y_i \in \{0,1\}$ penalizes missclassifcation by the log probability of the correct class. - Multiclass cross entropy $\propto -\sum_i t_i \cdot \log(P(\hat y_i))$ for $y_i \in \{1,K\}$ and $t_i$ one-hot encoding of $y_i$ penalizes missclassfication by the log probability of the correct class, regardless of probabilities of the other classes. - Hinge loss $\propto \sum_i \max(0, 1 - \hat y_i \cdot y_i)$ used as the additional term for a soft margin for SVMs. ## Analysis of variance (ANOVA) Compare the means of a continuous variable between $\geq 2$ groups using $F$ test.(Use Student's $t$ test if just $2$ groups) . Assumptions: - Equal variances, between groups (homoscedasticity). Check with Levene's, Bartlett's, or Brown-Forsythe test. - Residuals are normal. Check with Shapiro-Wilks test or histogram. - Data points are indepedent. Hypotheses: - Null: All group means are equal. - Alternative: At least one group mean is different. Steps: Let $X^1, \ldots, X^N$ be $N$ groups of $M$ observations of a variable. We denote - Overall mean $\mu = \frac{1}{MN}\sum_{n=1}^N \sum_{m=1}^M X^n_m$. - Group mean $\mu^n = \frac{1}{M} \sum_{m=1}^M X^n_m$ for $n=1,\ldots,N$. - Between-group SS $SSB = M \times \sum_{n=1}^N (\mu^n-\mu)^2$. - Degree-of-freedom (DOF) for $SSB$ $DOF_{SSB} = N - 1$. - Within-group sum-of-squares (SS) $SSW^n = \sum_{m=1}^M (X^n_m-\mu^n)^2$ for $n=1,\ldots,N$ then $SSW = \sum_{n=1}^N SSW^n$. - DOF for $SSW$ $DOF_{SSW} = N \times (M - 1)$. - $F$-statistic $\frac{SSB/DOF_{SSB}}{SSW/DOF_{SSW}}$. - Check $p$-value according to $F$-distribution. We can state each data point as \begin{align}X^n_m &= \mu + (\mu - \mu^n) + (\mu_n + X^n_m)\\ &=\mu + \alpha^n + \epsilon^n_m. \end{align} We assume that $\epsilon^n_m \overset{\text{iid}}{\sim} N(0,1)$ for all $n$ and $m$. The null states that $\alpha^1 = \alpha^2 = \cdots = \alpha^N$. ## Classification result evaluation ### Confusion Matrix For binary classification, it looks like this. | | Pred no | Pred yes | | -------- | -------- | -------- | | Actual no | TN | FP | | Actual yes | FN | TP | A perfect classifier will have 0 off-diagonal entries. We can summarise the matrix using the *precision*. $\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}$. It is the accuracy considering all positive predicitions; it is okay to not capture all positive truths (type II errors). We can trivially get $\text{Precision}=1$ by making only one correct positive prediction. We have no false positive so there. Indeed, we are not looking at the negative predictions! Therefore, we also consider the *recall* or *sensitivity*. $\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}$. It is the accuracy considering all positive truths; it is okay to predict true when it is not (type I errors). We can combine them as a harmonic mean called the $F_1$ score. $F_1 = \frac{2}{1/\text{Precision} + 1/\text{Recall}} = \frac{\text{TP}}{\text{TP} + \frac{\text{FN} +\text{FP}}{2}}$. $F_1$ is high when precision is close to recall. Indeed, the two quantities trade off. We may prioritise one over the other depending on application. We can plot precision vs recall to observe the trade-off. ### The receiver operating characteristic (ROC) curve ROC plots the true positive rate (recall) against the false positive rate ($1 -$ true negative rate, or *specificity*). The higher the recall, it also means we likely have more false positives, so specificity drops. To compare two curves, we can measure the area under the curve (AUC). A perfect classifier has area of $1$ while a coin-flip produces an AUC of $0.5$. ## Support Vector Machine (SVM) for classification Ideal for small-ish datasets, (linear) SVM is essentially a loss function designed to achieve the largest margin between classes, therefore it is determined only by the *support vectors* or data points closest to the boundaries. We can train it using e.g. SGD. A more flexible approach is to use a *soft margin* that balances accuracy and margin size, i.e. some missclassifications may result. The softness can be controlled by some parameter. We can use the *kernel trick* to achieve non-linear boundaries, e.g. polynomial or Gaussian radial basis functions. By transforming the features using the kernel, possibly into spaces of higher dimensions than the original data, hopefully it is linearly separable in the more complex space. Be careful for over/underfitting as a result of the choice of kernel (and resulting boundary), e.g. higher degree polynomial implies less regularization. ## Support Vector Machine (SVM) for regression Instead of separating data points by a large margin, we now look to include the data points within a margin whose size/width is a parameter. Within the margin, adding more data points has no effect on the prediction. ## Decision Tree It works with both regression and classification and does not require scaling or centering of the dataset. The main idea is that at each branching of the tree, to make a prediction we throw in a test point and it goes to left or right branch depending on the threshold of a specific feature. It traverses down the tree and takes on the class of the leaf node. The tree is trained using the CART algorithm. Each split is chosen as the minimizer of a (weighted sum of) score. For classification, we use the Gini impurity score or entropy. The split is defined by the feature and a corresponding threshold. Indeed, we searched all possible feature and thresholds exhausitively. For example, if you have 1000 data points. The real-valued feature in fact just takes on 1000 different values, and we can split them at 1000 different points. We do this exhaustive search efficiently by first sorting the values of the features, then use a sliding window implementation to keep track of the score. We repeat this until we terminate the training as a function of maximum depth or other metrics. This training algorithm is greedy. It optimises the score at each node (split) and does not see into the future. To handle regression, we do a sort-of classification by where the feature space is divided into regions whoses label is the mean of the truth values on that region. The score for regression is simply the mean squared error. ## Ensemble We can combine the strength of several different predictors/learners (logistic regression, SVM, trees) by treating their predictions as votes and taking the majority or some weighted average as the final prediction. ### Bagging (or Bootstrap Aggregating) We could also use the same learner method but train each weak learner on a subset of the dataset where each sub-datapoints are selected (with replacement!) from the full dataset. We then maybe use the mode or average of the collection of predictions from each weak learner. The final result will have less variance than the individual weak learners but usually a similar degree of bias. Pasting is bagging but random subsets are chosen without replacement. Note that the weak leaners can be trained in parallel, making the method very scalable. Since we are training on a subset, the unseen (or out-of-bag) data can be conveniently used for evaluation of the weak learner. A random forest is an ensemble of decision trees. Extra Trees are a special kind of random forest where each tree is trained more randomly such that we maybe use a random threshold and only split from a subset of features instead of exhaustive optimization. Bias gets worse but variance gets better. A nice feature of random forests is that we can check how the features are used in each tree and distill information of feature importance! ### Boosting and AdaBoost Boosting also uses multiple learners, but they are trained sequentially and each learns from the past learners. A popular method is AdaBoost. The main idea is that each data is given a weight, where missclassification will give it a higher weight (of being chosen). Each learner will also be given a weight (based on their performance). For each learner $j=1,\ldots, J$: 1. For a dataset of size $N$, each data point is given a weight $w_i \gets 1/N$ for $i=1,\ldots,N$. A subset of size smaller than $N$ where each data point is uniformly sampled according to the weights. The first iteration implies equal probability for all data points. 2. Train the $j$-th learner and evaluate its error $r_j$ (on the full dataset). 3. Evaluate the weight of the learner by $\alpha_j \propto \log \frac{1 - r_j}{r_j}$. The better the performance, the more weightage the learner has in the final ensemble prediction. 4. Evaluate the new weights of each missclassified data point $w_i \gets w_i \exp(\alpha_j)$. The weights of correctly predicted data points are unchanged. The idea is that wrongly predicted data points are given higher weights so they are more likely to be included in the training of subsequent learners. Where the current learner has failed we hope the future learners will be correct. 5. Normalize such that $\sum_{i=1}^N w_i = 1$. We end up with $J$ learners. We make predictions by running all of them and the the final production is the weighted result according to the weights $\{\alpha_j\}_{j=1}^J$. It is common to apply AdaBoost to decision stumps (trees with only one split) as weak learners. We expect better bias in the ensemble than the weak learners. ### Gradient Boosting for regression Consider $N$ data points $\{x_i,y_i\}_{i=1}^N$, a weak learner $f_j$ and a differentiable loss function $L(y_i, f(x_i))$. Let $f_0 \equiv \frac{1}{N}\sum_i y_i$ so that the first predictor just outputs the mean for every input. For (weak learner) $j = 1, \ldots, J$: 1. Compute gradient (pseudo-residuals) for each data point $i$ $$r_i^j = -\frac{\partial L(y_i, f(x_i))}{\partial f(x_i)}$$ evaluated at $f(x_i) = f_{j-1}(x_i)$. For regression, the $L_2$ loss means that the gradient is simply the residual $(y_n - f_{j-1}(x_n))$. 2. Train $f_j$ to predict the pseudo-residuals so that $f_j(x_i)$ predicts $r_i^j$. Note that we usually do not have $N$ unique predictions. For example, we train a decision tree of 8 to 32 leaves (could be different for each learner), so we only have those number of predictions. Let's say we have $K$ leaves and $x_i$ gets mapped to some leaf $k$. Indeed, we need to evaluate what the output value should be for the leaf $k$. The output for leaf $k$ (for learner $j$) should be $\gamma_k^j$ that minimizes $\sum_{i\in I_k} L(y_i, f_{j-1}(x_i) + \gamma_k^j))$ where $I_k$ are the set of indices for the data points in leaf $K$. We can use gradient descent to solve for $\gamma_k^j$. It turns out that for $L_2$ loss this value is simply the average of the pseudo-residuals in that leaf $$ \gamma_k^j = \frac{1}{\lvert I_k \rvert}\sum_{i'\in I_k} r_{i'}^j (x_i). $$ 3. With learning rate $0 \leq \nu \leq 1$, the new prediction is $f_0(x_i) + \nu \gamma_k^1$ such that $x_i$ maps to leaf $k$. We iteratively have more refined predictions using more and more levels of (learning-rate-scaled) output of trees which are some averaging of pseudo-residuals. ### Gradient Boosting for classification We work with log-odds are our prediction. From the log-odds of a class $c$ $$ l_c= \log\frac{p(c)}{1 - p(c)} = \log\frac{\#c/N}{\#!c/N} = \log\frac{\#c}{\#!c} $$ we can recover the probability $p(c)$ by taking the sigmoid of the log-odds $\sigma(c) = \frac{1}{1 + \exp(-l_c)} \equiv p(c)$. The pseudo-residual for data point $x_i$ is given simply by $1 - p(\hat y_i)$ if correct prediction and $p(\hat y_i)$ for wrong prediction where $\hat y_i$ is the predicted class for data point $x_i$. (to be continued) ## Other techniques 4. PCA or Factor analysis 5. Time series Analysis 6. Cluster analysis ## Other concepts 1. AIC 1. Durbin-Watson statistic # Sources Black-Scholes: https://www.streetofwalls.com/finance-training-courses/quantitative-hedge-fund-training/important-quant-math-topics/

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully