---
title: ML
tags: Templates, Talk
description: View the slide with "Slide Mode".
---
# Trees
- Steps to build a tree
```
while |iterate until |:
```
what if a leaf has no data points - shouldn't happen
Suppose we have this tree, with these examples: (1:f1=3, 2:f1=2)
```
f1 < 4 root f1 > 4
/. \
{1,2}
```
Y = 0, 1
$Gini = \sum N_{branch} * H$
Regression
- sort all the data by the f1
- use a pair of points (two points), cal avg, as splitter
- use it as
- why avg and not "median"?
Time complexity O(mn log m)
m : feature
## Random Forest
O(M x mn log m)
parallel grow trees
`If regression`
## Compare
### RF vs GBDT vs Adaboost vs Xgboost
- All build on DT
- RF use bagging + bootstraping + random features
- GBDT steps
- init guess
- for t in epochs:
- calculate prediction and residual
- Train a new tree (tree1) to predict residual
- combine: init + $\alpha$ * tree1_predict to make
`LOSS functions of GBDT`
`Converge condition`
# SVM
## Concepts
## Kernal Usage
- when use linear / polynomial / rbf / Gassian
- depends on data pattern
- projection to low dim space (PCA=linear/ tSNE=non-linear), visualize
`kernel trick` - not useful anymore
# Deep Learning
## Constraint to use Deep Learning
- dimension?
- if pretrain - then few hundreds,
- else 10K
- Computation resource?
- inference, on devices vs. on cloud
- 8bit instead of 32bit cut space by 4 -> larger batch
- Student / Teacher model, student has less parameter
- Training data pre-processing / compression
## Momentum
Use historical gradient descent to adjust direction of current grad descent.

## Adagard
For more frequent updates, use small steps to update weights; For less frequent updates, use larger step to learn more information.

## RMSprop
Update based on lastest momemtum

## Adam
## sigmoid vs. ReLu
- sig: when we need prediction bounded 0-1, or softmax
- softmax: all add up to 1
- ReLU, more CV, adv: preserve information of large number; feature: all negatives go to 0
(when we need keep negative labels/probability/scores, use tanh)
## Batch Normalization
Standardization, keep the output at same scale
also a good for overfitting.
## SGD
Stochastic Batch vs. mini Batch: same, take several random samples from the data to form a batch; better for learning because it will shuffle the data
GD: feed model with all data, which is not necessary
## RNN / GRU / LSTM
RNN: only one state
GRU: faster than LSTM; accuracy almost LSTM
Often En/de same arch, but diff para, encode: represent a sentence, obtain a single vector, at
decoder: use the representation as input, then output words vocabulary, get argmax of word score (position of the word), this will be input of next step of decoder
## Group Convolution
## What can cause a model failed to converge? Is it failed model if not converge?
## VGG3x3 Core Adv.
## Attention D
During encoder, each step output a vector, people only care last step, when use attention we use all stages of encoder, output of decoder: hiden representation, multiplied by every encoder stage,
## Overfitting Prevention
- Dropout
- only for training
- introduce noise to reduce overfitting
- help to cut off relationship between features, to pay attention on each feature individually for each epoch
- Shrink Network (layers number / width)
- Regularization (L1 / L2)
- Multi-task learning (BERT, predict next sentence / word), predict image and words together
- Review Learning Rate, reduce if no big change in likelihood
- Early stopping, if no improvements during epoch, stop training
- track performance of model by epoch, choose the best one
- Batch Normalization
## Vanish / Explosion of Gradient
- gradient can not have too large number with over 32 bits float
- grad clipping
- Sigmoid / ReLu, neg ~ 0, no learning, no adjustment
- no good solution for vanishing, maybe `dropout`