# 423 - DataScience in Industry
# Week 1
## Lect 1 - 23 Feb - 2021
### Lab 1 - 25 Feb - 2021
## Lect 2 - 26 Feb - 2021
# Week 2
## Lect 3 - 2 March - 2021
### Lab 2 - 4 March - 2021
## Lect 4 - 5 March - 2021
# Week 3
## Lect 5 - 9 March - 2021
### Lab 3 - 11 March - 2021
- Assignment
- Speedy and Nick can help with the assignment
- Interesting task for today
## Lect 6 - 12 March - 2021
.. What do you do with date variable?
.. Look at the date, R won't read it in a same way as it is in CSV. How would he know if it's DD/MM.YY or YY/MM/DD etc
.. Missing data, always look for it.. You are going to challenge with different ways in the next assignment. Nothing at all is also a missing, -99, or -1 if value has no meaning.
.. why we have a data which has nothing meaningful. No background info about the data to give same playingfield for level playing.
.. target variable can be Y.
.. logical things in the quiz.
- Data roles
- valid roles for variables are:
- Outcome i.e. the Y variable (multiple)
- Predictors i.e. the X variables
- Case Weight (positive numeric)
- Train-test split (factor or binary)
- Observation identifier (100% unique)
- Observation strantifier (factor)
.. ID shouold be unique, often used checksum. PolicyNumber has number + checksum. ID might be composed of date + number followed by postcode etc. Sometime, the ID can be broken down in the smaller variable e.g. 2006G1204.
.. explore the data with question in terms of ID sequence. If there is no sequence then try to create one.
..
**Stratifier Role:**
Sometimes called Grouping role. Sets of observations might sensibly be grouped by stratifier variable e.g. the hospital name for patient data. A stratifier variable is used to drive a train-test resampling strategy. A stratifier variable does not take futher part in the modeling. A stratifier variable is categorizal.
A stratifier role is needed for observations that are not entirely independed of each other. A good stratifier variable is one that enables the obsercations that are similar(i.e. dependent) to be sanoked s that they are all in train or all in test.
Leave-group-out resampling, using the stratifier variable, is the solution to non-independent observations. Sometimes the stratifier role is unfilled and unnecessary when the observations are independent. Sometimes there are many stratifier candidates. When there are and they form a hierarchy, (e.g. country - city - suburb), the highest level is likely to be the right level. stratifier role being filled is mutually exclusive with the Train-test role being filled.
The leave-group out resampling style attempts to keep independency between train observations and test observations. If the train-test cases will be too well represented in training and will provide a biased assessment of the model performance. This is a form of Data leakage.
A conventional k-fold cross validation is restricted to situations in which observations are independent of each other.
.. patient data of whole newzealand.. patient ID, patient name, diagnostic info, age, gender etc.. and I added that which hospital they IN.. then predict the cost of their admissino in to hospital OR the deat rate.
.. Is the hospital they are IN is the predictor variable? No clear answer? But what If you have been told that we are treating every hospital is same..
.. Exciting thing that hospital have different doctors so not getting the same care.
Stratified variable is to define strata of the data from the same place. Train n test with the same strata rather random data. It would be a better representation and making of model.
**Character Strings**
In the context of predictor role, variable that are character string are of no use in that form.
.. Low cadinality :: convert them to factor (Check that level are fixed) but ultimately dummy encode them.
.. High cardinality ::
.. use specific encoding to run in to number variables e.g. word2Vec embedding or ngram vector.
.. use hash encoding to turn it in to a number of numeric variable.
.. use target encoding to turn it in to a numeric variable (given there is an outcome variable)
For distance operations, strings can be compared for their similarity / dissimilarity in their raw form.
Distance matrix is very important .. in terms of KNN ..
# Week 4
## Lect 7 - Tuesday 16 March - 2021
Take home message from last week
There must be at least PREDICTOR role.
. Early in the process to allocate the variables to their roles
. Not all roles are necessary to fill
. All variables needs to be allocated to a single role. If a variable has no feasible role, try to do without it.
. The role imply things that will matter downstream.
. Straitfier/group roles are subtle and oten overlooked.
#### *Missing Data*
Missing is a very common real life problem. Very rare data will have a no missing data. The problem is that in the future, you may end up having a missing data in testing data.
When you see a CSV file then you may end up having a data which is not related.
In statistics missing data occurs when no data is stored for an observation or fot variables in an observation. For example missing data is representation as NA, "", -000 etc. In R, we want to get ot the NA format since NA is R's way of saying NotAvailable or Not Applicable, this is done as part of data tidying/cleaning.
Common techniques for missing values.
--> *Imputation*
process of replacing missing data with substituted values. There are two types
. Introducing whle observation that are missing
. Infilling values that are missing
We are concerned with the later.
--> *Partial Deletion*
Also called row-wise deletion. The removal of whole observations that have any (relevant) missing values. Clearly a missing value for a variable that is not relevant to a model should not cause partial deletion. Removal of observations may cause model bias if the pattern of missingness is not random
--> *why missing an issue; why can't we train?*
Most method can not deal the missing values.
LM doesn't mind missing value. But this is not true, LM use partial deltion and gives you a model. But KN crash if there are missing values.
Random Forest (tree base model) tolerate the missing values.
Kernel methods tend to raise errors in the presence of missing training or prediction data.
**Concepts**
1. Not Applicable is different to Not Available.
In the domain of selling homes a variable is defined as “the surface of any adjacent lane way” with levels “asphalt”,”concrete”,”gravel”.What does a NA value for this variable imply?Is this Not-Applicable or Not-Available?Should this value be imputed?What else can we do?
** None ** as a new level and assign the missing value.
It's a wee bit of noise in our data/model.
2. Other place-holders
- NaN - not a number e.g. log(-1)
- -Infinity - A huge negative number e.g. -1/0
- Infinity - A huge positive number e.g. 1/0
3. Attrition
- A type of missing that can occur in a longtitudinal studies ( long term studies), people move out of country
4. Censored data
- data that is partially known and partially missing
- left censored data
- right censored data
--> *Significance of all above:*
Many methods crash in the presence of missing data – by avoiding these the choice of methods becomes restricted. Someme thods discard observations with missing variables as they train ( typically through a parameter like na.action=na.omit). For these methods, doing nothing about missing values is the same as explicit partial deletion. So rather stick with method like Random Forest ..
--> *Partial deletion:*
it's imperfect because it can lead to weaker training sets (due to fewer observation) and sometimes biased models. e.g. less woemn data and if I use partial deltion than i may have biase towards males.
--> *Missingness*
might not be independed of the value(were it not missing). Consider a health survey that has missing data due t a person being too sick to complete it.
--> *Imputation*
restores the number of observations but is imperfect i.e. it can increase model variance and sometimes introduces model bias.
**Excessively Missing**
A variable that is heavily missing (say > 50 % missing ) is a good candidate to drop as a predictor. Leaving such a variable in will cause the removal of atleast this proportion of observations if partial deletion isused.
Observations that are excessively missing are good candidates to remove. Imputing these is mostly an excercise in observation creations.
Strategy 1
First, Remove excessively missing variables.
Second, remove excessively missing observations.
This, impute missing values
Strategy 2
First, remove excessively missing variables.
Second, remove excessively missing observations.
Third, use a method tolerant of missing values. Random forest etc.
Strategy 3.
First, remove excessively missing variables.
Second, delete observations with any missing values.
which strategy is best, depends if missing value in the future, anticpate earlier.
**Informative Missingness**
If missingness has a systematic relationship to the outcome variable, we have informative missingness. We can create extra “shadow” variables to capture this information by assigning 1 to a missing value and 0 to a non-missing value. In reality you can do either way.
#### **TYPES**
- Missing completely at random MCAR (no pattern)
The likelihood of a missing value is independent of the observation, other variables and of things like time
- Missing at random (pattern to it)
The likelihood of a missing value is predictable from other variables and/or something time-like such as observation order.
- Missing not at random MNAR – (biased)
The likelihood of a missing value is related to the value were it not missing i.e. it is suppressed because of its value. Can you detect/deduce this?
Missing value is missing because of it's missing value - cause n effect relationship. Certain surveys people don't refuse to answer about their income is either low or too high.
Can we deduce Missing Not at Random (biased)? Can't because we need to see the data.. We need domain knowledge.
**MCAR**
Missing Completely at Random means there is no relationship between the missingness of the data and any values, observed or missing. The missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than others.Both partial deletion or imputation should be unbiased strategies in this situation.MCAR is rare in practice.
MCAR is rare..
**MAR**
Missing at Random means there is a systematic relationship between the propensity for missing values and the observed data, but not the missing data.Whether a value is missing has nothing to do with the missing values, but it does have to do with the values of an observation’s other (observed) variables.
Imputation should be an unbiased strategy in this situation.
However partial deletion might be biased.
For example, if men are more likely to tell you their weight than women, weight is MAR.
**MNAR**
Missing Not at Random means there is a relationship between the propensity of a value to be missing and its (missing) value.
For example, if people who weigh a lot are more likely to avoid declaring their weight than others, then the weight variable is MNAR.
Both imputation and partial deletion strategies can introduce bias in this situation.Can we prove this is happening from the data? Sadly no.
Can we suspect this from an understanding of the data collection process? Absolutely yes.
**DIAGNOSIS**
In most missing data situations, we don't have the luxury of getting hold of the missing data. So we can;t test it directly to determine whether missingness is related to the missing value.
In diagnosing randomness of the missing data , use your scientific and psychological knowledge of the data , the domainan ddata collection process. The more sensitive the issue , the more like ly people are to withhold or lie. “They’re not going to tell you as much about their cocaine usage as they are about their phone usage.
Many fields have common situations relating to missing data. Educate yourself in your fields literature.
MNAR is difficult to diagnose. Sometimes diagnosing this does not particularly help as there is no remedy available other than survey redesign.It is important to explain that a model based on MNAR missing values will be biased. However, It is not possible to quantify this bias.
### Lab 3 - 18 March - 2021
Missing data - R markdown..
## Lect 8 - 19 March - 2021
We will go through the Model answer once all the students submit their assignments.
Need more work on reporting - how to read the data?
**RESPONSE**
--> Imputation (Must not be MNAR)
Really the process of predicting the missing values.. main model is to predict a target .. this imputation is a model, needs to train/learn to predict the missing values and infill the missing values once we predict them.
Better out of two evils....
Types of Imputation..
- Fast/simple: median for numeric and mode for nomical (check distribution first - don't want bimodal shape)
Median is a first thing to jump .. median/mode is only for numeric values. e.g. typically doing imputation when we don't have large amount of observations and we can't throw the observations.
- Slow/complex: KNN & bag imputation (but deals with patterned missingness well)
K Nearest Neighbour (Categorical & Numeric)
Bag Imputation - they deal with pattern missingness very well.
- Time Series: Special methods may be needed for time-series data.
--> Create another category level
For categorical variables (red, blue, green), create a new level to represent not-applicable missing values. Turn in to 4th level and called it something 'None'. We can do the similar with numeric variable e.g. first born child name..
--> Partial deletion
Omit observations with missing data and / oor omit an entire variable.
Column / row wise.. (it's row wise partial deltion)
--> Restricted choice of methods
Methods exist that are intrinsically tolerant of missing data. however, these are a small subset of relevant methods.
If a value is missing becuase it is Not Applicable (like the height of the oldest child of someone who does not have any children) then it would not make sense to try and guess what it might be. Within a dataset each variable would need to be considered on its own merits for a missing value strategy.Categorical var 1 ...extra levelNumeric var 2 ...leave as NANumeric var 3 ...ImputeCategorical var 4 ...ImputeNumeric var 5...Column omitNumeric var 6...Variable omit
**MV-Tolerant Methods**
Ven-diagram to show Classification and Regression methods and the venn diagram is tell us that there are really only three method groups that are intrinsically tolerant of missing values:
. Ada (and is variations
. C5.0 (and its variations)
. rpart (and its variation)
- Classification
Ada
AdaBoost.M1
AdaBag
C5.0
C5.0Cost
C5.0Rules
C5.0Tree
- Using Caret package naming standard for method
We are using Carrot because it's well documentated.
rpart
rpart1se
rpart2
**HIGH-LEVEL-STRATEGY**
Write it, at the start of your work.
For a given problem, you will need to declare and implement a strategy for missing data
--> Factors that affect this strategy are:
•Proportion missing: Small / not small
If you get 0.1% of missing observation, better to remove them.
•Informative missingness: by variable
Is the missing relevant to the things we are predicting. It's not easy
•Missing type: MCAR / MAR / MNAR
Try to get ahead of this ..
•Expect in future: Yes / No
•Tolerant methods: Yes / No
Are the tolerant method suits this data?
That is how you are going to break the problem:

Gausseine method does support missing value..
Sometime we have to make the compromise and we have to understand what we are doing.
**DETAIL STRATEGY**
For each column that has missing values, your steps should be:
1)Understand why values are missing if you can. Using the reason discovered (and a Not-Applicable scenario) an obvious value to infill might become apparent (often 0 for numeric or "none" for categorical)
2)If you suspect informative missingness, create a shadow variable
How would you suspect informative missingness, just to be on safe !
3)When the missingness is excessive and variable importance is not high, the variable should be removed
If the variable is
4)Possibly*, impute any remaining missing variable values using the rest of the observationFor each observation that has missing values, your steps should be:
5)When the missingness is excessive, the observation should be removed
6)Possibly*, any remaining observations with missing values should be removed*Only "Possibly" because you may intend to use a method that is tolerant of missing values and so you can afford to leave in your data.
**TAKE HOME MESSAGES**

--> MCAR, both are fine.
--> MAR - pattern but we can't see.. pattern can work against us so what if deleting more interesting observations.. so it can bias the model. e.g. question in a dataset which supply your weight.. may be women are more reluctant to provide the weight.. so there is a pattern in it.
If there is a pattern then I might end up removing missing values for Sex thus creating a lot of bias.
MNAR -->
We want tos tay in the green zone.
**Expecting future missing data**
Expecting future missing data implies the need to consider how to predict in the presence of missing data. This implies either:
•an imputation sub-model being available
•a tolerant method used to create the model
•rejection of incomplete observations
Rejection may or may not be allowed by the application.
You may need a strategy for missing data even when you have no missing training data. Otherwise how will you handle missing in future unseen observation or in the test data.
You can try to predict observation missingness i.e. model the number of missing values per observations. If this produces a viable model you can rule our MCAR.
If your model, above, shows that cetain roles (e.g. Outcome, ID, Stratifier) are importan variable, the points to a structural missingness. You should investigate why this is happening.
## Create a strategy for missing data and document this in your report.
You might find at the end of the modeling that certain variables were not important (i.e. feature select out)
Remove excessibley missing variables and obervations early, and in that order.
Never impute the outcome variable. ()
Visualize the pattern of missingness in your report. Suggest what the causes might be based on domain knowledge of the data set. Record what the impact might be upon the model you are reporting.
Report needs to acknowledge about the missingness of data.
# Week 5
## Lect 9 - 23 March - 2021
**Model Answer discussion of Assignment 1**
- Try to find the interesting facts about data
-
#### Lab 4 - 25 March - 2021
** Outliers**
- Visual Outliers
- If you use ggplot ther eis bug ggplot with plotly
- Don't use ggplotly.
-
- Univariable outliers
Asks the question, do you need dependent/independent outliar
Synthesize
Prove that uni-variable proves that there isn't much difference in our model.
- Model based outliers
It's a powerful technique which goes each way - either model is not good or the technique isn't good.. so we have to investigate either which way the result go.
## Lect 10 - 26 March - 2021
#### **OUTLIARS**
We found that outliers that we have to get rid of them
. either by deleting them
. providing values to them
We have outliars, there is no obligations to fix it. you may not have a problem.
NOTE: The outliar is stopping me to get the result. Example, if someone disagree with you, you never say him to step out of the room or make that person disappear. In statistics, there is a balance between which we approve and disapprove.
1. In statistics, an outlier is an observation that differs significantly from the majority of observations.
2. Outside of statistics, an outlier is an infeasible observation based on your understanding of the problem domain e.g. No house has 148 bathrooms.
3. By definition, statistical outliers must be rare.
4. Outliers are notoriously hard to prove statistically as it all depends on the distribution that is assumed (and its parameters).
5. No Free Lunch – there is no best way to detect them.
==> what if I have -99, or 0 bathroom.. it means, unreliable data observation..
Lots of will be false outliars - so have to find which one is true or false outliars.
Suppose a dataset is supposed to represent observations about houses. An observation has 172 bathrooms.
What can we say about this?
172 bathrooms is a logical outlier. No statistical argument is needed.It is likely that a hotel or hospital has appeared in the dataset.
Can we confirm this suspicion?
Since a hotel or hospital is not a house, it might be best to remove this observation AND ANY OTHERS THAT ARE SUGGESTED BY THIS REASONING.Suppose an observation has -99 bathrooms. What does this suggest? Replace this value with NA?
1. NOVELITIES
- Outliers without any associated judgement
When you talk about novelties, it's an opportunities to learn whereas outliers is something to solve.
Thinking of novelties rather than outliers is a good idea as it encourages us to explain the situation rather than to discard it.
One approach to novelties to to replace them with NA. However we cannot pretend the values are missing at random. Imputation will introduce some (acceptable?) bias.If you replace a variable’s value with 10000000, the mean will be immediately affected in a big way. In contrast the median is barely affected. In fact you can add up to 50% outliers before the median is degraded. We say that median is 50% robust.ML algorithms are robust in varying degrees. Most are 0% robust i.e. not robust.
2. Anomalies – These are outliers whose detection is the sole purpose of an analysis.
4. Winsorizing – Replacing outliers with feasible values
Not heard .. there are certain fields, can't think many.
6. Robustness– The intrinsic tolerance to outliers. As an example, consider the difference in behaviour between mean and median
some methods are tolerant to outliers, and if outliers are giving us problem then <<refer to univariable outliers - Robust Modeling >>
There are residual outliers on the chart. The bulk of the data is not sitting on the linear. Out data is not actually well fitted. Our model is not fitted properly.. low n high bias..
**Significance of Outliers**
- why are outliers an issue
--> In everyday life?
--> in statistics
--> in datascience
Outliers are one of the indications that something may be wrong. When you think something is wrong you attempt to prove your suspicions by investigating the situation.
The negative age example: the value is -99. Perhaps this is a placeholder for missing. Is -99 the only negative value? Is there evidence of -99 being used for fields that are from the same source but not part of this study?
By the way, 1-Apr is April Fool’s day.
Outliers are not the only indicator of problems.
● Consider high cardinality of categorical variables
● Consider high missingness.
● Consider whole number measurements
● Consider the presence of duplicates
**Outliers to WHAT**
- Global Mean : outliers are always on the outside based on assumed distribution
- Local Mean : outliers can be on the inside (say in a hollow zone). This allows for “mixtures.”
- Modal : outliers are viewed in the context of some model’s residuals. Can be a high dimensional model. outliers are rare when compared to the frequency of the mode.
- Mediocrity: Outliers are interesting observations. outliers are where your assumptions will be challenged
- Quality : outliers are of significantly lower repeatability i.e. the values change when the observation is repeated.
**RESPONSE to OUTLIERS**
Never ever go for deletion at the first stage.
1. Verify - Always start by investigating the outliers. Trace the data upstream to determine your response.
2. Allow - Leave the observation intact, maybe plan to use ML methods that are robus to outleirs.
Using a robust method is somewhat similar to deleting observations except that the method might have a more flexible criterion. Robust methods do not detect whether a future unseen observation is an outlier.
3. Deletion - Omit observations if you would be willing to not predict observations like this in future.
The assignable cause can make deletion an obvious choice. Can these cases be detected in future unseen observations? Should they be predicted?
4. Modify - to the correct value, once this is researched. To NA, now we have a missing value issue to deal with - Missing NOT at Random.
To a feasible value, winsorizing but what feasible value is justified. It's tough but you have to choose between deleting/winsoring and winsoring is a better solution.
Can these cases be detected in future unseen observations? Should they be changed before being predicted?
*Points to remember*
- whether the outliers are found manually or automatically, they need to be dealt with programmatically.
- avoid the temptation to do manual deletions or corrections. In case you are using winsoring, then write some code, if observation value is 'something" then assign a value 'xyz'. It's not good but still programmatical.
- The process must be repeartable and documented.
- Consider the implications for future unseen observations.
--> Even though we tolerate manual outlier detection, the process of responding must be programmatic so that the same results can be regenerated in the future.
--> Omitted observations need to be recorded in the report (even if it is an appendix.) It makes sense to describe how the observations can be identified and what the response to them was.
--> Develop a strategy for how outliers in future unseen data can be detected and responded to
### ATTRIBUTABLE CAUSE OUTLIERS
- These are outliers that, after can be explained.
- The outliers were sand-pit data that should have been deleted. Omit all observations with this explanation whether they are outliers or not.
Outliers migh related to a survery takers that was suspected of faking survey entries. Omit all observation related to this person whether thy are outliers or not.
- Outliers relate to a period leading up to an instrument failure. Omit all from this period.
- Outliers related to an old computer system that recorded data in imperial units. Convert these to metric and reexamine.
- Outliers related to a particular branch that did not follow the data collection guidlines correctly. Omit all observations related to this branch whether they are outliers or not.
- Outliers relate to a prices where no decimal point was entered. e.g. $12.50 entered as 1250.
### Impossible Observations
- These are outliers that are physically or logically impossible.
- The probablility of rain is 1.2 for some observations. Omit these provided we have plenty of observations other set to 1.
- A negative concentration was found. Replace this value with zero if the rest of the observations looks fine. Otherwise omit the observations.
- The fficiency of the engine is recorded at 120%. We have perpetual motion ! This produces a loss of confidence in the observations omit ones like this.
- A person height is recorded as 1.75 km. Change it to meters but I lost the confidence.
### Causes of outliers.
1. Chance ( false outliers )
2. Mixture of observation types
3. Measurement errors
4. Data management errors
Clearly these causes generates small effects as well as big effects. Only the big effects will produce outliers. There likely to be many more small effects situation than big effects situations..........
### 1. CHANCE OUTLIER

Chance outliers are false outliers in the sense that although they appear to be invalid they are in fact valid observations provided their frequency is appropriate.
The probability of a value below 3 standard deviations is 0.1% given the variable has a normal distribution. We can expect an average of 6 such false “outliers” from a sample of 6000 observations.
What if there are more than 6? Are there significantlymore? Is the distribution really normal? Is the sample standard deviation and mean preciseenough? Perhaps some of them are outliers?
Boxplots and bagplots are reasonable ways to investigate chance outliers especially after reshaping.
Outliers can occur by chance in any distribution. If we are certain about the distribution we can calculate that probability. Often outliers indicate that the sample has a heavy-tailed distribution and the analysis assumes a distribution without a heavy tail.
## Lect 11 - 30 March - 2021
#### **OUTLIERS - CONTINUE**
### 2. Mixture of observation types**
A mixture of two or more distributions, which might be several distinct sub-populations.
A house with 172 bathrooms is probably a hotel. Hotels do not belong in a dataset about houses.
Mixtures of observation types are suggested by bi-modal or multi-modal distributions. Outliers, being rare, also suffer from imbalance - so do not expect balanced peaks.
Cluster or density-based analysis is a good way to detect outliers in the presence of mixtures of observations e.g. houses and hotels.
### 3. Data Management errors
For example:
Switching the house_area & number_of_bathroomsvalues for an observation
Assigning the wrong council information because the house-plot number was misread from the plans.

Data management errors are surprisingly common. They can arise from changes in measurement units, integer truncation, time zone changes, and mis-matching of outcome variable to predictor variables.
The latter is sometimes a consequence of fuzzy matching. We connect two sources of data using a probablistic match between them. For example, based on name and date of birth.
### 4. Measurement error

For example:
The house_areameasurement might be missing the garage.
The surveyor’s electronic distance meter was out of calibration.
Sometimes called “measurement uncertainty”
Types of measurement error:
1)Instrumental errors
2)Environmental errors
3)Observational errors
Random measurement errors often (but not always) have a normal distribution due to the central limit theorem.
We estimate measurement errors by repeatedly measuring a quantity and calculating the standard deviation of the sequence of measurements
### **Types of outlier detection**
1. General
- Histogram (1-d)
- Scatter plots (2-d)
Histograms and scatter plots while useful, do not detect outliers.
2. Model-independent
- Z scores 1-d
- Boxplots 1-d
- Bagplots 2-d
- Mahalanobis distance (multi-dim)
The model-independent techniques assume the data is a single cluster of points. The Mhalanobis distance is the most elaborate. Mosaic plot for categorical...
3. Model-based
- Residuals to some regression model(n-d)
- Cook's distance
- Misclassified observations by some classification model (n-d)
- Outliers to some clustering model (n-d)
The model based techniques make assumptions that are embedded in the assumed model. For example
● Cook’s distance assumes linearity.
● LOF assumes points are linked into chain-like clusters.
● The residual outliers of your preferred model makes an excellent final check.
It's easy to find the outliers with regression model.
### **Outliers to what?**

● Boxplots and Bagplots assume a single cluster of points and are mainly visual aids. They do calculate the outliers. The outliers are not labeled except through extra coding.
● A “one-class” SVM is an example of an anomaly detection method.
● LOF is an example of density based local outlier estimator.
● A Random-Forest is a method to generate residuals from which outliers might stand out.
**Box Plot**

The boxplot does not assume anything about the distribution BUT its “outlier prediction” assumes the distribution is normal.
The default is an IQR multiplier of 1.5. This default can be changed via a boxplot parameter. It accounts for 90% of the observations of a normal distribution.
For the Z score corresponding to 50% of the observations of a normal distribution is ±0.675. A multiplier of 1.5 is the same as a Z score of 2.5 x 0.675 = 1.6875. By default boxplots will declare any observations whose |Z| score > 1.6875 to be an outlier. This is clearly not magic. Nor it is particularly sound to accept the default multiplier unless the distribution is perfectly normal.
**BOX-Plot NOTCH**

- The solid part is the interquartile range; the 25 percentile value to the 75 percentile value.
- The notch, when showing, is the standard deviation of the mean value.
- The middle bar is the median (50 percentile) value.
- The whiskers are the min. and max. values excluding outliers
- A single outlier can be seen. This is decided based on the value being > 1.5 x IQR + 75 percentile value.
- While difficult to do, it is important to label the outliers with their observation IDs
**BagPlots**

- A bagplot is the 2-d version of the boxplot.
- Like boxplots, it has a parameter that controls the sensitivity of outlier detection.
- Bagplots are from the “aplpack” package.
- While difficult to do, it is important to label the outliers with their observation IDs.
- A single outlier can be seen. This is decided based on the value being outside the inner bag inflated 3 times
Note that the criterion for outliers in bagplots is different to boxplots. Hence the default of 3 versus 1.5.
Bag plots are an improvement over uni-variable outliers but it is still a long way short of looking all all dimensions at the same time.
### **Power transforms**
Since boxplots and bagplots assume normal distributions when detecting outliers, you can try transforming the non-normal data to be more normal in shape and hence get fewer false outliers.
This is especially relevant to “count” data (typically integer). For example, the number of bathrooms that a house has.
- Log, 1/x, x2, √x are all possible transforms to consider. These are generally called power transforms
- Any outlier detection process can legitimately used at a transforms as preprocess in g steps even if these data transforms do not become part of the data cleaning pipeline.
- Box-Cox & Yeo-Johnson transform s are special in that they will learn from your data and perform anappropriate transform (or no transform) to make the data more normal in shape.
In order to avoid false outliers (because the distribution is far from normal), it might be useful to have the option of transforming the distribution to look more normal. Let’s be clear: it does not make it normal. It merely reshapes it to appear more normal.
Count data typically always shows outliers when visualised in a box plot because its distribution looks more like the Poisson, Binomial or Negative Binomial distribution. It is definitely not going to be normal. Such data benefits from a transform.

If count data n positive - use Box-cox..
If zero feature, negative value, use Yeo-Johnson.
These two charts need to be read in the following way:
● Project a non-normal distribution upwards. Choose an appropriate curve (one of the 5 shown or a curve in between that is not shown). Using that curve project the distribution to the right to get its reshaped distribution.
● The transform will stretch or squeeze depending on the chosen curve.The box-cox transform suits count data which is never negative (except through centering which, if needed, should be delayed)
Is it fair to have a long-tail distribution? Data is data, keep it like.. one argument..
### **Unvariable outlier**

This shows two variables X & Y. Both histograms show (high) outliers. The boxplots confirm this.
Having identified outliers one might be worried about how well we can hope to model Y ~ X if we leave the outliers in place.
One might (wrongly) be tempted to remove the outliers since we have identified them.
It cannot be stressed strongly enough that an outlier does not prove you have a problem.
- There are X&Y outliers but the model fits this data really well
- This is possible because all the X outliers corresponded with Y outlier
- Uni-variable outliers do not prove we will have modelling problems.
The model (i.e. the 45° line) is showing no significant outliers. The outlier observations can be seen at the far end of the 45° line but they are a good fit to the model.
In this (artificial) example, the presence of outliers means nothing to the modelling process. The bulk of the observations belong to the model in the same way that the rare outliers belong.This demonstrates the weakness of uni-variable outliers. It also demonstrates the advantages of the model-based outliers. The actual v predicted chart is a model-based outlier detector - it is not fooled by the apparent outliers.
Just because univariate model , doesn't suggest that you have outliers.
### **Mahalanobis Distance**

The Mahalanobis distance is a measure of the distance between a point P and a distribution D, introduced by P. C. Mahalanobis in 1936. It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D. This distance is zero if P is at the mean of D, and grows as P moves away from the mean along each principal component axis.
If each of these axes is re-scaled to have unit variance, then the Mahalanobis distance corresponds to standard Euclidean distance in the transformed space. The Mahalanobis distance is thus unitless and scale-invariant, and takes into account the correlations of the data set.

I can check homogeniety. I can sort over date, or any other to make more sense of the data.
Dm2 has a chi2 distribution.
A chi2 method of finding the cutoff threshold for Mahalanobis distance is given in the worksheet.
Sometimes a subjective (visual) cutoff works fine too.
### **Cook's Distance**
It's for Linear -- and not for NON_LINEAR.

In statistics, Cook's distance or Cook's D is a commonly used estimate of the influence of a data point when performing a least-squares regression analysis. In a practical ordinary least squares analysis, Cook's distance can be used in several ways:
● to indicate influential data points that are particularly worth checking for validity;
● to indicate regions of the design space where it would be good to be able to obtain more data points. It is named after the American statistician R. Dennis Cook, who introduced the concept in 1977.

I can move my threshold a bit higher.. to see the outliers..
A cutoff threshold of 4 x mean(Dc) is sometimes recommended.Note the points in the chart should ideally be labelled.
In this case the purpose of the chart was to investigate pattern rather than identify outliers.
### **Local outliers Factor**
Outlier is rare.. suited to numeric .. not for a single cluster.

LOF stands for Local Outlier Factor.In R it is implemented in the dbscan package.

LOF threshold of 1.6 has been used here. A value of 1 would have generated too many observations to follow up on.
The local outlier factor is based on a concept of a local density, where locality is given by k nearest neighbors, whose distance is used to estimate the density. By comparing the local density of an object to the local densities of its neighbors, one can identify regions of similar density, and points that have a substantially lower density than their neighbors. These are considered to be outliers.
### **Model Residuals**
We are more concerned about magnitude over sign.

We can be vague about which model is being used to generate these residuals.
We can create
● an all-purpose model like Random Forest
● a linear model if we are certain the problem is linear
● the best candidate model we can findRather than employ our test data, we can resample out train data (e.g. 10 fold cross validation) to arrive at a residual for each observation.Remember that outliers will not be only present in the Test data

## Week 7
## Lect 1 - 27 April - 2021
- Assignment 2 - model answer discussion
## Lect 2 - 29 April - 2021
#### **SAMPLING STRATEGIES**
It's a bit difficult area - whenever we do data science project, we pick up the data set and never improve this step. First thing to say is that standard way is not always the best ways to do thing.
First thing is to know what kind of sampling techniques:
- Stratified / unstratified sampling.
There are better ways to test/train split.
When you doing time series, must not do convential test/train split. In time series, they are not independent. Observation ahead in time are somehow related to past observation.
We would typically e.g. first month of data set aside as test set and then in training we have older data.
Sampling/resampling - when we use carrot or anyother, something special happen for training. We feel like we are managing the test/train split but carrot did it all by itself for hyper parameters.
. You can tell carrot to do 10 folds verification
. Carrot package (Train in particular) does multiple splits.
LABS:
1 Introduction
1.1 Packages
1.2 Data
2 Sampling 2-ways
2.1 Unstratified Random Sampling
2.2 Stratified Random sampling
3 Sampling 3-ways
4 Cross-Validation sampling
4.1 10 fold
4.2 5 fold 4 repeats
5 Bootstrap sampling
5.1 Random samples (with replacement)
6 Sampling based upon the Predictors (method 1)
7 Sampling based upon the Predictors (method 2)
7.1 Sampling for Time Series
8 Sampling based upon a stratifier group variable
## Lect 2 - Fri 30 April - 2021
-- Resampling
sampling vs resampling

Sampling enables us to determine some characteristic of the population e.g. an unbiased estimated MSE of a model.
Resampling enables us to determine the distribution of some characteristic of the population e.g. the MSE: mean, median, variance, quantiles, skew etc.
These terms are frequently misused and come to mean little on-line i.e. be cautious.
**Concepts:**
- tuning hyper-parameters
- selecting a best model
- predicting the accuracy on unseen data
In modelling we can use stratified and unstratified sampling and resampling. We can think of stratified sampling as “supervised”. Sometimes we sample so as to reach a manageable amount of observations with suitable distributional properties.

Observation independence (no relationship for both linear/non-linear)
When would observations not be independent?
✗ Duplicated data (rows that are copies of other rows)
✗ Medical data where a patient has multiple observations
✗ Time Series data
Randomised Split:
Normally we do a one-time train-test (2-way) split (most common)
However it is not uncommon to do a one-time train-validation-test (3-way) split (not a carrot approach but still applicable)
The goal is that each set is independent of the other(s). That the data is partitioned so that dependent information is in one set or the other – not spread across them.
if ihave duplicate case, test & train then the accuracy will be higher and it's also called data leakage.
Randomised splitting assumes observation independence
**Stratified Randomised Split:**
This is an outcome-stratified form of a randomised split.
If the outcome is nominal, the observations belonging to each level are randomly split and then recombined.
If the outcome is numeric, the outcome is treated as an n-level nominal variable based on n-1 quantiles. It is then split as per a nominal outcome.
**The goal of stratifying the split is to ensure the splits partition the outcome with minimal sampling variance.**
Stratified sampling assumes observation independence.
**Cross Validation**
Cross validation (CV) is a form of partitioning of observations into m equal sized groups called folds.
Folds are generated with one or more assigned as a hold-out and the others are combined.
Repeat this until every fold has had a turn at being held-out.
When m rises to N (the observation count), this is known as Leave One Out Cross Validation (LOOCV)
CV assumes observation independence
Caret package implements a stratified form of cross validation

**Bootstrapping**
Bootstrapping is an form of resampling using replacement.
The process of sampling with replacement will (on average for large N) yield 63.2% of unique observations in one group and 36.8% in a holdout group.
Bootstrapping assumes observation independence

**Leave Group out resampling**
None of the technique deals with observation not independent but Leave Group out resampling deals with it. LGO assumes observation independent not within groups.
Leave Group Out (LGO) is a stratified form of a sampling.
There must be one or more factor variables that describe the relationship between observations.
Observations that share the same level are more similar than those that do not share the same level.
LGO assumes observation independence between groups but not within groups
**Variable Roles From Data Roles presentation**
Valid Roles for variables are:
Outcome i.e. the Y variable
Predictors i.e. the X variables
Case Weight (numeric)
Train-test split indicator (factor or binary)
Observation identifier (100% unique)
Observation Group Stratifier (factor)
Anything that is not one of these roles should be removed A variable should only have a single role
A new variable shop+date could have the role of ID
Variable Turnover has the role of Outcome
Date can be expanded into year, quarter, month, dow, holiday
The variable shop could have the role Group or the role Predictor
How do we decide between Group Role or Predictor Role?
If we expect to predict the novel case where shop = “K-Mart” and KMart” does not appear in our training data, then shop is a bit ofa
problem predictor.
Since shop has 89 levels (i.e. high cardinality) in the data, shop is a bit of a problem predictor.
When a predictor is a problem, there is a good chance it is not a predictor at all.
Observations from the same shop are not independent of each other.
We do not want TopShop/Dorothy/O’Brian in both train and test as this would “leak data”
This style of data leakage needs to be considered whenever the observations are not independent
**GROUPKFOLD - how it works**
**Time series resampling**
Time series sampling is a sequential form of sampling where the sampling is applied to the data in its time ordered sequence
There must be a variable that defines the ordering of observations.
The past affects the future but the future does not affect the past.
To sample time-series data fairly, we must pick a point in time and the past is the train data and the the rest is the test data.

## Week 8
## Lect 1 - Tue May 4 - 2021
**Maximally dissimilar sampling **
Suppose we 1 million obs, vector machine model.. lets assume to run it over weekend or try samller number but i want to choose the smaller number which are most intrested.. find out the uninteresting by reducing the data by maximising the dimensionality to form the cluster..
This is a predictor supervised form of a non-random sampling.
Suppose we have 1,000,000 observations and we discover that SVM takes 14 hours to run.
What are your options?
Leave it to run over a weekend
Give up and try a different method
Use fewer training observations while maximising the modelling power of the retained data
The idea behind this is simple.
1. Choose the numeric variables that should drive the dissimilarity calculation
2. Do any preprocessing to these variables (e.g. YJ, centre, scale)
- set the dates/ days/week/month
3. Locate the observations that are most dissimilar
The chosen set represents a diverse sub-sample from the original data. We can sample train and test sets from this sub-sample.
By doing this we have avoided duplicates and near duplicates. The data has a more uniform distributional shape.
Two possible approaches are
caret::maxDissim()
cluster::pam() i.e. k-medoids based
out of 1 million, 1 need 1k.. i need the middle one.. (partioning accross medios)
**Take-home messages**
In Caret we employ resampling of the train data in order to tune the hyper-parameters and to create a distribution of performance metrics.
Ask the question – Are the observations independent of each other?
Leave-Group-Out and Time-Series sampling are the only sampling styles that allow observations to not be independent.
I can't get hyper-parameter tuned in Caret.. but in ML3..
### **METHOD SELECTION STRATEGIES**
- what methods are out there?
- We mean the data science algorithm (hyper-parametre)
- candidate methods
Method: a data science algorithm (typically with hyper-parameters) e.g.
KNN (parameter: number-of-neighbours)
Candidate Methods: a set of methods that are suitable for a given problem.
Candidate Models: a set of models derived from the same data and produced in a manner that makes it fair to compare them.
Method + Train Data ⇒ Model
Model + Unseen Data ⇒ Predictions
How do you discover available methods?
Which available methods should be attempted?
121 models available...
#### **No Free Lunch Theorem**

Read: a-blog-about-lunch-and-data-science-how-there-is-no-such-a-thing-as-free-lunch
A method that works well with a specific data set does so because it is responsive to the unique qualities of that data set, however, another data set might have other unique challenges that will not be solved with the same method.

**Example:**
An author writes a report about predicting university
student drop-outs using student data.
The report only discusses the use of SVM.
The author concludes that an SVM model can give a classification accuracy of 89% on unseen data.
Has the author done a good job?
How good is 89%.
SVM is good if there is a class imbalance.
If the training data was unbalanced and had 87% did not drop out then the most naive mmodel (the null model) would be right 87% of the time. The model would only have improved this by 2 percentage
points.
It is hoped that the study involved checking more than SVM. But how many more?
How did missing-values / outliers / feature engineering drive the choice of method?
Is there a method (say, Gaussian Processes with polynomial kernel, degree = 2) that is superior?
Is there a transparent method like GLM that is statistically indistinguishable in its performance? If that is so, should a transparent model be used in preference?
**Candidate Methods**
Within R there are over 238 different regression & classification methods available
Brute force approach: Try them all to be sure you have found the best method. Why might this be impractical?
Can we make the candidate list smaller?

SupportVectorMachine come up with their own kernels (polynomials etc)..
Outliers present --> imputation before I pass
Observation weighting --> method who understand obs weighting
Implicit feature selection --> subset e.g. RF, decision Tree, Eleastic net, GLM .. they use implicit feature selection
Caret methods can be filtered on the following criteria to a feasible subset.
Methods can be classification, regression or both
Methods can be two-class only
Methods can be nominal predictors only
Methods can produce class probabilities
Methods can be binary predictors only
Methods can be robust to outliers
Methods can utilise observation weightings
Methods can be linear or non-linear
Methods can be intrinsically tolerant of missing values
Methods can be tolerant of large numbers of predictors (i.e. implicit feature
selection)
Methods can have implicit feature engineering
The following is a basic list of relevant method characteristics.
https://topepo.github.io/caret/index.html
Accepts Case Weights
Bagging
Bayesian Model
Binary Predictors Only
Boosting
Categorical Predictors Only
Cost Sensitive Learning
Discriminant Analysis
Distance Weighted Discrimination
Ensemble Model
Feature Extraction
Feature Selection Wrapper
Gaussian Process
Generalized Additive Model
Generalized Linear Model
Handle Missing Predictor Data
Implicit Feature Selection
Accepts Case Weights
Bagging
Bayesian Model
Binary Predictors Only
Boosting
Categorical Predictors Only
Cost Sensitive Learning
Discriminant Analysis
Distance Weighted Discrimination
Ensemble Model
Feature Extraction
Feature Selection Wrapper
Gaussian Process
Generalized Additive Model
Generalized Linear Model
Handle Missing Predictor Data
Implicit Feature Selection
Quantile Regression
Radial Basis Function
Random Forest
Regularization
Relevance Vector Machines
Ridge Regression
Robust Methods
Robust Model
ROC Curves
Rule-Based Model
Self-Organising Maps
String Kernel
Support Vector Machines
Supports Class Probabilities
Text Mining
Tree-Based Model
Two Clas
https://topepo.github.io/caret/index.html
**Choosing Candidate Methods**
Ensure you have examples of each of the main
method “flavours”:
Neural networks
Ordinary Least Squares (OLS) - family of linear model, ridge, lasso
Tree based - classification tree, random forest,
Kernel methods -- SVM N*N matrix (try to create the similarity)
The kernel method try to establish, how the relationship are?
ensemble method ??
# Week 8
## LAB - Thur May 6 - 2021
Method selection - lab work.. helpful for assignment 3
## Lect 2 - Friday May 7 - 2021

This is not a formal division of methods. It is a subjective grouping. Some methods fall into multiple categories.
In order to experiment with diverse methods we need to know something about method diversity.
Different styles of method have different advantages
and disadvantages.
In choosing methods draw from
Knowledge about the data set i.e. the implied constraints
Previous experience
Industry practice
Business knowledge
Common sense
Genetic-like algorithm: Concentrate your search around those methods with the best performance, but also keep trying random methods.
**Common problem Methods**
Logistic Regression
Predicting churn
Credit scoring & Fraud detection
Predicting biological (sigmoid) response
Effectiveness of marketing campaigns
Linear regression
Time to travel between locations
Predicting future sales quantities
Impact of alcohol on coordination
Predicting revenue
Trees
Investment choices
Predicting churn
Loan defaulters
Build vs Buy decisions
Sales-lead quality
SVM
Disease detection
Handwriting recognition
Text topic categorisation
Stock market prediction
Naive Bayes
Sentiment analysis
Recommendations
Spam detection
Face recognition
Random Forest
Patients at high risk
Parts failure in manufacturing
Loan defaulters
Deep Learning
Computer vision
Natural Language processing
**Strategy 1 – Good Practice**
Filter the available methods down using problem constraints
Research literature for precedence
Sample methods from each style of model (and the occasional wild-card)
Try further methods similar to ones that perform well
Investigate methods that surprise; for example, precedence methods that perform poorly.
Abandon methods that perform worse than NULL method. Also abandon methods that fail to train. (simple method which you can build)
Continue until a reasonable period of time runs out.
Record the process and report the breadth of the search
**Strategy 2 – Poor Practice**
Try your favourite methods – ones you like and train quickly.
Report on the one that performs best on the test data
Do not mention what you tried in the report
Imply your choice is the global best despite no evidence
**Take-home messages**
There is no prior-discernible “best method” for any specific problem
Ensure you have put enough effort into finding a globally best method. Show this in your report.
Employ the constraints that the problem provides
Use tools & frameworks that make this tedious process easy and reliable. Caret is one such framework which provides descriptive tags for methods.
### **CARET**
Classification & Regression Tree

Caret stands for Classification And REgression Training
It is an R package designed for:
* Resampling
* Data Visualisation
* Utilising roles for variables
* Feature selection
* Feature engineering
* Standardised method training / predicting
* Method categorisation
* Model hyperparameter training
* Model evaluation
* Model selection
- There is a similar python library called “scikit-learn” (sk-learn)
- tidymodels is an alternative/complementary set of packages to caret
- caret works with packages:
• recipes - prepocessing
• caretEnsemble - ensembling
• textrecipes - text preprocessing
• embed - encoding
• themis - rebalancing
• breakDown - sensitivity analysis
• modelgrid - groups of models
• timetk - time series
The caret package was developed to:
• create a uniform interface for training using ~238 methods.
• create a uniform interface for prediction using ~238 methods.

• create a common workflow for hyper-parameter tuning
• provide standard tools for data sampling / resampling (incl. stratified sampling)
• create a common visualisation for candidate model selection
**Uniform interface**
• Many methods utilise a formula to describe the roles of the variables. For example, lm(y ~ x1+x2+x3, data = d) In this case y has the role of outcome variable, x1,x2 and x3 have the role of predictors.
• Other methods utilise a predictor matrix. For example svm(x = xvars, y = yvar). In this case the x matrix has the role of predictor and the y vector has the role of outcome variable.
• Other methods can be used both ways.
• Caret provides a uniform interface for all its supported
methods.
**Preprocessing**


**SCOPE**
Caret is not a framework for every data science problem. It is not suitable for:
Image processing
Deep learning
Big Data
Audio & Video
Anomaly detection
Clustering
Tuned preprocessing
**Take home messages**
• Caret is one of the leading frameworks for routine Classification and Regression
• Caret makes best-practise easy
• scikit-learn or tidymodels or MLR3 or something else will
eventually replace caret
# Week 9
## Lect 1 - Tues May 11 - 2021
### **Model Tuning**

**Model Tuning** is the process of optimising a method’s hyper-parameters through an iterative process.
In Caret this process is automated and utilises resampling.
Fundamentally it involves producing multiple models (each with different hyper-parameters) and selecting the best model.
Fundamentally this is a model selection problem.
Hyper-parameters are method parameters that must be supplied to the method and whose value cannot be directly estimated from data.
They can be specified by the practitioner.
They can be set using heuristics.
They can be set using optimisation (called model tuning).
Hyperparameter = method parameter
Some examples of method hyper-parameters include:
The learning rate for training a neural network.
The C and sigma values for support vector machines.
The k in k-nearest neighbours
“Many models have important parameters which cannot be directly estimated from the data. For example, in the K-nearest neighbor classification
model … This type of model parameter is referred to as a tuning parameter because there is no analytical formula available to calculate an appropriate value.”
**Model Parameters** are configuration variables that are internal to the
model and whose value can be estimated from the given data.
They are required by the model when making predictions.
They are estimated or learned from data during training.
They are normally* not set manually by the practitioner.
They are saved as part of the trained model.
Some examples of model parameters include:
The weights in an artificial neural network.
The support vectors in a support vector machine.
The coefficients in a linear regression or logistic regression.

**SEARCHING**
In Caret, searching for hyper-parameters that produce an optimum
metric (RMSE, accuracy etc) can be done either by a grid method or
stochastically.

Stochastic: having a random probability distribution or pattern that may be analysed statistically but may not be predicted precisely.
“….for most datasets only a few of the hyperparameters really matter, but [...] different hyperparameters are important on different datasets. This phenomenon makes grid search a poor choice for configuring algorithms for new datasets.
https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
Random search strategy has a similar accuracy but is
faster.
Caret’s train function allows the user to specify alternative rules for selecting the optimal hyper-parameters. The argument selectionFunction can be used to supply a function to algorithmically determine this. There are three existing functions in the package:
best is chooses the largest/smallest value
oneSE attempts to capture the spirit of Breiman et al (1984)
tolerance selects the least complex model within some percent tolerance of the best value.
The default method is best

There are several pausible optimal values for “# Boosting Iterations”. Three is the highest, but given the fluctuations, 1 or 2 might be more realistic (i.e. similar performance but lower complexity).
**Take-home messages**
There is generally no prior-discernible “best hyper-parameters” for any specific problem
Ensure you have put enough effort into finding a globally best set of hyper-parameters. Show this in your report.
Use tools & frameworks that make this tedious process easy and reliable.
## Lect 2 - Thursday May 13 - 2021
LAB - discussion - Model Tuning
Model selection & Hyper parameters.
## Lect 1 - Friday May 14 - 2021
Model selection --<> can be based on one of many model performances metric.
This becomes particularly important for classification where the metrics perform quite differently.
To do hyper parameters optimisation demands resampling to avoid(wrongly) assessing the model on the same data used to train the model.
Statiscally, best model requires a distribution of metrics that arise naturally from resampling.
### Class Imbalance
Imbalance applies to categorical variables
• Class imbalance is the unequal proportions of levels (categories) in the outcome variable
• Predictor variables can also be imbalanced. Sometimes this is important too. When?
This is specific to the outcome variable.Strictly, we should say that class imbalance applies to nominal variables but not to ordinal variables.
Imbalance in the categorical predictors is important if your goal is to use your model to explain cause and effect.
### Causes of imbalance
● Chance - The occurrence of some things are just rare e.g. cancer
● Cost - Some types of observation are expensive / time consuming / difficult. We tend to have a disproportionate number of cheap / fast / easy observations.
● Sampling Bias - The sampling of observations might be biased away from certain characteristics.
Normally the reason is chance
### Significance
This is a classification issue only
● “Null” models do quite well when assessed using accuracy
● Certain methods become distracted by the more frequent outcomes
● Model performance becomes harder to judge (more sensitive to the chosen metric)
a) Hyper-parameters can get incorrectly set
b) Model selection can get difficult
c) The final assessment of being fit-for-purpose can be invalid
For example class levels A : B : C at 20:50:30.Is 55% accuracy doing “well”?
Why would “null” models do well?
What is a null model in classification?
Why is accuracy a poor metric in the presence of class imbalance?
Null model would predict the mode i.e. the B level.
We will be right 50% of the time using the train data. 55% is only slightly better than our null model. This is not doing well at all.

This calculation is a great way to evaluate your classification performance whether you have imbalance or not.
### Prevention
• Good experimental design
Often too late for this.
• Redefining the problem by coalescing levels
For multi-class imbalanced problems.
• Redefining the problem as anomaly detection
For binary imbalanced problems.
Good experimental design is the best prevention BUT... this is seldom in a data scientist’s control.
Coalescing levels is about reallocating the levels so that the cardinality drops. Typically we merge several levels together.
By redefining a classification problem into an anomaly detection problem, the hope is that with the right feature engineering, the anomalies can correspond strongly with the rare class. This only makes sense for binary situations

## Response to imbalance
1) Observation weighting
2) Rebalancing the train data
3) Cost sensitive methods4)Appropriate metrics for assessing models
The list shows the actions we take when we have class imbalance.
The first 3 are mutually exclusive.
The last one can be used along side the first 3.
#### 1 Observation weighting
1) Observation weighting
• Weight the observations to compensate for their rareness
• Weighting = 1/proportion
2) Use classifier methods that utilise weightings
3) Supply the weight role to the method

The advantage of weighting is that we can get back to perfect balance WITHOUT throwing away any observations.
The disadvantage is that not all methods utilise weights.
Do not assume a metric is necessarily a weighted metric just because the method was one that utilised weights. The weights of the out-of-sample observations are not utilised when calculating a metric.
Weight roles can only be decimal but at times numeric should be above and below zero -
Weight dont have to sum up to 1.

Since class imbalance is only a classifiaction issue there is no point in showing Regression methods that utilise weights.Notice that two of these methods are Cost-based methods. These are in bold font.
### 2 ) Rebalancing the train data
The list shows the actions we take when we have class imbalance.The first 3 are mutually exclusive. The last one can be used along side the first 3.
➔ By down-sampling the frequent classes
● Down-sample the common observations in the train data
e.g. 10% : 20% : 70% to 1-in-3 to arrive at 10:20:23 → 18% : 38% : 43%
● Not on test data
➔ By up-sampling the rare classes
● Up-sample the rare observations in the train data e.g. 18% : 38% : 43% by 2 to arrive at 36:38:43 →31% : 32% : 37%
● Not on test data
➔ Hybrids of up & down
Remember that down-sampling while maximising diversity is a good strategy
Up-sampling means creating new (duplicated) observations.Down-sampling means throwing away unwanted observations.Hybrids of both up and down sampling are better than either on their own.
### Rebalancing methods

The step_rose() function is based on Random Over-Sampling Examples (ROSE). This is binary only.
The step_smote function is based upon Synthetic Minority Over-sampling Technique (SMOTE). This is multi-class.
The idea about maximum dissimilarity goes like this: Suppose your data had (near) duplicated observations. Any down-sampling technique that seeks to remove observations should begin by down-sampling the duplicates until there is only one of each.
### Cost sensitive methods
The list shows the actions we take when we have class imbalance.
The first 3 are mutually exclusive.
The last one can be used along side the first 3.
Cost-based methods attempt to automatically compensate for the rare levels as part of their methodology.
Using Caret naming standards for methods:
• mlpKerasDropoutCost, mlpKerasDecayCost
• c5.0Cost
• SvmRadialWeights
• RpartCost
• svmLinearWeights, svmLinearWeights2
Cost-based methods attempt to automatically compensate for the rare levels as part of their methodology.
# Week 10
## Lecture 1 - May 18 - Tuesday
### 4) Appropriate metrics for assessing models
The list shows the actions we take when we have class imbalance.
The first 3 are mutually exclusive. The last one can be used along side the first 3.
Use a measure of classification performance that is less prone to favour the most common outcomes
➢ Area under the curve (only for binary classifiers?)
➢ Precision: the number of true positives divided by all positive predictions. Also called Positive Predictive Value.
➢ Recall: the number of true positives divided by the number of positive values in the test data. Also called Sensitivity or the True Positive Rate.
➢ F1 Score: the combination of precision and recall.
➢ Cohen’s Kappa performance metric
Classification metrics that are poor in the presence of imbalance include measures like
● Accuracy
● Miss-classification rate
● Specificity
Bear in mind that AuC is not a perfect solution to imbalance because it is only defined for binary classification. However there are techniques that extend AuC to be the average of several one-versus-the-rest AuC measures.
#### Strategy

• Outcome class imbalance is a significant problem.
• Weighting observations is an easy fix but your methods need to honour weightings.
• Accuracy is not an appropriate performance metric for imbalanced problems.
• Rebalancing the data removes the problem.
• Never rebalance test data.
Do any rebalancing after the in-sample out-of-sample division. Do it only to the in-sample set.
Accuracy is not the only inappropriate metric.
## ENCODING
• Encoding is the transform that turns categorical data into numeric data
• Novel value is a level of a nominal predictor that has not been seen during training. It might arise for the first time in the test data or it might arise months later in some future unseen data.
• Cardinality is the count of distinct levels of a variable.
• High cardinality is when the cardinality exceeds a value somewhere between 15 to 50
• Nominal is a type of categorical data that has no order.
• Ordinal is a type of categorical data that has order
A cardinality > 50 : definitely High.A cardinality < 15 : definitely Low.A cardinality between 10 & 50 could go either way.
### The Problem
➢ Most methods cannot cope with any categorical variables. A pre-processing encoding transform is needed. (tree based model can deal with categorical variables)
➢ Those methods that can cope with categorical variables, do not cope with high cardinality categorical variables. Random forest deals with categorical, Random Forest can not deal with more than 55 level.
➔ Some encodings generate a large number of extra variables when applied to high cardinality variables.
1 hot encoding - dummay varible..
➔ Some encoding do not deal well with novel values
Ordinal encoding - label encoding - does not deal with novel values.
Which methods do allow categorical predictors?
● Tree based methods
● Naive bayes
### The encoding problem
Generating a large number of numeric variables (from encoding a high cardinality variable) is a problem because:
● Models end up with high complexity (hence potential to over-fit)
● Models require more memory
● Models require more observations (to avoid over-fitting)
● Models train more slowly (in a non linear fashion) as the number of variables and observations grows [ squared fashion]
High cardinality nominal predictors occur frequently in real-life datasets. For example:
Address (Street, City, postcode)
Product description
Tweet
Patient note
Email
### Response
1) Reduce the cardinality by coalescing levels [last resort - ]
2) Choose an encoder that does not produce too many extra variables AND handles novel values acceptably
3) Discard the predictor (as a last resort)
## 1) Reduce cardinality
The process requires a threshold percentage below which levels are typically renamed to “Other”. Novel levels can be allocated to “Other” as well.
This is an unsupervised process.
The levels can also be coalesced manually especially on the basis of some hierarchy – for example substituting country names for city names.
Consider whether the new “Other” level is too frequent in the training data and drop the threshold percentage

Note that the new level called “other” (the first bar in in the second chart) is way higher than any others. I would suggest it is too high.
A lower threshold will lower this bar to fit in better with the others.
## 2)Choose an encoder that does not produce too many extra variables AND handles novel values acceptably
### Dummy encoding
➢ Converts nominal predictors using dummy encoding
➢ Converts ordinal predictors using a form of label encoding
- If you have categorical data -
*This is the implicit encoding for many methods.
*uses orthogonal polynomials for label encoding where the polynomial degree is given by the cardinality – See contrasts.pdf
Dummy encoding is VERY common in R. It is the default for many methods that cannot deal with categorical methods.Formula based methods typically will emply dummy based encoding.Note that nominal and ordinal variables are treated differently.Learn more about orthogonal polynomials and “contrasts” by following the link.
## Lecture 2 - May 20 - Thusday
## LAB...
## Lecture 3 - May 21 - Friday
## LAB...
Dummy encoding give us one less column as compare of one-
If you
### Binary dummy encoding
Why do you think dummy encoding removes one of the levels?Consider linear combinations of variables.

Not a good choice for high cardinality - as it give us a huge number of column list.
### Label encoding
Note that the ordinal factor must be ordered in the correct manner for label encoding to work best.This can used to encode nominal (i.e. non ordinal) variables too but this is not strongly recommended

### One hot binary encoding

What situations would one hot encoding be a problem?
Poor choice with high carinality.ch
Consider linear regression
### Target encoding (good for high card)
●Converts a nominal predictor to a single numeric predictor in a supervised manner.
●Each encoding is related to the what outcome is predicted by that particular level alone.●Novel levels are given a encoded value of zero.
●This is a good choice for high cardinality predictors. less than n..
●There are variations to this known as Mean encoding, Impactencoding, Likelihood encoding.
### Hash encoding (repeatable) - good for high cardinality
Converts a nominal predictor (with cardinality c) to dnumeric predictors in an unsupervised manner - where d is much less than c.
An optimum value for d needs to be arrived at.
Each encoding is related to the decomposition of the characters (via a hashing algorithm) into a set of numbers.
Since there is no “learning”, novel values are encoded without difficulty.
This is a good choice for high cardinality predictors
Hash encoding can generate one or more numeric variables that are either constant or Near-Zero-Variance. Look out for this as it might be a problem.
### Text features encoding
. Converts a text predictor (with cardinality c) to dnumeric predictors in an unsupervised manner - where d is much less than c.
Each encoding is related to the decomposition of the predictor into a set of numbers. For example, number of words, sentiment, number of characters etc
Since there is no “learning” novel values are encoded without difficulty.
This is a good choice for multi-word high cardinalitypredictors.
### Text embedding
Converts a text predictor (with cardinality c) to dnumeric predictors in an **unsupervised** manner - where d is much less than c.
Each encoding is related to the decomposition of the predictor (via a high dimensional word embedding) into a set of numbers.
This is a **good choice for high cardinality predictors.**
An optimum value for d needs to be arrived at.
The word embedding can be pre-trained on a relevant corpus.
### Strategy
For a given problem, you will need to declare and implement a strategy for each high cardinality predictor.
➢The best strategy need not be the same for all categorical predictors in a data set.
➢Identify ordinal predictors – ensure their ranking is set correctly – medium, slow, fast is wrong
➢Identify low cardinality predictors. These (incl. ordinal) can be dealt with using Dummy encoding in R
➢Use some experimentation to determine the best strategy for each high cardinality predictor based on optimising the model metric
Understand your data, eleminate the variable - make judgement..
### Take home messages
•One-hot encoding includes all levels.
•Low cardinality ⇒ Dummy encoding ⇒ Binary + Label encoding. •High cardinality ⇒ Coalesce levels?
•High cardinality ⇒ Impact encoding? or Hash encoding? or word encoding? etc
•Novel levels impacted by type of encoding
•Potentially employ different encoding for each categorical variable

## Novel levels
### Concepts
•Novel value is a level of a nominal predictor that has not been seen during training. It might arise for the first time in the test data or it might arise months later in some future unseen data.
Closed categorical variable: The set of levels that covers the days of the week are a fixed set. Monday to Sunday. We can argue that we should see no others.
Open categorical variable: The set of levels that covers the colours of the visible spectrum are an open set for practical purposes. We can speculate that we might get a colour between red and purple (e.g. burgundy) that does not exist in the training data.
If we find a day-of-the week called “Lundi” we need to reassign it to one of the existing ones.
### The problem
When “closed categorical” variables have values not seen during training, they must be mapped to the existing values. For example “Mon” maps to “Monday”. Typically these have cardinality less than ~15.
When “open categorical” variables have values not seen during training, there is a potential difficulty generating a prediction for these observations. We can anticipate this by using an encoding that makes novel data usable

We can hope that all the values that will ever occur will be present in the training data i.e. closed categories. Our degree of confidence is inversely related to the cardinality of the categorical variables. Or we can know in advance that the category is open through domain knowledge

### Response
1)Raise a warning or error and produce no prediction or a prediction of low quality.
2)Cope with the new level in an acceptable manner
3)Cope with the new level without loss of prediction quality.

### Dummy encoding
If the nominal variable gets dummy encoded then a novel level cannot be assigned to a dummy variable.
Methods will typically warn but proceed. One-hot: ignores value. Dummy: assigns to the same as the first category.
For ordinal factors, novel levels are encoded to zero.
Either way novel values play no part in the prediction.
This is an example of type 1 response
### Target encoding
If the nominal variable gets target encoded (or similar) then a novel level cannot be assigned to a set of target values.
The novel level is assigned a target encoding of zero
Methods will typically warn but proceed.
This is an example of type 1 response
### Coalescing
If the nominal variable gets coalesced (maybe because of high cardinality) then it will have a level (typically called “other”) which represents the rare levels.
A novel level can be argued to be rare – since it did not occur in the train data.
Therefore any novel levels can be assigned to the “other” level.
This is an example of type 2 response
### Hash & text encoding
➢Hash encoding can process a novel level without difficulty. ➢Text-feature encoding can process a novel level without difficulty.
➢Text-encoding can process a novel level without difficulty.
These are examples of type 3 response


## WEEK 11
### Lecture 1 - Tues May 25
### MOdel lifeCycle
**Regenerate** – To repeat the training of a model using a documented procedure, allowing for different observations. It consists of updating the data followed by retraining.
**Deploy** – The process of exposing the model operationally to unseen data
**Process Control charts** – The control chart is a graph used to study how a process changes over time. Data are plotted in time order. A control chart typically has a central line for the average, an upper line for the upper control limit, and a lower line for the lower control limit. These lines are determined from historical data. By comparing current data to these lines, you can draw conclusions about whether the process variation is consistent (in control) or is unpredictable (out of control, affected by special causes of variation).
### Concepts
**Model Decay (Model Drift)** – Broadly, there are two ways a model can decay. Due to data drift or due to concept drift.
1)Data drift: data evolves with time potentially introducing previously unseen variety of predictor data and new categories. But there is no impact to previously labelled data. b
2)Concept drift: our interpretation of the outcome changes with time even while the general distribution of the predictor data does not. This causes the end user to interpret the model predictions as having deteriorated over time for the same/similar data.

### Data Drift
Data drift arises because the process that generates the predictor data is not stationary i.e. it undergoes change. This something we expressly look for in time-series modelling but not in regression or classification.Data drift occurs when the distribution of predictor values, their correlation structure or the frequency of missingness/outliers changes in a manner that is statistically significant.When time is not a significant predictor that proves that the rate of change of the data is small enough to be hidden by noise (for the duration spanned by the training data.) Not that the data is stationary.
### Concept Drift
Concept drift arises when our interpretation of the data changes over time (even while the data may not have.)
**Classification**: What we agreed upon as belonging to class A in the past, we now claim should belong to class B, as our understanding of the properties of A and B change e.g. definition of a terrorist.
**Regression**: What we agreed as the outcome variable in the past that best represented the thing we wanted to predict, we now understand to be better represented by some alternative measurement e.g. predicting happiness in citizens.
### The problem
Once you have deployed your machine learning model, the realities of real life will result in model decay over time. Regenerating and redeploying will be required. In other words, model building should be treated as a repetitive process rather than a one-pass process.

### Response
Once the drift is detected, the model needs to be updated and retrained.
In general:
If we diagnose a concept drift, the affected old data needs to relabelled as well and model re-trained.
If we diagnose a data drift, enough of the new data needs to be introduced and the model re-trained. Maybe some old data is phased out in the process.
A combination of the above when we find that both data and concept have drifted
### Detection - model decay
The best way to detect model decay is to make the effort to label at least some of the new data on a routine basis and look for degradation in the predictive ability of the model. A control chart of residuals/misclassifications could detect changes in mean (i.e. model bias) and spread of the residuals (model variance)When we feel that the degradation is no longer tolerable we will need to rebuild the model.
### Detection - Data drift
A way to detect data drift is to look for heterogeneity between the train data and the recent data i.e. see if recent predictors are different to the past predictors.We can do many statistical tests to disprove the hypothesis that the data are from the same population.
For example using:
Feature means and spreads (ANOV)
Hotelling’s T2
Correlation matrices
Missingness rates
Outlier rates
### Detection
Outlier differences:Are the rates of outliers statistically different between training and a set of recent observations?This is essentially an anomaly detection problem. Suppose the training data had 3% outliers but the recent 80 observations have 6% outliers (by the same methodology).

### Life cycle phases

### Deployment
Deployment typically involves a 3rd party; typically an IT department. They might request evidence of:
Unit tests
Scalability
Support documentation
It may be that the model was developed on a subset of the data. Then deployment also involves a full scale model build.
Deployment may also involve a rewrite e.g. R to Python. If a rewrite is undertaken you need to prove that the two forms are producing equivalent results.The prediction mode of a deployed model might have special requirements and implications regarding handling of outliers, novel values and missing values.
### Monitoring
What is valuable to monitor?
Server (operational behaviour)
CPU loadMemory (in case of memory leaks)
Time per request (especially for lazy learning
)Server Log events (especially server restarts)
Model decay - ensure a proportion of future observations have known outcomes
Residual-bias control chart (regression)
Residual-variance control chart (regression)
Misclassification-rate control chart (classification)
Data drift
Predictor outlier-rate control chart
Predictor missingness-rate control chart

### Planning for re-use
Plan your code to be reused in the future.
Think about:
Setting random seeds to allow precise repeatability
Change control of source code
The effects of Package updates (Docker deals with this in Python)
Highlighting your dubious assumptions so they can be reviewed in the future.
Your code should not be too tightly bound to the training data (esp. outlier treatment)
Remember that the scripts needed to query databases are also “source code”
Write notes for the person doing the regeneration in the future.
Archive the raw data (in case the query is not repeatable)
Try not to hard code the names of variables. For example, rather than name the date variables, use program code to identify date variables.
### Strategies
Strategies for regenerating:
Do periodically to keep it relevant to new data (ensure neither too freq nor too infreq)
On demand when sufficient new labelled training data is available (ensure new data accumulates at the required rate)
On demand when monitoring shows drift (provided there is enough new data)
Before redeployment someone must formally validate the replacement model.•Check any messages/warnings output during the running of the regeneration script.•Check that the "test" summary statistics are the same (or better) than they were.•Check the predictions of a set of testing cases are within a small margin of what they previously were.

Model building is not “one-pass.”If you have a monitoring strategy in place you will discover problems before it is too late.
All models decay – only the time frame changes.
Plan your code to be reused in the future.
Regenerating & redeploying needs careful management. It can go horribly wrong.
## Near Zero Variance
Near-zero-variance (NZV) is a situation in which random sampling from a population occasionally produces a sample that is constant (therefore zero variance) despite the population variable not being constant.
This is a feature of both numeric and nominal data.
### The Problem
Many common scenarios involve near-zero-variance predictors such as:Var1: 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0Var2: 7,7,7,7,7,7,7,7,7,7,7,7,7,3,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7Var3: G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,F,G,G,G,G,G,G,G,G,G,G,G,G
If this dataset were used in cross validation (or similar) some of the folds would, statistically speaking, miss the 1, 3 and F values. This implies:
The train data would be constant (zero variance) for this variable (for a proportion of the CV folds)
The model assessment would see “F” as a novel value
The first variable contains a solitary “1”
The second variable contains a solitary “3”
The third variable contains a solitary “F”
The process of encoding categorical variables into numeric variables via dummy encoding creates sets of binary variables.
These binary variables lend themselves to near-zero-variance issues when the categories have low frequency levels present. This situation will generate a binary column of zeros with the occasional 1.
### Response
1)Avoid doing randomised resampling. (But we want to do hyper-parameter optimisation)
2)Get more observations of the rare values. (But not through (late) up-sampling. Why?)
3)Coalesce rare categorical levels.
4)Avoid OLS methods that have a problem with zero variance. (But OLS methods are most of the transparent one)
5)Use methods & encodings that are tolerant of novel levels. (But these are for high cardinality variables)
6)Discard the NZV predictors as a last resort. (A bit drastic)
Cannot avoid resampling when you have hyper-parameters to optimise.
Upsampling should be done with the resampling loop - so the problem has already happened.
Upsampling before the resampling will cause data leakage.
Coalescing is more difficult when cardinality is low.
We can avoid OLS provided transparency is not important.
Encoding that are tolerant of novel levels are typically high cardinality ones.
Discarding will always work but it’s a bit drastic
### step_zv()
A recipe step that removes zero variance predictors from each of the resamples. That is, it removes constant variables from each of the resamples.
It does this in the context of resampling, so it may remove predictors that are not consistently constant in all of the resamples.
### step_nzv()
A recipe step that removes predictors whose variance is likely to drop to zero for any of the resamples.
That is, it removes variables that are likely to be constant for any of the resamples.
In contrast to step_zv(), this step will more aggressively drop predictors. It will typically reach the same conclusion for all resamples.
step_nzv() has parameters that tune how aggressive it will behave. For most purposes the defaults are fine.

## Feature engineering
Concepts
Feature: A feature is an individual measurable property or characteristic of the phenomenon being observed i.e. another word for a variable. Feature engineering: The dual aspects of feature selection and feature extraction combined. Feature selection and feature extraction are intertwined. The overall goal of feature engineering is to concentrate importance i.e. more importance and/or fewer variables.
Feature extraction: refers to creating new variables that are potentially more informative than the variables from which they were derived. Sometimes we immediately replace the old variable with the new one(s); sometimes we add the new variables.Feature extraction is about transforming or adding variables to the problem
Feature selection: refers to the process of automatically or manually selecting those features which contribute most to your prediction of your output variable. Having irrelevant features in your data can decrease the accuracy of the models on unseen data and make your model learn based on irrelevant features. Some methods are able to do implicit feature selection.Feature selection is about removingunimportant variables from the problem
Iterative nature
The dual problems of feature selection and feature extractionare related.Some Feature extraction mostly produces “unimportant” variables e.g. hash encoding
An “important” variable is more likely to be transformed into a more-important variable through feature extraction than an unimportant variable is.To manage a large number of predictors, feature extractionneeds to focus on “important” variables.
Therefore feature selection might need to be applied before and afterfeature extraction.
**Take-home messages**There is no prior-discernible “best method” of undertaking feature engineering. Be guided by business knowledge.Feature selection and extraction are iterative.
### Feature extraction strategies
Concepts
Feature: A feature is an individual measurable property or characteristic of the phenomenon being observed i.e. another word for a variable.
Feature extraction: refers to creating new variables that are potentially more informative than the variables from which they were derived. Sometimes we immediately replace the old variable with the new one(s); sometimes we add the new variables.
Latent variables: are variables that are not directly observed as raw variables (i.e. lie hidden) but are inferred from raw variables. Their uncovering is typically a learned processes rather than a manual one