Sampling from a sample, Model Variance

# Sampling from a sample  slide: https://hackmd.io/@ccornwell/sampling --- <h3>So you have data...</h3> <font size=+2>How do you use it to make an ML model? Some rules:</font> 1. <font size=+2>**Split data you have**, into training data and test data.</font> 2. <font size=+2>**Don't touch the test data until parameters are set.**</font> ![](https://i.imgur.com/rKMV18f.png =x300) ---- <h3>So you have data...</h3> <font size=+2>How do you use it to make an ML model? Some rules:</font> 1. <font size=+2>**Split data you have**, into training data and test data.</font> 2. <font size=+2>**Don't touch the test data until parameters are set.**</font> ![](https://i.imgur.com/cCV9mC6.png =x300) --- <h3>How to split the training and test data?</h3> - <font size=+2>Each data point in your set will go into either *training* bin or *test* bin. For each spot in *test*, you want equal likelihood for each point to get that spot.</font> - <font size=+2 style="color:#181818;">How to do this? Say have 100 data points, want 10 of them in *test*.</font> - <font size=+2 style="color:#181818;">Could go through each point, flip a (90/10)-coin to decide if goes in test data. (If you do it this way, likely won't have exactly 10 in *test* set.)</font> - <font size=+2 style="color:#181818;">Could consider each of the 10 spots. For each, randomly choose number 1-100, put that point in that spot. Issues?</font> ---- <h3>How to split the training and test data?</h3> - <font size=+2>Each data point in your set will go into either *training* bin or *test* bin. For each spot in *test*, you want equal likelihood for each point to get that spot.</font> - <font size=+2>How to do this? Say have 100 data points, want 10 of them in *test*.</font> - <font size=+2 style="color:#181818;">Could go through each point, flip a (90/10)-coin to decide if goes in test data. (If you do it this way, likely won't have exactly 10 in *test* set.)</font> - <font size=+2 style="color:#181818;">Could consider each of the 10 spots. For each, randomly choose number 1-100, put that point in that spot. Issues?</font> ---- <h3>How to split the training and test data?</h3> - <font size=+2>Each data point in your set will go into either *training* bin or *test* bin. For each spot in *test*, you want equal likelihood for each point to get that spot.</font> - <font size=+2>How to do this? Say have 100 data points, want 10 of them in *test*.</font> - <font size=+2>Could go through each point, flip a (90/10)-coin to decide if goes in test data. (If you do it this way, likely won't have exactly 10 in *test* set.)</font> - <font size=+2 style="color:#181818;">Could consider each of the 10 spots. For each, randomly choose number 1-100, put that point in that spot. Issues?</font> ---- <h3>How to split the training and test data?</h3> - <font size=+2>Each data point in your set will go into either *training* bin or *test* bin. For each spot in *test*, you want equal likelihood for each point to get that spot.</font> - <font size=+2>How to do this? Say have 100 data points, want 10 of them in *test*.</font> - <font size=+2>Could go through each point, flip a (90/10)-coin to decide if goes in test data. (If you do it this way, likely won't have exactly 10 in *test* set.)</font> - <font size=+2>Could consider each of the 10 spots. For each, randomly choose number 1-100, put that point in that spot. Issues?</font> --- <h3>How to split the training and test data?</h3> - <font size=+2>Have 100 data points, want 10 of them in *test*.</font> - <font size=+2>Pick a *permutation* of numbers 1-100, at random. The images of the first 10 numbers go into the *test* set.</font> - <font size=+2 style="color:#181818;">Knuth, Fisher, Yates: For each $i = 1,\ldots,99$, random number in range $i - 100$, swap *current* point in $i^{th}$ spot with point in that spot. Each permutation equally likely.</font> <br /> <br /> <br /> <br /> ---- <h3>How to split the training and test data?</h3> - <font size=+2>Have 100 data points, want 10 of them in *test*.</font> - <font size=+2>Pick a *permutation* of numbers 1-100, at random. The images of the first 10 numbers go into the *test* set.</font> - <font size=+2>Knuth, Fisher, Yates: For each $i = 1,\ldots,99$, random number in range $i - 100$, swap *current* point in $i^{th}$ spot with point in that spot. Each permutation equally likely.</font> ```python= from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.4) ``` --- <h3> Overfitting models, considering model variance. </h3> - <font size=+2>Important: the choice of training data is a proxy for "just looking at available data".</font> - <font size=+2 style="color:#181818;">What if your model has *too much* ability to respond to variations in the training data?</font> <br /> <br /> <br /> <br /> ---- <h3> Overfitting models, considering model variance. </h3> - <font size=+2>Important: the choice of training data is a proxy for "just looking at available data".</font> - <font size=+2>What if your model has *too much* ability to respond to variations in the training data?</font> <br /> <br /> <br /> <br />