# Linear regression, and the Python packages ecosystem
A key characteristic of Python is its excellent range of packages. Like `numpy` and `matplotlib`, there are thousands of packages ready-written to make your tasks easier. A key question to always ask is "has someone already written this?".
A powerful machine-learning package is `scikit-learn`. See https://scikit-learn.org/ for a full introduction to its capabilities.
The key point is that, unless we're writing novel machine learning methods, we don't need to write our own - `scikit-learn` will probably already implement it.
As a simple example, take the task of training a least-squares linear regression model. This is just about the simplest machine learning method out there.
It takes as input a set of x and y coordinates, and outputs the line that minimises the squares of the distances of all training points to the line.
There are two key steps: training on one dataset, and then testing its predictive power against another dataset.
The full linear regression example is below:
```python
# Code source: Jaques Grobler
# License: BSD 3 clause
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
# Load the diabetes dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
# Use only one feature
diabetes_X = diabetes_X[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
# The coefficients
print("Coefficients: \n", regr.coef_)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(diabetes_y_test, diabetes_y_pred))
# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: %.2f" % r2_score(diabetes_y_test, diabetes_y_pred))
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color="black")
plt.plot(diabetes_X_test, diabetes_y_pred, color="blue", linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
```

This nicely demonstrates how linear regression attempts to draw a straight line that will best
minimize the residual sum of squares between the observed responses in the dataset, and the
responses predicted by the linear approximation.
Note how the `scikit-learn` package provides high-level functions that match the logical steps that the analysis requires.
This frees us up from the pain and effort of thinking about the internals of the machinery (which, if we tried to write it
ourselves, would probably be error-strewn and far less performant), and we can instead focus on the more interesting questions.
# Exercise 1
Spend a couple of minutes browsing https://scikit-learn.org/ to see the vast array of functionality available.
Often, the hardest thing is knowing what's out there!
# Exercise 2
Think of a task that you would like to use Python for. Spend 2 minutes searching online for any python packages that seem relevant.
Post in the Teams chat with this format e.g.:
```
Task: I want to manipulate image data
Possible packages: Pillow, matplotlib
```
and we'll talk through a couple of examples. If you can't find an appropriate package, say so in the chat, and we can help find one.