Linear regression, and the Python packages ecosystem

# Linear regression, and the Python packages ecosystem A key characteristic of Python is its excellent range of packages. Like `numpy` and `matplotlib`, there are thousands of packages ready-written to make your tasks easier. A key question to always ask is "has someone already written this?". A powerful machine-learning package is `scikit-learn`. See https://scikit-learn.org/ for a full introduction to its capabilities. The key point is that, unless we're writing novel machine learning methods, we don't need to write our own - `scikit-learn` will probably already implement it. As a simple example, take the task of training a least-squares linear regression model. This is just about the simplest machine learning method out there. It takes as input a set of x and y coordinates, and outputs the line that minimises the squares of the distances of all training points to the line. There are two key steps: training on one dataset, and then testing its predictive power against another dataset. The full linear regression example is below: ```python # Code source: Jaques Grobler # License: BSD 3 clause import matplotlib.pyplot as plt import numpy as np from sklearn import datasets, linear_model from sklearn.metrics import mean_squared_error, r2_score # Load the diabetes dataset diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True) # Use only one feature diabetes_X = diabetes_X[:, np.newaxis, 2] # Split the data into training/testing sets diabetes_X_train = diabetes_X[:-20] diabetes_X_test = diabetes_X[-20:] # Split the targets into training/testing sets diabetes_y_train = diabetes_y[:-20] diabetes_y_test = diabetes_y[-20:] # Create linear regression object regr = linear_model.LinearRegression() # Train the model using the training sets regr.fit(diabetes_X_train, diabetes_y_train) # Make predictions using the testing set diabetes_y_pred = regr.predict(diabetes_X_test) # The coefficients print("Coefficients: \n", regr.coef_) # The mean squared error print("Mean squared error: %.2f" % mean_squared_error(diabetes_y_test, diabetes_y_pred)) # The coefficient of determination: 1 is perfect prediction print("Coefficient of determination: %.2f" % r2_score(diabetes_y_test, diabetes_y_pred)) # Plot outputs plt.scatter(diabetes_X_test, diabetes_y_test, color="black") plt.plot(diabetes_X_test, diabetes_y_pred, color="blue", linewidth=3) plt.xticks(()) plt.yticks(()) plt.show() ``` ![](https://hackmd.io/_uploads/BkNufXc0a.png) This nicely demonstrates how linear regression attempts to draw a straight line that will best minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation. Note how the `scikit-learn` package provides high-level functions that match the logical steps that the analysis requires. This frees us up from the pain and effort of thinking about the internals of the machinery (which, if we tried to write it ourselves, would probably be error-strewn and far less performant), and we can instead focus on the more interesting questions. # Exercise 1 Spend a couple of minutes browsing https://scikit-learn.org/ to see the vast array of functionality available. Often, the hardest thing is knowing what's out there! # Exercise 2 Think of a task that you would like to use Python for. Spend 2 minutes searching online for any python packages that seem relevant. Post in the Teams chat with this format e.g.: ``` Task: I want to manipulate image data Possible packages: Pillow, matplotlib ``` and we'll talk through a couple of examples. If you can't find an appropriate package, say so in the chat, and we can help find one.