Learn More →
Note: See Sect. 3 of cl_en.pdf in Computational Intelligence course for introduction on Linear Regression.
Linear regression is a statistical method used in supervised learning where the goal is to predict the value of a dependent variable
The general form of a linear regression model is given by:
where:
A design matrix, often used in statistics and machine learning, is a matrix of data in which rows represent individual observations and columns represent the features or predictors associated with those observations. This matrix is crucial in various modeling techniques, particularly in regression analysis, where it helps to organize the data for computational efficiency and clarity in the formulation of models. Here's a deeper look into its components and uses:
In regression analysis, the design matrix
Where:
Polynomial Basis: In a polynomial basis design matrix, each column represents a power of the input feature. For a single input feature
Sigmoidal Basis: A sigmoidal basis design matrix uses sigmoid functions, typically logistic functions, to transform the input feature. Each column in the design matrix
Polynomial Basis:
Sigmoidal Basis:
Polynomial Basis: While flexible, polynomials can become unwieldy and prone to overfitting as their degree increases, especially if the data does not inherently follow a polynomial trend.
Sigmoidal Basis: Offers fine-grained control over where the model should be sensitive to changes in the input features, allowing for more nuanced modeling of complex behaviors. However, determining the right placement and scale of sigmoid functions can be challenging and may require domain knowledge or experimentation.
Polynomial Basis: Easier to compute and interpret in terms of traditional curve fitting. The coefficients directly relate to the influence of powers of the input feature.
Sigmoidal Basis: Computation can be more intensive due to the nature of the exponential function in the sigmoid. The interpretation is more about the importance and impact of specific ranges of input values rather than overall trends.
In summary, the choice between a polynomial and a sigmoidal basis in constructing a design matrix largely depends on the nature of the data and the specific characteristics of the phenomenon being modeled. Each offers unique advantages and challenges that make them suited to different types of regression problems.
The optimal weight vector computed with regularization (often referred to in machine learning and statistics as ridge regression when using L2 regularization) differs from the one computed without regularization in how it manages the complexity and potential overfitting of the model. Let’s explore these differences and the role of the regularization parameter:
The optimal weight vector without regularization is computed using the simple least squares approach. This method minimizes the sum of the squared differences between the observed targets and the targets predicted by the linear model. Mathematically, it is computed as:
When regularization is introduced, the equation is modified to include a regularization term
Bias-Variance Trade-off: Regularization helps manage the bias-variance tradeoff. By introducing bias into the model (since it does not fit the training data perfectly), regularization helps to reduce variance and thus improves the model’s performance on unseen data.
Prevention of Overfitting: By penalizing large weights, regularization effectively limits the model's ability to fit the noise in the training data, focusing instead on capturing the underlying patterns.
Impact on Coefficients: Regularized models typically have smaller absolute values of coefficients compared to non-regularized models. This often results in a smoother model function that is less likely to capture high-frequency fluctuations (noise) in the data.
Regularization is a powerful concept in machine learning and statistics, helping to create more generalizable models, especially when dealing with high-dimensional data or situations where the number of features might approach or exceed the number of observations.
The primary objective in linear regression is to find the weight vector
where:
In many contexts within regression analysis, especially in the formulation of the error function, a factor of
If we incorporate the
Similarly, the regularized sum of squared errors with the
From a probabilistic perspective, assuming the noise
Here,
The weights
This solution involves computing the pseudo-inverse of the matrix
Linear regression provides a straightforward method for modeling the linear relationships between variables. It is widely used due to its simplicity, interpretability, and the fact that it can be applied to a variety of real-world problems. However, it assumes that the relationship between the dependent and independent variables is linear, which might not always hold, potentially limiting its effectiveness in complex scenarios where relationships might be non-linear.
The sigmoid function is a mathematical function that produces a characteristic "S"-shaped curve, also known as a sigmoid curve. This function is widely used in various fields, especially in statistics, machine learning, and artificial intelligence, mainly because of its ability to map a wide range of input values to a small and specific range, typically between 0 and 1. This makes it very useful for tasks like transforming arbitrary real-valued numbers into probabilities, which is crucial in logistic regression and neural networks.
The basic form of the sigmoid function, often referred to as the logistic function, is defined as:
Here’s what happens in this function:
As
In the context of using sigmoid functions as basis functions in regression or neural networks, the sigmoid function can be adjusted to better fit specific data patterns through two parameters: center
Where:
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(x):
return 1 / (1 + np.exp(-x))
# Range of x values from -10 to 10 for a clear view of the sigmoid behavior
x_values = np.linspace(-10, 10, 400)
# Different centers and widths for demonstration
centers = [0, 2, -2] # Shifts the sigmoid along the x-axis
widths = [1, 0.5, 2] # Changes the steepness of the curve
plt.figure(figsize=(10, 6))
for c, s in zip(centers, widths):
y_values = sigmoid((x_values - c) / s)
plt.plot(x_values, y_values, label=f'Center = {c}, Width = {s}')
plt.title('Effect of Center and Width on Sigmoid Function')
plt.xlabel('X')
plt.ylabel('Sigmoid(X)')
plt.legend()
plt.grid(True)
plt.show()
The choice of width (
Note: nbf refers to nr_basis_functions
This formula effectively sets the width of each sigmoid to half the average distance between the centers of adjacent sigmoid functions. The rationale behind dividing by 2 is to ensure that each sigmoid function reaches about halfway to its midpoint at the location of the next center. This approach has several implications:
This version sets the width equal to the average distance between the centers of adjacent sigmoid functions. Here, each sigmoid will transition from near 0 to near 1 right around the point where the next sigmoid center is located, assuming minimal overlap at the steepest part of each sigmoid. Key characteristics include:
The choice between these two (or other variations) depends on the specific requirements of the modeling task:
Ultimately, the width should be chosen based on experimental validation or domain-specific knowledge that informs how features are expected to influence the output. It's common to adjust these parameters during the model development process based on performance metrics on validation datasets.
Imagine you have data on how likely individuals are to respond to a particular stimulus, and you believe that the probability of a response changes sharply at a specific value of some variable (like dosage of a drug). Using a sigmoid function as a basis function allows you to model this behavior, capturing the probability transitioning from very low to very high near a particular dosage level.
This flexibility to fit complex, non-linear patterns in data makes sigmoid functions invaluable in fields that require detailed probability estimates and classifications based on real-world, continuous input data.
When constructing a design matrix using sigmoidal basis functions for linear regression or other machine learning models, the choice of the width function for each sigmoidal basis function is crucial for capturing the variability and structure of the data effectively. The width, often denoted as
A few approaches to define the width of sigmoidal basis functions, which can be tailored based on the specific characteristics of the dataset or the desired properties of the model:
A simple and straightforward approach is to use a fixed width for all sigmoid functions across the input space. This method is easy to implement and interpret but may not be flexible enough to model data with varying scales or complexities effectively.
Adjusting the width in proportion to the overall range of the input data ensures that the sigmoid functions are scaled appropriately relative to the variability in the data.
Another common approach is to set the width relative to the distance between the centers of adjacent sigmoid functions. This method helps to control the overlap between the functions, ensuring smooth transitions and sufficient coverage across the input domain.
In datasets with outliers or heavy tails, using percentile ranges (such as interquartile range) to determine the width can help in focusing the sigmoid functions on the most dense parts of the data distribution, thereby avoiding undue influence from extreme values.
Adaptive width strategies involve dynamically adjusting the width of each sigmoid function based on local data density or other criteria. This can be complex to implement but allows the model to adapt more finely to the underlying data structure.
Here is a Python function to set up a sigmoidal design matrix with widths proportional to the distance between centers:
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def design_matrix_sigmoid(data, nr_basis_functions, x_min, x_max, k=2):
centers = np.linspace(x_min, x_max, nr_basis_functions)
width = np.min(np.diff(centers)) / k
Phi_sig = np.zeros((data.shape[0], nr_basis_functions))
for i in range(nr_basis_functions):
Phi_sig[:, i] = sigmoid((data[:, 0] - centers[i]) / width)
return Phi_sig