Linear Regression and Regularized Linear Regression
Linear regression is generally the first step into anyone’s Data Science journey. When you hear the words Linear and Regression, something like this pops up in your mind:
X1, X2, ..Xn are the independent variables or features. W1, W2…Wn are the weights (learned by the model from the data). Y’ is the model prediction. For a set of say 1000 points, we have a table with 1000 rows, n columns for X and 1 column for Y. Our model learns the weights W from these 1000 points so that it can predict the dependent variable Y for an unseen point (a point for which Xs are available but not Y)
For these seen points (1000 in this case), the actual value of the dependent variable (Y) and the model’s prediction of the dependent variable (denoted by Y’) are related as:
Epsilon (denoted by e henceforth) is the residual error. It captures the difference between models prediction and the actual value of Y. One thing to remember about e is that it follows a normal distribution with 0 mean.
The weights are learned using OLS or Ordinary Least Squared fitting. Meaning, The cumulative squared error, defined as,
Is minimized. Why do we square the errors and then add? Two reasons.
- If the residual error for a point is -1 and for the other, it is 1 then merely adding them gives 0 error which means the line fits perfectly to the points. This is not true.
- Squaring leads to more importance given to larger errors and less to smaller errors. Intuitively, model weights quickly update to minimize larger errors more than smaller ones.
Quite often, in the excitement of learning new and advanced models, we usually do not fully explore this model. In this blog, we’ll look at how to analyze and diagnose a linear regression model. This blog is going to be as intuitive as possible. Let’s talk about the model. There are three main assumptions linear regression makes:
- The independent variables have a linear relationship with the dependent variable.
- The variance of the dependent variable is uniform across all combinations of Xs
- The error term e associated with Y and Y’ is independent and identically distributed.
Do not worry if you don’t correctly understand the above lines. I am yet to simplify them.
Linear Relationship
Seems quite intuitive. If the independent variables do not have a linear relationship with the dependent variables, there’s no point modeling them using LINEAR regression. So what should one do if that is not the case? Consider a dataset with one independent variable X and dependent variable Y. In this case, Y varies as the square of X (some random noise is added to Y to make it look more realistic).
If we fit and plot a linear regression line, we can see that it isn’t a good fit. The MSE (mean squared error) is 0.0102
So what we do here is we transform X such that Y and this transformed X follows a linear relationship. Take a look at the picture below. Y and X might not have a linear relation. However, Y and X² do have a linear relation.
Next, we build the model, generate the predictions and reverse transform it. Take a look at the code and the plots below to get an idea. The MSE here is 0.0053 (almost half the previous one)
Isn’t is evident which one fits better. I hope it is a bit more clear why linear relationships are needed. Let’s move on to the next assumption.
The variance of the dependent variable is uniform across all combinations of Xs
Formally speaking, we need something called homoscedasticity. In simple terms, it means that the residuals must have constant variance. Let’s visualize this. Later I’ll explain why it is essential.
If you notice carefully, the variance among the Y values increases from left to right like a trumpet. Meaning the Y values for lower X values do not vary much concerning the regression line, unlike the ones to the right. We call it heteroscedasticity and is something you want to avoid. Why? Well, if there is a pattern among the residuals, like this one (for the above plot)
It generally means that the model is too simple for the data. The model is unable to capture all the patterns present in the data. When we achieve homoscedasticity, residuals look something like this. Another reason to avoid heteroscedasticity is to save us from unbias results in significance tests. We’ll look at these tests in details.
As you can see, the residuals are entirely random. One can hardly see any pattern. Now comes the last and the final assumption.
The error term e associated with Y and Y’ is independent and identically distributed
Sounds similar to the previous one? It kind of does. However, it is a little different. Previously, our residuals had growing variance but there we still independent. One residual did not have anything to do with the other. Here, we analyze what if one residual error has some dependency with the other. Consider the following plot whose data is generated by Y = X + noise (random number). Now, this noise accumulates over different values of X. Meaning noise for an X is a random number + the noise of the previous noise. We deliberately introduce this additive noise for the sake of our experiment.
A linear fit seems a good choice. Let’s check the residual errors.
They do not look entirely random. Is there some metric we can compute to validate our claim. It turns out we can calculate something called Autocorrelation. What is autocorrelation? We know that correlation measures the degree of linear relationship between two variables, say A and B. Autocorrelation measures the correlation of a variable with itself. For example, we want to measure how dependent a particular value of A correlates with the value of A some t steps back.
In our example, it turns out to be 0.945 which indicates some dependency. Now, why do we need errors to be independent? Again, this means the linear model fails to capture complex patterns in the data. Such type of patterns may frequently occur in time series data (where X is time, and Y is a property that varies with time. Stock prices for instance). The unaccounted patterns here could be some seasonality or trends.
I hope the three assumptions are a bit clear. Now how do we evaluate our model? Let’s take a look at some metrics.
Evaluating a Model
Previously, we defined MSE to calculate the errors committed by the model. However, if I tell you that for some data and some model the MSE is 23.223. Is this information alone enough to say something about the quality of our fit? How do we know if it’s the best our model can do? We need some benchmark to evaluate our model against. Hence, we have a metric called R squared (R²).
Let’s get the terms right. We know MSE. However, what is TSE or Total Squared Error? Suppose we had no X. We have Y, and we asked to model a line to fit these Y values such that the MSE minimizes. Since we have no X, our line would be of the form Y’ = a, where a is a constant. If we substitute Y’ for a in the MSE equation, and minimize it by differentiating with respect to a and set equal to zero, it turns out that a = mean(Y) gives the least error. Think about this — the line Y’ = a can be understood as the baseline model for our data. Addition of any independent variable X improves our model. Our model cannot be worse than this baseline model. If our X didn’t help to improve the model, it’s weight or coefficients would be 0 during MSE minimization. This baseline model provides a reference point. Now come back to R squared and take a look at the expression. If our model with all the X and all the Y produces an error same as the baseline model (TSE), R squared = 1–1 = 0. This is the worst case. On the opposite, if MSE =0, R squared = 1 which is the best case scenario.
Now let’s take a step back and think about the case when we add more independent variables to our data. How would the model respond to it? Suppose we are trying to predict house prices. If we add the area of the house to our model as an independent variable, our R square could increase. It is obvious. The variable does affect house prices. Suppose we add another independent variable. Something garbage, say random numbers. Can our R square increase? Can it decrease?
Now, if this garbage variable is helping minimize MSE, it’s weight or coefficient is non zero. If it isn’t, the weight is zero. If so, we get back the previous model. We can conclude that adding new independent variable at worst does nothing. It won’t degrade the model R squared. So if I keep adding new variables, I should get a better R squared. And I will. However, it doesn’t make sense. Those features aren’t reliable. Suppose if those set of random numbers were some other set of random numbers, our weights would change. You see, it is all up to chance. Remember that we have a sample of data points on which we build a model. It needs to be robust to new data points out of the sample. That’s why we introduce something called adjusted R squared. Adjusted R squared penalizes any addition of independent variables that do not add a significant improvement to the model. You usually use this metric to compare models after the addition of new features.
n is the number of points, k is the number of independent variables. If you add features without a significant increase in R squared, the adjusted R squared decreases.
So now we know something about linear regression. Now we look at regularization and assessment of coefficients.
Motivation
What exactly are we trying to improve? Linear regression doesn’t essentially need parameter tuning. Consider the following experiment. We generate 10 evenly spaced numbers, from 0 to 10. We assign these numbers to two variables x and y. Consequently, we can establish a relationship between them as y=x. The relationship is obviously linear. Now we add some random Gaussian noise (mean 0) to our y for the sake of this experiment. So now we have
The linearity between y and x is now disrupted. For this experiment, we know the process that generated our data (shown in the picture below). But this might not be the case most of the time.
Now, if you recall the from above, a linear model would still fit well. Because the random noise has a mean equal to zero, the standard deviation of the noise is constant and hence there is no heteroskedasticity present. But what if we didn’t know any of this info about our data? Let’s assume that we still go ahead and fit a line (shown in the figure below)
As we see, the line doesn’t fit well. We go ahead and fit a polynomial of degree 2 (having a squared term and the linear terms). The fit looks something like this
For the above fit, our X data has two columns: x and x². We know that the R squared of this curve is at least as good as our linear fit. Because if it weren’t a better fit then the coefficient of the squared term would have been 0. Now let’s take a step ahead and fit a polynomial of degree 4.
This curve fits the data even better. It seems that the curve is able to capture the irregularities well.
It can be asserted that adding higher degree polynomials will most of the times give you a better fit for training data. If not better, it won’t degrade the metric you follow (Mean Squared Error). But what about generalization? Adding higher degree polynomials leads to overfitting. Many times, a particular feature(in this case, a particular power of X) dominates the model. The outcome is then dependent largely on this feature. Consequently, the model becomes too sensitive and does not generalize well. Hence, we need a way too control the dominance of our input features. By control, we mean a way to control the coefficients. More specifically, the magnitude of these coefficients. This process is called Regularization. In this article, we study two ways to achieve this goal — LASSO and Ridge.
These two are not very different except for the way they control the magnitude of the coefficients. Previously, we defined the loss function as follows –
Where N is the total number of training points, yi is the actual predicted value, yiis the value predicted by the model. In regularization, we add a term to our existing MSE. We will talk about this term in detail ahead.
Least Absolute Shrinkage and Selector Operator (LASSO)
In Lasso, the MSE looks like
It is the same expression as before, just that the predicted values are expressed as the sum of input features and their coefficients. Also, the additional term is the sum of magnitudes of coefficients multiplied by lambda, or the strength of regularization. Let’s explore this term in more detail.
We know that we train the linear regression model using gradient descent. Previously, we tried to minimize the squared difference term (the first term). Now, we also try to minimize the sum of the magnitude of coefficients. If you think carefully, the two terms roughly oppose each other. Consider the following — If some coefficients try to assume large values to minimize the sum of squares of the differences (the first term), the regularization term, or the sum of magnitudes of the coefficients increases. Ultimately, the sum isn’t changed as much as it would without the regularization. Ultimately, we obtain coefficients that generalize well. Now, what’s the role of the strength of regularization you may ask? It helps decide how much importance should be given to the regularization in comparison to the first term. If it is large, the model focuses more on reducing the sum of magnitudes of coefficients.
Ridge
Ridge differs in the way the regularization term is modeled. In LASSO, we summed the magnitudes of coefficients. Here we sum the squares of the coefficients. The modified loss function looks like
The high-level idea remains the same. Just the way gradient descent works changes, mathematically.
Analysis
One may ask, how do these two methods differ? The answer lies in the graph below
Image credits: http://freakonometrics.hypotheses.org/files/2018/07/lasso1.png
Consider the above plot. It represents the coefficient value vs. the regularization strength. After a certain lambda value, the coefficients start to shrink to zero value. Thus getting eliminated from the model itself. But in the case of Ridge,
Image credits: https://i.stack.imgur.com/E7BHo.png
The coefficients do not shrink to absolute zero. They are minimized, but not to absolute zero.
Conclusion
We have seen why do we need regularization at all, what problems does it solve and what are the methods in which it is implemented. We also saw how does it work.