Using Norms to Understand Linear Regression
Introduction
In my last post, I described how we can derive modes, medians and means as three natural solutions to the problem of summarizing a list of numbers, \((x_1, x_2, \ldots, x_n)\), using a single number, \(s\). In particular, we measured the quality of different potential summaries in three different ways, which led us to modes, medians and means respectively. Each of these quantities emerged from measuring the typical discrepancy between an element of the list, \(x_i\), and the summary, \(s\), using a formula of the form,
$$ \sum_i \lvert x_i - s \rvert ^p, $$
where \(p\) was either \(0\), \(1\) or \(2\).
The \(L_p\) Norms
In this post, I’d like to extend this approach to linear regression. The notion of discrepancies we used in the last post is very closely tied to the idea of measuring the size of a vector in \(\mathbb{R}^n\). Specifically, we were minimizing a measure of discrepancies that was almost identical to the \(L_p\) family of norms that can be used to measure the size of vectors. Understanding \(L_p\) norms makes it much easier to describe several modern generalizations of classical linear regression.
To extend our previous approach to the more standard notion of an \(L_p\) norm, we simply take the sum we used before and rescale things by taking a \(p^{th}\) root. This gives the formula for the \(L_p\) norm of any vector, \(v = (v_1, v_2, \ldots, v_n)\), as,
$$ \lvert v \rvert_p = (\sum_i \lvert v_i \rvert^p)^\frac{1}{p}. $$
When \(p = 2\), this formula reduces to the familiar formula for the length of a vector:
$$ \lvert v \rvert_2 = \sqrt{\sum_i v_i^2}. $$
In the last post, the vector we cared about was the vector of elementwise discrepancies, \(v = (x_1 - s, x_2 - s, \ldots, x_n - s)\). We wanted to minimize the overall size of this vector in order to make \(s\) a good summary of \(x_1, \ldots, x_n\). Because we were interested only in the minimum size of this vector, it didn’t matter that we skipped taking the \(p^{th}\) root at the end because one vector, \(v_1\), has a smaller norm than another vector, \(v_2\), only when the \(p^{th}\) power of that norm smaller than the \(p^{th}\) power of the other. What was essential wasn’t the scale of the norm, but rather the value of \(p\) that we chose. Here we’ll follow that approach again. Specifically, we’ll again be working consistently with the \(p^{th}\) power of an \(L_p\) norm:
$$ \lvert v \rvert_p^p = (\sum_i \lvert v_i \rvert^p). $$
The Regression Problem
Using \(L_p\) norms to measure the overall size of a vector of discrepancies extends naturally to other problems in statistics. In the previous post, we were trying to summarize a list of numbers by producing a simple summary statistic. In this post, we’re instead going to summarize the relationship between two lists of numbers in a form that generalizes traditional regression models.
Instead of a single list, we’ll now work with two vectors: \((x_1, x_2, \ldots, x_n)\) and \((y_1, y_2, \ldots, y_n)\). Because we like simple models, we’ll make the very strong (and very convenient) assumption that the second vector is, approximately, a linear function of the first vector, which gives us the formula:
$$ y_i \approx \beta_0 + \beta_1 x_i. $$
In practice, this linear relationship is never perfect, but only an approximation. As such, for any specific values we choose for \(\beta_0\) and \(\beta_1\), we have to compute a vector of discrepancies: \(v = (y_1 - (\beta_0 + \beta_1 x_1), \ldots, y_n - (\beta_0 + \beta_1 x_n))\). The question then becomes: how do we measure the size of this vector of discrepancies? By choosing different norms to measure its size, we arrive at several different forms of linear regression models. In particular, we’ll work with three norms: the \(L_0\), \(L_1\) and \(L_2\) norms.
As we did with the single vector case, here we’ll define discrepancies as,
$$ d_i = \lvert y_i - (\beta_0 + \beta_1 x_i) \rvert^p, $$
and the total error as,
$$ E_p = \sum_i \lvert y_i - (\beta_0 + \beta_1 x_i) \rvert^p, $$
which is the just the \(p^{th}\) power of the \(L_p\) norm.
Several Forms of Regression
In general, we want estimate a set of regression coefficients that minimize this total error. Different forms of linear regression appear when we alter the values of \(p\). As before, let’s consider three settings:
$$ E_0 = \sum_i \lvert y_i - (\beta_0 + \beta_1 x_i) \rvert^0 $$
$$ E_1 = \sum_i \lvert y_i - (\beta_0 + \beta_1 x_i) \rvert^1 $$
$$ E_2 = \sum_i \lvert y_i - (\beta_0 + \beta_1 x_i) \rvert^2 $$
What happens in these settings? In the first case, we select regression coefficients so that the line passes through as many points as possible. Clearly we can always select a line that passes through any pair of points. And we can show that there are data sets in which we cannot do better. So the \(L_0\) norm doesn’t seem to provide a very useful form of linear regression, but I’d be interested to see examples of its use.
In contrast, minimizing \(E_1\) and \(E_2\) define quite interesting and familiar forms of linear regression. We’ll start with \(E_2\) because it’s the most familiar: it defines Ordinary Least Squares (OLS) regression, which is the one we all know and love. In the \(L_2\) case, we select \(\beta_0\) and \(\beta_1\) to minimize,
$$ E_2 = \sum_i (y_i - (\beta_0 + \beta_1 x_i))^2, $$
which is the summed squared error over all of the \((x_i, y_i)\) pairs. In other words, Ordinary Least Squares regression is just an attempt to find an approximating linear relationship between two vectors that minimizes the \(L_2\) norm of the vector of discrepancies.
Although OLS regression is clearly king, the coefficients we get from minimizing \(E_1\) are also quite widely used: using the \(L_1\) norm defines Least Absolute Deviations (LAD) regression, which is also sometimes called Robust Regression. This approach to regression is robust because large outliers that would produce errors greater than \(1\) are not unnecessarily augmented by the squaring operation that’s used in defining OLS regression, but instead only have their absolute values taken. This means that the resulting model will try to match the overall linear pattern in the data even when there are some very large outliers.
We can also relate these two approaches to the strategy employed in the previous post. When we use OLS regression (which would be better called \(L_2\) regression), we predict the mean of \(y_i\) given the value of \(x_i\). And when we use LAD regression (which would be better called \(L_1\) regression), we predict the median of \(y_i\) given the value of \(x_i\). Just as I said in the previous post, the core theoretical tool that we need to understand is the \(L_p\) norm. For single number summaries, it naturally leads to modes, medians and means. For simple regression problems, it naturally leads to LAD regression and OLS regression. But there’s more: it also leads naturally to the two most popular forms of regularized regression.
Regularization
If you’re not familiar with regularization, the central idea is that we don’t exclusively try to find the values of \(\beta_0\) and \(\beta_1\) that minimize the discrepancy between \(\beta_0 + \beta_1 x_i\) and \(y_i\), but also simultaneously try to satisfy a competing requirement that \(\beta_1\) not get too large. Note that we don’t try to control the size of \(\beta_0\) because it describes the overall scale of the data rather than the relationship between \(x\) and \(y\).
Because these objectives compete, we have to combine them into a single objective. We do that by working with a linear sum of the two objectives. And because both the discrepancy objective and the size of the coefficients can be described in terms of norms, we’ll assume that we want to minimize the \(L_p\) norm of the discrepancies and the \(L_q\) norm of the \(\beta\)’s. This means that we end up trying to minimize an expression of the form,
$$ (\sum_i \lvert y_i - (\beta_0 + \beta_1 x_i) \rvert^{p}) + \lambda (\lvert \beta_1 \rvert^q). $$
In most regularized regression models that I’ve seen in the wild, people tend to use \(p = 2\) and \(q = 1\) or \(q = 2\). When \(q = 1\), this model is called the LASSO. When \(q = 2\), this model is called ridge regression. In another approach, I’ll try to describe why the LASSO and ridge regression produce such different patterns of coefficients.