The Normal Equation

Machine Learning

What is Linear Regression?

Linear regression is a type of model-based machine learning that predicts a target numerical value by calculating a weighted sum of input features, plus a constant called the bias term. Examples of problems that could be addressed with linear regression include:

  • 1 Predicting life satisfaction based on GDP per capita.
  • 2 Predicting housing prices based on features like square footage and the number of bedrooms.
  • 3 Predicting crop yields based on factors like rainfall and fertilizer use.
  • 4 Predicting a company's revenue next year based on previous performance metrics.

Linear Regression is represented mathematically by: y^=θ0+θ1x1+θ2x2++θnxn\begin{aligned}\\&\widehat{y}=\theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2}+\cdots+\theta_{n}x_{n}\end{aligned}

where y^\widehat{y} is the predicted value, nn is the number of features, xx is the ithi^{th} feature value and θj\theta_{j} is the model parameter including the bias term.

Finding Optimal Parameters

The goal of training a linear regression model is to find the optimal values for the parameter vector θ\theta that minimize the difference between the model's predictions y^\widehat{y} and the actual target values yy in the training data.

Normal Equation

The Normal Equation is a mathematical formula used to find the optimal values for the parameters of a linear regression model. The optimal values are the ones that minimize the difference between the model's predictions and the actual target values in the training data.

θ^=(XTX)1XTy\widehat{\mathbf{\theta}}=\left(\mathbf{X}^{\mathsf{T}}\mathbf{X}\right)^{-1}\quad\mathbf{X}^{\mathsf{T}}\quad\mathbf{y}

We'd like to minimize the least-squares cost:

J(θ0...n)=12mi=1m(hθ(x(i))y(i))2J(\theta_{0...n})=\dfrac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2

Where x(i)x^(i) is the i-th sample (from a set of m samples) and y(i)y^{(i)}is the i-th expected result To proceed, well represent the problem in matrix notation; this is natural, since we essentially have a system of linear equations here.

The regression coefficients θ\theta we're looking for are the vector:

(θ0θ1...θn)Rn+1\begin{pmatrix}\theta_0\\\theta_1\\...\\\theta_n\end{pmatrix}\in\mathbb R^{n+1}

Given a training set, define the design matrix XX to be the nn-by-dd matrix 1(actually nn-by-d+1d+1, if we include the intercept)

that contains the training examples input values in its rows:

X=[(x(1))T(x(2))T(x(n))T].\left.X=\left[\begin{array}{c}-(x^{(1)})^T-\\-(x^{(2)})^T-\\\vdots\\-(x^{(n)})^T-\end{array}\right.\right].

Also, let y\vec{y} be the nn-dimensional vector containing all the target values from the training set:

y=[y(1)y(2)y(n)].\left.\vec{y}=\left[\begin{array}{c}y^{(1)}\\y^{(2)}\\\vdots\\y^{(n)}\end{array}\right.\right].

Now, since hθ(x(i))=(x(i))Tθh_\theta(x^{(i)})=(x^{(i)})^T\theta, we can easily verify that

Xθy=[(x(1))Tθ(x(n))Tθ][y(1)y(n)]=[hθ(x(1))y(1)hθ(x(n))y(n)].\begin{aligned}X\theta-\vec{y}&=\quad\left[\begin{array}{c}(x^{(1)})^T\theta\\\vdots\\(x^{(n)})^T\theta\end{array}\right]-\left[\begin{array}{c}y^{(1)}\\\vdots\\y^{(n)}\end{array}\right]\\&=\quad\left[\begin{array}{c}h_\theta(x^{(1)})-y^{(1)}\\\vdots\\h_\theta(x^{(n)})-y^{(n)}\end{array}\right].\end{aligned}

Thus, using the fact that for a vector zz, we have that zTz=izi2z^Tz=\sum_iz_i^2:

12(Xθy)T(Xθy)=12i=1n(hθ(x(i))y(i))2=J(θ)\begin{aligned}\frac12(X\theta-\vec{y})^T(X\theta-\vec{y})&=\quad\frac12\sum_{i=1}^n(h_\theta(x^{(i)})-y^{(i)})^2\\&=\quad J(\theta)\end{aligned}

Finally, to minimize JJ, let's find its derivatives with respect to θ.\theta. Hence,

θJ(θ)=θ12(Xθy)T(Xθy)=12θ((Xθ)TXθ(Xθ)TyyT(Xθ)+yTy)=12θ(θT(XTX)θyT(Xθ)yT(Xθ))=12θ(θT(XTX)θ2(XTy)Tθ)=12(2XTXθ2XTy)=XTXθXTy\begin{aligned}\nabla_\theta J(\theta)&=\quad\nabla_\theta\frac12(X\theta-\vec{y})^T(X\theta-\vec{y})\\&=\quad\frac12\nabla_\theta\left((X\theta)^TX\theta-(X\theta)^T\vec{y}-\vec{y}^T(X\theta)+\vec{y}^T\vec{y}\right)\\&=\quad\frac12\nabla_\theta\left(\theta^T(X^TX)\theta-\vec{y}^T(X\theta)-\vec{y}^T(X\theta)\right)\\&=\quad\frac12\nabla_\theta\left(\theta^T(X^TX)\theta-2(X^T\vec{y})^T\theta\right)\\&=\quad\frac12\left(2X^TX\theta-2X^T\vec{y}\right)\\&=\quad X^TX\theta-X^T\vec{y}\end{aligned} 2(Proof was derived in Andrew Ng Stanford CS229 Notes.)

In the third step, we used the fact that aTb=bTaa^Tb=b^Ta, and in the fifth step used the facts xbTx=b\nabla_xb^Tx=b and xxTAx=2Ax\nabla_xx^TAx=2Ax for symmetric matrix AA 3(for more details, see Section 4.3 of “Linear Algebra Review and Reference")

To minimize JJ, we set its derivatives to zero, and obtain the normal equations:

XTXθ=XTyX^TX\theta=X^T\vec{y}

Thus, the value of θ\theta that minimizes J(θ)J(\theta) is given in closed form by the equation

θ=(XTX)1XTy\theta=(X^TX)^{-1}X^T\vec{y}

Key points

  • Purpose: The Normal Equation provides a direct, closed-form solution to calculate the best parameters (weights) for a linear regression model.
Normal Equation in sklearn and numpy python.
from sklearn.preprocessing import add_dummy_feature
import numpy as np


def normal_equation(X: np.array, Y: np.array):
    X_b = add_dummy_feature(X)
    theta_best = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ Y
    
    return theta_best    
  • Cost Function: It works by finding the parameter values that minimize the Mean Squared Error (MSE) cost function. The MSE measures the average squared difference between the model's predictions and the actual target values.
MSE(X,hθ)=1mi=1m(θTx(i)y(i))2\mathrm{MSE}(\mathbf{X},h_{\mathbf{\theta}})=\frac{1}{m}\sum_{i = 1}^{m}\left(\mathbf{\theta}^{\mathsf{T}}\mathbf{x}^{(i)}-y^{(i)}\right)^{2}

Advantages

  • Fast and efficient for smaller datasets with a limited number of features.
  • Direct solution without needing iterative optimization algorithms like gradient descent.

Disadvantages

  • Can become computationally very expensive for datasets with a very large number of features.
  • Does not handle cases where certain features are redundant or when the number of features exceeds the number of training instances.

Alternative to the Normal Equation: For datasets with many features or when computational efficiency is critical, alternative methods like gradient descent algorithms are used to find the optimal parameters iteratively.

Checkout Gradient Descent Here

Read More Here