Linear Regression - Minimization Problem

Understand the math behind linear regression.

Posted by Nivin Anton Alexis Lawrence on May 14, 2018

Recently by the same author:


Introduction to Functional Programming [Part 2]

Think Functional


Nivin Anton Alexis Lawrence

Human

Prerequisites

  • Calculus
  • Linear Algebra
  • Haskell
  • Loss Function

Idea

Linear regression is a simple statistical model that is used to study relationship between continuous variables. Linear regression can be categorized to simple or multiple based on number of predictor variable used in the linear equation. When the number of variable is one it’s simple otherwise it’s multiple linear regression. Say we have function \(h(x) = \theta_1 x + \theta_0\), goal of regression model is to find values for \(\theta_1, \theta_0\), so that cost function (which will be described below) is minimized to fit the data. For simplicity, we will focus on simple linear regression and learn the related concepts. Note, intuition and concepts holds true even for multiple linear regression model.

Code

module Regression where

import Data.List
import Types

-- calculate h(x) function 
hOfX :: Coefficients -> Point -> Float
hOfX thetas point = t0 + t1 * x
  where
    Coefficients (t0, t1) = thetas
    Point (x, _) = point
    
-- calculate the difference between h(x) and y
deltaT0 :: Coefficients -> Point -> Float
deltaT0 thetas point = hOfX thetas point - y
  where
    Point (_, y) = point

deltaT1 :: Coefficients -> Point -> Float
deltaT1 thetas point = (deltaT0 thetas point) * x
  where
    Point (x, _) = point

-- find the new theta value by applying gradient descent on cost function
newThetas :: DataSet -> Coefficients -> Float -> Coefficients
newThetas dataset theta alpha =
  let newt0 =
        t0 - alpha * m_reciprocal * (sum $ map (deltaT0 theta) training_data)
      newt1 =
        t1 - alpha * m_reciprocal * (sum $ map (deltaT1 theta) training_data)
  in Coefficients (newt0, newt1)
  where
    Coefficients (t0, t1) = theta
    DataSet training_data = dataset
    m_reciprocal = 1 / genericLength training_data

linearRegression :: DataSet -> Coefficients -> Float -> Int -> Coefficients
linearRegression dataset theta alpha iteration
  | iteration == 0 = theta
  | otherwise =
    let newtheta = newThetas dataset theta alpha
    in linearRegression dataset newtheta alpha (iteration - 1)

Concepts

Cost Function

Say we have a live data in form of cartesian values (x, y). x denoting horse power of a motor cycle and y denoting cost of the motor cycle. Now, we want to find a line that can best describe the data. By applying linear regression, we can find a line that best fits the data. But, how will we know the line which we find is the best fit? Cost function comes to the rescue. Cost function acts as the scale that tells how close or far the predicted line is from the data. So ideally, linear regression algorithm tries to minimize the value of cost function in order to get the best fit.

Loss Function or Cost Function is used interchangeably. Cost Function acts as the steer that provides the information about how good or bad the predicted function is. In machine learning, two famous loss function mentioned below are used in many ml models

  • Least Absolute Deviations (L1)
  • Least Squares Error (LSE) (L2)

In linear regression we will be using L2 norm as the loss function. You can read about the cost function in detail in the below provided references.

Least Squares Error (LSE) (L2)

L2 (loss function), \(S = \sum_{i=0}^m (y_i - h(x_i))^2\), where \(y_i\) is the actual value and \(y_i\) is the predicted value.
Here we will representing the cost function as,
\(J(\theta_0, \theta_1) = \frac{1} {(2 * m )} \sum_{i=0}^m (h(x_i) - y_i)^2\)
Next step will be coming up with mathematical optimization for our cost function. Optimization techniques is a separate field of study and applied in various field to find maximization or minimization of a function.

Gradient Descent

Solve for \(\theta_0 :\)
\(\frac{\partial }{\partial \theta_0} J(\theta_0, \theta_1) = \frac{\partial }{\partial \theta_0} \left( \frac{1} {(2 * m )} \sum_{i=0}^m (h(x_i) - y_i)^2\right)\)
\(\frac{\partial }{\partial \theta_0} J(\theta_0, \theta_1) = \frac{1} { m } \sum_{i=0}^m ( h(x_i) - y_i) * 1\)

Solve for \(\theta_1 :\)
\(\frac{\partial }{\partial \theta_1} J(\theta_0, \theta_1) = \frac{\partial }{\partial \theta_1} \left( \frac{1} {(2 * m )} \sum_{i=0}^m (h(x_i) - y_i)^2\right)\)
\(\frac{\partial }{\partial \theta_1} J(\theta_0, \theta_1) = \frac{1} { m } \sum_{i=0}^m ( h(x_i) - y_i) * x\)

The multivariable form of the hypothesis function can be generalized to:
\(h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + ... +\theta_n x_n\)
\(h_\theta(x) =\begin{bmatrix}\theta_0 \hspace{2em} \theta_1 \hspace{2em} ... \hspace{2em} \theta_n\end{bmatrix}\begin{bmatrix}x_0 \newline x_1 \newline \vdots \newline x_n\end{bmatrix}= \theta^T x\)

  • Normal Equation
  • Regression Problem
  • best way to understand this is plot a set of points and then try to run through manually and come up with a graph that fits the model.

Reference:

  • https://onlinecourses.science.psu.edu/stat501/node/251/
  • http://mccormickml.com/2014/03/04/gradient-descent-derivation/
  • https://samcgardner.github.io/2018/10/06/linear-regression-in-haskell.html
  • http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/