Prerequisites
- Calculus
- Linear Algebra
- Haskell
- Loss Function
Idea
Linear regression is a simple statistical model that is used to study relationship between continuous variables. Linear regression can be categorized to simple or multiple based on number of predictor variable used in the linear equation. When the number of variable is one it’s simple otherwise it’s multiple linear regression. Say we have function \(h(x) = \theta_1 x + \theta_0\), goal of regression model is to find values for \(\theta_1, \theta_0\), so that cost function (which will be described below) is minimized to fit the data. For simplicity, we will focus on simple linear regression and learn the related concepts. Note, intuition and concepts holds true even for multiple linear regression model.
Code
module Regression where
import Data.List
import Types
-- calculate h(x) function
hOfX :: Coefficients -> Point -> Float
hOfX thetas point = t0 + t1 * x
where
Coefficients (t0, t1) = thetas
Point (x, _) = point
-- calculate the difference between h(x) and y
deltaT0 :: Coefficients -> Point -> Float
deltaT0 thetas point = hOfX thetas point - y
where
Point (_, y) = point
deltaT1 :: Coefficients -> Point -> Float
deltaT1 thetas point = (deltaT0 thetas point) * x
where
Point (x, _) = point
-- find the new theta value by applying gradient descent on cost function
newThetas :: DataSet -> Coefficients -> Float -> Coefficients
newThetas dataset theta alpha =
let newt0 =
t0 - alpha * m_reciprocal * (sum $ map (deltaT0 theta) training_data)
newt1 =
t1 - alpha * m_reciprocal * (sum $ map (deltaT1 theta) training_data)
in Coefficients (newt0, newt1)
where
Coefficients (t0, t1) = theta
DataSet training_data = dataset
m_reciprocal = 1 / genericLength training_data
linearRegression :: DataSet -> Coefficients -> Float -> Int -> Coefficients
linearRegression dataset theta alpha iteration
| iteration == 0 = theta
| otherwise =
let newtheta = newThetas dataset theta alpha
in linearRegression dataset newtheta alpha (iteration - 1)Concepts
Cost Function
Say we have a live data in form of cartesian values (x, y). x denoting horse power of a motor cycle and y denoting cost of the motor cycle. Now, we want to find a line that can best describe the data. By applying linear regression, we can find a line that best fits the data. But, how will we know the line which we find is the best fit? Cost function comes to the rescue. Cost function acts as the scale that tells how close or far the predicted line is from the data. So ideally, linear regression algorithm tries to minimize the value of cost function in order to get the best fit.
Loss Function or Cost Function is used interchangeably. Cost Function acts as the steer that provides the information about how good or bad the predicted function is. In machine learning, two famous loss function mentioned below are used in many ml models
- Least Absolute Deviations (L1)
- Least Squares Error (LSE) (L2)
In linear regression we will be using L2 norm as the loss function. You can read about the cost function in detail in the below provided references.
Least Squares Error (LSE) (L2)
L2 (loss function), \(S = \sum_{i=0}^m (y_i - h(x_i))^2\), where \(y_i\) is the actual value and \(y_i\) is the predicted value.
Here we will representing the cost function as,
\(J(\theta_0, \theta_1) = \frac{1} {(2 * m )} \sum_{i=0}^m (h(x_i) - y_i)^2\)
Next step will be coming up with mathematical optimization for our cost function. Optimization
techniques is a separate field of study and applied in various field to find maximization or minimization
of a function.
Gradient Descent
Solve for \(\theta_0 :\)
\(\frac{\partial }{\partial \theta_0} J(\theta_0, \theta_1) = \frac{\partial }{\partial \theta_0} \left( \frac{1} {(2 * m )} \sum_{i=0}^m (h(x_i) - y_i)^2\right)\)
\(\frac{\partial }{\partial \theta_0} J(\theta_0, \theta_1) = \frac{1} { m } \sum_{i=0}^m ( h(x_i) - y_i) * 1\)
Solve for \(\theta_1 :\)
\(\frac{\partial }{\partial \theta_1} J(\theta_0, \theta_1) = \frac{\partial }{\partial \theta_1} \left( \frac{1} {(2 * m )} \sum_{i=0}^m (h(x_i) - y_i)^2\right)\)
\(\frac{\partial }{\partial \theta_1} J(\theta_0, \theta_1) = \frac{1} { m } \sum_{i=0}^m ( h(x_i) - y_i) * x\)
The multivariable form of the hypothesis function can be generalized to:
\(h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + ... +\theta_n x_n\)
\(h_\theta(x) =\begin{bmatrix}\theta_0 \hspace{2em} \theta_1 \hspace{2em} ... \hspace{2em} \theta_n\end{bmatrix}\begin{bmatrix}x_0 \newline x_1 \newline \vdots \newline x_n\end{bmatrix}= \theta^T x\)
- Normal Equation
- Regression Problem
- best way to understand this is plot a set of points and then try to run through manually and come up with a graph that fits the model.
Reference:
- https://onlinecourses.science.psu.edu/stat501/node/251/
- http://mccormickml.com/2014/03/04/gradient-descent-derivation/
- https://samcgardner.github.io/2018/10/06/linear-regression-in-haskell.html
- http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/