Linear regression is a supervised learning algorithm. It is used to predict the real-valued output ybased on the input features x.
We denote x as the feature vector and bβ as the bias. Here, we want to find the weight vector wβ and bias b that best fit the model.
yiβ^β=wTx+b=w1βxi1β+w2βxi2β+...+b
βIn the formula, w is regression coefficient that quantify how much each feature impacts the outcome. The bias bβ shows how far the linear regression line is alway from the zero point - or, the outcome when all features are 0. Here, we can use gradient descent to solve for w and bβ.
Gradient Descent
The idea of gradient descent is as follow: we initialize value for w and b in the first iteration, the we gradually learn from the costs and try to minimize the deviation. Below are the building blocks of a linear regressor:
Cost Function: We can use Mean squared error (MSE) as cost function: J(w,b)=nβinβ(yiββyiβ^β)2β. Cost function measures how βoffβ the model predictions are. yiβ is the observed value of observation i, yiβ^ββ is the predicted value of i,and n is the number of observations in the dataset. The objective is to find values of w and b that minimize J(w,b)
Partial Derivative: A gradient is the partial derivative of a function with respect to a parameter.
Stopping Criteria: Each iteration, we calculate the next point using gradient at the current position, scales it (by a learning rate) and subtract obtained value from the current position (makes a step). We subtract the value because we want to minimise the cost function. Thus, we stop when the gradient approach zero, global or local minima; or after fix number of iterations.
# SGDRegressor is a stochastic gradient descent package inside Scikit-Learn. # Instead of using all observations each time, SGD randomly chooses one observation # each epoch to computer gradients, thus speeding up the learning process.from sklearn import datasetsfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import SGDRegressor# load the boston house pricing dataset for demonstrationX, y = datasets.load_boston(return_X_y=True)# use 80% for training and 20% for testX_train, X_test, y_train, y_test =train_test_split(X, y,train_size=0.8,random_state=42)# create a new modellr =SGDRegressor(learning_rate="optimal",max_iter=10000)# fit model to training datalr.fit(X_train, y_train)# use fitted model to predict test datay_pred = lr.predict(X_test)
Instead of gradient descent, OLS is another method used in linear regression. The idea is same: minimize the prediction error.
Formula: yiβ=Ξ²0β+Ξ²1βxi1β+Ξ²2βxi2β+...+Ξ²nβxinβ. Therefore, we want to minimize the residual error β₯yβXΞ²β₯2. We present the result directly here: Ξ²^β=(XTX)β1XTy
# Gradient descent code from scratch. No use of Scikit-Learn
class LinearRegression:
def _init_(self,lr=0.01,epoch=10):
self.lr=lr # set hyperparameter learning rate
self.epoch=epoch # set hyperparameter epoch
def fit(self, X, y):
self.n_obs, self.n_feature=X.shape
# weight initialization
self.W=np.zeros(self.n_feature)
self.b=0
self.X=X
self.Y=Y
# gradient descent learning step
for i in range(self.epoch):
self.update_weights()
return self
def update_weights(self):
Y_pred=self.predict(self.X)
#calculate gradients
partial_W=-(2*(self.X.T).dot(self.Y-Y_pred))/self.n_obs
partial_b=-2*np.sum(self.Y-Y_pred)/self.n_obs
#update weights by substracting partial derivative times learning rate
self.W=self.W-self.lr*partial_W
self.b=self.b-self.lr*partial_b
return self
def predict(self, X):
return X.dot(self.W)+self.b
# OLS with Scikit-Learn
from sklearn.linear_model import LinearRegression
from sklearn import datasets
from sklearn.model_selection import train_test_split
# load the boston house pricing dataset for demonstration
X, y = datasets.load_boston(return_X_y=True)
# use 80% for training and 20% for test
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.8, random_state=42)
# LinearRegression use OLS method as default
model=LinearRegression()
model.fit(X_train, y_train)
y_pred=model.predict(X_test)
import numpy as np
import copy
class LinearRegression:
def _init_(self):
#no hyperparameter
self.w=None
self.b=None
def fit(self, X, y):
self.X=X
self.y=y
X=copy.deepcopy(X)
X=np.concatenate((np.ones(X.shape[0]),1)
# implement the formula
betas=np.linalg.inv(X.transpose().dot(X)).dot(X.transpose()).dot(y)
self.b=betas[0]
self.w=betas[1:]
return self
def predict(self,X):
return X.dot(self.w)+self.b