💡Linear Regression

Linear regression is a supervised learning algorithm. It is used to predict the real-valued output yybased on the input features x.x.

We denote xx as the feature vector and bb​ as the bias. Here, we want to find the weight vector ww​ and bias bb that best fit the model.

yi^=wTx+b=w1xi1+w2xi2+...+b\hat{y_i}=w^Tx+b=w_1x_{i1}+w_2x_{i2}+...+b

​In the formula, ww is regression coefficient that quantify how much each feature impacts the outcome. The bias bb ​ shows how far the linear regression line is alway from the zero point - or, the outcome when all features are 0. Here, we can use gradient descent to solve for ww and bb​.

Gradient Descent

The idea of gradient descent is as follow: we initialize value for ww and bb in the first iteration, the we gradually learn from the costs and try to minimize the deviation. Below are the building blocks of a linear regressor:

  • Cost Function: We can use Mean squared error (MSE) as cost function: J(w,b)=in(yiyi^)2nJ(\mathbf{w}, b) = \frac{\sum_{i}^{n} (y_i - \hat{y_i})^2}{n}. Cost function measures how “off” the model predictions are. yi y_i is the observed value of observation ii, yi^\hat{y_i}​ is the predicted value of ii,and nn is the number of observations in the dataset. The objective is to find values of ww and bb that minimize J(w,b)J(\mathbf{w}, b)

  • Partial Derivative: A gradient is the partial derivative of a function with respect to a parameter.

    • J(w,b)w=2XT(yy^)n\frac{\partial J(\mathbf{w}, b)}{\partial \mathbf{w}} = \frac{-2 \mathbf{X}^T \cdot (\mathbf{y}-\hat{\mathbf{y}})}{n}

    • J(w,b)b=2i=0n(yiyi^)n\frac{\partial J(\mathbf{w}, b)}{\partial {b}} = \frac{-2 \sum_{i=0}^{n} (y_i-\hat{y_i})}{n}

  • Stopping Criteria: Each iteration, we calculate the next point using gradient at the current position, scales it (by a learning rate) and subtract obtained value from the current position (makes a step). We subtract the value because we want to minimise the cost function. Thus, we stop when the gradient approach zero, global or local minima; or after fix number of iterations.

# SGDRegressor is a stochastic gradient descent package inside Scikit-Learn. 
# Instead of using all observations each time, SGD randomly chooses one observation 
# each epoch to computer gradients, thus speeding up the learning process.
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor
# load the boston house pricing dataset for demonstration
X, y = datasets.load_boston(return_X_y=True)

# use 80% for training and 20% for test
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.8, random_state=42)

# create a new model
lr = SGDRegressor(learning_rate="optimal", max_iter=10000)
# fit model to training data
lr.fit(X_train, y_train)
# use fitted model to predict test data
y_pred = lr.predict(X_test)
# Gradient descent code from scratch. No use of Scikit-Learn

class LinearRegression:
    def _init_(self,lr=0.01,epoch=10):
        self.lr=lr # set hyperparameter learning rate
        self.epoch=epoch # set hyperparameter epoch
    
    def fit(self, X, y):
        self.n_obs, self.n_feature=X.shape
        # weight initialization
        self.W=np.zeros(self.n_feature)
        self.b=0
        self.X=X
        self.Y=Y
        
        # gradient descent learning step
        for i in range(self.epoch):
            self.update_weights()
        return self
    
    def update_weights(self):
        Y_pred=self.predict(self.X)
        #calculate gradients
        partial_W=-(2*(self.X.T).dot(self.Y-Y_pred))/self.n_obs
        partial_b=-2*np.sum(self.Y-Y_pred)/self.n_obs
        #update weights by substracting partial derivative times learning rate
        self.W=self.W-self.lr*partial_W
        self.b=self.b-self.lr*partial_b
        return self
        
    def predict(self, X):
        return X.dot(self.W)+self.b

Code snippets are adapted from GeeksforGeeks.

Ordinary Least Squares (OLS)

Instead of gradient descent, OLS is another method used in linear regression. The idea is same: minimize the prediction error.

Formula: yi=β0+β1xi1+β2xi2+...+βnxiny_i=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+...+\beta_nx_{in}. Therefore, we want to minimize the residual error yXβ2∥y−Xβ∥ ^2 . We present the result directly here: β^=(XTX)1XTy\hat{\beta}=(\mathbf{X}^T \mathbf{X})^{-1}\mathbf{X}^T \mathbf{y}

# OLS with Scikit-Learn
from sklearn.linear_model import LinearRegression
from sklearn import datasets
from sklearn.model_selection import train_test_split

# load the boston house pricing dataset for demonstration
X, y = datasets.load_boston(return_X_y=True)

# use 80% for training and 20% for test
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.8, random_state=42)

# LinearRegression use OLS method as default
model=LinearRegression()
model.fit(X_train, y_train)

y_pred=model.predict(X_test)
import numpy as np
import copy

class LinearRegression:
    def _init_(self):
        #no hyperparameter
        self.w=None
        self.b=None
    
    def fit(self, X, y):
        self.X=X
        self.y=y
        X=copy.deepcopy(X)
        X=np.concatenate((np.ones(X.shape[0]),1)
        # implement the formula
        betas=np.linalg.inv(X.transpose().dot(X)).dot(X.transpose()).dot(y)
        
        self.b=betas[0]
        self.w=betas[1:]
        return self
    
    def predict(self,X):
        return X.dot(self.w)+self.b

Code snippets are adapted from IBM Developer.

Last updated