💡Linear Regression

Linear regression is a supervised learning algorithm. It is used to predict the real-valued output $y$ based on the input features $x.$

We denote $x$ as the feature vector and $b$ as the bias. Here, we want to find the weight vector $w$ and bias $b$ that best fit the model.

$\hat{y_i}=w^Tx+b=w_1x_{i1}+w_2x_{i2}+...+b$

In the formula, $w$ is regression coefficient that quantify how much each feature impacts the outcome. The bias $b$ shows how far the linear regression line is alway from the zero point - or, the outcome when all features are 0. Here, we can use gradient descent to solve for $w$ and $b$ .

Gradient Descent

The idea of gradient descent is as follow: we initialize value for $w$ and $b$ in the first iteration, the we gradually learn from the costs and try to minimize the deviation. Below are the building blocks of a linear regressor:

Cost Function: We can use Mean squared error (MSE) as cost function: $J(\mathbf{w}, b) = \frac{\sum_{i}^{n} (y_i - \hat{y_i})^2}{n}$ . Cost function measures how “off” the model predictions are. $y_i$ is the observed value of observation $i$ , $\hat{y_i}$ is the predicted value of $i$ ,and $n$ is the number of observations in the dataset. The objective is to find values of $w$ and $b$ that minimize $J(\mathbf{w}, b)$
Partial Derivative: A gradient is the partial derivative of a function with respect to a parameter.
- $\frac{\partial J(\mathbf{w}, b)}{\partial \mathbf{w}} = \frac{-2 \mathbf{X}^T \cdot (\mathbf{y}-\hat{\mathbf{y}})}{n}$
- $\frac{\partial J(\mathbf{w}, b)}{\partial {b}} = \frac{-2 \sum_{i=0}^{n} (y_i-\hat{y_i})}{n}$
Stopping Criteria: Each iteration, we calculate the next point using gradient at the current position, scales it (by a learning rate) and subtract obtained value from the current position (makes a step). We subtract the value because we want to minimise the cost function. Thus, we stop when the gradient approach zero, global or local minima; or after fix number of iterations.

# SGDRegressor is a stochastic gradient descent package inside Scikit-Learn. 
# Instead of using all observations each time, SGD randomly chooses one observation 
# each epoch to computer gradients, thus speeding up the learning process.
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor
# load the boston house pricing dataset for demonstration
X, y = datasets.load_boston(return_X_y=True)

# use 80% for training and 20% for test
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.8, random_state=42)

# create a new model
lr = SGDRegressor(learning_rate="optimal", max_iter=10000)
# fit model to training data
lr.fit(X_train, y_train)
# use fitted model to predict test data
y_pred = lr.predict(X_test)

# Gradient descent code from scratch. No use of Scikit-Learn

class LinearRegression:
    def _init_(self,lr=0.01,epoch=10):
        self.lr=lr # set hyperparameter learning rate
        self.epoch=epoch # set hyperparameter epoch
    
    def fit(self, X, y):
        self.n_obs, self.n_feature=X.shape
        # weight initialization
        self.W=np.zeros(self.n_feature)
        self.b=0
        self.X=X
        self.Y=Y
        
        # gradient descent learning step
        for i in range(self.epoch):
            self.update_weights()
        return self
    
    def update_weights(self):
        Y_pred=self.predict(self.X)
        #calculate gradients
        partial_W=-(2*(self.X.T).dot(self.Y-Y_pred))/self.n_obs
        partial_b=-2*np.sum(self.Y-Y_pred)/self.n_obs
        #update weights by substracting partial derivative times learning rate
        self.W=self.W-self.lr*partial_W
        self.b=self.b-self.lr*partial_b
        return self
        
    def predict(self, X):
        return X.dot(self.W)+self.b

Code snippets are adapted from GeeksforGeeks.

Ordinary Least Squares (OLS)

Instead of gradient descent, OLS is another method used in linear regression. The idea is same: minimize the prediction error.

Formula: $y_i=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+...+\beta_nx_{in}$ . Therefore, we want to minimize the residual error $∥y−Xβ∥ ^2$ . We present the result directly here: $\hat{\beta}=(\mathbf{X}^T \mathbf{X})^{-1}\mathbf{X}^T \mathbf{y}$

# OLS with Scikit-Learn
from sklearn.linear_model import LinearRegression
from sklearn import datasets
from sklearn.model_selection import train_test_split

# load the boston house pricing dataset for demonstration
X, y = datasets.load_boston(return_X_y=True)

# use 80% for training and 20% for test
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.8, random_state=42)

# LinearRegression use OLS method as default
model=LinearRegression()
model.fit(X_train, y_train)

y_pred=model.predict(X_test)

import numpy as np
import copy

class LinearRegression:
    def _init_(self):
        #no hyperparameter
        self.w=None
        self.b=None
    
    def fit(self, X, y):
        self.X=X
        self.y=y
        X=copy.deepcopy(X)
        X=np.concatenate((np.ones(X.shape[0]),1)
        # implement the formula
        betas=np.linalg.inv(X.transpose().dot(X)).dot(X.transpose()).dot(y)
        
        self.b=betas[0]
        self.w=betas[1:]
        return self
    
    def predict(self,X):
        return X.dot(self.w)+self.b

Code snippets are adapted from IBM Developer.

PreviousML Algorithms Implementation in Python NextLogistic Regression

Last updated 2 years ago