Linear regression vs multiple linear regression

If you’re learning about regression or you’re just starting to learn about machine learning, you might be wondering what the difference is between simple linear regression and multiple linear regression. This post will show you the difference between them and when to use each of them. I’ll also show you how to implement them both using Python.

Simple Linear Regression is used to predict continuous outputs where there is a linear relationship between one feature and the output variable. An example could be predicting the price of a house using only the median income in the area.

Multiple Linear Regression is used to predict continuous outputs where there is a linear relationship between more than one feature and the output variable. An example would be predicting the price of a house using the median income in the area and the number of rooms in the house.

There are many nuances to consider with both simple linear regression and multiple linear regression and there are a number of things you can do to get them to perform better.

Simple linear regression

When to use it

Simple linear regression can be used when the independent variable (the factor you are using to predict with) has a linear relationship with the output variable (what you want to predict).

So, the equation between the independent variable (the X value) and the output variable (the Y value) is of the form Y= θ0+θ1X1 (linear) and it is not of the form Y=θ0+θ1e^X1 or Y = θ0+θ1X1X2 (non-linear).

If there is a non-linear relationship, you could potentially use polynomial regression which modifies linear regression to deal with non-linear features or you could use an entirely different algorithm such as decision trees.

Examples could be predicting the price of a house based on the median income in the area, the number of expected sales on a particular day based on the temperature, or the number of tickets that will be sold based on the price.

You can read more about when linear regression is appropriate in this post.

The playlist below shows you how linear regression works.

How to optimize linear regression

Linear regression can be prone to underfitting the data. If you build a model using linear regression and you find that both the test accuracy and the training accuracy are low then this would likely be due to underfitting. In this case, it would likely help to switch to polynomial regression which involves multiplying feature vectors to an nth degree polynomial. To do this, you can use the PolynomialFeatures class from sklearn.

If you find that the test accuracy is much lower than the test accuracy then you can use regularization to reduce the variance (overfitting) of the model. To do this you can use Ridge Regression, Lasso Regression, Elastic-net or early stopping.

Is feature scaling required

Feature scaling is not generally required in linear, multiple or polynomial regression. However, there are some reasons why you might want to scale and normalize the data which get explained in this StackExchange question.

Alternative algorithms to linear regression

  • Polynomial regression
  • Decision tree regression
  • Random forest regression
  • Support vector regression

How to implement it using Python

Here is how to implement linear regression, in Python, using sklearn:

import numpy as np
import matplotlib.pyplot as plt

X_values = np.arange(0,5.1,0.1).reshape(-1,1)
Y_values = 4 + 5 * X_values + 8*np.random.rand(51,1)

plt.scatter(X_values, Y_values)
plt.xlabel(“$x_1$”, fontsize=18)
plt.ylabel(“$y$”, rotation=0, fontsize=18)
plt.show()

from sklearn.linear_model import LinearRegression

reg = LinearRegression().fit(X_values, Y_values)

reg.intercept_, reg.coef_

(array([7.37957952]), array([[4.78145655]]))

This represents the equation Y_hat = 7.379 + 4.78*X

plt.scatter(X_values, Y_values)
plt.plot(X_values, reg.predict(X_values), color=”r”)
plt.xlabel(“$x_1$”, fontsize=18)
plt.ylabel(“$y$”, rotation=0, fontsize=18)
plt.show()

Multiple linear regression

When to use it

Multiple linear regression can be used when multiple independent variables (the factors you are using to predict with) each have a linear relationship with the output variable (what you want to predict).

So, the equation between the independent variables (the X values) and the output variable (the Y value) is of the form Y= θ0+θ1X1+θ2X2+…+θnXn (linear) and it is not of the form Y=θ0+θ1e^X1+… or Y = θ0+θ1X1X2+… (non-linear).

Below are some possible examples where multiple linear regression might be appropriate:

  1. Predicting the price of a house based on the median income in the area and the number of rooms in the house.
  2. Predicting the number of ice cream sales based on the price and the temperature.

The video below shows how multiple linear regression works.

How to optimize multiple linear regression

Multiple linear regression can also be prone to underfitting the data and polynomial regression can also be used to make the model more complex.

You can also use Ridge Regression, Lasso Regression, Elastic-net or early stopping to deal with overfitting.

Alternative algorithms to multiple linear regression

  • Polynomial regression
  • Random forests
  • Decision trees
  • Support vector regression

How to implement it using Python

Here is how to implement multiple linear regression, in Python, using sklearn:

import numpy as np
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

X, y = datasets.load_diabetes(return_X_y=True)

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

reg = LinearRegression().fit(X_train, y_train)

reg.intercept_, reg.coef_

(151.06925805841755, array([ 54.52885236, -287.65356694, 546.96003617, 337.11892523, -749.7977235 , 403.8751115 , 109.85313395, 218.18005015, 722.75246222, 64.94051697]))

This represents the equation:
Y_hat = 151+54*X1 – 287*X2 + 546*X3 + 337*X4 – 749*X5 + 403*X6 + 109*X7 + 218*X8 + 722*X9 + 64*x10

preds = reg.predict(X_test)

print(‘Mean squared error: %.2f’ % mean_squared_error(y_test, preds))
print(‘Coefficient of determination: %.2f’ % r2_score(y_test, preds))

Mean squared error: 2899.82 Coefficient of determination: 0.43