What is the difference between linear and logistic regression?

If you’re just starting out in machine learning then two things that you will likely be learning about are linear and logistic regression.

This post will show you how linear and logistic regression are different and when to use each of them.

So, what is the difference between linear regression and logistic regression? Linear regression gives you a continuous output that is used to predict something with infinite possible answers such as the price of a house. Logistic regression gives you an output between 0 and 1 and is used to classify things such as whether or not a tumor is malignant.

There are actually a number of similarities between linear regression and logistic regression and they can both be used to predict similar things with some nuances between them.

Differences between linear and logistic regression

Here are the key differences between the two:

Output and when to use them

Linear regression

Linear regression gives a continuous output that can be greater than 1 or less than 0. It is used to predict things that have infinite possible answers. It does not tell you the probability of a particular outcome.

It can be used when the independent variables (the factors that you want to use to predict with) have a linear relationship with the output variable (what you want to predict) ie it is of the form Y= C+aX1+bX2 (linear) and it is not of the form Y = C+aX1X2 (non-linear).

Examples could be predicting the number of expected sales of icecream, on a particular day, based on the temperature or the number of tickets that will be sold for an event based on the price.

Using linear regression for classification is problematic since it doesn’t give you an output in the range of 0 to 1 and it is also susceptible to outliers.

If there are multiple independent variables then it becomes multiple linear regression.

Logistic Regression

Logistic regression gives an output between 0 and 1 which tells you the probability of something happening. If the output is below 0.5 it means that it is not likely to occur, if it is above 0.5 then it is likely to occur.

Logistic regression is also to be used when there is a linear relationship between the output and the factors. The difference is that logistic regression will give you a yes or no type of answer.

Examples could be predicting if a tumor is malignant, if the price of a house will be greater or less than $300,000 or whether or not someone will win an election.

How they are calculated

Linear regression

The equations used for linear regression are:

Theta represents the optimal gradients that you are trying to find that will allow you to predict the output variable.

By doing stochastic gradient descent where you select only one training example for each iteration instead of doing it over the whole training set you are able to get to a solution much more quickly.

By setting alpha = 1/(1+k), where k is the number of iterations that you have done, you can get to a solution more quickly since you’ll take larger steps at first and then take smaller steps when theta gets close to the optimal values.

Closed-form exists because the cost function is convex.

It is possible to overfit when using linear regression which is why you might want to do regularization. You can use ridge regression to avoid overfitting which gets explained in this video https://www.youtube.com/watch?v=qbvRdrd0yJ8.

Logistic regression

The formulas used for logistic regression are:

This looks very similar to linear regression and the only difference is that with logistic regression you are using a function to make the predicted value of y be between 0 and 1. There are many different functions that are used to do this. A common function that gets used is the sigmoid function.