Linear regression vs decision trees

If you are learning machine learning, you might be wondering what the differences are between linear regression and decision trees and when to use them. This post will show you how they differ, how they work and when to use each of them.

So, what is the difference between linear regression and decision trees?

Linear Regression is used to predict continuous outputs where there is a linear relationship between the features of the dataset and the output variable. It is used for regression problems where you are trying to predict something with infinite possible answers such as the price of a house.

Decision trees can be used for either classification or regression problems and are useful for complex datasets. They work by splitting the dataset, in a tree-like structure, into smaller and smaller subsets and then make predictions based on what subset a new example would fall into.

There are many nuances to consider with both linear regression and decision trees and there are a number of things you can do to get them to perform better.

How linear regression works

Linear regression gives a continuous output and is used for regression tasks. It can be used when the independent variables (the factors that you want to use to predict with) have a linear relationship with the output variable (what you want to predict) ie it is of the form Y= C+aX1+bX2 (linear) and it is not of the form Y = C+aX1X2 (non-linear).

Examples could be predicting the price of a house, the number of expected sales on a particular day based on the temperature or the number of tickets that will be sold based on the price.

Using linear regression for classification is problematic since it doesn’t give you an output in the range of 0 to 1 and it is also susceptible to outliers. Instead, logistic regression is used for classification.

Also, if there is more than one feature vector then multiple linear regression can be used and if there is not a linear relationship between the features and the output then polynomial regression can be used.

When to use linear regression

Linear regression is appropriate for datasets where there is a linear relationship between the features and the output variable. Polynomial regression can also be used when there is a non-linear relationship between the features and the output.

You can read more about when linear regression is appropriate in this post.

How to optimize linear regression

Linear regression can be prone to underfitting the data. If you build a model using linear regression and you find that both the test accuracy and the training accuracy are low then this would likely be due to underfitting. In this case, it would likely help to switch to polynomial regression which involves multiplying feature vectors to an nth degree polynomial. To do this, you can use the PolynomialFeatures class from sklearn.

If you find that the test accuracy is much lower than the test accuracy then you can use regularization to reduce the variance (overfitting) of the model. To do this you can use Ridge Regression, Lasso Regression, Elastic-net or early stopping.

Does it require feature scaling?

Feature scaling is not generally required in linear, multiple or polynomial regression. However, there are some reasons why you might want to scale and normalize the data which get explained in this StackExchange question.

Alternative Algorithms to Linear Regression

  • Polynomial regression
  • Decision tree regression
  • Random forest regression
  • Support vector regression

Decision trees

Decision trees are a powerful machine learning algorithm that can be used for classification and regression tasks. They work by splitting the data up multiple times based on the category that they fall into or their continuous output in the case of regression.

Decision trees for regression

In the case of regression, decision trees learn by splitting the training examples in a way such that the sum of squared residuals is minimized. It then predicts the output value by taking the average of all of the examples that fall into a certain leaf on the decision tree and using that as the output prediction.

You can watch the video below to see how decision trees work for regression.

Decision trees for classification

For classification, decision trees learn by splitting the training examples so that it can divide the data into subsets that separate the categories the data points fall into as much as possible. They then make output predictions based on the most common category in the subset that the new example would fall into. You can watch the two videos below to see more intuition about how decision trees work.

This is also a good video about decision trees for classification:

When to use decision trees

Decision trees are useful when there are complex relationships between the features and the output variables. They also work well compared to other algorithms when there are missing features, when there is a mix of categorical and numerical features and when there is a big difference in the scale of features.

How to optimize decision trees

Decision trees can be prone to overfitting. To prevent overfitting, it would help to restrict the maximum depth of the tree, to set a minimum number of samples a node must have before splitting it, to set a minimum number of samples that a leaf must contain and to set a maximum number of leaves that the tree can contain. All of these things can be done in sklearn as shown here and here.

It would also likely help to use a random forest instead which reduces overfitting by aggregating the results of multiple decision trees on the same dataset. The video below shows how random forests work.

Does it require feature scaling?

Decision trees do not require feature scaling.

Alternative Algorithms to decision trees

In the case of classification:

  • Random forests
  • Support vector machines
  • K-nearest neighbors
  • Logistic regression
  • Naive bayes
  • Neural networks

For regression:

  • Linear regression
  • Polynomial regression
  • Support vector regression
  • Random forest regression