If you are just starting out in machine learning, you might be wondering what the difference is between regression and classification. This post will show you how they differ and how they work.
So, what is the difference between regression and classification? Regression is used when you are trying to predict an output variable that is continuous. Whereas, classification is used when you are trying to predict the class that a set of features should fall into.
For example, regression might be used to predict the price of a house. Whereas, classification would be used to predict whether the price of the house is above or below a certain price or if it will rise or fall in price.
Both of them fall under the umbrella of supervised learning algorithms. This means that the training data that they learn from includes the output values they are trying to learn to predict. If the data is not labeled then it is called an unsupervised machine learning problem and a clustering algorithm will be used instead.
What regression is
Regression, in machine learning, is where you train an algorithm to predict a continuous (y) output based on a set of features (X).
Examples of regression problems could include:
- Predicting the price of houses based on data such as the quality of schools in the area, the number of bedrooms in the house and the location of the house.
- Predicting the sales revenue of a company based on data such as the previous sales of the company.
- How much a customer will cost a company based on data from previous customers.
Popular algorithms used for regression
Below are some popular machine learning algorithms that are used for regression problems.
Linear Regression and Polynomial Regression
Linear regression is used to find a linear relationship between the target and one or more predictor variables. An example of when linear regression would be used could be to predict someone’s height based on their age, gender and weight.
You can look here for a more detailed explanation of how linear regression works in machine learning.
Polynomial Regression allows for a non-linear relationship to be found. It is used when the relationship between the y values and the X values is not linear. You can look here for a more detailed explanation of how it works and how to use it in machine learning.
Pros: Simple to implement, works well without a lot of data and easy to interpret.
Cons: It is not complex enough to find more complex relationships and it is susceptible to outliers.
Decision tree Regression
A decision tree can be used for either regression or classification. It works by splitting the data up in a tree-like pattern into smaller and smaller subsets. Then, when predicting the output value of a set of features, it will predict the output based on the subset that the set of features falls into. You can look here for a more detailed explanation.
You can watch the video below to see how decision trees work.
You can watch the video below for an explanation of how decision trees are used for regression.
Pros: It can find complex patterns in the data, doesn’t require a lot of preprocessing, easy to understand.
Cons: It can be prone to overfitting.
Random forest Regression
Random forests are very similar to decision trees and can be used for classification or regression. The difference is that random forests build multiple decision trees on random subsets of the data and then average the results. This often allows for much more accurate predictions on unseen data than with decision trees.
You can watch the video below to see an explanation of how random forests work.
Pros: More accurate on unseen data than decision trees, outputs the importance of variables and can find complex patterns in the data.
Cons: It does not give a precise prediction for regression and it does not extend beyond the range seen in training data.
Neural networks
Neural networks are loosely based on the neurons in the brain. They can be used for many different tasks including regression and classification. They tend to be the best algorithms for very large datasets. You can read more about neural networks here and you can read about how to use them for regression here.
Pros: Can be used for classification or regression, work well with large amounts of data, work well with non-linear data and can make predictions quickly after being trained.
Cons: they are described as being “black boxes” meaning that it is difficult to understand why they are weighted the way they are and they require a lot of data to train on.
What classification is
Classification algorithms are used to assign labels to unlabeled examples. They work by learning from training data that contains the labels to go along with the features and then use the patterns they find in the data to predict what class a new example should fall into. An example would be predicting if a house will sell for more than a certain price or if an email is spam or not.
Binary classification is where there are only two output classes (ie spam or not spam). Whereas, multiclass classification is where there are more than two output classes (ie predicting what dog breed a picture of a dog is).
Popular classification algorithms
Below are some algorithms that are commonly used for classification.
Logistic Regression
Logistic Regression outputs the probability that an example falls into a certain class. It is often a good algorithm for binary classification but it can also be used for multiclass classification.
Pros: Outputs the probabilities of a classification, easy to interpret and works well for binary classification.
Cons: Cannot handle non-linear data.
Random forests
As mentioned above, Random Forests can be used for regression and classification. They are often one of the best algorithms for classification problems. You can read more about how Random Forests are used for classification here.
Support Vector Machines (SVM)
Support Vector Machines learn what class examples belong to by fitting a line between the data points and maximizing the margin on either side of that line based on their y-labels. SVMs can either use a “hard margin” or a “soft margin”. Hard margin SVMs do not allow any data points to fall within the margin but soft margin SVMs do. Allowing for some data points to fit within the margin helps to avoid overfitting.
You can watch the video below to see how SVMs work.
Pros: They work well for datasets with lots of features, they work well when the classes are separable,
Cons: They do not provide probability estimates, they don’t work well with large datasets.
Naive bayes
The Naive Bayes classifier uses Bayes theorem to classify data. You can look here for more info about how it works.
Pros: Makes predictions quickly, works well on categorical data, fast training time and easy to interpret.
Cons: It considers each feature as being independent which is not always true.
K-Nearest Neighbours (KNN)
K-nearest Neighbours classifies examples by looking at the classes of the k-nearest data points to that example.
You can watch the video below to see how it works:
Pros: Easy to understand, quickly adapts to new training data and useful for multiclass classification.
Cons: Becomes less effective as the number of features increases, requires feature scaling, sensitive to outliers.
Neural Networks
Neural networks can also be used for classification problems. You can read more about using Neural Networks to classify images of numbers here.