If you are just starting out in machine learning, you might be wondering what the differences are between classification and clustering. This post will show you what the differences are, the popular algorithms used in Scikit-Learn for classification and clustering and what their advantages and disadvantages are.
So, what is the difference between classification and clustering? Classification is a supervised learning approach that learns to figure out what class a new example should fit in by learning from training data that contains the class labels for the data points. Clustering is an unsupervised learning approach which tries to cluster similar examples together without knowing what their labels are.
For example, classification might be used to determine if an email is spam or not. The algorithm would learn by detecting patterns in training examples that contain spam or not spam labels.
Although, clustering can also be used for this as well by clustering emails together based on the heading, body and who the sender is but without the spam or not spam labels.
Clustering could also be used to segment a company’s customers based on their purchasing history.
Classification algorithms are used to assign a class to a new example. They work by learning from training data that contains the labels to go along with the features and then use the patterns they find in the data to predict what class a new example should fall into.
An example would be predicting if a house will sell for more than a certain price (note that predicting the exact price the house will sell for would be a regression problem and grouping similar houses together would be classification).
Binary classification is where there are only two output classes (ie spam or not spam). Whereas, multiclass classification is where there are more than two output classes (ie predicting what dog breed a picture of a dog is).
Popular classification algorithms
Below are some algorithms that are commonly used for classification.
Logistic Regression outputs the probability that an example falls into a certain class. It is often a good algorithm for binary classification but it can also be used for multiclass classification.
Pros: Outputs the probabilities of a classification, easy to interpret and works well for binary classification.
Cons: Cannot handle non-linear data.
A decision tree can be used for either regression or classification. It works by splitting the data up in a tree-like pattern into smaller and smaller subsets. Then, when predicting the output value of a set of features, it will predict the output based on the subset that the set of features falls into. You can look here for a more detailed explanation.
You can watch the video below to see how decision trees work.
Pros: It can find complex patterns in the data, doesn’t require a lot of preprocessing, easy to understand.
Cons: It can be prone to overfitting.
Random forests are very similar to decision trees and can be used for classification or regression. The difference is that random forests build multiple decision trees on random subsets of the data and then average the results. This often allows for much more accurate predictions on unseen data than with decision trees. You can read more about how random forests are used for classification here.
You can watch the video below to see an explanation of how random forests work.
Pros: More accurate on unseen data than decision trees, outputs the importance of variables and can find complex patterns in the data.
Cons: It does not give a precise prediction for regression and it does not extend beyond the range seen in training data.
Support Vector Machines (SVM)
Support vector machines learn what class examples belong to by fitting a line between the data points and maximizing the margin on either side of that line based on their y-labels. SVMs can either use a “hard margin” or a “soft margin”. Hard margin SVMs do not allow any data points to fall within the margin but soft margin SVMs do. Allowing for some data points to fit within the margin helps to avoid overfitting.
You can watch the video below to see how SVMs work.
Pros: They work well for datasets with lots of features, they work well when the classes are separable,
Cons: They do not provide probability estimates, they don’t work well with large datasets.
The Naive Bayes classifier uses Bayes theorem to classify data. You can look here for more info about how it works.
Pros: Makes predictions quickly, works well on categorical data, fast training time and easy to interpret.
Cons: It considers each feature as being independent which is not always true.
K-Nearest Neighbours (KNN)
K-nearest Neighbours classifies examples by looking at the classes of the k-nearest data points to that example.
You can watch the video below to see how it works:
Pros: Easy to understand, quickly adapts to new training data and useful for multiclass classification.
Cons: Becomes less effective as the number of features increases, requires feature scaling, sensitive to outliers.
Neural networks are loosely based on the neurons in the brain. They can be used for many different tasks including regression and classification. They tend to be the best algorithms for very large datasets. You can read more about neural networks here and you can read more about using Neural Networks to classify images of numbers here.
Pros: Can be used for classification or regression, work well with large amounts of data, work well with non-linear data and can make predictions quickly after being trained.
Cons: they are described as being “black boxes” meaning that it is difficult to understand why they are weighted the way they are and they require a lot of data to train on.
The goal of clustering algorithms is to group similar data points together. They are unsupervised algorithms, meaning that the data they are used with does not contain output labels.
So, if the goal is to create a spam classifier, the dataset wouldn’t say if the email is spam or not, it would just contain data about the emails such as the body text and the subject line.
Popular clustering algorithms
Below are the clustering algorithms that you can currently implement using Scikit-Learn, their advantages and their disadvantages.
K-means clustering is used to segment data into k clusters. It works by initially placing k centroids to the data, the data points are then linked to the closest centroid, then the best clusters are found by iteratively minimizing the sum of the squared distances between the data points and the nearest centroid. You can watch the video below to see how it works. You can also read more about it here.
Pros: Scales well to large datasets, easily adapts to new examples, generalizes well to different cluster shapes.
Cons: You must choose k manually, heavily impacted by outliers, doesn’t do well when the clusters have different sizes and densities.
Affinity propagation is a centroid based algorithm similar to k-means clustering. However, it does not require you to set the number of clusters beforehand. The problem with this is that it results in a higher time complexity. You can read more about the algorithm here and you can see how it works in the video below.
Pros: Determines the number of clusters for you, works well when there are lots of clusters, works if there are uneven cluster sizes or non-flat geometry.
Cons: It has a worse time complexity than k-means clustering.
Mean-shift is a hierarchical clustering algorithm. It works by creating circles around each of the examples and then combining examples into the same cluster if they all fall into a circle together. You can watch the video below for a very good demonstration of how it works.
Pros: It figures out how many clusters there should be for you, works well when there are lots of clusters, works if there are uneven cluster sizes or non-flat geometry.
Cons: Does not scale well
Spectral clustering is a very useful algorithm for situations where the desired clusters have very abnormal shapes. This Quora post explains very well why you would want to use the algorithm and this post explains how it works.
Pros: Good for data with dense data points and abnormal shapes
Cons: Doesn’t work well with large datasets or datasets with lots of clusters.
Hierarchical clustering is useful for grouping correlated features on heatmaps. You can read more about how it works and its applications here.
Pros: Hierarchical clustering gives you a dendrogram which is useful for visualizing the relationships of in the data.
Cons: Doesn’t scale well and it is not as good as other clustering algorithms at accurately clustering the data (source).
DBSCAN is useful when there are dense clusters that are separated by less dense regions. It also scales well with the amount of data there is so it is preferred over many of the other algorithms that try to separate dense regions and non-dense regions.
Pros: Ignores outliers, scales well.
Cons: You need to set the neighborhood parameter
OPTICS is similar to DBSCAN in that it is useful for separating dense clusters from less dense regions, it also works in a similar way to DBSCAN. However, it also adds a minimum radius term for core points and it calculates reachability distances. You can look here and here for more details.
Pros: Scales well, good for detecting clusters of differing densities.
Cons: Slower than DBSCAN.
Gaussian mixture models assume that the data points have been accumulated with a finite number of Gaussian distributions. They are useful when you want to estimate densities or if you know that the data has come from Gaussian distributions. You can look here for more information.
Pros: More flexible than K-Means, good for estimating densities.
Cons: Doesn’t scale well
Birch is a hierarchical algorithm that is useful when the dataset is large and there are a lot of clusters within the dataset. It can be used as an alternative to mini-batch k-means. You can read more about it here and here.
Pros: Works well on large datasets with lots of clusters and it is useful for data reduction.
Cons: Not as accurate as alternative models.