Introduction
Clustering in Machine Learning is a type of unsupervised learning method. Clustering is used to find structure, patterns, collective similarity within the observations of the dataset. It does this by grouping data points together. Data points in a certain group are similar to each other with respect to the features that were used in the analysis. The group themselves are different with respect to the features that were used for clustering. When using the clustering technique, to should be ensured that the dataset is free of labels. The goal of clustering is to determine intrinsic grouping, this grouping can be beneficial in finding similarities between data points. In a two dimensional or three dimensional space, it’s easy to visualize data and be able to understand data. Although, in a high dimensional space, it is hard to visualize the data and find groups using visual inspection. Clustering is an effective way to work with high dimensional data and find groups with data points that are similar to each other. The closeness of data points is measured using a distance metric.
Distance Metrics
There are numerous distance metrics that can be used to measure the distance between data points. Some of the most prominent distance metrics used in the industry are:
- Euclidean distance
- Manhattan distance
- Minkowski distance
- Jaccard distance
- Cosine similarity
This closeness or farness however is relative. There is no defined distance in Machine Learning that dictates whether data points are close or far.
Types of clustering
Density-Based Methods: Density based clustering methods use the idea of density to cluster. Density is defined as the mass/volume. In case of clustering and dealing with multidimensional spaces and huge number of data points, density is defined as the number of points divided by the volume of the shape created when the epsilon is defined. Epsilon in density based clustering needs to be defined as an argument, this determines the number of border points for core points that fall within the epsilon defined shape. It uses a friend of a friend method. If a point can be reached from jumping from neighborhood to neighborhood, the point belongs in that cluster Points outside the density based clusters are usually categorized as noise or outliers. The figure and the example below will provide a clear understanding of the terms neighborhoods, border point and core points.
The core points in this particular example can be assumed to be point p and point q. Point p’s border points fall within the greenish blue area (p’s neighborhood) , point q’s border points fall withing the orangish area (q’s neighborhood). Point p and q in this example have two and three points respectively. Point r is a border point of point q. Since point r can be reached by jumping from point p’s neighborhood to q’s neighborhood, they all belong to the same cluster. Please note that the clusters have not been completely formed in this example. Only the first two steps of the cluster formation are shown. An example of a density based clustering algorithm is DBSCAN
Hierarchical Clustering Methods: Hierarchical clustering is also known as connectivity based clustering. It’s based on the principal that every object is connected to its neighbor depending on their proximity distance. Thus clusters are represented in hierarchical structure called a dendrogram. The Figure below is an example of a dendrogram. The data points are divided and then subdivided into groups. The sooner the data points split up, the larger is the distance between them.
For example in this particular dendrogram, the distance between ‘a’ and ‘n’ is greater than the distance between ‘a’ and ‘f’. The distance between points ‘a’ and ‘e’ is greater than the distance between points ‘a’ and ‘c’.
There can be several clusters that can be implied from the dendrogram on top. One could consider ‘a’ and ‘b’ as one cluster ‘a’, ‘b’, ‘c’, ‘d’, and ‘e’ could be considered one cluster. The final clustering is user dependent and case dependent.
Hierarchical clustering can be Agglomerative (bottom-up) where each observation starts its own cluster and pairs of clusters are merged up as one moves up the hierarchy. Second, Divisive clustering (top-down) where all observation start out in one cluster and then split up recursively as one moves down the hierarchy.
Partitional Clustering Methods: Centroid based or partitional clustering is a very common type of clustering. In partitional clustering, clustering occurs based on the closeness of data points to a chosen centroid or central value. In partitional clustering the first step is to determine how many clusters a user wants. Depending on the number of clusters the location of the centroids is either manually provided or defaults to the centroid the algorithm chooses. Choosing the number of clusters and the location of clusters is often the most complicated step in partitional clustering. Below is a visual representation of partitional k means clustering.
The distance between the data points with each other and the centroid is measured using distance metrics such as Euclidian distance, Manhattan distance or Minkowski distance. Some examples of partitional clustering are: k-means clustering, Quality threshold clustering, Expectation Maximization clustering, mean shift, Locality sensitive hashing based clustering and k-way spectral clustering.
Source: www.analytixlabs.co.in, en.wikipedia.org
Application of clustering in this project
For the purpose of the project, where the goal is to predict the number of Electric Vehicle registrations in 2030 and to predict the distribution of renewable and non renewable sources of energy for electricity, both hierarchical and partitional clustering methods are used. Using two different methods of clustering will provide insight on the different clusters that can be formed and why those states in that year were clustered together. Hierarchical clustering is performed first to notice gauge the appropriate number of cluster to use for partitional clustering. Methods such as silhouetting, elbow and gap statistics are also used to determine the appropriate number of clusters. States that fall in the cluster with states that currently have high population of Electric Vehicle might be the new hotspots for Electric Vehicles. In case of the hierarchical clustering method, it’s more sensible to use divisive (top-down) method. For partitional clustering, k-mean clustering is used.
Hierarchical clustering using cosine similarity – R
1. Data Prep for hierarchical clustering:
Since clustering is an unsupervised learning technique, the data needs to be unlabeled. The data also needs to be strictly numerical. It is also beneficial to make the data normalized, so that certain features don’t deviate the clusters due to their extremely high or low relative value. Below images show the transformation of the original data to the standardized form. The data needed to be converted to a form where the mean of each column is zero and the standard deviation of each column is 1.
The data used for this analysis can be found here.
2. Process and Analysis:
Hierarchical clustering is very diverse and easy to use in R. Using Dr. Amy gate’s function to create a distance metric, the distance between observations were converted into a distance matrix, and fed into hclust to produce a dendrogram. The dendrogram for hclust using cosine similarity for standardized data is shown below. The k to form rectangles for the clusters in this case is set to 4.
While using hclust to cluster the data, it is good to determine the method that would yield the best agglomerative coefficient. A lower agglomerative coefficient is better because it indicates a tighter and far away clusters, whereas a higher agglomerative coefficient indicates a looser closer clusters. There are several algorithms that can be used in hclust, such as, average, single, complete and ward. The agglomerative coefficients for each of the methods are below:
The agglomerative coefficient for the single method is the lowest, that means for this data, it would be the optimal method to use to form clusters. Let’s give single method a try along with cosine similarity as the distance metric.
The clusters in the above figure look very disproportionate. These clusters could be tight cluster compared to other methods but as it can be seen, just using agglomerative coefficient as a determining factor would be the incorrect approach. Other clustering method and distances yielded the following dendrograms.
2. Code for above analysis:
The code for generating the above figures and performing the above analysis can be found here.
3. Results of Hierarchical clustering:
Analyzing then dendrogram for different methods and distances, it can be seen that ward method with cosine similarity and ward method with manhattan distance yield sensible cluster, We can also see a good distribution of three to four clusters in the different dendrograms.
K Means clustering – Python
1. Data Prep for K-Means clustering in python:
For partitional clustering K-Means algorithm is used. To apply K Means the data needs to be in record format, strictly numerical and quantitative. The numerical columns should also be min max normalized to ensure that features are in the same range. Below are figures showing the range of the numerical columns before and after normalization.
Data for this analysis can be found here.
2. Process and Analysis:
Now that the data is normalized and ready to be utilized for clustering, the number of clusters needs to be determined, possibly the dimensions needs to be reduced using PCA, then clustering can be performed. Further the results of clustering need to be visualized. In this part of the project, setting centroids based on the number of clusters will also be tested.
PCA was performed to determine how many components contribute to over 80% of variance in the data. Following are the results of PCA:
From the above figure, it is safe to say that 2 components contribute to more than 80% of explained variance in the data. Hence the dimensionality can be reduced from 11 to 2.
Further, the appropriate number of clusters needs to be determined. For this silhouette and elbow method will be used.
Silhouette method uses the ratio of the difference between average of the distance between a point in the cluster to the points in the other closest cluster and the average of the distance of a point to every point in the cluster to the difference between max distance between the point and a point in another closest cluster and the max distance of a point to another point in the cluster. The results of the silhouette method are below:
From the above results, it can be seen that KMeans with 3,4 and 5 cluster have a higher silhouette score than the average silhouette score (red dotted line). It’s important to note that in KMeans with3, 4 and 5 clusters all have observations that fell in the wrong cluster, this can be seen by the cluster tails that go below 0. From above figure 4 or 5 clusters look like the appropriate amount of cluster. Elbow method can help either confirm or challenge the above conclusion. Below are the results of using within sum of squared error metric and elbow technique to determine appropriate number of cluster.
Elbow method also depicts that after 4, the increased number of clusters don’t reduce the error significantly. Using k = 4 and PCA dimension reduction technique, the resulting clusters look like:
Looking at the clusters in Fig 17 where k = 4, the bottom part of the figure looks strange in the way that one would expect the bottom half of the figure to be one cluster. But considering the way distances are measured for clustering, the results look appropriate. Maybe density clustering is better for this dataset.
Fig 18 and Fig 19 show the clusters when k = 3 and k = 5 respectively. The above results show that perhaps k = 3 is not a good choice. Surprisingly K-Means clustering with k=5 does show more distinct and intuitive clusters.
Density clustering might work better in this case
3. Code for above analysis:
The code for above figures and analysis can be found here.
4. Results of K-Means clustering
The results from K Means let to unexpected conclusions. Initially, the data was believed to be divided into three categories. Low, Medium and High number of electric vehicles. Clustering suggested that 4 clusters are the most optimal to successfully divide the data. This should be taken with a grain of salt, since the clusters in Fig 17 are not traditional cluster patterns. Although, this does provide a basis to include another label for the data perhaps instead of using Low, Medium and High number of electric vehicles in each state that year, the labels could be low, slow growth, rapid growth and high.
K Means clustering – R
1. Data Prep for K-Means clustering in R:
2. Code for above analysis:
3. Results of K-Means clustering:
Results and Conclusion
Hierarchical clustering and K Means yielded some similar results, and some variate results. Some hierarchical clustering methods suggested that the data is divided into 2-4 clusters. Kmeans suggested that the best number of k to use is 4, supported by the elbow method and the silhouette method in Fig 14 and Fig 15 respectively. Clustering can be very useful in getting a sense of the data and finding if there are any groups of observations that exist. Clustering of the states using PCA provided insight in the relevance of some variables. There is also room to edit the labels to be of four or two categories instead of three.