Support Vector Machines (SVMs) are a powerful class of machine learning algorithms that have become increasingly popular in recent years. They are widely used in various fields such as image recognition, text classification, and bioinformatics, among others. SVMs are linear classifiers that can separate data into different classes using a hyperplane. In this article, we will delve into the inner workings of SVMs, focusing on their linear separation capabilities, the role of kernels, and the use of the dot product in SVMs. Additionally, we will examine the polynomial and radial basis function kernels in detail.

Introduction

SVMs are a class of supervised learning algorithms that can be used for both classification and regression tasks. They work by finding the optimal hyperplane that maximally separates the data points into different classes. This hyperplane is selected to maximize the margin between the different classes, i.e., the distance between the hyperplane and the closest data points from each class. SVMs can handle high-dimensional datasets and are particularly useful in cases where the number of features is much larger than the number of samples.

Why are they Linear separators?

Fig 1: Maximum-margin hyperplane and margins for an SVM trained with samples from two classes. Samples on the margin are called the support vectors. Source: Wikipedia

SVMs are linear classifiers because they find a hyperplane that separates the data points into different classes. A hyperplane is a linear decision boundary that divides the feature space into two regions corresponding to the different classes. The equation of a hyperplane in a d-dimensional feature space is given by:

w_0 + w_1x_1 + w_2x_2 + … + w_d*x_d = 0

where w_0 is the bias term and w_1, w_2, …, w_d are the weights associated with each feature x_1, x_2, …, x_d. If a data point x lies on one side of the hyperplane, it is classified as belonging to one class, while if it lies on the other side, it is classified as belonging to the other class.

What is a kernel and how does it work?

A kernel is a function that transforms the data points into a higher-dimensional space, where the data points can be linearly separated by a hyperplane. In other words, a kernel allows us to find a nonlinear decision boundary in the original feature space by mapping the data points into a higher-dimensional space. This is called the kernel trick, and it is a powerful technique that allows us to use linear classifiers such as SVMs to solve nonlinear classification problems.

Kernels work by computing the dot product between the transformed data points in the higher-dimensional space. The dot product is a measure of the similarity between two vectors, and it plays a critical role in the use of kernels in SVMs.

Why is the dot product critical to the use of the kernel?

The dot product between two vectors x and y is defined as:

x . y = ||x|| * ||y|| * cos(theta)

where ||x|| and ||y|| are the lengths of the vectors, and theta is the angle between them. The dot product is a measure of the similarity between two vectors, and it is equal to zero if the vectors are orthogonal (i.e., they are perpendicular to each other). The dot product is also commutative, meaning that x . y = y . x.

In SVMs, the dot product is used to compute the similarity between two data points in the higher-dimensional space defined by the kernel. The SVM algorithm tries to find the hyperplane that maximizes the margin between the different classes, and this margin is defined in terms of the dot product between the data points. By using a kernel function that maps the data points into a higher-dimensional space, we can find a decision boundary that is nonlinear in the original feature space, but linear in the higher-dimensional space.

Polinomial and Radial Basis Function kernel

There are many types of kernel functions that can be used in SVMs, but two of the most common ones are the polynomial kernel and the radial basis function (RBF) kernel.

Fig 2: Polynomial kernel can be used to cast data points into a higher dimensional space to make them linearly separable. Source: https://www.pycodemates.com/2022/10/svm-kernels-polynomial-kernel.html

For degree d polynomials, the polynomial kernel is defined as:

Equation for SVM polynomial kernel

where c is a constant and x1 and x2 are vectors in the original space.

The polynomial kernel maps the data points into a higher-dimensional space by computing all possible monomials up to degree d. The degree d controls the complexity of the decision boundary, with higher values of d resulting in more complex decision boundaries. The constant c is used to shift the origin of the polynomial to a different point in the feature space.

The RBF kernel is defined as:

K(x, y) = exp(-gamma * ||x – y||^2)

where gamma is a parameter that controls the width of the kernel. The RBF kernel maps the data points into an infinite-dimensional space, where the distance between two points is measured by the Gaussian function. The RBF kernel is a popular choice for SVMs because it is capable of modeling highly nonlinear decision boundaries.

Fig 3: Radial kernels are curved in nature. Source: Gogas, Periklis & Papadimitriou, Theophilos. (2015). Emerging Methodologies in Economics and Finance. 10.13140/RG.2.1.3426.1849.

Casting 2D points using a kernel function

In conclusion, SVMs are a powerful class of machine learning algorithms that can be used for both classification and regression tasks. They are linear classifiers that can separate data into different classes using a hyperplane, and they are particularly useful in cases where the number of features is much larger than the number of samples. Kernels are a critical component of SVMs, as they allow us to find a nonlinear decision boundary in the original feature space by mapping the data points into a higher-dimensional space. The polynomial and RBF kernels are two of the most commonly used kernel functions in SVMs, and they can model highly nonlinear decision boundaries.

Data Prep

From previous classification models it was learned that the features are not a great representatives of the label. Hence the features were modify to give a better idea about the growth of electric vehicles. Another huge change made to the data was including the year as a feature in this analysis. The label was also adjusted for this analysis. The data used in this analysis can be found here.

If you want to skip to the code for the below analysis, the code can be found here.

The data was split into training and testing sets which looked like:

Fig 4: Training features
Fig 5: Training label
Fig 6: Testing labels
Fig 7: Testing features

Above datasets were used to fit various SVM models. SVM models can be Linear, Polynomial or Radial. The regularization parameter is also adjsuted for each type of SVM models.

Linear SVMS

Fig 8: Linear SVM with c = 10, accuracy = 98%
Fig 9: Linear SVM with c = 20, accuracy = 98%
Fig 10: Linear SVM with c = 30, accuracy = 99%

The Linear SVM does a fantastic job classifying the data. Linear svm with c = 30 has the best performance out of all Linear SVMs.

Polynomial SVM

Fig 11: Polynomial SVM with d=2, c=2, accuracy = 85%
Fig 12: Polynomial SVM with d = 2, c = 20, accuracy = 86%
Fig 13: Polynomial SVM with d = 2, c = 50, accuracy = 87%

The polynomial SVMs with degree = 2 provided an average accuracy. Using a polynomial kernel to approach this dataset is unwise. Keep it simple if possible.

Radial Basis Function SVM

Fig 14: RBF kernel with c = 1, accuracy = 94%
Fig 15: RBF kernel with c = 20, accuracy = 94%
Fig 16: RBF kernel with c = 200, accuracy = 94%

RBF kernels yield a good accuracy, but still lower compared to the accuracy achieved from using a Linear kernel.

Using PCA it was determined that Percentage of population with bachelors and the year bring in the highest amount of variance in the data, as can be seen below in Fig 17. Using only the two features and sampling 30 data points from the data, the support vectors for fitting a linear model on the sampled data can be seen in Fig 18.

Fig 17: Cumulative Expected Variance contributed by the features in the dataset
Fig 18: Visualization of sample data, the members of one class (red), another class (yellow), support vector (blue)

For this dataset and analysis, Linear kernel provided the best results. SVM provided the highest classification accuracy out of all other models used.

Results and conclusions

SVM yielded great results, it would be interesting to see the predictions of SVM applied on the predicted features for 2021, 2022 and 2023. A Linear model achieved a staggering accuracy of 99%. Although, it must be noted that the dataset is small and extremely high accuracies could be because of how small the test dataset is.