Support Vector Machines

Classification of Incarceration using Support Vector Machines 

← Click to access the page 

A Comprehensive Intuition of SVM

Support Vector Machines (SVMs) are a powerful machine learning algorithm used for classification and regression analysis. Developed in the 1990s by Vladimir Vapnik and his colleagues, SVMs are widely used for solving a wide range of real-world problems in many fields, including bioinformatics, image recognition, text classification, and more.

The practical implementation of the SVM algorithm involves the utilization of a kernel. The acquisition of knowledge pertaining to the hyperplane in linear Support Vector Machines (SVM) is accomplished through a process of problem transformation utilizing principles of linear algebra. However, this particular aspect of SVM is beyond the purview of this introductory discussion. One notable observation is that the linear Support Vector Machine (SVM) can be reformulated in terms of the inner product between any two given observations, as opposed to the observations themselves. The inner product of two vectors is computed by summing the product of each corresponding pair of input values. 


Idea of Separation

SVMs are based on the idea of finding a hyperplane that best separates data into different classes. The algorithm aims to find the hyperplane that maximizes the margin between the two classes, i.e., the distance between the hyperplane and the closest data points in each class. This approach makes SVMs particularly useful for dealing with datasets that are not linearly separable.

SVMs can also be extended to handle non-linearly separable data by using kernel functions that map the input data into a higher-dimensional feature space, where it is more likely to be linearly separable. This technique is known as the kernel trick and has made SVMs one of the most versatile and widely used machine learning algorithms.

Margin Separating two classes represented by features

Margin

In the context of Support Vector Machines (SVM), the margin pertains to the spatial separation between the hyperplane that demarcates the distinct classes and the nearest data points from each respective class. The objective of Support Vector Machine (SVM) is to identify the hyperplane that optimizes the margin. The margin holds significant importance in Support Vector Machines (SVM) as it aids in determining the classifier's generalization performance. A wider margin denotes a reduced likelihood of the classifier to overfit the training data, thereby enhancing its generalization performance on novel data. The data points that are situated on the margin are commonly referred to as support vectors, and they hold significant importance in ascertaining the optimal hyperplane. The data points that are in closest proximity to the hyperplane and serve to establish its position and orientation are commonly referred to as support vectors. The Support Vector Machine (SVM) algorithm determines the optimal hyperplane through the resolution of an optimization problem, wherein the objective is to minimize the classification error while ensuring that the margin is maximized, subject to a set of constraints. The present inquiry concerns a convex optimization problem, for which a multitude of algorithms have been developed to achieve efficient solutions. 

The Kernels.

Linear Kernel

The linear kernel is a type of kernel function that is used in SVM. It is often used for linearly separable data, which means that the two classes can be separated by a straight line or hyperplane.

The linear kernel computes the dot product between the feature vectors of the input data points. In other words, it measures the similarity between the data points based on the angle between their feature vectors. If the angle is close to zero, it means that the data points are very similar and are likely to belong to the same class. If the angle is close to 90 degrees, it means that the data points are very different and are likely to belong to different classes.

where Xij and Xi'j and  are the feature vectors of two data points. The dot product is simply the sum of the products of the corresponding elements of the two vectors.

By computing the dot product between the feature vectors of the input data points, the linear kernel can transform the data into a higher-dimensional space where it is easier to find a linear decision boundary that separates the classes. This can be useful for simple classification problems where the data is linearly separable, but it may not work well for more complex problems where the data is nonlinearly separable. In these cases, other types of kernel functions, such as the polynomial kernel or the RBF kernel, may be more appropriate.

The graph depicting a support vector classifier with linear boundary, and consequently performs very poorly.→ 

Linear kernel

Polynomial Kernel

Polynomial Kernel

The polynomial kernel is a type of kernel function that is used in SVM. It is often used for data that is not linearly separable, meaning that the classes cannot be separated by a straight line or hyperplane in the original feature space.

The polynomial kernel computes the similarity between the feature vectors of two data points by raising their dot product to a certain power. In other words, it maps the data to a higher-dimensional space where it is easier to find a decision boundary that separates the classes.

Where Xij and Xi'j are the feature vectors of two data points, c is a constant, and d is the degree of the polynomial. The dot product is the sum of the products of the corresponding elements of the two vectors, as in the linear kernel. The constant c and the degree d are hyperparameters that can be adjusted to control the complexity of the decision boundary.

The polynomial kernel can transform the data into a higher-dimensional space, where it is easier to find a decision boundary that separates the classes. The degree d determines the degree of nonlinearity of the decision boundary, and the constant c determines the offset of the decision boundary from the origin.

The polynomial kernel can be useful for classification problems where the data is not linearly separable, but it may not work well for very high-dimensional data or for data with many outliers. In these cases, other types of kernel functions, such as the RBF kernel, may be more appropriate.

← The Graph depicts a SVM with a polynomial kernel of degree 3 is applied to the non-linear data from Figure 

Radial Bias Function as Kernel

The RBF (Radial Basis Function) kernel is a type of kernel function that is used in SVM. It is often used for data that is not linearly separable, meaning that the classes cannot be separated by a straight line or hyperplane in the original feature space.

The RBF kernel computes the similarity between the feature vectors of two data points based on their distance from each other. In other words, it measures how close the data points are to each other in the feature space. The RBF kernel is defined as:

where x and y are the feature vectors of two data points, ||Xij - Xi'j|| is the Euclidean distance between the two vectors, and gamma is a hyperparameter that determines the width of the kernel.

The RBF kernel can transform the data into a higher-dimensional space, where it is easier to find a decision boundary that separates the classes. The hyperparameter gamma determines the width of the kernel, which affects the smoothness of the decision boundary. A small gamma will result in a smoother decision boundary, while a large gamma will result in a more complex decision boundary that is sensitive to noise and outliers.

The RBF kernel is a popular choice for SVM because it is flexible and can work well for a wide range of classification problems. However, it can be computationally expensive for large datasets or high-dimensional data, as it requires computing the distance between all pairs of data points. In these cases, other types of kernel functions, such as the linear or polynomial kernel, may be more appropriate.

The graph depicts an SVM with a radial kernel. → 

Radial Bias Function Kernel

Play with Kernels 

More of such examples will be placed under the code page

More of such examples will be placed under the code page

If the above Sandbox Simulator shows that it's under maintenance, 

→ Here's a preview of what exactly it does

The Tuning Parameter C

In practice, C is treated as a tuning parameter that is generally chosen via cross-validation. As with the tuning parameters that we have seen through- out this book, C controls the bias-variance trade-off of the statistical learn- ing technique. When C is small, we seek narrow margins that are rarely violated; this amounts to a classifier that is highly fit to the data, which may have low bias but high variance. On the other hand, when C is larger, the margin is wider and we allow more violations to it; this amounts to fitting the data less hard and obtaining a classifier that is potentially more biased but may have lower variance.

The optimization problem has a very interesting property: it turns out that only observations that either lie on the margin or that violate the margin will affect the hyperplane, and hence the classifier obtained. In other words, an observation that lies strictly on the correct side of the margin does not affect the support vector classifier! Changing the position of that observation would not change the classifier at all, provided that its position remains on the correct side of the margin. Observations that lie directly on the margin, or on the wrong side of the margin for their class, are known as support vectors. These observations do affect the support vector classifier.

The fact that only support vectors affect the classifier is in line with our previous assertion that C controls the bias-variance trade-off of the support vector classifier. When the tuning parameter C is large, then the margin is wide, many observations violate the margin, and so there are many support vectors.

By Cross Validated

← Small C Value

← Large C Value

The Significance of dot Product

The dot product is a fundamental operation in SVM kernel functions, as it forms the basis for measuring the similarity between pairs of data points. In SVM, the kernel function is used to transform the input data into a higher-dimensional space, where it is easier to find a linear or nonlinear decision boundary that separates the classes. The dot product between the feature vectors of two data points is used to measure their similarity in the transformed space.

The dot product is used in the linear kernel, which is the simplest type of kernel function in SVM. The linear kernel computes the dot product between the feature vectors of two data points in the original feature space, which is equivalent to a linear transformation of the data. The linear kernel is often used for linearly separable data, where the classes can be separated by a straight line or hyperplane.

In other kernel functions, such as the polynomial and RBF kernels, the dot product is used to measure the similarity between pairs of data points in a higher-dimensional space. The polynomial kernel raises the dot product to a certain power, while the RBF kernel uses the distance between pairs of data points in the transformed space. The dot product is important in these kernel functions because it allows SVM to find nonlinear decision boundaries in the transformed space, which can improve the classification accuracy for data that is not linearly separable.

Source Code

Attributions: The Plots and Kernel, Tuning parameter equations are from the book An Introduction to Statistical Learning with Applications in R by Gareth James et.al.