Clustering

Click to Access the Page → 

←  Click to Access the Page 

Comprehensive Review of Clustering & Application 

A Machine learning approach called clustering involves assembling related data points into clusters or segments depending on their traits or properties. As an unsupervised learning technique, it does not depend on labeled data to function.

A dataset is clustered in order to separate it into informative and logical subgroups or clusters. These clusters are often based on how similar the data points inside them are, with the same cluster having more comparable data points than other clusters.

An Assortment of Clustering Techniques 

Partitional Clustering: A form of unsupervised machine learning method called "partitional clustering" divides a collection of data points into distinct groups. The goal is to keep different data points in distinct clusters and group like data points together. This is accomplished by reducing the sum of squared distances between data points and their cluster centroids, for example, by iterative optimization of an objective function. Popular partitional clustering methods used in a variety of applications include K-means and its derivatives.

Hierarchical Clustering: Using this method of clustering, clusters are organized into hierarchies either by agglomerative (top-down) or divisive (bottom-up) methods. Each data point begins as its own cluster in aggregative clustering, which subsequently merges the most related clusters until all of the data points are in a single cluster. All of the data points are first grouped together in a single cluster, which is subsequently divided into more manageable clusters.

Density-Based Clustering: Based on their density, data points are grouped together in this sort of clustering. The most popular density-based clustering technique, DBSCAN, separates points that are far apart and clusters points that are near to one another.

Fuzzy Clustering: This kind of clustering enables data points to have varied degrees of membership in many clusters. Fuzzy C-Means is the most widely used fuzzy clustering method.

Model-Based Clustering: In order to find the model's parameters that best suit the data, this kind of clustering makes the assumption that the data points were produced by a statistical model. The most commonly used model-based clustering algorithm is Gaussian Mixture Model.

Each type of clustering algorithm has its strengths and weaknesses and is better suited to different types of data and clustering problems. The choice of clustering algorithm depends on the data characteristics and the specific problem at hand.

Clustering 101 - From Google Developers - Centroid and Density Based Clusters

Centroid-based Clustering

Centroid-based clustering is a type of clustering algorithm that partitions the data into non-hierarchical clusters based on the similarity between the data points. The goal of centroid-based clustering is to group the data points into k clusters, where k is a pre-defined number of clusters.

The most widely used centroid-based clustering algorithm is k-means. In the k-means algorithm, the data points are initially assigned to random clusters. The algorithm then iteratively updates the cluster centroids to minimize the sum of squared distances between the data points and their assigned centroids. This process continues until the cluster assignments converge and do not change anymore.

Centroid-based algorithms are efficient because they do not require storing a hierarchy of clusters, unlike hierarchical clustering. However, they can be sensitive to initial conditions and outliers. The choice of the initial cluster centroids can impact the final clustering result, and the presence of outliers can distort the centroid positions and hence the cluster boundaries.

Despite these limitations, k-means is a popular and effective clustering algorithm because it is efficient and can handle large datasets. It is also relatively simple to implement and has a clear objective function, making it easy to interpret and evaluate the clustering results.


Example of centroid-based clustering.

Density-based Clustering

Data points are grouped together using a clustering method called density-based clustering based on their closeness and density. The fundamental tenet of density-based clustering is that regions of high density that are separated by regions of low density make up clusters. This permits distributions of any form as long as dense regions can be linked.

DBSCAN is the most well-liked density-based clustering method (Density-Based Spatial Clustering of Applications with Noise). A cluster in DBSCAN is referred to be a collection of closely related data points. Outliers or noise are points that do not belong to any cluster.

Density-based clustering techniques can, however, have certain drawbacks. They can struggle with data of different density, which is a significant restriction. Particularly in areas where the data point density is very low or excessively high, the algorithm may have problems recognizing clusters. High-dimensional data might also provide difficulties for density-based clustering since greater dimensions make the concept of density less clear.

The fact that these techniques are not intended to allocate outliers to clusters is another drawback of density-based clustering. They are instead recognized as distinct data points that do not form a cluster. This indicates that applications where outlier identification is crucial may not be suited for density-based clustering.


Example of Density-based clustering.