KMeans Clustering

K Means Algorithm

K-means is a clustering algorithm used in unsupervised machine learning. Clustering is a type of exploratory data analysis that involves grouping similar data points together based on some similarity metric. The goal of K-means is to partition a given set of data points into K clusters, where K is a pre-specified number of clusters that the user wants to identify in the data.

The K-means algorithm works as follows:

Step 1: Initialization

The algorithm starts by randomly selecting K initial centroids from the feature space of the data. The initial centroids can be selected in various ways, such as uniformly at random or based on some prior knowledge of the data.

Step 2: Assignment

Each data point is assigned to the nearest centroid based on a distance metric such as Euclidean distance. The distance metric is a measure of dissimilarity between two data points. Euclidean distance is a popular choice for continuous data, but other distance metrics such as Manhattan distance or cosine similarity may be more appropriate for different types of data.

After each data point is assigned to its nearest centroid, K clusters of data points are formed.

Step 3: Recalculation

The mean of each cluster is calculated, which becomes the new centroid for that cluster. This recalculation step moves the centroid of each cluster to the center of the data points in that cluster.

Step 4: Convergence

The algorithm repeats steps 2 and 3 until the assignment of data points to centroids no longer changes, or a maximum number of iterations is reached. The algorithm is said to have converged when the assignment of data points to clusters no longer changes between iterations. This means that each data point is assigned to the same cluster in consecutive iterations.


KMeans Convergence

Evaluating the Quality of Clustering

Once the K-means algorithm has converged, the quality of the clustering solution can be evaluated using various metrics, such as the sum of squared distances between each point and its assigned centroid. This metric is called the within-cluster sum of squares (WCSS). The lower the WCSS, the better the clustering solution. However, it's important to note that a low WCSS does not necessarily mean a good clustering solution, as the algorithm may have converged to a local minimum rather than the global minimum.

Limitations of K-means

K-means is a simple and computationally efficient algorithm, but it does have some limitations. One limitation is that it is sensitive to the initial choice of centroids. If the initial centroids are chosen poorly, the algorithm may converge to a suboptimal clustering solution. Another limitation is that K-means tends to converge to local optima rather than the global optimum, which may not be the best clustering solution. Various modifications and extensions to the algorithm have been proposed to address these issues, such as k-means++, which uses a smarter initialization scheme to choose the initial centroids.


Leveraging KMeans for Chicago Crime

The Chicago crime dataset covers the years 2001 through 2021 and is packed with information regarding criminal acts committed in the city. This includes the nature of the crime, when and where it took place, and more. The unsupervised machine learning approach known as K-means clustering may be used to organize data into coherent groups of comparable elements. Using K-means clustering to the Chicago crime dataset may help authorities see trends and patterns that can guide their work.

Loading and preprocessing the data is the initial stage in applying K-means clustering to the Chicago crime dataset. To prepare the data for clustering, you must first import it into a pandas DataFrame. In certain cases, this may need data cleaning, which includes the elimination of duplicates and missing values, the transformation of category variables into numerical ones, and the application of appropriate scale.

Data clustering is the next phase, and the number of clusters is up to you. A multitude of techniques, such as the elbow approach and silhouette analysis, may be used to zero down on the best number of clusters. To determine the optimal value of K, one may plot the within-cluster sum of squares (WCSS) vs the number of clusters and choose the point at which the WCSS decreases at a slower pace.

After the number of clusters has been decided upon, a distance metric must be selected in order to determine how closely individual data points relate to cluster centers. Although Euclidean distance is the most popular, alternative distance measures like Manhattan distance, cosine similarity, or Mahalanobis distance may be more appropriate for your data.

K-means clustering may be applied to the cleaned data with the specified number of clusters and distance metric. The scikit-learn package offers a Python version of the K-means clustering method. Besides the maximum number of iterations and the convergence tolerance level, you may choose a variety of additional hyperparameters.

When you've clustered your data, it's important to assess whether or not the resulting insights are meaningful and helpful. Scatter plots or heat maps may be used to visualize the clusters, and then their individual properties, such as the distribution of crime categories or the location of crimes, can be analyzed. K-means clustering's efficacy may be verified by comparing its findings to those of other clustering methods or with domain expertise.


Data Prepreparation

Kmeans For Spatial Clustering: Inorder to use the Spatial Data in the Crime Dataset with a characterization, the demographic/socio-economic dataset is used to Categorize Community Areas into Three Categories namely, Low Income, Moderate Income and High Income. These are now the Labels for the Dataset which are to be clustered together using Kmeans Clustering.

Data for 2-Dimensional KMeans ( Spatial Clustering )

Demographic/Socio-Economic Data

Selected Spatial Data with Labels

KMeans Clustering on the Above Data

KMeans with Initial Clusters = 3

Using Starting with 3 Clusters because the dataset has the ground-truth pre-known. The 3 labels are categorized demographic status according to the community area. The labels are Low Income, Moderate Income and High Income. 

KMeans with Initial Clusters = 2

KMeans with Initial Clusters = 4

Silhouette Method to Find the Optimal Clusters

The Silhouette method is a popular technique for determining the optimal number of clusters in a clustering analysis. The method evaluates how well each data point fits within its assigned cluster and compares it to the fit of the same point in the nearest neighboring cluster. The Silhouette score ranges from -1 to 1, with higher scores indicating a better fit of the data point within its assigned cluster and worse fit within other clusters. A score of 0 indicates that the point is equally distant to two clusters and could belong to either of them.

To use the Silhouette method, one needs to calculate the Silhouette score for each data point in the dataset for different numbers of clusters, typically ranging from 2 to a maximum value determined by the problem domain. The average Silhouette score across all data points is then calculated for each number of clusters. The optimal number of clusters is chosen as the one with the highest average Silhouette score, indicating that the clustering solution provides the best fit of the data points within clusters and differentiation between clusters.


A Partial Conclusive Summary from the above KMeans Modeling 

From the above Comparative plots it can be evidently seen that the Kmeans Algorithm is not able to correctly aggregate the spatial data points according to the given label. This can be because of the Distance metric that has been used, which is Euclidean Distance. When dealing with the Spatial Data points such as the Logitude and Latitude, the distance measure is obtained from the formula called the Haversine's Distance. In this case performing Sihouette to obtain the optimal K is also not substantial. Just as stated the Silhouette method for the above method resulted in a K value of 2 as optimal but in reality there are 3 Clusters. This inability of accurate clustering is determined by performing a Density Based Clustering, which is certainly a decent option to deal with such scenario. 

DBScan of Same Data and Labels with eps=0.01, min_samples=4

Data for 3-Dimensional KMeans

3 Dimensional Data

Ward -  "Ward" refers to the electoral district used for the election of members of the Chicago City Council, which is the legislative branch of the government of the City of Chicago. The city is divided into 50 wards, each represented by an alderman who is elected to serve a four-year term.

IUCR - "IUCR" stands for "Illinois Uniform Crime Reporting". It is a coding system used by law enforcement agencies in the state of Illinois, including the Chicago Police Department, to classify criminal offenses and report crime statistics to the Illinois State Police.

District -  "District" refers to the police district in which a particular crime occurred. The City of Chicago is divided into 22 police districts, each with its own police station and designated police officers responsible for patrolling the district and responding to incidents. The districts are further divided into beats, which are smaller geographic areas that police officers patrol on foot or by car.


Results and Discussion for 3 Dimensional Clustering

In this analysis, Ward, District, and IUCR codes were used in K-means clustering on the Chicago Crime dataset. The term "ecosystem" refers to a group of people who work in the construction industry. We noticed a distinct divergence between clusters 0 and 2 based on the three-dimensional charts. Clusters 1 and 2 looked to be forced clusters that were split apart by K-means according to the threshold.

The analysis's findings imply that K-means clustering with a k value of 3 would be the best option for this dataset since it creates clear distinctions between the various groups. In example, the obvious distinction between clusters 0 and 2 shows that the crime patterns of these two groups vary significantly, but the forced distinction between clusters 1 and 2 suggests that these clusters may not be sufficiently dissimilar to warrant their distinction.

K-means clustering does, however, have limits, especially when the underlying data is complicated or the clusters have erratic sizes or forms. To validate the ideal number of clusters and assess the quality of individual clusters, it is advised to utilize K-means clustering in combination with other clustering validation approaches, such as the Silhouette method.

Data for n-Dimensional KMeans

Crime Counts Per Hour

This Data was aggregated on each Hour of the day. 

Elbow to find out Optimal Clusters

Principal Component Analysis to Visualise the Clusters

Inference for the above

In the above n-dimensional Kmeans, the data is aggregated as counts for Crime Type in each hour of the day. In this case the clusters labels are the crime types which is greater than 3 dimensions so visually cannot be perceived to check if the clusters are performed correctly. Principal Component Analysis on the data is performed to visualise the clusters on a 2-Dimensional Plane. Looking at the plot, it is evident that the model has performed decently enough to produce some separation in the data points. The far right clusters are correctly aggregated as separate clusters because they doesn't have a closer neighbour. Rest of the plots seems to have clear centroids and the data points around them.

Conclusion

In conclusion, while K-means clustering can be a useful technique for identifying patterns and relationships in the Chicago Crime dataset, it may only be partially helpful due to the lack of data points that fit the conditions required for clustering. In particular, clustering requires that data points are similar to each other within clusters and different from data points in other clusters. However, in the case of the Chicago Crime dataset, there are no such variables that have a greater spread so that KMeans can be performed at optimum efficiency. 

Furthermore, K-means clustering may not always be the best choice for analyzing crime patterns in the city, as it may not account for the nuances and complexities of crime data, such as differences in crime patterns within individual Wards or the unique characteristics and challenges of different communities within a given district.

Therefore, while K-means clustering can provide some insights into crime patterns in the Chicago Crime dataset, it should be used with caution and in conjunction with other clustering validation techniques, such as the Silhouette method or hierarchical clustering, to ensure that the resulting clusters are meaningful and can be used to inform policy decisions and improve crime prevention strategies in the future. Additionally, it is important to carefully consider the variables used for clustering and to include relevant variables that provide a more comprehensive and nuanced understanding of crime patterns in the city.

Source Code