Probabilistic Classification of Incarceration Using Naive Bayes

Naive Bayes to Probabilistically Classify Incarcerations for Crime

Naive Bayes may be used to categorize incarcerations/arrests in the Chicago crime dataset in the context of criminal justice. The Chicago crime dataset is a large and intricate dataset that includes details on crimes that have place in Chicago between 2018 and the present. The dataset contains a large number of different characteristics, such as the crime's date and time, scene, kind, and resolution.

Finding patterns and trends in the data is one of the main problems of studying the Chicago crime dataset. Based on the factors in the dataset, such as the kind of crime, the location, and the conclusion of the case, Naive Bayes may be used to categorize incarcerations. Naive Bayes may assist forecast the chance of imprisonment depending on the features of the crime by evaluating these factors to detect patterns and trends in the data.

The Application

Data Preparation 

Similar to the Application of Decision Trees, the requirement of numerical data is mandatory in order to perform Naive Bayes in Python. But upon using pd.factorise as the encoding method didn't seem to yield good results. An alternative to encode the data is by using  pd.get_dummies function where it encodes all the categorical variables into a binary format ( 0 or 1 ). Considering Naive bayes probabilistic nature, it will be ideal for the model to make probability matrix out of this data.

In contrast to the decision tree model, for Naive Bayes the features are again selectively chosen to yield better performance. The chosen features are listed below :

Features -  Month, Day, Community Area, Location Description, Domestic, Primary Type, SocioEconomic-Status 

Target - Arrest

Considering that the features "Location Description" and "Primary Type" have a high variety of values, the top3 frequently occurring candidates were picked to slice down the data on those candidates. This, along with down sampling on the Class, lead to a significant decrease in the data size. 

Class Imbalance in Original Dataset

 Balanced Classes - DownSampling

Training and Testing Distributions

80% Train set

20% Test set

Sample Illustrations of Train and Test Sets

Original Dataset with Feature Columns

Encoded Dataset with Feature Columns

X ~ y 

Features     ~ Target

Naive Trees & Performance Analysis

Model ~ GaussianNB( )

Train Accuracy:          67.63%

Test Accuracy:         67.28%

← Confusion Matrix 


Classification Report ↓

Results & Conclusion

Based on the above quantified performance metrics, first, looking at the Train and Test accuracies, there's no significant difference between the two, so the problem of Bias Variance Tradeoff is obsolete. But, since the accuracies are low for both distributions, it is considered to be underfitting the data. Conventionally, in this scenario,  we opt to get more data to fit the model.  

Inference on Classification Report

For the "False" class, the model has a precision of 0.50, which means that when the model predicts a sample as "False", it is correct only 50% of the time. The recall for this class is 0.76, meaning the model correctly identifies 76% of the actual "False" samples. The F1 score, which is the harmonic mean of precision and recall, is 0.60, which is a moderate score.

For the "True" class, the model has a precision of 0.85, which means that when the model predicts a sample as "True", it is correct 85% of the time. The recall for this class is 0.63, which means that the model correctly identifies 63% of the actual "True" samples. The F1 score for this class is 0.72, which is a relatively high score.

In Summary, the model appears to perform better for the "True" class than for the "False" class. The precision for the "True" class is relatively high, indicating that the model is good at correctly predicting this class when it is present. However, the recall for the "True" class is relatively low, indicating that the model may miss some of the "True" samples. For the "False" class, both precision and recall are relatively low, indicating that the model may not be very effective at distinguishing between "True" and "False" samples.

Potential Reasons for the Inaccuracy of Naive Bayes

One challenge is that Naive Bayes assumes all features are equally important in determining the class label. However, in the Chicago crime dataset, some features may be more informative than others. For example, the type of crime committed may be more indicative of the class label than the time of day or the location of the crime. Naive Bayes may not be able to effectively weigh the importance of different features, which can lead to suboptimal performance.

Another issue is that Naive Bayes assumes that the relationship between features and the class label is linear. However, in the Chicago crime dataset, the relationship may be more complex and nonlinear. For example, the type of crime committed may interact with the location of the crime in a nonlinear way. Naive Bayes may not be able to capture these complex interactions, which can lead to poor performance.

In addition, Naive Bayes assumes that the features are conditionally independent given the class label. This means that the occurrence of one feature does not affect the occurrence of any other feature, given the class label. However, in the Chicago crime dataset, there may be dependencies between features even after conditioning on the class label. For example, the location of a crime may be dependent on the time of day, even after conditioning on the type of crime committed. Naive Bayes may not be able to effectively model these dependencies, which can lead to suboptimal performance.

In conclusion, while Naive Bayes can be effective for many classification tasks, it may not be the best choice for the Chicago crime dataset due to the independence assumption, imbalanced data, outliers and noise, non-normal distribution, and limited feature interaction. Other machine learning algorithms. Also, based on these results, it may be worth exploring other machine learning models or feature engineering techniques to improve the overall performance of the model.

Insights and Takeaways

From the whole quest of applying naive bayes to the Chicago Crime Data set, it is evident that the circumstances around the data don't allow it to be a good fit for the model. perhaps, considering the entirety of the dataset from the year 2001 to the present would leave us with more data to be used for modeling that can deal with the underfitting scenario. In summation, the general takeaway is the topic itself is unsuitable for an algorithm like naive bayes considering the complexity of data. 

Source Code