Naïve Bayes

Probabilistic Classification of Incarceration Using Naive Bayes → 

← Click to access the page 

A Comprehensive Intuition of Naive Bayes

Defining Naive Bayes in Layman's Nomenclature 

" Imagine you have a bag of different colored marbles, and you want to know the probability of picking a blue marble. Naive Bayes uses similar probabilities to predict the likelihood of an event happening, based on the probability of similar events that have happened in the past. For example, it can predict the likelihood of a customer buying a certain product based on their age, gender, and past purchase history. The "naive" part of the name refers to the assumption that all of the features (like age, gender, and past purchase history) are independent of each other, even if they may actually be related. Despite this simplification, Naive Bayes is often a very accurate and useful tool for prediction in many real-world situations. "

An Illustration of Naive Bayes for Spam Classification ~ By Nozzman

Introduction - The Math

As a machine learning practitioner, one of my primary objectives is to select the best hypothesis given the available data. For instance, in a classification task, I need to determine the most suitable class to assign to new data instances based on the information available to me. To make this selection process easier, I can use Bayes' Theorem. This theorem allows me to calculate the probability of a hypothesis based on my prior knowledge of the problem. By using this approach, I can make more informed decisions and select the most probable hypothesis given the available data.

Bayes' Theorm

Where,

P(h|d) is the probability of hypothesis h given the data d. This is called the posterior probability.

P(d|h) is the probability of data d given that the hypothesis h was true.

P(h) is the probability of hypothesis h being true (regardless of the data). This is called

the prior probability of h.

P(d) is the probability of the data (regardless of the hypothesis). 

In Supervised Learning in general the objective is to pick the Best Hypothesis (h) for Given Data (d). In a classification problem, the hypothesis (h) could represent the class assigned to a new data instance (d). One way to determine the most probable hypothesis based on prior knowledge about the problem is by using Bayes' Theorem. To calculate the probability of a hypothesis given prior knowledge, we need to find the posterior probability of P(h|d) using the prior probability P(h), P(D), and P(d|h). By computing the posterior probability for several hypotheses, we can select the hypothesis with the highest probability, known as the Maximum a Posteriori (MAP) hypothesis. This can be expressed as:

Types of Naïve Bayes Model: 

There are three main types of Naive Bayes models available:

Gaussian Naive Bayes:

This model assumes that the continuous values associated with each class are distributed according to a Gaussian distribution. It is suitable for continuous data.

Multinomial Naive Bayes:

This model is used for discrete count data. For example, let's say we have a dataset of text documents, and we want to classify them by topic. The features (words) are likely to appear multiple times in a single document, so this model considers the frequency of each word in a document.

Bernoulli Naive Bayes:

This model is similar to Multinomial Naive Bayes, but it is used for discrete data that follows a Bernoulli distribution. It is suitable for binary or boolean features, such as the presence or absence of a certain word in a text document.

Laplace Smoothing

What is smoothing in the context of the Naïve Bayes classifier? 

When a specific feature has a 0% chance of falling into a certain class, smoothing is a method used to alter the probabilities of the features in the dataset. When a feature is absent from the training dataset for a given class, the probability that that class exists is 0. By adding a little amount, known as a smoothing factor or regularization parameter, to the probability estimates, smoothing may be used to create non-zero probabilities for each class.

Smoothing is used to prevent zero probability, which may cause models to perform poorly when used with unobserved data. The Naive Bayes classifier may assign probability scores to each class even when a particular feature is absent from the training dataset for a given class because smoothing makes sure that each class gets a non-zero probability estimate. Many smoothing approaches, including Laplace smoothing, Lidstone smoothing, and additive smoothing, may be used, depending on the characteristics of the data and the particular specifications of the issue.

The Laplace smoothing technique is employed to smear out categorical data. It is a kind of regularization that aids in preventing zero probabilities, which may lead to difficulties with classification.

The probability estimate is smoothed by Laplace by adding a tiny constant value, usually 1, and increasing the denominator by the number of potential values or classes for the target characteristic. This "smooths" out the data and lessens the significance of unusual occurrences. The constant number is often selected such that it is both small enough not to affect the data's general distribution and high enough to have an effect on low counts.

Naive Bayes Classifier

Naive Bayes is a classification algorithm for binary (two-class) and multiclass classification problems. The technique is easiest to understand when described using binary or categorical input values. It is called naive Bayes or idiot Bayes because the calculation of the probabilities for each hypothesis are simplified to make their calculation tractable. Rather than attempting to calculate the values of each attribute value P (d1, d2, d3|h), they are assumed to be conditionally independent given the target value and calculated as P (d1|h) × P (d2|h) and so on. This is a very strong assumption that is most unlikely in real data, i.e. that the attributes do not interact. Nevertheless, the approach performs surprisingly well on data where this assumption does not hold.

Naive Bayes models' de facto representation

The representation for naive Bayes is probabilities. A list of probabilities is stored to file for a learned naive Bayes model. This includes:

 Class Probabilities: The probabilities of each class in the training dataset.

Conditional Probabilities: The conditional probabilities of each input value given each class value.

Create a Data-Driven Naive Bayes Model

Learning a naive Bayes model from the training data is a fast process as it only requires calculating the probability of each class and the probability of each class given different input values. This method does not involve fitting coefficients using optimization procedures, which further adds to its efficiency.

Calculating Class Probabilities

The class probabilities are simply the frequency of instances that belong to each class divided by the total number of instances. In the simplest case, each class would have a probability of 0.5 or 50% for a binary classification problem with the same number of instances in each class.

Calculating Conditional Probabilities

The conditional probabilities are the frequency of each attribute value for a given class value divided by the frequency of instances with that class value. 

Make Predictions with Naive Bayes

Given a naive Bayes model, you can make predictions for new data using Bayes theorem. Upon calculating the multiple of probabilities in terms of the classes. The Highest probability determines corresponding to that class will be classfied that target class.