Decision Trees

Application of Decision Trees on Chicago Crime Data → 

← Click to access the page 

A Comprehensive Intuition of Decision Trees

Defining Decision Trees in Layman's Nomenclature 

"Think of a decision tree as a flowchart that helps you make decisions based on a set of rules. For example, imagine you want to decide what to wear today. A decision tree for this task might have questions like "Is it sunny outside?" or "Is it cold outside?" Each question leads to a different path in the flowchart with additional questions or decisions, until you finally reach a leaf node with a final decision, such as "Wear a t-shirt and shorts" or "Wear a jacket and boots".

In machine learning, a decision tree works in a similar way to make predictions about data. It starts with a dataset and selects the best feature to split the data into smaller groups based on their similarities. This process is repeated recursively until it reaches a point where the subsets cannot be further divided or a stopping criterion is met. The resulting tree can then be used to make predictions about new data by following the same path down the tree until reaching a final decision or predicted outcome.

           A Simple Decision Tree - D3.js

Introduction - The Technical Stuff

In the realm of machine learning, decision trees are a strong and extensively used tool for categorization and prediction problems. Decision trees are popular because they are simple to read and may be used to a variety of activities, such as medical diagnosis and financial forecasts. This article will examine the operation and training of decision trees.

A Decision Tree is a tree-like architecture used to categorize data based on a set of characteristics. The tree is constructed by recursively dividing the data depending on the value of one of the characteristics until every piece of data has been categorised. The branches of the tree indicate the various outcomes of the decision process, while the leaves represent the final classifications.

To construct a decision tree, we first choose a trait that we feel would aid in categorization. The data is then divided into two groups depending on the feature's value. We continue this method for each of the generated groups until all of the data is categorised. Using a criterion referred to as "information gain," the optimal feature to divide on is determined.

Information Gain

Information gain is a measure of how much "information" a characteristic gives for categorization. Information theory's idea of entropy is connected to the concept of "information" in this context. Entropy quantifies the degree of uncertainty in a collection of data. If all the instances in a piece of data belong to the same category, then there is no uncertainty and the entropy is 0. If the data is equally distributed between two classes, then there is maximum uncertainty, and the entropy is 1.  We can compute the entropy of a collection of data as follows:

Entropy

Entropy is a measure of uncertainty or randomness in a dataset. In machine learning, entropy is used to quantify the degree of impurity or mixing of class labels in a set of data. A dataset with perfect balance, i.e., equal numbers of examples for each class label, has maximum entropy and the least amount of information. On the other hand, a dataset that is perfectly split into two groups, with all examples of one class in one group and all examples of the other class in the other group, has minimum entropy and the most information. The concept of entropy is central to many machine learning algorithms, such as decision trees and information gain, and helps to create efficient and accurate models for classification or regression tasks.

Gini Index/Impurity

The Gini index, also known as Gini impurity, is a metric used to measure the degree of impurity or mixing of class labels in a set of data. It is a measure of the probability of incorrectly classifying a randomly chosen instance in the dataset. A Gini index of 0 indicates that the dataset is perfectly pure, with all instances belonging to the same class, while a Gini index of 1 indicates that the dataset is perfectly impure, with instances of all classes equally distributed. The Gini index is used in decision tree algorithms to select the best feature to split the dataset at each node, along with other measures such as information gain. By selecting the feature with the lowest Gini index, decision trees can efficiently partition the data and create simple yet powerful models for classification or regression tasks.

The Fundamental Differences between Gini and Entropy

The main difference between the two lies in their mathematical formulation and computational complexity. The Gini index measures the probability of incorrectly classifying a randomly chosen instance in the dataset, while entropy measures the degree of uncertainty or randomness in the data. In practice, both measures can be used interchangeably in decision tree algorithms to select the best feature to split the dataset. However, Gini index is often preferred over entropy due to its computational efficiency, especially for large datasets with many features.

Source Code