Supervised Learning
Navigate to any one of the following approaches to Learn More.
The Convention of Supervised Machine Learning
The Dataset.
In Supervised learning, the objective at hand is to train a Machine Learning Model to be able to classify the data accordingly provided the characteristics of a Class. Rationally, the machine learning model is able to spot patterns in the provided data. In general convention the dataset that we give to a machine learning model is termed the "Training Set", and it is generally accompanied by two other portions of the Data Called the "Validation" and "Test" Sets. The logical usage of Validation and Test Sets is to infer or quantify the performance of the trained machine-learning model. The Test Set is entirely unseen and new to the trained model; Hence, achieving a decent amount of accuracy on both Sets is generally considered as a Good Fit for the Model.
The Requirement of Disjoint Distributions:
It is important to ensure that the distributions of the training and test datasets are disjoint. This means that the examples in the test dataset should not be present in the training dataset and vice versa. The reason for this is to ensure that the model is evaluated on its ability to generalize to new, unseen data. If the test dataset includes examples that are present in the training dataset, then the model may perform well on the test dataset simply because it has seen those examples during training. This can lead to overestimating the model's performance and poor generalization to new data. Therefore, keeping the training and test distributions disjoint is a critical aspect of building a robust and accurate machine-learning model.
Enter the Problem → "Bias/Variance Tradeoff"
The bias-variance tradeoff is a machine learning concept that describes the equilibrium between the model's ability to fit the training data well (low bias) and its ability to generalize to new, untrained data (low variance). The objective of machine learning is to identify a model with low bias and low variance, since this results in high performance on both the training and test datasets.
When a model has a significant degree of bias, it is too simplistic to represent the underlying data patterns. The model will perform badly on both the training and test datasets in this circumstance. This is referred to as underfitting. Underfitting happens when the model is insufficiently sophisticated to capture the data's subtleties; as a consequence, it cannot reliably forecast the output for fresh samples.
On the other side, when a model has a large variance, it indicates that it is overfitting the training data and is overly complicated. The model performs very well on the training data, but badly on the test data. This is referred to as overfitting. Overfitting happens when a model is too complicated and collects noise in the training data instead of the underlying patterns. This results in inadequate generalization to new, unobserved data.To achieve the ideal balance between bias and variance, it is essential to fine-tune the parameters and architecture of the model.
The infamous "Class Imbalance" Problem
Class imbalance is a typical situation in which the distribution of classes in the training data is extremely skewed, with one or more classes having much less instances than the rest. This may arise in several real-world applications, such as fraud detection and illness diagnosis, when the positive class (i.e., the class of interest) is very uncommon compared to the negative class. Under such circumstances, a machine learning algorithm may learn to effectively predict the majority class while doing badly on the minority class, resulting in a biased and erroneous model. This is due to the fact that the model may become biased towards the majority class, resulting in poor classification performance for the minority class and perhaps a substantial fall in the model's overall accuracy. Handling class imbalance is thus essential for developing a strong and accurate machine learning model.
Mitigation
Upon Stating all the possible limitations or Incorrect approaches to Supervised Machine Learning, these can be easily mitigated. For the Class Imbalance problem, if you have significantly huge amount of data, it is okay to balance the classes to the lowest class label considering it is not as significantly low compared to the size of the dataset. Otherwise, there are methods like SMOTE using which we can balance data artificially, but these come with their fair share of advantages and disadvantages. For the Bias/Variance Tradeoff problem, requires determining the optimal amount of model complexity depending on the quantity and complexity of the training data. Regularization methods include dropout, L1/L2 regularization, and early stopping may also be utilized to reduce overfitting and enhance generalization performance. It is also often a requirement to get more data to deal with this problem.