In Machine Learning, the problem of classification involves predicting the categorical class label to which the query data point belongs. And the confusion matrix is a tabular representation of the classification modelâs performance.
This tutorial will help you understand the confusion matrix and the various metrics that you can calculate from the confusion matrix.
Weâll start by explaining what classification is, the types of classification problems, and how to interpret the confusion matrix for a binary classification problem.
Let's get started.
Table of Contents
-
What is Classification?
-
Types of Classification
2.1. Binary Classification
2.2. Multiclass Classification
-
General Structure of the Confusion Matrix
-
How to Calculate Evaluation Metrics from Confusion Matrix
4.1. Accuracy
4.2 Recall
4.3 Precision
-
High Precision vs High Recall - When to Choose What?
-
Generating the Confusion Matrix in scikit-learn
-
Generating the Classification Report in scikit-learn
What is Classification?
In essence, classification algorithms aim at answering the question:
âGiven labeled training data points, whatâs the class label of a previously unseen test, or query data point?â
A classification problem could be as simple as classifying a given image as that of a cat or a dog.
Or it could be as complex as examining brain scans to detect the presence or absence of tumors.
Types of Classification
Binary Classification
In this tutorial, weâll focus on the binary classification problem. In binary classification, the class labels 1
and 0
are used.
Suppose youâre given a large dataset of student loans containing features such as the name of the university, tuition and employment details.
Youâd like to predict whether or not a new student with a specific tuition fee and employment status will default on the student loan. Notice how youâre trying to answer the question âWill the student default on the loan?ââand the answer is either a âYesâ or a âNoâ.
You might as well think of other examples, say, identifying spam emails - the answers in this case are âSpamâ or âNot Spamâ.
In these examples,
- the answers âYesâ, âSpamâ indicate relevant classes, and in practice are encoded as class
1
, and - the answers âNoâ and âNot Spamâ are encoded as class
0
.
Using disease diagnosis as another example, if the problem is to detect the presence of a disease: label 1
indicates that the patient has the disease; and label 0
indicates the absence of the disease.
This classification problem where the data points belong to one of the two classes is called binary classification. And weâll build on binary classification in this tutorial.
Multiclass Classification
You can also have classification problems where you have more than two classes, called multiclass classification.
For instance, classifying an email as âSpamâ or âNot Spamâ is a binary classification problem, whereas, categorizing emails as âSchoolâ, âWorkâ or âPersonalâ is a multiclass classification problem.
Now that youâve gained an understanding of the types of classification, letâs proceed to understand the confusion matrix.
General Structure of the Confusion Matrix
The general structure of the confusion matrix for binary classification is shown below:
Letâs now define a few terms:
- True Positive (TP): When the actual label is 1, and the classifier also predicted the label to be 1
- False Positive (FP): When the actual label is 0, but the classifier falsely predicted it to be 1
- True Negative (TN): When the actual label is 0, and classifier also predicted as 0
- False Negative (FN): When the actual label is 1, but the classifier predicted the label to be 0
Letâs now head over to the next section to understand the evaluation metrics for classification.
Youâll learn them by asking questions and following up with answersâand the answers explain what the metric signifies.
How to Calculate Evaluation Metrics from Confusion Matrix
Accuracy
Accuracy answers the question:
âHow often is the model correct?â
The number of times the classifier correctly predicted class 1
, plus the number of times it correctly predicted class 0
.
Now, look up from the matrix above, itâs the count of True Positive (TP) + True Negative (TN). And the total number of predictions is the sum of counts in all 4 quadrants.
This this leads to the formula for accuracy as given below:
Accuracy = TP + TN/ (Total Predictions)
where, Total Predictions = TP + TN + FP + FN
At the outset accuracy may seem like a good metric for evaluation. However, it is not a reliable metric when you have an imbalance in the class labels.
Suppose youâre designing a model to predict if a person has a particular medical condition that is rareâsay, it affects only 0.5% of the population.
So in a population of 1000 people, about 5 people will likely have the disease. You clearly have a class imbalance in this case! The majority class is class 0
indicating that the person doesnât have that particular medical condition.
In this case, a naĂŻve model that predicts the majority class all the time will be 99.5% accurate. However, such a model clearly isn't very helpful.
Can you see why this is the case? The confusion matrix for this example will look like this:
-
Youâre making 1000 predictions. And for all of them, the predicted label is class 0.
-
And 995 of them are actually correct (True Negatives!)
-
And 5 of them are wrong.
-
The accuracy score still works out to 995/1000 = 0.995
To sum up, imbalanced class labels distort accuracy scores. And the model is projected to perform better than what is truly warranted.
Examples include problems like:
- Credit card transactions that are potentially fraudulent
- A medical condition that affects a very small fraction of the total population
If the percentage of the minority class is p%, a model that predicts the majority class all the time will have an accuracy score of 1 - p.
As you might have guessed by now, the error rate is 1-accuracy score
.
Instead of saying âMy model is correct 98% of the timeâ, if youâd like to say âMy model is wrong 2% of the timeâ, then youâre talking error rates!
So itâs now time to learn about other metrics that are more useful in measuring a modelâs performance.
Recall
Recall answers the question:
âWhen it actually is a positive case, how often is the model correct? Or, What fraction of the positive labels does the model predict correctly?â
In essence, itâs the number of relevant cases that have been found by the model.
Now, go back to the confusion matrix and look up the Actual
row to identify which predictions correspond to an actually positive labelâthat is, class 1
.
As you can see, itâs the TP + FN count.
And the number of times the model got it right is equal to the TP count. So hereâs our formula for recall:
Recall = TP/ (TP + FN)
Our previous model for disease detection did not identify any positive casesâso the TP count = 0. And that leaves us with a recall score of 0. So the model has a recall score of 0 even though its accuracy score is 0.995.
Precision
Precision answers the following question:
âWhen the prediction is positive, how often is it correct?â
Once again, go back to the confusion matrix and look up under the Predicted
column to identify which predictions correspond to a predicted positive label. And itâs the TP + FP count, as shown below:
Hereâs our formula for precision:
Precision = TP/ (TP + FP)
In practice, youâll often hear people talk about the Precision-Recall Trade-off.
This means you cannot maximize both precision and recall, and will have to choose one over the otherâdepending on the problem at hand.
Letâs discuss that in the next section.
High Precision vs High Recall - When to Choose What?
For the problem that youâre solving, ask yourself the question: Which is worse - a False Positive (FP) or a False Negative (FN)?
If you cannot have a False Negative (FN) â Maximize recall
If you cannot have a False Positive (FP) â Maximize precision
Let's revisit the previous examples of disease detection and spam detection.
In which of the above cases would you prefer a higher recall?
Well, you probably guessed it right. Itâs in the case of disease detection that you cannot afford to have a False Negativeâtherefore, youâll need a high recall.
Why?đ¤
You would rather misclassify a patient as having the diseaseâwhich is a False Positive(FP). And youâll follow up with additional medical examination, and be extra cautiousârather than misclassify someone with the disease as healthy. In the worst case it could cost the person's life.
đ§ Let us now look at the example of spam detection. Here, False Positives(FP) can be dangerous.
- Recall that in the problem of spam classification tagging an email as spam is said to be predicting a positive label.
- A spam or two in your inbox does not cost much but what if an email from a recruiter was misclassified as spam? And you never cared to look at it?đ
- Youâd lose a potential job opportunity. And hereâs where you should maximize precision.
Not detecting a spam email (False Negative) is not as impactful as predicting a recruiterâs email to be spam (False Positive). So remember to ask yourself the above questions, and choose accordingly.
And that concludes our discussion. And itâs time to write some code.âł
Generating the Confusion Matrix in scikit-learn
Download the code used in this tutorial from my GitHub repo.
Now, letâs see how you can generate the confusion matrix in scikit-learn.
- Youâll have ground truth labels
y_true
. - And youâll have the predicted labels
y_pred
.
Here are the steps:
-
Generate the arrays
-
Generate the confusion matrix
To know more about the implementation of confusion matrix in sklearn, read the docs here.
Generating Classification Report in scikit-learn
âśď¸ Hereâs how you can generate the classification report with metrics like accuracy, precision, recall, and F1-score in scikit-learn.
To summarize
In this tutorial, youâve learned:
- What is classification and its types,
- The general structure of the confusion matrix,
- How to calculate various metrics from the confusion matrix, and
- When to choose precision over recall and vice-versa.
Congratulations on making it this far!đ
đRelated Posts
If you liked this post, here are a few other posts you may enjoy reading:
âśď¸ Document-Term Matrix in NLP: Count and TF-IDF Scores Explained
âśď¸ Learn K-Means Clustering by Quantizing Color Images in Python
âśď¸ 9 Best Data Engineering Courses You Should Take in 2022
If youâre looking to get started with Machine Learning, I wish you the very best in your journey! Happy learning and coding.đ