Evaluation Metrics for Classification.
Evaluating Machine Learning models is a fundamental part of building reliable machine learning models. Evaluation ensures that the models effectively do the jobs they were built to do and also check for possible biases in our models. Without proper evaluation, we wouldn’t know how bad or good our models are prior to deployment.
Today we are looking at some evaluation metrics you could use for classification models.
1. Accuracy Score.
Accuracy Score is the simplest and most frequently used evaluation metric for classification models. It’s what we are referring to when talking about the accuracy of a machine learning model.
Accuracy score is the ratio of the total correct predictions to the total number of samples in the data.
As simple as the accuracy score is, it can be misleading when your data contains imbalanced target classes.
Assume you are working on a fraud detection model for bank transactions where 98% of all transactions are non-fraudulent and only 2% are fraudulent. Your model could have a 98% accuracy score by simply predicting all transactions as non-fraudulent. This can be an awfully expensive affair if a fraudulent transaction is classified as non-fraudulent when the model is in production.
In such cases we need to look at other evaluation metrics which are better suited to deal with deals with imbalanced data.
2. Precision
Precision is the ratio of the true predicted positives to all predicted positives. Precision answers the question;
Of all the cases the model predicted to be positive how many were actually positive.
To explain the more we’ll need a confusion matrix.
True Negative — Observation is negative, and predicted to be negative.
False Negative — Observation positive but predicted to be negative.
False Positive — Observation negative but predicted to be positive.
True Positive — Observation Positive and predicted to be positive.
Precision is a good metric to use when the cost of a False Positive outweighs the cost of a False Negative.
A good example is Email Spam Detection where the cost of missing a very important email (like a job offer) is higher than seeing a few spams in your inbox. In this case you should use precision to evaluate your model.
3. Recall
Recall is the ratio of true predicted positives to the total actual positives. Precision answers the question;
Out of all the actual positive cases how many did the model predict as positive.
Recall is a good measure to evaluate your model when the cost of having a False Negative is higher than False Positive.
A good example is a COVID-19 prediction model. Having an infected person predicted as healthy (False Negative) could could be more detrimental than running a few more tests (False Positive). In such a scenario, Recall is a better evaluation metric.
4. F1 Score
We use the F1 evaluation metric when we want to strike a balance between Precision and Recall of your model.
A good example for this would be a Credit Scoring model for a financial institution like a bank. In this case , if bad loans are predicted as good loans (False Positive), the bank might loose a lot of money from defaulted loans. On the other hand, if good loans are predicted as bad loans (False Negative), then the bank will loose additional income from those loans.
In this scenario, we might want to use a metric that is a blend of both precision and recall and that is the F1 Score.
Summary
In this article we have looked at 4 classification metrics that is: Accuracy, Precision, Recall and F1. There are other metrics that you can use to score your models among the Receiver Operating Characteristic (ROC). In the next article we’ll look at ROC in depth. See you then.