Balancing with Binary Classification, Part 1

Blog: Balancing with Binary Classification

Evaluation of a trained machine learning model is not an easy task. Even when considering a rather simple case, classifying a dataset into two classes, careful attention is needed in deciding whether the model is actually good. In this blog we will present two of the evaluation metrics, namely precision and recall, for a binary classification model along with some examples. In the second part of this blog we will see how bringing probabilities into play will add a whole new level to the classification game.

Generally speaking, in classification our aim is to infer the classes of objects based on their characteristics. For instance, you could classify portrait photos to happy, neutral, angry and sad according to the facial expressions on them. We talk about binary classification when there are exactly two groups we want our objects to be classified into. For example, emails could be classified into spam and non-spam.

Large numbers of things/objects with numerous features easily exceed the limits of the computational capacity of a human brain. Luckily in such tasks machine learning can reveal the hidden patterns that a human eye can not see. Classical algorithms suitable for classification include e.g. Decision Trees and Logistic Regression. As in all trained machine learning models, it is essential to evaluate the model’s reliability and improve (or change) the model if necessary. Often the first natural metric for classification models is accuracy which gives the plain success rate. In the real world, however, datasets are rarely evenly distributed and accuracy might not be well suited for the evaluation of the classification model.

Example 1. We’ll take credit card fraud detection as our first example. Let us assume that there are altogether 1000 credit card transactions of which 10 are frauds and the rest 990 are honest. Just by classifying all transactions to honest, we have a model with 99% accuracy! Unfortunately this “model” doesn’t succeed at all in recognizing the frauds - the reason we wanted to have the model in the first place.

Example 2. Let us assume that we have a model which classifies people into healthy (negative diagnosis) and ill (positive diagnosis). The model is tested by 100 people of which 10 are known to be ill. We’ll have the following predictions given by the model:

Predicted condition


From the table above, we see that among the ill, the model correctly predicted five as ill (orange color) and falsely five as healthy (purple). The 90 healthy people are also healthy according to our model. So the model is right in 95% of the cases, i.e. has pretty good accuracy. However, 50% of the ill get a wrong diagnosis according to this model so it can not be considered a good one.

Precision and Recall

In binary classification we can divide the predictions into four groups according to their true and predicted labels:

  • True positives, i.e. those positives that also the model predicted to be positive
  • False positives, i.e. those negatives that the model predicted (falsely) to be positive
  • True negatives, i.e. those negatives that also the model predicted to be negative
  • False negatives, i.e. those positives that the model predicted (falsely) to be negative

For imbalanced datasets (as in the examples above), there are metrics which describe the model’s reliability better than accuracy. Here we will take a closer look at two of those, namely precision and recall.


Precision is the rate (percentage) that tells the proportion of true positives among those predicted to be positive. In the cases where falsely positive predictions have serious consequences, we want the precision to be high, i.e. close to 1. For instance, you don’t want important emails (negative) to be classified as spam (positive).


Recall describes how well the model recognized the positive data points from all truly positive ones. Having falsely negative prediction can be fatal, for instance, when diagnosing deadly diseases or severe illnesses; in such cases it is vital that the model has high recall.

In the next part we will see that increasing both precision and recall simultaneously can be challenging. What if we need to avoid both false positives and false negatives: How to find a balance between precision and recall, i.e. a situation where both are good enough? A common way to incorporate recall and precision into one metric comes in the form of their harmonic mean, called F1 score:


The higher the F1 score is, the better the model when measured by both precision and recall. More on the trade-off between precision and recall as well as using probabilities in the predictions in the second part of the blog.