Know what you don’t know – allowing classifiers to reject input samples

Daniel Gerstgrasser, Signal Processing Engineer
Reading time: 4 min

Insight in Brief

Typical classification algorithms assign a class to every sample, even though that sample may be far off / dissimilar to the training data or the type of thing you want classify. This is especially problematic for safety critical systems. Therefore, in this article the four methods

  • Rejection samples for neural networks
  • Sigma-Ellipses for Bayesian classifiers with gaussian class densities
  • One-Class SVMs with radial basis functions
  • Isolation forests


to integrate novelty detection into classification algorithms are presented and compared using a toy example. Which method suits best – as always – depends on the application.


Many classification algorithms assign one class to every sample (feature vector), even if that sample is far from the training data. This property is problematic for practical applications, especially if they are safety critical. Therefore, it is desirable to allow classification algorithms to respond with “I don’t know” for samples dissimilar to the training data. In this article, methods to achieve this goal are presented and are compared using a toy example.

Toy example

Suppose a fruit company wants to build a machine that packs apples and pears coming along a conveyer belt into boxes for shipping. The machine needs to distinguish between apples and pears to pack them into the right box. The properties measured for classification are the weight and color of the fruits. The company decides to train a linear classifier with training data obtained from measuring a large sample of apples and pears from its fields (the training data is visualized in Figure 1 on the right) The company brings the classifier into production and the system seem to work fine.

Know what you don't know - EN1

Figure 1: Classifier (left) and Features of training examples (right)


After some weeks, customers start complaining about having tennis balls in their apple boxes. The company investigates the problem. They know that there is a tennis court near their fields, and they find out that the trained classifier classifies the incoming tennis balls as apples – although they are dissimilar to the training data (see Figure 2 on the right). How can the company solve this problem?

Know what you don't know - EN2

Figure 2: Classifier with tennis ball (left) and Linear classifiers with decision regions (right)



Possible solutions

1. Neural networks

“Classic” neural networks are not inherently able to reject out of distribution samples. One way to solve this problem is to introduce an additional rejection class. To generate training samples for that rejection class, a smallest surrounding hypercube for the data of each class is constructed. In the non-overlapping regions of the hypercubes of all classes, training samples are generated and labeled as “rejection” class. Care must be taken that the number or rejection samples is in a sound ratio to the other training samples to avoid a skewed distribution. With the generated rejection samples, the neural network can be trained normally (the networks capacity may need to be increased). You can see in Figure 3 that that tennis ball would NOT be classified as apple or pear.

Know what you don't know - EN3

Figure 3: Neural network classifier with decision regions and rejection. Black dots are generated rejection samples.


2. Bayesian classifiers with gaussian class densities

Assume that the classifier consists of a generative model that has normally distributed (or Gaussian mixture) class densities p\left(\mathbf{x}\middle| C_i\right)~N\left(\mathbf{\mu}_\mathbf{i},\mathbf{\Sigma}_\mathbf{i}\right). Then one could assume that the evidence term (calculated by the law of total probability),

    \[ p\left(\mathbit{x}\right)=\ \sum{p\left(\mathbit{x}\middle| C_i\right)p\left(C_i\right),} \]

could be used to reject “unlikely” samples with respect to the model. However, p\left(\mathbit{x}\right) is a probability density and has no absolute meaning. Its values dependent on the scaling of the features. Therefore, a more reliable measure is needed. One idea is to use the Mahalanobis distance of the feature vector to the mean of a certain class

    \[ {d_i(\mathbit{x})}^2=\left(\mathbf{x}-\mathbf{\mu}_\mathbf{i}\right)^\mathbf{T}{\mathbf{\Sigma}_\mathbf{i}}^{-\mathbf{1}}\left(\mathbf{x}-\mathbf{\mu}_\mathbf{i}\right). \]


This distance forms ellipses (at least in 2d) around the distribution mean whose axes are scaled by the distribution variance. One can then calculate the probability of observing a distance greater than a certain value by

    \[ p\left(D_i>d\right)=\frac{1}{2}\left(1-erf\left(\frac{1}{\sqrt2}d\right)\right). \]

As such, samples can be rejected if the probability of observing a certain distance for that sample is below a threshold

    \[ p\left(D_i>d_i\left(x\right)\right)<\delta$ \]

These thresholds can correspond to the well-known three-sigma rule – for example, choosing a threshold of \delta= 0.27 means that 99.73 (3 sigma) of the data generated by the model would lie within this region. To extend the criterion to multiple classes the following relation can be used

    \[ p_{\sigma\left(x\right)}=\frac{\underset i \max \left{p\left(D_i>d_i\left(x\right)\right)p\left(C_i\right)}}{\underset i \max \left{p\left(C_i\right)}} \]


You can see in Figure 4 that that tennis ball would NOT be classified as apple or pear and that the decision regions have an elliptic shape.

Know what you don't know - EN4

Figure 4: Bayesian classifier with decision regions and rejection


3. Two stages (two models) approaches

In general, it is also possible to follow a two-stage approach to reject samples. In this case, two models are trained. The first model detects samples that are not part of the training data and does the rejection. The second model is then any other classifier (as the linear classifier in the initial toy example). In the following the two possible rejection models one-class SVM and isolation forests are shown. Nevertheless, going this direction, nearly any model that can perform outlier detection can be used.


3.1. One-class SVM with radial basis function kernel

One possibility to train a rejection model would be to use a one-class support vector machine. To capture arbitrary data distributions a radial basis function kernel can be used. Figure 5 shows the result of applying the one-class SVM to the toy example. Again, the tennis ball would be rejected.

Know what you don't know - EN5

Figure 5: Rejection based on a one-class SVM with RBF kernel


3.2. Isolation forest

As a second example for a rejection model, Figure 6 shows the result of training an isolation forest for the toy example. The key idea about isolation forests or trees is that anomalies can be isolated from the data with only a few decisions and have therefore a low depth in the trained isolation trees.

Know what you don't know - EN6

Figure 6: Rejection based on isolation forest


In this article, three different approaches to integrate “I don’t know” into classification algorithms have been shown. Which method works best (in case of accuracy, number of dimensions, computational performance or memory requirements, overall system design etc.) depends on the application. In general, it is desirable to have to train only one model to keep the overhead and the computational complexity low. However, if the model is not able to reject samples a two-stage procedure can be used (e.g. class densities are not normally distributed). In general, one should always be aware of the consequences of a misclassification for a particular application.


Are you interested in reading further articles?

Leave your e-mail address here and we will inform you as soon as we publish new contributions.

This might also interest you

Reading time: 3 min.

Ensuring successful outsourced engineering projects

How do you ensure success of your outsourced engineering project from the start?
Usability study for medical devices
Reading time: 3 min.

Usability study for medical devices

Preventing safety related use-errors or harm to the user using usability studies.
Reading time: 4 min.

Why event-based software architecture?

This article explains the event-based software architecture, its advantages, & possible disadvantages for the development...
Expert Blog Architekturprozess
Reading time: 4 min.

The systematic creation of a system and software architecture

This article deals with the development of a good system architecture...
Reading time: 5 min.

Additive manufacturing - reliable enough for medical technology?

For a customer's device, a dynamically rotating system was developed within a short period of time...
Reading time: 4 min.

Know what you don't know

Typical classification algorithms assign a class to every sample, even though that sample may be far off...