Know what you don't know - allowing classifiers to reject input samples

Insight in Brief

Typical classification algorithms assign a class to every sample, even though that sample may be far off / dissimilar to the training data or the type of thing you want to classify. This is especially problematic for safety-critical systems. Therefore, in this article the four methods

Rejection samples for neural networks
Sigma-Ellipses for Bayesian classifiers with Gaussian class densities
One-class SVMs with radial basis functions
Isolation forests

to integrate novelty detection into classification algorithms are presented and compared using a toy example. Which method suits you best – as always – depends on the application.

Introduction

Many classification algorithms assign one class to every sample (feature vector), even if that sample is far from the training data. This property is problematic for practical applications, especially if they are safety-critical. Therefore, it is desirable to allow classification algorithms to respond with “I don’t know” for samples dissimilar to the training data. In this article, methods to achieve this goal are presented and are compared using a toy example.

Toy example

Suppose a fruit company wants to build a machine that packs apples and pears coming along a conveyor belt into boxes for shipping. The machine needs to distinguish between apples and pears to pack them into the right box. The properties measured for classification are the weight and color of the fruits. The company decides to train a linear classifier with training data obtained from measuring a large sample of apples and pears from its fields (the training data is visualized in Figure 1 on the right) The company brings the classifier into production and the system seems to work fine.

Figure 1: Classifier (left) and Features of training examples (right)

After some weeks, customers started complaining about having tennis balls in their apple boxes. The company investigates the problem. They know that there is a tennis court near their fields, and they find out that the trained classifier classifies the incoming tennis balls as apples – although they are dissimilar to the training data (see Figure 2 on the right). How can the company solve this problem?

Figure 2: Classifier with tennis ball (left) and Linear classifiers with decision regions (right)

Possible solutions

1. Neural networks

“Classic” neural networks are not inherently able to reject out-of-distribution samples. One way to solve this problem is to introduce an additional rejection class. To generate training samples for that rejection class, the smallest surrounding hypercube for the data of each class is constructed. In the non-overlapping regions of the hypercubes of all classes, training samples are generated and labeled as “rejection” classes. Care must be taken that the number of rejection samples is in a sound ratio to the other training samples to avoid a skewed distribution. With the generated rejection samples, the neural network can be trained normally (the network capacity may need to be increased). You can see in Figure 3 that that tennis ball would NOT be classified as apple or pear.

Figure 3: Neural network classifier with decision regions and rejection. Black dots are generated rejection samples.

2. Bayesian classifiers with Gaussian class densities

Assume that the classifier consists of a generative model that has normally distributed (or Gaussian mixture) class densities \( p\left(\mathbf{x}\middle| C_i\right) \) ~ \( N\left(\mathbf{\mu}_\mathbf{i},\mathbf{\Sigma}_\mathbf{i}\right) \). Then one could assume that the evidence term (calculated by the law of total probability),

\( p(x)= \sum{p\left({x}\middle| C_i\right)p\left(C_i\right),} \)

could be used to reject “unlikely” samples with respect to the model. However, \( p(x) \) is a probability density and has no absolute meaning. Its values are dependent on the scaling of the features. Therefore, a more reliable measure is needed. One idea is to use the Mahalanobis distance of the feature vector to the mean of a certain class

\( {d_i(x)}^2=\left(\mathbf{x}-\mathbf{\mu}_\mathbf{i}\right)^\mathbf{T}{\mathbf{\Sigma}_\mathbf{i}}^{-\mathbf{1}}\left(\mathbf{x}-\mathbf{\mu}_\mathbf{i}\right). \)

This distance forms ellipses (at least in 2d) around the distribution mean whose axes are scaled by the distribution variance. One can then calculate the probability of observing a distance greater than a certain value by

\( p\left(D_i>d\right)=\frac{1}{2}\left(1-erf\left(\frac{1}{\sqrt2}d\right)\right). \)

As such, samples can be rejected if the probability of observing a certain distance for that sample is below a threshold

\( p\left(D_i>d_i\left(x\right)\right)<\delta \)

These thresholds can correspond to the well-known three-sigma rule – for example, choosing a threshold of \( \delta \) = 0.27 means that 99.73 (3 sigma) of the data generated by the model would lie within this region. To extend the criterion to multiple classes the following relation can be used

\( p_{\sigma(x)}=\frac{\underset i \max {p(D_i>d_i(x))p(C_i)}}{\underset i \max {p\left(C_i\right)}} \)

You can see in Figure 4 that that tennis ball would NOT be classified as apple or pear and that the decision regions have an elliptic shape.

Figure 4: Bayesian classifier with decision regions and rejection

3. Two stages (two models) approaches

In general, it is also possible to follow a two-stage approach to reject samples. In this case, two models are trained. The first model detects samples that are not part of the training data and does the rejection. The second model is then any other classifier (as the linear classifier in the initial toy example). In the following the two possible rejection models one-class SVM and isolation forests are shown. Nevertheless, going in this direction, nearly any model that can perform outlier detection can be used.

3.1. One-class SVM with radial basis function kernel

One possibility to train a rejection model would be to use a one-class support vector machine. To capture arbitrary data distributions a radial basis function kernel can be used. Figure 5 shows the result of applying the one-class SVM to the toy example. Again, the tennis ball would be rejected.

Figure 5: Rejection based on a one-class SVM with RBF kernel

3.2. Isolation forest

As a second example of a rejection model, Figure 6 shows the result of training an isolation forest for the toy example. The key idea about isolation forests or trees is that anomalies can be isolated from the data with only a few decisions and have therefore a low depth in the trained isolation trees.

Figure 6: Rejection based on isolation forest

Summary

In this article, three different approaches to integrating “I don’t know” into classification algorithms have been shown. Which method works best (in case of accuracy, number of dimensions, computational performance or memory requirements, overall system design, etc.) depends on the application. In general, it is desirable to have to train only one model to keep the overhead and the computational complexity low. However, if the model is not able to reject samples a two-stage procedure can be used (e.g. class densities are not normally distributed). In general, one should always be aware of the consequences of misclassification for a particular application.