Suppose a fruit company wants to build a machine that packs apples and pears coming along a conveyer belt into boxes for shipping. The machine needs to distinguish between apples and pears to pack them into the right box. The properties measured for classification are the weight and color of the fruits. The company decides to train a linear classifier with training data obtained from measuring a large sample of apples and pears from its fields (the training data is visualized in Figure 1 on the right) The company brings the classifier into production and the system seem to work fine.
Figure 1: Classifier (left) and Features of training examples (right)
After some weeks, customers start complaining about having tennis balls in their apple boxes. The company investigates the problem. They know that there is a tennis court near their fields, and they find out that the trained classifier classifies the incoming tennis balls as apples – although they are dissimilar to the training data (see Figure 2 on the right). How can the company solve this problem?
Figure 2: Classifier with tennis ball (left) and Linear classifiers with decision regions (right)
1. Neural networks
“Classic” neural networks are not inherently able to reject out of distribution samples. One way to solve this problem is to introduce an additional rejection class. To generate training samples for that rejection class, a smallest surrounding hypercube for the data of each class is constructed. In the non-overlapping regions of the hypercubes of all classes, training samples are generated and labeled as “rejection” class. Care must be taken that the number or rejection samples is in a sound ratio to the other training samples to avoid a skewed distribution. With the generated rejection samples, the neural network can be trained normally (the networks capacity may need to be increased). You can see in Figure 3 that that tennis ball would NOT be classified as apple or pear.
Figure 3: Neural network classifier with decision regions and rejection. Black dots are generated rejection samples.
2. Bayesian classifiers with gaussian class densities
Assume that the classifier consists of a generative model that has normally distributed (or Gaussian mixture) class densities ~. Then one could assume that the evidence term (calculated by the law of total probability),
could be used to reject “unlikely” samples with respect to the model. However, is a probability density and has no absolute meaning. Its values dependent on the scaling of the features. Therefore, a more reliable measure is needed. One idea is to use the Mahalanobis distance of the feature vector to the mean of a certain class
This distance forms ellipses (at least in 2d) around the distribution mean whose axes are scaled by the distribution variance. One can then calculate the probability of observing a distance greater than a certain value by
As such, samples can be rejected if the probability of observing a certain distance for that sample is below a threshold
These thresholds can correspond to the well-known three-sigma rule – for example, choosing a threshold of = 0.27 means that 99.73 (3 sigma) of the data generated by the model would lie within this region. To extend the criterion to multiple classes the following relation can be used
You can see in Figure 4 that that tennis ball would NOT be classified as apple or pear and that the decision regions have an elliptic shape.
Figure 4: Bayesian classifier with decision regions and rejection
3. Two stages (two models) approaches
In general, it is also possible to follow a two-stage approach to reject samples. In this case, two models are trained. The first model detects samples that are not part of the training data and does the rejection. The second model is then any other classifier (as the linear classifier in the initial toy example). In the following the two possible rejection models one-class SVM and isolation forests are shown. Nevertheless, going this direction, nearly any model that can perform outlier detection can be used.
3.1. One-class SVM with radial basis function kernel
One possibility to train a rejection model would be to use a one-class support vector machine. To capture arbitrary data distributions a radial basis function kernel can be used. Figure 5 shows the result of applying the one-class SVM to the toy example. Again, the tennis ball would be rejected.
Figure 5: Rejection based on a one-class SVM with RBF kernel
3.2. Isolation forest
As a second example for a rejection model, Figure 6 shows the result of training an isolation forest for the toy example. The key idea about isolation forests or trees is that anomalies can be isolated from the data with only a few decisions and have therefore a low depth in the trained isolation trees.
Figure 6: Rejection based on isolation forest