What connections are there between data augmentation and out-of-distribution data?

July 202515 min read~2,600 words
Diego Huerfano

Diego Huerfano

CEO & Co-founder
Nabla Labs

Discover how data augmentation during training affects out-of-distribution samples.

What are the Data and Empirical Distributions?

The framework for machine learning is to assume the data is drawn from a hypothetical distribution that is mainly defined by the application domain and how the data is collected, the data generating process. This process is specified by a random variable Loading...that follows the data distribution, Loading..., and the data collection process consists of drawing a sequence Loading... of length Loading...so that the variables are independent and identically distributed (IID), i.e. Loading.... After data collection, we obtain a sample Loading... that in the case of images, for example, each image Loading... is a realization of the random variable Loading.... Although the data distribution Loading... is almost always unknown and complex, it's assumed that it exists and that it drives the data as just described. It's the way how we link the reality to the theory.
It's useful to know what assumptions are made within this framework as this allows us to understand what limitations it has, how much of the theory applies to our case and how to conduct the data collection at best, for example, by keeping in mind the ideal independence of realizations.
On the other hand, the more tangible empirical data distribution [1] is the observed distribution of a sample drawn form the data distribution. Let the sampleLoading... (that can be e.g our training or test set), its empirical distribution is defined by the density
Loading...
where Loading... is the Dirac delta function. A simple as just being an average of individual single-point densities, the Dirac delta functions, centered at each data point Loading..., it is a good approximation of the true data density Loading... of the distribution Loading...when the sample size Loading... is large enough. It will also allow us to gain some insights about the ralation between data augmentation and out-of-distribution data.

What does Data Augmentation do to the Empirical Data Distribution?

If our traning dataset is Loading..., during traning we apply a (random) data augmentation transformation Loading..., parametrized by Loading..., to each data point Loading... that results in a random variableLoading.... The nature of this transformation depends on the domain and the task we are trying to solve but often involves perturbing each Loading...in a way that Loading... is equivalent to Loading...for the task, e.g. the image class of both augmented and original data points is the same in an image classification task. This is important because it restricts how far the augmented points can end up from the original points, so that Loading...is not too large to the extent that Loading... lies closer to another data point of a different class or that the features of Loading... are meaningless for the task, for example.
Visualization of data augmentation: original data points and their Gaussian-augmented distributions in feature space
Figure: Visualization of data augmentation in machine learning. Black dots represent original data points Loading..., while blue points and contours show the effect of the augmentation densities Loading... in feature space.
The augmentation Loading...induces then a probability density Loading... of augmented samples around each Loading...that is in principle known to us as we have control over Loading...and where we wantLoading... to be high around Loading...and low when are far from Loading... (see Figure). For example, if the augmentation consists of adding randon gaussian noise to the image Loading...,Loading... is a gaussian distribution centered at Loading...with a standard deviation defined by Loading.... In the end, we obtain the augmented empirical density
Loading...
that, similarly to the empirical density, is an average of densities around each data point where the Dirac delta functions have been spread out to Loading... by means of the augmentation.

What does the Divergence tell us?

After training, we have the inference dataset Loading...with empirical density Loading.... The Kullback–Leibler divergence [2] between inference data distribution Loading... and augmented data distribution Loading..., given by
Loading...
(using the fact that the delta function simplifies the integral to a evaluation of the integrand) gives us the following insights:

  • For out-of-distribution points Loading..., i.e. points that are far from all the training data points, Loading... is close to zero and so will push the divergence to infinity as expected. This means that we are using the trained model on a dataset that is not representative of the training data distribution and the augmentation transformation. This will then result in poor performance.
  • The opposite of the above is also true: large divergence only occurs when there is at least oneLoading... far enough from the training data.
  • The augmentation Loading... controls the spread of Loading...around each training data point Loading... and in turn defines what is considered out-of-distribution: If the augmentation is aggressive, the spread of Loading...will be large and so the the only out-of-distribution points will be the ones that are very far from the training data. If the augmentation is conservative, Loading... will be concentrated around
  • This analysis supports out-of-distribution detection methods [3] based on the direct relationship between the distance to in-distribution points and the probability of being out-of-distribution.

Key Takeaways

  • Empirical vs. Augmented Distributions: In machine learning, the empirical distribution represents the observed data, while data augmentation transforms this into a richer, smoother distribution by spreading each data point into a local region. This process helps models generalize better by simulating new, plausible data points.
  • Why Data Augmentation Matters: Augmentation increases the diversity of the training set without requiring new data collection. By simulating variations (e.g., noise, transformations), it helps models become robust to real-world variability and reduces overfitting.
  • Mathematical Foundation: The empirical distribution is a sum of Dirac delta functions at each data point. Augmentation replaces these with continuous densities (e.g., Gaussians), resulting in an augmented empirical distribution that allows us to better understand the relationship between out-of-distribution and augmentation effects.
  • Measuring Distribution Shift: The Kullback–Leibler (KL) divergence quantifies how much the inference (test) data distribution differs from the (augmented) training data distribution. Large divergence indicates out-of-distribution data, which can lead to poor model performance.
  • Practical Implications: The choice and strength of augmentation directly affect what is considered in-distribution. Aggressive augmentation broadens the model’s understanding, while conservative augmentation keeps it focused on the original data regions.

References

  1. Empirical distribution function (Wikipedia)
  2. Kullback–Leibler divergence (Wikipedia)
  3. Sun et al., "Out-of-Distribution Detection with Deep Nearest Neighbors" (ICML 2022)