Neural Networks as Statistical Models — Part 1
The conditional probability distribution induced by a given network and loss in supervised learning
How a network and a loss function induce a conditional probability distribution, and why this probabilistic view is beneficial.
An artificial neural network can be described as a function over some input set with values in some output set . In supervised learning, is different from , while in unsupervised learning they are the same. A network computes its output through many layers of activation units connected by weights (Goodfellow et al., 2016). Denote the set of all possible weights the network can take at any one time by . In large models, is a very large tuple. To reflect that for different we have different network realizations , we will adopt the notation .
In this post, I show in detail how a neural network (and an associated loss function ) induces a conditional probability distribution and briefly discuss the benefits of this view. In future posts, I extend this probabilistic view to other elements of supervised learning and show in detail why this view is important for understanding deep learning.
1. The induced conditional probability distribution
To see how any neural network also defines a conditional probability density or mass function , suppose a loss function is given. We can define the following probability density (or mass) function:
where is a normalizing constant such that integrates to 1 for any network realization .
Equation (1) above is the well-known Gibbs density, where plays the role of the energy of the configuration . For this to define a bona fide probability density (or mass) function, the normalizing constant must be finite for every and . It is standard to assume that the loss satisfies:
- for all , with equality if .
- is integrable with respect to the (unknown) true conditional distribution that generated the data, i.e. for all . Here is the expected loss (or risk) in decision-theoretic terms.
On their own, conditions 1–2 do not guarantee that is finite; one also needs a mild tail-growth condition on so that is integrable in .
In practice, we cannot verify these conditions directly since is unknown. However, many commonly used losses do satisfy such conditions. For example, squared error loss (Legendre, 1805), absolute error loss (Koenker & Bassett, 1978), and Huber loss (Huber, 1964) all lead to well-defined conditional distributions under reasonable assumptions on . Let’s work out what happens in these important examples.
1.1 Examples
In supervised learning, one is given a data set and our objective is to construct a neural network to predict a future value from an input (Friedman, 1994). Based on the nature of the observables , one constructs an appropriate neural network and chooses a loss function that is deemed appropriate. In the following examples, we derive the conditional probability densities (or mass functions) associated with a given network and loss function.
Example 1.1.1: Squared error loss
When is on a continuous scale (e.g., stock price, air temperature, etc.) modelled as a subset of , we could use the squared Euclidean norm as a loss function.
with , and is the -th entry of . The neural network and the loss induce a parametric conditional probability density , which one can immediately recognize as the -dimensional normal distribution with mean and variance , where is the transpose of the column vector , and is the determinant of .
Example 1.1.2: Absolute error loss
Another loss function used in practice is the norm.
The induced conditional probability density is
which is the product of independent Laplace probability densities with parameter .
Example 1.1.3: Cross entropy loss
When is on a categorical scale (e.g., dog vs cat vs bird; happy vs sad; a number in the set ), one typically uses a network with a number of output units matching the cardinality of and the cross-entropy loss
where is the Kronecker delta, and is the -th component of . The associated conditional probability mass function is in fact explicit:
When the cardinality of is , the cross-entropy reduces to the binary cross-entropy.
2. Why should we care?
Now that we have a good handle on the conditional distribution induced by our choice of the loss function and the nature of the output layer of the network, one can apply the tools of frequentist statistics such as maximum likelihood, hypothesis testing, and asymptotic theory for analyzing supervised learning methods.
For instance, under the assumption that any pair , is independent, we can write the log-likelihood of the dataset (under our model) for different parameters as
yielding the following maximum likelihood estimate that many readers will recognize from undergraduate statistics.
In the case that is independent of (which is the case in the squared-error and absolute-error examples above, with fixed variance/scale), maximizing the likelihood equation (3) above is equivalent to minimizing the empirical loss
This equivalence is a result of the identity , which makes clear where one requires that is independent of .
One might argue that we did not gain much by characterizing the conditional probability density (or mass) associated with a given loss and network. This is not exactly true. First, making the nature of the assumed noise in the model explicit provides us with more information about our model and ways to change it. For instance, in example 1.1.1 above, one could choose a non-diagonal matrix to represent known correlations in the noise. Moreover, one could compute the observed errors and check whether they conform to the assumed noise structure (a statistical technique for measuring model fit).
Second, one can apply tools of information theory to rigorously characterize what it means to be surprised when making predictions using the learned model. Using the well-known notion of Shannon surprise, the average Shannon surprise when using the learned model to predict given with a true unknown conditional distribution , is defined as:
is the Kullback-Leibler divergence (Kullback & Leibler, 1951) which one can rearrange as:
where
is the average surprise when using to predict samples drawn from , and
is the conditional entropy of given , the minimum achievable average surprise when predicting from using the true conditional .
Equation (6) above is one important reason why machine learning minimizes the empirical cross entropy, which is a Monte Carlo estimate of when the samples are independent. Minimizing is equivalent to minimizing the KL divergence from to . Effectively, if our model is any good, it should stand in for when making decisions related to our observables and .
Another reason the probabilistic view is valuable is that it provides natural criteria for model selection and for comparing different training algorithms. It is standard practice to split data into training and validation sets: one searches for an optimal parameter on the training set, trying different training regimes or architectures, and selects the model that scores best on the validation set. Probability and statistical analysis give us principled ways to think about how to split the data, why splitting is needed, and how to interpret the results.
If we want to build AI systems that are not just powerful but trustworthy, we need learning machines with provable consistency: systems that converge to the truth, or as close to it as the model allows, as evidence accumulates. Without such guarantees, the learning machine cannot be fully trusted. The probabilistic view is not merely convenient; it may be essential for understanding how to build machines we can trust.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
- Legendre, A.-M. (1805). Nouvelles méthodes pour la détermination des orbites des comètes. Courcier.
- Koenker, R. W., & Bassett, J., Gilbert. (1978). Regression Quantiles. Econometrica, 46(1), 33–50.
- Huber, P. J. (1964). Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics, 35(1), 73–101. https://doi.org/10.1214/aoms/1177703732
- Friedman, J. H. (1994). An Overview of Predictive Learning and Function Approximation. In V. Cherkassky, J. H. Friedman, & H. Wechsler (Eds.), From Statistics to Neural Networks (pp. 1–61). Springer Berlin Heidelberg.
- Kullback, S., & Leibler, R. A. (1951). On Information and Sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86. https://doi.org/10.1214/aoms/1177729694