Neural Networks as Statistical Models — Part 1

The conditional probability distribution induced by a given network and loss in supervised learning

How a network and a loss function induce a conditional probability distribution, and why this probabilistic view is beneficial.

An artificial neural network can be described as a function f(x)f(x) over some input set X\mathcal{X} with values in some output set Y\mathcal{Y}. In supervised learning, X\mathcal{X} is different from Y\mathcal{Y}, while in unsupervised learning they are the same. A network computes its output through many layers of activation units connected by weights (Goodfellow et al., 2016). Denote the set of all possible weights the network can take at any one time by W\mathcal{W}. In large models, wWw\in\mathcal{W} is a very large tuple. To reflect that for different wWw\in \mathcal{W} we have different network realizations f(x)f(x), we will adopt the notation f(xw)f(x|w).

In this post, I show in detail how a neural network f(w)f(\cdot|w) (and an associated loss function \ell) induces a conditional probability distribution and briefly discuss the benefits of this view. In future posts, I extend this probabilistic view to other elements of supervised learning and show in detail why this view is important for understanding deep learning.

1. The induced conditional probability distribution

To see how any neural network f(w)f(\cdot|w) also defines a conditional probability density or mass function p(yx,w)p(y|x,w), suppose a loss function (y,f(xw))\ell(y,f(x|w)) is given. We can define the following probability density (or mass) function:

p(yx,w)=e(y,f(xw))Z(1) \tag{1} p(y|x,w) = \frac{e^{-\ell(y,f(x|w))}}{Z}

where ZZ is a normalizing constant such that p(yx)p(y|x) integrates to 1 for any network realization f(xw)f(x|w).

Equation (1) above is the well-known Gibbs density, where \ell plays the role of the energy of the configuration yy. For this to define a bona fide probability density (or mass) function, the normalizing constant Z(w)=Ye(y,f(xw))dy Z(w) = \int_{\mathcal{Y}} e^{-\ell(y,f(x|w))}\,dy must be finite for every xx and wWw\in\mathcal{W}. It is standard to assume that the loss satisfies:

  1. (y,f(xw))0\ell(y, f(x|w)) \geq 0 for all y,f(xw)Yy, f(x|w)\in\mathcal{Y}, with equality if y=f(xw)y=f(x|w).
  2. \ell is integrable with respect to the (unknown) true conditional distribution q(yx)q(y|x) that generated the data, i.e. L(w)=X×Y(y,f(xw))q(yx)dydx<(2) \tag{2} L(w) = \int_{\mathcal{X}\times\mathcal{Y}} \ell(y,f(x|w))\,q(y|x)\,dy\,dx < \infty for all ww. Here L(w)L(w) is the expected loss (or risk) in decision-theoretic terms.

On their own, conditions 1–2 do not guarantee that Z(w)Z(w) is finite; one also needs a mild tail-growth condition on \ell so that e(y,f(xw))e^{-\ell(y,f(x|w))} is integrable in yy.

In practice, we cannot verify these conditions directly since q(yx)q(y\mid x) is unknown. However, many commonly used losses do satisfy such conditions. For example, squared error loss (Legendre, 1805), absolute error loss (Koenker & Bassett, 1978), and Huber loss (Huber, 1964) all lead to well-defined conditional distributions under reasonable assumptions on Y\mathcal{Y}. Let’s work out what happens in these important examples.

1.1 Examples

In supervised learning, one is given a data set {(x1,y1),,(xn,yn)}\{(x_1, y_1),\cdots, (x_n, y_n)\} and our objective is to construct a neural network to predict a future value yn+1y_{n+1} from an input xn+1x_{n+1} (Friedman, 1994). Based on the nature of the observables x,yx, y, one constructs an appropriate neural network and chooses a loss function \ell that is deemed appropriate. In the following examples, we derive the conditional probability densities (or mass functions) associated with a given network and loss function.

Example 1.1.1: Squared error loss

When yy is on a continuous scale (e.g., stock price, air temperature, etc.) modelled as a subset of Rk\mathbb{R}^k, we could use the squared Euclidean norm 22|\cdot|^2_2 as a loss function.

SE(y,f(xw)):=yf(xw)22=i=1k(yifi(xw))2 \begin{aligned} \ell_{\text{SE}}(y, f(x|w)) &:= |y-f(x|w)|_2^2\\ &= \sum_{i=1}^k(y_i-f_i(x|w))^2 \end{aligned} with y=(y1,,yk)y=(y_1, \cdots, y_k), and fi(xw)f_i(x|w) is the ii-th entry of f(xw)f(x|w). The neural network f(w)f(\cdot|w) and the loss SE(y,f(xw))\ell_{\text{SE}}(y,f(x|w)) induce a parametric conditional probability density 1Zeyf(xw)22\frac{1}{Z}e^{-|y-f(x|w)|_2^2}, which one can immediately recognize as the kk-dimensional normal distribution p(yx,w)=12πΣk/2e12(yf(xw))TΣ1(yf(xw)) p(y|x,w) = \frac{1}{\sqrt{2\pi}|\Sigma|^{k/2}}e^{\frac{-1}{2}(y-f(x|w))^T\Sigma^{-1} (y-f(x|w))} with mean f(xw)f(x|w) and variance Σ=12Ik×k\Sigma=\frac{1}{2}\mathbb{I}_{k\times k}, where (yf(xw))T(y-f(x|w))^T is the transpose of the column vector (yf(xw))(y-f(x|w)), and Σ|\Sigma| is the determinant of Σ\Sigma.

Example 1.1.2: Absolute error loss

Another loss function used in practice is the L1L^1 norm.

AE(y,f(xw)):=L1(y,f(xw))=i=1kyifi(xw) \begin{aligned} \ell_{\text{AE}}(y, f(x|w)) &:= L^1(y, f(x|w))\\ &=\sum_{i=1}^k|y_i-f_i(x|w)| \end{aligned}

The induced conditional probability density is

p(yx,w)=1Zei=1kyifi(xw)=1Zi=1keyifi(xw) \begin{aligned} p(y|x,w) &= \frac{1}{Z}e^{-\sum_{i=1}^k|y_i-f_i(x|w)|}\\ &= \frac{1}{Z}\prod_{i=1}^k e^{-|y_i-f_i(x|w)|} \end{aligned} which is the product of kk independent Laplace probability densities with parameter 11.

Example 1.1.3: Cross entropy loss

When yy is on a categorical scale (e.g., dog vs cat vs bird; happy vs sad; a number in the set {0,,9}\{0, \cdots, 9\}), one typically uses a network with a number of output units matching the cardinality of Y\mathcal{Y} and the cross-entropy loss

CE(y,f(xw))=iY{δ[y=i]logsoftmax[f(xw)]i} \ell_{\text{CE}}(y, f(x|w)) = -\sum_{i\in \mathcal{Y}} \big\{\delta[y=i]\log{\text{softmax}[f(x|w)]}_i\big\} where δ[y=i]\delta[y=i] is the Kronecker delta, and softmax[f(xw)]i\text{softmax}[f(x|w)]_i is the ii-th component of softmax[f(xw)]\text{softmax}[f(x|w)]. The associated conditional probability mass function is in fact explicit:

P(y=ix,w)=softmax[f(xw)]iiY P(y=i|x,w) = \text{softmax}[f(x|w)]_i \quad i\in\mathcal{Y} When the cardinality of Y\mathcal{Y} is 22, the cross-entropy reduces to the binary cross-entropy.

2. Why should we care?

Now that we have a good handle on the conditional distribution induced by our choice of the loss function and the nature of the output layer of the network, one can apply the tools of frequentist statistics such as maximum likelihood, hypothesis testing, and asymptotic theory for analyzing supervised learning methods.

For instance, under the assumption that any pair (xi,yi),(xj,yj)(x_i,y_i), (x_j,y_j), iji\neq j is independent, we can write the log-likelihood of the dataset (under our model) for different parameters ww as

logi=1np(yixi,w)=i=1nlogp(yixi,w)(3)\tag{3} \log{\prod_{i=1}^n p(y_i|x_i,w)} = \sum_{i=1}^n \log{p(y_i|x_i,w)}

yielding the following maximum likelihood estimate that many readers will recognize from undergraduate statistics.

w^mle=argmaxwW{i=1nlogp(yixi,w)} \hat{w}_{\text{mle}} = \underset{w\in \mathcal{W}}{\text{argmax}}\bigg\{\sum_{i=1}^n \log{p(y_i|x_i,w)}\bigg\}

In the case that Z(w)Z(w) is independent of wWw\in\mathcal{W} (which is the case in the squared-error and absolute-error examples above, with fixed variance/scale), maximizing the likelihood equation (3) above is equivalent to minimizing the empirical loss

Ln(w):=1ni=1n(yi,f(xiw))(4)\tag{4} L_n(w) := \frac{1}{n}\sum_{i=1}^n \ell(y_i, f(x_i|w))

This equivalence is a result of the identity logp(yx,w)=(y,f(xw))log(Z(w))\log{p(y|x, w)} = -\ell(y,f(x|w))-\text{log}(Z(w)), which makes clear where one requires that Z(w)Z(w) is independent of ww.

One might argue that we did not gain much by characterizing the conditional probability density (or mass) associated with a given loss and network. This is not exactly true. First, making the nature of the assumed noise in the model explicit provides us with more information about our model and ways to change it. For instance, in example 1.1.1 above, one could choose a non-diagonal matrix to represent known correlations in the noise. Moreover, one could compute the observed errors and check whether they conform to the assumed noise structure (a statistical technique for measuring model fit).

Second, one can apply tools of information theory to rigorously characterize what it means to be surprised when making predictions using the learned model. Using the well-known notion of Shannon surprise, the average Shannon surprise when using the learned model p(yx,wmle)p(y|x, w_{\text{mle}}) to predict yy given xx with a true unknown conditional distribution q(yx)q(y|x), is defined as:

D(q,p(w^mle)):=X×Yq(yx)q(x)logq(yx)p(yx,w^mle)dydx(5)\tag{5} D(q,p(\cdot|\hat{w}_{\text{mle}})) := \int_{\mathcal{X}\times\mathcal{Y}} q(y|x)q(x)\log\frac{q(y|x)}{p(y|x,\hat{w}_{\text{mle}})}\,dy\,dx

D(q,p)D(q,p) is the Kullback-Leibler divergence (Kullback & Leibler, 1951) which one can rearrange as:

D(q,p(,w))=H(q,p(,w))H(q(YX))(6) \begin{aligned}\tag{6} D(q,p(\cdot|\cdot,w)) = H(q,p(\cdot|\cdot,w)) - H(q(Y|X)) \end{aligned} where

H(q,p(,w)):=X×Yq(yx)q(x)logp(yx,w)dxdy, H(q,p(\cdot|\cdot,w)):=-\int_{\mathcal{X}\times\mathcal{Y}}q(y|x)q(x)\log{p(y|x,w)}\,dx\,dy, is the average surprise when using pp to predict samples drawn from qq, and

H(q(YX)):=X×Yq(yx)q(x)logq(yx)dxdy, H(q(Y|X)) := -\int_{\mathcal{X}\times\mathcal{Y}}q(y|x)q(x)\log{q(y|x)}\,dx\,dy, is the conditional entropy of YY given XX, the minimum achievable average surprise when predicting YY from XX using the true conditional qq.

Equation (6) above is one important reason why machine learning minimizes the empirical cross entropy, which is a Monte Carlo estimate of H(q,p)H(q,p) when the samples are independent. Minimizing H(q,p)H(q,p) is equivalent to minimizing the KL divergence from qq to pp. Effectively, if our model is any good, it should stand in for qq when making decisions related to our observables xx and yy.

Another reason the probabilistic view is valuable is that it provides natural criteria for model selection and for comparing different training algorithms. It is standard practice to split data into training and validation sets: one searches for an optimal parameter ww_* on the training set, trying different training regimes or architectures, and selects the model that scores best on the validation set. Probability and statistical analysis give us principled ways to think about how to split the data, why splitting is needed, and how to interpret the results.

If we want to build AI systems that are not just powerful but trustworthy, we need learning machines with provable consistency: systems that converge to the truth, or as close to it as the model allows, as evidence accumulates. Without such guarantees, the learning machine cannot be fully trusted. The probabilistic view is not merely convenient; it may be essential for understanding how to build machines we can trust.

  1. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
  2. Legendre, A.-M. (1805). Nouvelles méthodes pour la détermination des orbites des comètes. Courcier.
  3. Koenker, R. W., & Bassett, J., Gilbert. (1978). Regression Quantiles. Econometrica, 46(1), 33–50.
  4. Huber, P. J. (1964). Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics, 35(1), 73–101. https://doi.org/10.1214/aoms/1177703732
  5. Friedman, J. H. (1994). An Overview of Predictive Learning and Function Approximation. In V. Cherkassky, J. H. Friedman, & H. Wechsler (Eds.), From Statistics to Neural Networks (pp. 1–61). Springer Berlin Heidelberg.
  6. Kullback, S., & Leibler, R. A. (1951). On Information and Sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86. https://doi.org/10.1214/aoms/1177729694