Neural Networks as Statistical Models — Part 1

The conditional probability distribution induced by a given network and loss in supervised learning

How a network and a loss function induce a conditional probability distribution, and why this probabilistic view is beneficial.

Published

02 August 2025

An artificial neural network can be described as a function $f(x)$ over some input set $\mathcal{X}$ with values in some output set $\mathcal{Y}$ . In supervised learning, $\mathcal{X}$ is different from $\mathcal{Y}$ , while in unsupervised learning they are the same. A network computes its output through many layers of activation units connected by weights (Goodfellow et al., 2016). Denote the set of all possible weights the network can take at any one time by $\mathcal{W}$ . In large models, $w\in\mathcal{W}$ is a very large tuple. To reflect that for different $w\in \mathcal{W}$ we have different network realizations $f(x)$ , we will adopt the notation $f(x|w)$ .

In this post, I show in detail how a neural network $f(\cdot|w)$ (and an associated loss function $\ell$ ) induces a conditional probability distribution and briefly discuss the benefits of this view. In future posts, I extend this probabilistic view to other elements of supervised learning and show in detail why this view is important for understanding deep learning.

1. The induced conditional probability distribution

To see how any neural network $f(\cdot|w)$ also defines a conditional probability density or mass function $p(y|x,w)$ , suppose a loss function $\ell(y,f(x|w))$ is given. We can define the following probability density (or mass) function:

$\tag{1} p(y|x,w) = \frac{e^{-\ell(y,f(x|w))}}{Z}$

where $Z$ is a normalizing constant such that $p(y|x)$ integrates to 1 for any network realization $f(x|w)$ .

Equation (1) above is the well-known Gibbs density, where $\ell$ plays the role of the energy of the configuration $y$ . For this to define a bona fide probability density (or mass) function, the normalizing constant $Z(w) = \int_{\mathcal{Y}} e^{-\ell(y,f(x|w))}\,dy$ must be finite for every $x$ and $w\in\mathcal{W}$ . It is standard to assume that the loss satisfies:

$\ell(y, f(x|w)) \geq 0$ for all $y, f(x|w)\in\mathcal{Y}$ , with equality if $y=f(x|w)$ .
$\ell$ is integrable with respect to the (unknown) true conditional distribution $q(y|x)$ that generated the data, i.e. $\tag{2} L(w) = \int_{\mathcal{X}\times\mathcal{Y}} \ell(y,f(x|w))\,q(y|x)\,dy\,dx < \infty$ for all $w$ . Here $L(w)$ is the expected loss (or risk) in decision-theoretic terms.

On their own, conditions 1–2 do not guarantee that $Z(w)$ is finite; one also needs a mild tail-growth condition on $\ell$ so that $e^{-\ell(y,f(x|w))}$ is integrable in $y$ .

In practice, we cannot verify these conditions directly since $q(y\mid x)$ is unknown. However, many commonly used losses do satisfy such conditions. For example, squared error loss (Legendre, 1805), absolute error loss (Koenker & Bassett, 1978), and Huber loss (Huber, 1964) all lead to well-defined conditional distributions under reasonable assumptions on $\mathcal{Y}$ . Let’s work out what happens in these important examples.

1.1 Examples

In supervised learning, one is given a data set $\{(x_1, y_1),\cdots, (x_n, y_n)\}$ and our objective is to construct a neural network to predict a future value $y_{n+1}$ from an input $x_{n+1}$ (Friedman, 1994). Based on the nature of the observables $x, y$ , one constructs an appropriate neural network and chooses a loss function $\ell$ that is deemed appropriate. In the following examples, we derive the conditional probability densities (or mass functions) associated with a given network and loss function.

Example 1.1.1: Squared error loss

When $y$ is on a continuous scale (e.g., stock price, air temperature, etc.) modelled as a subset of $\mathbb{R}^k$ , we could use the squared Euclidean norm $|\cdot|^2_2$ as a loss function.

$\begin{aligned} \ell_{\text{SE}}(y, f(x|w)) &:= |y-f(x|w)|_2^2\\ &= \sum_{i=1}^k(y_i-f_i(x|w))^2 \end{aligned}$ with $y=(y_1, \cdots, y_k)$ , and $f_i(x|w)$ is the $i$ -th entry of $f(x|w)$ . The neural network $f(\cdot|w)$ and the loss $\ell_{\text{SE}}(y,f(x|w))$ induce a parametric conditional probability density $\frac{1}{Z}e^{-|y-f(x|w)|_2^2}$ , which one can immediately recognize as the $k$ -dimensional normal distribution $p(y|x,w) = \frac{1}{\sqrt{2\pi}|\Sigma|^{k/2}}e^{\frac{-1}{2}(y-f(x|w))^T\Sigma^{-1} (y-f(x|w))}$ with mean $f(x|w)$ and variance $\Sigma=\frac{1}{2}\mathbb{I}_{k\times k}$ , where $(y-f(x|w))^T$ is the transpose of the column vector $(y-f(x|w))$ , and $|\Sigma|$ is the determinant of $\Sigma$ .

Example 1.1.2: Absolute error loss

Another loss function used in practice is the $L^1$ norm.

$\begin{aligned} \ell_{\text{AE}}(y, f(x|w)) &:= L^1(y, f(x|w))\\ &=\sum_{i=1}^k|y_i-f_i(x|w)| \end{aligned}$

The induced conditional probability density is

$\begin{aligned} p(y|x,w) &= \frac{1}{Z}e^{-\sum_{i=1}^k|y_i-f_i(x|w)|}\\ &= \frac{1}{Z}\prod_{i=1}^k e^{-|y_i-f_i(x|w)|} \end{aligned}$ which is the product of $k$ independent Laplace probability densities with parameter $1$ .

Example 1.1.3: Cross entropy loss

When $y$ is on a categorical scale (e.g., dog vs cat vs bird; happy vs sad; a number in the set $\{0, \cdots, 9\}$ ), one typically uses a network with a number of output units matching the cardinality of $\mathcal{Y}$ and the cross-entropy loss

$\ell_{\text{CE}}(y, f(x|w)) = -\sum_{i\in \mathcal{Y}} \big\{\delta[y=i]\log{\text{softmax}[f(x|w)]}_i\big\}$ where $\delta[y=i]$ is the Kronecker delta, and $\text{softmax}[f(x|w)]_i$ is the $i$ -th component of $\text{softmax}[f(x|w)]$ . The associated conditional probability mass function is in fact explicit:

$P(y=i|x,w) = \text{softmax}[f(x|w)]_i \quad i\in\mathcal{Y}$ When the cardinality of $\mathcal{Y}$ is $2$ , the cross-entropy reduces to the binary cross-entropy.

2. Why should we care?

Now that we have a good handle on the conditional distribution induced by our choice of the loss function and the nature of the output layer of the network, one can apply the tools of frequentist statistics such as maximum likelihood, hypothesis testing, and asymptotic theory for analyzing supervised learning methods.

For instance, under the assumption that any pair $(x_i,y_i), (x_j,y_j)$ , $i\neq j$ is independent, we can write the log-likelihood of the dataset (under our model) for different parameters $w$ as

$\tag{3} \log{\prod_{i=1}^n p(y_i|x_i,w)} = \sum_{i=1}^n \log{p(y_i|x_i,w)}$

yielding the following maximum likelihood estimate that many readers will recognize from undergraduate statistics.

$\hat{w}_{\text{mle}} = \underset{w\in \mathcal{W}}{\text{argmax}}\bigg\{\sum_{i=1}^n \log{p(y_i|x_i,w)}\bigg\}$

In the case that $Z(w)$ is independent of $w\in\mathcal{W}$ (which is the case in the squared-error and absolute-error examples above, with fixed variance/scale), maximizing the likelihood equation (3) above is equivalent to minimizing the empirical loss

$\tag{4} L_n(w) := \frac{1}{n}\sum_{i=1}^n \ell(y_i, f(x_i|w))$

This equivalence is a result of the identity $\log{p(y|x, w)} = -\ell(y,f(x|w))-\text{log}(Z(w))$ , which makes clear where one requires that $Z(w)$ is independent of $w$ .

One might argue that we did not gain much by characterizing the conditional probability density (or mass) associated with a given loss and network. This is not exactly true. First, making the nature of the assumed noise in the model explicit provides us with more information about our model and ways to change it. For instance, in example 1.1.1 above, one could choose a non-diagonal matrix to represent known correlations in the noise. Moreover, one could compute the observed errors and check whether they conform to the assumed noise structure (a statistical technique for measuring model fit).

Second, one can apply tools of information theory to rigorously characterize what it means to be surprised when making predictions using the learned model. Using the well-known notion of Shannon surprise, the average Shannon surprise when using the learned model $p(y|x, w_{\text{mle}})$ to predict $y$ given $x$ with a true unknown conditional distribution $q(y|x)$ , is defined as:

$\tag{5} D(q,p(\cdot|\hat{w}_{\text{mle}})) := \int_{\mathcal{X}\times\mathcal{Y}} q(y|x)q(x)\log\frac{q(y|x)}{p(y|x,\hat{w}_{\text{mle}})}\,dy\,dx$

$D(q,p)$ is the Kullback-Leibler divergence (Kullback & Leibler, 1951) which one can rearrange as:

$\begin{aligned}\tag{6} D(q,p(\cdot|\cdot,w)) = H(q,p(\cdot|\cdot,w)) - H(q(Y|X)) \end{aligned}$ where

$H(q,p(\cdot|\cdot,w)):=-\int_{\mathcal{X}\times\mathcal{Y}}q(y|x)q(x)\log{p(y|x,w)}\,dx\,dy,$ is the average surprise when using $p$ to predict samples drawn from $q$ , and

$H(q(Y|X)) := -\int_{\mathcal{X}\times\mathcal{Y}}q(y|x)q(x)\log{q(y|x)}\,dx\,dy,$ is the conditional entropy of $Y$ given $X$ , the minimum achievable average surprise when predicting $Y$ from $X$ using the true conditional $q$ .

Equation (6) above is one important reason why machine learning minimizes the empirical cross entropy, which is a Monte Carlo estimate of $H(q,p)$ when the samples are independent. Minimizing $H(q,p)$ is equivalent to minimizing the KL divergence from $q$ to $p$ . Effectively, if our model is any good, it should stand in for $q$ when making decisions related to our observables $x$ and $y$ .

Another reason the probabilistic view is valuable is that it provides natural criteria for model selection and for comparing different training algorithms. It is standard practice to split data into training and validation sets: one searches for an optimal parameter $w_*$ on the training set, trying different training regimes or architectures, and selects the model that scores best on the validation set. Probability and statistical analysis give us principled ways to think about how to split the data, why splitting is needed, and how to interpret the results.

If we want to build AI systems that are not just powerful but trustworthy, we need learning machines with provable consistency: systems that converge to the truth, or as close to it as the model allows, as evidence accumulates. Without such guarantees, the learning machine cannot be fully trusted. The probabilistic view is not merely convenient; it may be essential for understanding how to build machines we can trust.

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Legendre, A.-M. (1805). Nouvelles méthodes pour la détermination des orbites des comètes. Courcier.
Koenker, R. W., & Bassett, J., Gilbert. (1978). Regression Quantiles. Econometrica, 46(1), 33–50.
Huber, P. J. (1964). Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics, 35(1), 73–101. https://doi.org/10.1214/aoms/1177703732
Friedman, J. H. (1994). An Overview of Predictive Learning and Function Approximation. In V. Cherkassky, J. H. Friedman, & H. Wechsler (Eds.), From Statistics to Neural Networks (pp. 1–61). Springer Berlin Heidelberg.
Kullback, S., & Leibler, R. A. (1951). On Information and Sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86. https://doi.org/10.1214/aoms/1177729694