Cross Entropy

The cross entropy between two distributions \(p(x)\) and \(q(x)\) is given by:

\[\xH{p || q} = -\sum_{x \in \mathcal{X}} p(x) \log_2 q(x)\]

This quantifies the average cost of representing a distribution defined by the probabilities \(p(x)\) using the probabilities \(q(x)\). For example, the cross entropy of a distribution with itself is the entropy of that distribion because the entropy quantifies the average cost of representing a distribution:

In [1]: from dit.divergences import cross_entropy

In [2]: p = dit.Distribution(['0', '1'], [1/2, 1/2])

In [3]: cross_entropy(p, p)
Out[3]: 1.0

If, however, we attempted to model a fair coin with a biased on, we could compute this mis-match with the cross entropy:

In [4]: q = dit.Distribution(['0', '1'], [3/4, 1/4])

In [5]: cross_entropy(p, q)
Out[5]: 1.207518749639422

Meaning, we will on average use about \(1.2\) bits to represent the flips of a fair coin. Turning things around, what if we had a biased coin that we attempted to represent with a fair coin:

In [6]: cross_entropy(q, p)
Out[6]: 1.0

So although the entropy of \(q\) is less than \(1\), we will use a full bit to represent its outcomes. Both of these results can easily be seen by considering the following identity:

\[\xH{p || q} = \H{p} + \DKL{p || q}\]

So in representing \(p\) using \(q\), we of course must at least use \(\H{p}\) bits – the minimum required to represent \(p\) – plus the Kullback-Leibler divergence of \(q\) from \(p\).

API

cross_entropy(dist1, dist2, rvs=None, crvs=None)[source]

The cross entropy between dist1 and dist2.

Parameters:

dist1 (Distribution) – The first distribution in the cross entropy.
dist2 (Distribution) – The second distribution in the cross entropy.
rvs (list, None) – The indexes of the random variable used to calculate the cross entropy between. If None, then the cross entropy is calculated over all random variables.

Returns:

xh – The cross entropy between dist1 and dist2.

Return type:

float

Raises:

ditException – Raised if either dist1 or dist2 doesn’t have rvs or, if rvs is None, if dist2 has an outcome length different than dist1.