Partial Information Decomposition¶
The partial information decomposition (PID), put forth by Williams & Beer [WB10], is a framework for decomposing the information shared between a set of variables we will refer to as inputs, \(X_0, X_1, \ldots\), and another random variable we will refer to as the output, \(Y\). This decomposition seeks to partition the information \(\I{X_0,X_1,\ldots : Y}\) among the antichains of the inputs.
Background¶
It is often desirable to determine how a set of inputs influence the behavior of an output. Consider the exclusive or logic gates, for example:
In [1]: In [1]: from dit.pid.distributions import bivariates, trivariates
We can see from inspection that either input (the first two indexes) is independent of the output (the final index), yet the two inputs together determine the output. One could call this “synergistic” information. Next, consider the giant bit distribution:
In [2]: In [4]: gb = bivariates['redundant']
In [3]: In [5]: print(gb)
Class: Distribution
Alphabet: ('0', '1') for all rvs
Base: linear
Outcome Class: str
Outcome Length: 3
RV Names: None
x p(x)
000 1/2
111 1/2
In [4]: Class: Distribution
In [5]: Alphabet: ('0', '1') for all rvs
...: Base: linear
...: Outcome Class: str
...: Outcome Length: 3
...: RV Names: None
...:
File "<ipython-input-5-35057d4d19b8>", line 1
Alphabet: ('0', '1') for all rvs
^
SyntaxError: invalid syntax
Here, we see that either input informs us of exactly what the output is. One could call this “redundant” information. Furthermore, consider the coinformation of these distributions:
In [6]: In [6]: from dit.multivariate import coinformation as I
This could lead one to intuit that negative values of the coinformation correspond to synergistic effects in a distribution, while positive values correspond to redundant effects. This intuition, however, is at best misleading: the coinformation of a 4-variable giant bit and 4-variable parity distribution are both positive:
In [7]: In [9]: I(dit.example_dists.giant_bit(4, 2))
Out[7]: 1.0
In [8]: Out[9]: 1.0
In [9]: In [10]: I(dit.example_dists.n_mod_m(4, 2))
Out[9]: 1.0
In [10]: Out[10]: 1.0
This, as well as other issues, lead Williams & Beer [WB10] to propose the partial information decomposition.
Framework¶
The goal of the partial information is to assign to each some non-negative portion of \(\I{\{X_i\} : Y}\) to each antichain over the inputs. An antichain over the inputs is a set of sets, where each of those sets is not a subset of any of the others. For example, \(\left\{ \left\{X_0, X_1\right\}, \left\{X_1, X_2\right\} \right\}\) is an antichain, but \(\left\{ \left\{X_0, X_1\right\}, \left\{X_0 X_1, X_2\right\} \right\}\) is not.
The antichains for a lattice based on this partial order:
From here, we wish to find a redundancy measure, \(\Icap{\bullet}\) which would assign a fraction of \(\I{\{X_i\} : Y}\) to each antichain intuitively quantifying what portion of the information in the output could be learned by observing any of the sets of variables within the antichain. In order to be a viable measure of redundancy, there are several axioms a redundancy measure must satisfy.
Bivariate Lattice¶
Let us consider the special case of two inputs. The lattice consists of four elements: \(\left\{\left\{X_0\right\}, \left\{X_1\right\}\right\}\), \(\left\{\left\{X_0\right\}\right\}\), \(\left\{\left\{X_1\right\}\right\}\), and \(\left\{\left\{X_0, X_1\right\}\right\}\). We can interpret these elements as the redundancy provided by both inputs, the information uniquely provided by \(X_0\), the information uniquely provided by \(X_1\), and the information synergistically provided only by both inputs together. Together these for elements decompose the input-output mutual information:
Furthermore, due to the self-redundancy axiom (described ahead), the single-input mutual informations decomposed in the following way:
Colloquially, from input \(X_0\) one can learn what is redundantly provided by either input, plus what is uniquely provided by \(X_0\), but not what is uniquely provided by \(X_1\) or what can only be learned synergistically from both inputs.
Axioms¶
The following three axioms were provided by Williams & Beer.
Symmetry¶
The redundancy \(\Icap{X_{0:n} : Y}\) is invariant under reorderings of \(X_i\).
Self-Redundancy¶
The redundancy of a single input is its mutual information with the output:
Monotonicity¶
The redundancy should only decrease with in inclusion of more inputs:
with equality if \(\mathcal{A}_{k-1} \subseteq \mathcal{A}_k\).
There have been other axioms proposed following from those of Williams & Beer.
Measures¶
We now turn our attention a variety of methods proposed to flesh out this partial information decomposition.
In [11]: In [11]: from dit.pid import *
\(\Imin{\bullet}\)¶
\(\Imin{\bullet}\)[WB10] was Williams & Beer’s initial proposal for a redundancy measure. It is given by:
However, this measure has been criticized for acting in an unintuitive manner [GK14]:
In [12]: In [12]: d = dit.Distribution(['000', '011', '102', '113'], [1/4]*4)
In [13]: In [13]: PID_WB(d)
Out[13]:
+--------+--------+--------+
| I_min | I_r | pi |
+--------+--------+--------+
| {0:1} | 2.0000 | 1.0000 |
| {0} | 1.0000 | 0.0000 |
| {1} | 1.0000 | 0.0000 |
| {0}{1} | 1.0000 | 1.0000 |
+--------+--------+--------+
In [14]: ╔════════╤════════╤════════╗
....: ║ I_min │ I_r │ pi ║
....: ╟────────┼────────┼────────╢
....: ║ {0:1} │ 2.0000 │ 1.0000 ║
....: ║ {0} │ 1.0000 │ 0.0000 ║
....: ║ {1} │ 1.0000 │ 0.0000 ║
....: ║ {0}{1} │ 1.0000 │ 1.0000 ║
....: ╚════════╧════════╧════════╝
....: