Multiscale Structure, Information Theory, Explosions

I’d like to talk a bit about using information theory to quantify the intuition that a complex system exhibits structure at multiple scales of organization. My friend and colleague Ben Allen wrote an introduction to this a while ago:

Ben’s blog post is a capsule introduction to this article that he and I wrote with Yaneer Bar-Yam:

I also cover this topic, as well as a fair bit of background on how to relate probability and information, in my PhD thesis:

In this post, I’ll carry the ideas laid out in these sources a little bit farther in a particular direction.

Imagine a system that is made up of a large number of pieces. A theme one encounters in many areas of science is that such a system is simpler to understand when the pieces are independent of one another, each one doing its own thing. In thermodynamics and statistical physics, for example, we spend a lot of time studying gases, and the easiest conceptual model of a gas to work with is an ideal gas, in which the atoms sail past each other without interacting. On the other hand, a system can also be simple to understand if all the pieces are so tightly correlated or so strongly bound together that they essentially move as a unit. Then, instead of having to understand lots of little things, we just have to understand one big thing, because knowing what any one part is doing tells us most all of what we need to know about the whole.

A similar high-level intuition is helpful in pure mathematics, as Terry Tao once explained. In mathematics, one finds structured collections, where the relations among the individual elements have a high degree of predictability, and random or pseudo-random collections, where the status of one element is not informative about the status of another. Moreover, one also encounters hybrid collections, which partake in features of both.

The general approach of the references I bullet-pointed above is that we can use information theory to make these heuristic discussions quantitative. One conclusion is that we should not aim to quantify a system’s “complicatedness” by a single number. Instead, it is more illuminating to devise curves that indicate how much structure is present at different scales.

We start by saying that a system $\mathcal{S}$ is composed of pieces, or components. We’ll denote the set of components of the system $\mathcal{S}$ by $S$. It is important to distinguish the two, because we could take the same components and arrange them in a different way to create a different system. We express the arrangement or the patterning of the components by defining an information function. For each subset $T \subset S$, the information function $H$ assigns a nonnegative number $H(T)$, which expresses how much information is necessary to say what all the pieces in $T$ are doing. One can actually prove quite a lot from the starting point that if $H$ is to be an information function, it had better satisfy a few basic axioms.

First, as we already said, $H(T) \geq 0$ for any $T \subset S$.

Second, if $T \subset V \subset S$, then $H(T) \leq H(V)$. We can call this property monotonicity.

Third, if we have two subsets $T \subset S$ and $V \subset S$, not necessarily nested in one another or anything like that, the total information assigned to their union, $H(T \cup V)$, must be limited. Information pertinent to the components in $T$ can be pertinent to the components in $V$. For one reason, $T$ and $V$ might have some components in common. And even those components that are not shared between the two sets might be correlated in some way such that the amount of information necessary to describe the whole lot is reduced. So, we require that
\[
H(T \cup V) \leq H(T) + H(V) – H(T \cap V),\] which we call strong subadditivity. Given two components $s$ and $t$ within $S$, the shared information expresses the difference between what is required to describe them separately versus describing them together:
\[
I(s;t) = H(\{s\}) + H(\{t\}) – H(\{s,t\}).\] It follows from strong subadditivity that the shared information is always nonnegative.

A descriptor of a system is an entity outside that system which tells us about it in some way. Mathematically speaking, we take the system $\mathcal{S}$ and augment it with a new component that we can call $d$, to form a new system whose component set is $S \cup \{d\}$. The information function of the augmented system reduces to that of the original system $\mathcal{S}$ when applied to subsets of $S$. The values of the augmented system’s information function on subsets that include the descriptor $d$ tell us how $d$ shares information with the original components of $\mathcal{S}$.

The utility of a descriptor $d$ is
\[ U(d) = \sum_{s \in S} I(d;s). \] Given the basic axioms of information functions that we listed above, we can define an optimal descriptor as the one which has the largest possible utility, given its own amount of information. That is, if we invest an amount of information $y$ in describing the system $\mathcal{S}$, then an optimal descriptor has $H(d) = y$, and it relates to $\mathcal{S}$ in such a way that $U(d)$ is as large as the basic axioms of information functions allow. This defines a linear programming problem whose solution is the optimal utility, and so the theory of linear programming lets us prove helpful results about how the optimal utility varies as a function of $y$. Taking the derivative of the optimal utility gives the marginal utility of information, or MUI.

We prove a fair bit about the MUI in the paper where we introduced it. For systems of the ideal-gas type, where information applies to one component at a time, the MUI is a low and flat function: Investing one bit of description buys one unit of utility, until the whole system is described. On the other hand, if all the components are bound together and “move as one,” then investing a small amount of information buys us utility on a large scale, because that small amount applies across the board. In this case, the MUI starts high and falls sharply.

When the construction of a system is specified in detail, it is sometimes possible to make a finer degree of analysis, which reveals features that a first application of a structure index can overlook. For example, take the two systems defined by James and Crutchfield (2016). These are three-component systems composed of random variables whose joint states are chosen by picking a row at random from a table, with uniform probability. When systems are defined as collections of random variables with some joint probability distribution, we can use the Shannon information (a.k.a., Shannon entropy, Shannon index) as our information function $H$. The dyadic and triadic systems are defined respectively by the tables
\[
D = \left(\begin{array}{ccc}
0 & 0 & 0 \\
0 & 2 & 1 \\
1 & 0 & 2 \\
1 & 2 & 3 \\
2 & 1 & 0 \\
2 & 3 & 1 \\
3 & 1 & 2 \\
3 & 3 & 3
\end{array}\right);
\ T =
\left(\begin{array}{ccc}
0 & 0 & 0 \\
1 & 1 & 1 \\
0 & 2 & 2 \\
1 & 3 & 3 \\
2 & 0 & 2 \\
3 & 1 & 3 \\
2 & 2 & 0 \\
3 & 3 & 1
\end{array}\right).
\]
If we compute the MUI for these two systems in the manner described above, we find that the MUI for the dyadic system is the same as for the triadic. In fact, the information functions for the two systems, computed according to the Shannon scheme, agree for all subsets $U$. However, we can detect a difference between their structures by augmenting them. Let the three components of the dyadic system be written $d_1$, $d_2$ and $d_3$, and introduce a new variable $\Delta_{12}$ which takes the value 1 when the state of $d_1$ and $d_2$ are the same, and is 0 otherwise. This new ancilla variable is determined completely by the original system, and is sensitive to the particular values taken by the original system components $d_1$, $d_2$ and $d_3$. We can define two other ancillae in the same way, $\Delta_{13}$ and $\Delta_{23}$. Carrying out this construction for the dyadic example system, we find that the three ancillae form a completely correlated block system (that is biased towards the joint state 000).

In contrast, for the triadic example, defining three ancillae in the same way, we find that they form a parity-bit system (with odd parity). The ancillary system has four possible joint states, all of which are equally probable, and so it has two bits of information overall. For each possible joint state of the ancillary system, the original system can be in one of two joint states, with equal probability. Therefore, we see that the three bits of information necessary to specify the state of the original system break down into a pair of bits for describing the ancillae, plus one more bit of additional detail.

The logical extreme of this approach is to define an ancilla for each possible value of each component of a system. I like to call this “exploding” the original system. Applying the our theory to an exploded system can reveal new details of organization, at the price of increasing the number of components one must consider. For example, consider a two-component system defined by picking two successive characters at random out of a large corpus of English text. The shared information between the two components quantifies how much knowing the value of the first character helps us predict the value of the second. However, knowing that the first character is a $Q$ is a stronger constraint than knowing that it is, say, a $T$, because fewer characters can follow a $Q$. We can express this in our theory by exploding the two-component system, defining new components that represent the events of the first character being a $Q$ and the first character being a $T$. This is the same basic idea as that of the partial information decomposition; however, our axioms do not force the user to introduce a distinction between “input” and “output” variables as the PID framework does.