John Baez
Entropy as a functor

Idea

These are notes by John Baez, Tobias Fritz and Tom Leinster. The idea is to develop a deeper understanding of entropy, free energy, the partition function and related ideas in probability theory and statistical mechanics using the tools of modern algebra: categories, monads, algebraic theories, operads and the like.

Some of these notes were turned into a paper:

  • John Baez, Tobias Fritz and Tom Leinster, A characterization of entropy in terms of information loss, on the arXiv or free online at Entropy 13 (2011), 1945-1957.

This paper was developed in a series of blog conversations, in this order:

The paper was written up on the nLab here:

and there is also a draft of a paper that never got finished, here:

Contents

Entropy as a functor

In our paper A characterization of entropy in terms of information loss we gave a characterization of Shannon and more generally Tsallis entropy, not just for probability measures on finite sets, but for arbitrary (nonnegative) measures on these sets. This characterization uses a certain category:

Definition

Let FinMeas be the category whose objects are finite sets equipped with measures and whose morphisms are measure-preserving functions.

We noted that one can take arbitrary nonnegative linear combinations of objects and morphisms in FinMeas. These can be built up from direct sums and scalar multiplication by nonnegative real numbers, defined as follows.

  • For ‘direct sums’, first note that the disjoint union of two finite sets equipped with measures is another thing of the same sort. We write the disjoint union of p,qFinMeas as pq. Then, given morphisms f:pp, g:qq there is a unique morphism fg:pqpq that restricts to f on the measure space p and g on the measure space q.

  • For ‘scalar multiplication’, first note that we can multiply a measure by a nonnegative real number and get a new measure. So, given an object pFinMeas and a number λ0 we obtain an object λpFinMeas with the same underlying set and with (λp) i=λp i. Then, given a morphism f:pq, there is a unique morphism λf:λpλq that has the same underlying function as f.

Tsallis entropy can be seen as the unique invariant of morphisms in FinMeas that is additive under composition, additive under direct sums, homogeneous with respect to scalar multiplication and also continuous:

Theorem

Let α(0,). Suppose F is any map sending morphisms in FinMeas to numbers in [0,), and obeying these four properties:

  1. Functoriality:

    F(fg)=F(f)+F(g)F(f \circ g) = F(f) + F(g)

    whenever f,g are composable morphisms.

  2. Additivity:

    F(fg)=F(f)+F(g)F(f \oplus g) = F(f) + F(g)

    for all morphisms f,g.

  3. Homogeneity of degree α:

    F(λf)=λ αF(f)F(\lambda f) = \lambda^\alpha F(f)

    for all morphisms f and all λ[0,).

  4. Continuity: F is continuous.

Then there exists a constant c0 such that for any morphism f:pq in FinMeas,

F(f)=c(H α(p)H α(q))F(f) = c(H_\alpha(p) - H_\alpha(q))

where H α is the Tsallis entropy of order α. Conversely, for any constant c0, this formula determines a map F obeying conditions 1-4.

In the simplest case, namely homogeneity of degree α=1, the Tsallis entropy H α reduces to the Shannon entropy.

In fact, such an F really defines a functor from FinMeas to a category with one object and with nonnegative real numbers as morphisms. So we now go ahead and formulate our results in the language of category theory. For this we need to pick a way to view the nonnegative real numbers as a category:

Definition

Let + be the category with a single object * and [0,) as the set of morphisms from this object to itself, with addition as composition.

The categories + and FinMeas have much in common, which is why it is reasonable to assign an entropy change, namely a morphism in +, to each morphism in FinMeas. Here we focus on three similarities between these categories.

First, both + and FinMeas are topological categories. Recall that a topological category is a category where the set of objects and the set of morphisms form topological spaces, and all the category operations are continuous. A continuous functor is a functor between topological categories that maps objects to objects and morphisms to morphisms in a continuous way.

To make + into a topological category, we use the only possible topology on the set of objects, and the standard topology on the set of morphisms [0,). To make FinMeas into a topological category, we should say not only when a sequence of objects or morphisms converges, but also when any net converges. A net of objects p βFinMeas converges to pFinMeas if eventually all the p β have the same underlying finite set X, and then (p β) ip i in the usual topology on [0,) for all iX. We say a net of morphisms f β:p βq β converges if p βp and q βq for some p,qFinMeas and for large enough β the underlying function of f β equals a fixed function f:XY. The observant reader will note that FinMeas is a ‘large’ topological category, meaning that its set of objects lives in a larger Grothendieck universe than that containing the finite sets we use here. This will not present a problem.

Second, both + and FinMeas come equipped with a notion of ‘addition’ making them into symmetric monoidal categories. For + this addition is trivial on objects, and the usual addition of obvious nonnegative real numbers on morphisms. The resulting functor

+: +× + ++ : \mathbb{R}_+ \times \mathbb{R}_+ \to \mathbb{R}_+

makes + into a symmetric monoidal category. On the other hand, we have already seen how disjoint union of finite sets equipped with measures defines a ‘direct sum’ operation on FinMeas. This gives a functor

:FinMeas×FinMeasFinMeas\oplus : \Fin\Meas \times \Fin\Meas \to \Fin\Meas

which makes FinMeas into a symmetric monoidal category.

Third, both + and FinMeas can be equipped with a notion of ‘scalar multiplication’ by any number λ[0,). This gives an action of the multiplicative monoid [0,) on these categories, by which we mean monoid homomorphisms

[0,)End( +),[0,)End(FinMeas)[0,\infty) \rightarrow \End(\mathbb{R}_+) \: , \qquad [0,\infty) \rightarrow \End(\Fin\Meas)

where for a category C, the monoid End(C) consists of all functors CC with functor composition as monoid multiplication.

Now let us think a bit about how the topological, additive and multiplicative structures on FinMeas and + interact with each other. Note that while in general on FinMeas,

(λ+λ)(p)λpλp(\lambda + \lambda')(p) \ncong \lambda p \oplus \lambda' p

and

0p,0 p \ncong \emptyset,

we do have canonical isomorphisms

λ(pq)λpλq\lambda (p \oplus q) \cong \lambda p \oplus \lambda q

and

λ.\lambda \emptyset \cong \emptyset .

So [0,) acts not just as endofunctors on FinMeas, but as symmetric monoidal endofunctors. Writing End SymMonCat for the monoid of symmetric monoidal endofunctors of a symmetric monoidal category C, we hence have a monoid homomorphism

[0,)End SymMonCat(FinMeas)[0,\infty) \rightarrow \End_{\Sym\Mon\Cat}(\Fin\Meas)

So, given a topological monoid M, let us call a topological symmetric monoidal category (X,,0) equipped with a monoidal functor

MEnd SymmMonCat(X)M \to End_{\Symm\Mon\Cat}(X)

providing a continuous action of M on X an M-module category. We can summarize the previous two paragraphs by saying that FinMeas is a [0,)-module category, and for each α>0 there is an [0,)-module category + α.

By a morphism of M-module categories, we mean a symmetric monoidal functor which is compatible with the action of M. In this way, we obtain a category of M-module categories; this is the coslice category over M in the category of symmetric monoidal categories with strict symmetric monoidal functors.

Using these concepts, we see that Theorem 2 implies a nice characterization of Shannon entropy:

Theorem

Suppose F:FinMeas + is a morphism of [0,)-module categories. Then for any morphism f:pq in FinMeas,

F(f)=c(H(p)H(q))F(f) = c(H(p) - H(q))

for some c0, where H is Shannon entropy. Conversely, for any constant c0 this equation defines a morphism of [0,)-module categories.

What about the Tsallis entropy for α1? In fact, for each number α>0 there is a way to make + into a [0,)-module category, by letting λ[0,) act trivially on objects and as multiplication by λ α on morphisms. We write + α for + equipped with this action of [0,).

The action of [0,) on the category + α has properties already mentioned in the case α=1. Namely, there are equations

λ α(x+x)=λ αx+λ αx\lambda^\alpha(x+x') = \lambda^\alpha x + \lambda'^\alpha x

and

0 αx=0,0^\alpha x = 0 \:,

and so the action actually also operates in terms of symmetric monoidal endofunctors,

A α:[0,)End SymMonCat( + α)A_\alpha : [0,\infty) \to End_{\Sym\Mon\Cat}(\mathbb{R}_+^\alpha)

This homomorphism A α is also continuous. So, for each α>0 we see that + α is a [0,)-module. Theorem Theorem 2 implies:

Theorem

Let α(0,). Suppose F:FinMeas + α is a morphism of [0,)-module categories. Then for any morphism f:pq in FinMeas,

F(f)=c(H α(p)H α(q))F(f) = c(H_\alpha(p) - H_\alpha(q))

for some c0, where H α is the Tsallis entropy of order α. Conversely, for any constant c0 this equation defines a morphism of [0,)-module categories.

The partition function

Definition

Let FinMeas >0 be the category of strictly positive finite measure spaces, where:

• an object is a finite set S equipped with a strictly positive measure: that is, a measure μ for which every set has nonnegative measure, and the only set of measure zero is the empty set.

• a morphism f:(S,μ)(S,μ) is a function such that the pushforward of μ is μ.

We define a topological category is a category internal to Top, and a continuous functor is a functor internal to Top. There is a category Cat(Top) with topological categories as objects, and continuous functors as morphisms.

We shall make FinMeas >0 into a topological category as follows. The topology on objects and morphisms is the obvious one except that I’m tempted to do the following: when a sequence of measures μ j on a set S is trying to converge (in the obvious sense!) to a measure μ for which only points in some subset TS have nonzero measure, we say that

(S,μ j)(T,μ T)(S,\mu_j) \to (T,\mu|_T)

In simple heuristic terms: call the points in a finite set ‘states’. Then when the probability of certain possible states goes to zero, these states effectively ‘go away’.

Note that one interesting feature of this definition is that the space of objects of FinMeas >0 is connected.

Definition

Let be the topological category where objects are real numbers and there is a unique morphism f:xy whenever xy (and none otherwise).

Let us classify all continuous functors

f:FinMeas >0f: \Fin\Meas_{\gt 0} \to \mathbb{R}

with the property that

f(XY)=f(X)+f(Y)f(X \oplus Y) = f(X) + f(Y)

Here at left denotes the ‘direct sum’ of measure spaces, while at right it denotes the usual sum of real numbers.

Every functor

F:FinMeas >0F: \Fin\Meas_{\gt 0} \to \mathbb{R}

restricts to a continuous functor

G:core(FinMeas >0)G: core(\Fin\Meas_{\gt 0}) \to \mathbb{R}

Here core(FinMeas >0) is the core of the category FinMeas >0, meaning that we keep the same objects but include only the isomorphisms. Note that this restriction loses no information. So, if we classify all continuous functors

G:G: \to \mathbb{R}

obeying the equation G(XY)=G(X)+G(Y), the answer to our problem will be easy.

Proposition

Continuous functors

G:core(FinMeas >0)G: core(\Fin\Meas_{\gt 0}) \to \mathbb{R}

obeying the equation G(XY)=G(X)+G(Y) correspond to continuous functions

g:[0,)g: [0,\infty) \to \mathbb{R}

with the property that g(0)=0.

Proof Sketch - Since every object in FinMeas >0 is a direc sum of one-element measure spaces, the functor g is determined by its value on those. A one-element measure space is just a nonnegative real number, so G is determined by a function

g:[0,)g: [0,\infty) \to \mathbb{R}

Continuity of the functor G implies that g is continuous. Note that G applied to the empty set must give 0. Thanks to the funny topology on FinMeas and the continuity of G, this forces g(0)=0. Conversely, given any such function g we can define G on the finite measure space (S,μ) by

G(S,μ)= iSg(μ i)G(S, \mu) = \sum_{i \in S} g(\mu_i)

where μ i is μ of the point iS.

Proposition

Continuous functors

F:FinMeas >0F: \Fin\Meas_{\gt 0} \to \mathbb{R}

obeying the equation F(XY)=F(X)+F(Y) correspond to continuous functions

f:[0,)f: [0,\infty) \to \mathbb{R}

such that f(0)=0 and and f(x+y)f(x)+f(y).

Proof Sketch - Use the previous proposition and show that for G:FinMeas 0 to extend to some (necessarily unique) F:FinMeas 0, it is necessary and sufficient that the corresponding function g obeys g(x+y)g(x)+g(y). We need this condition so that any map from the 2-point measure space to a 1-point measure space gets sent to a morphism in . Every map in FinMeas is surjective and every surjection is a composite of coproducts of copies of the identity and these 2-points-to-1 maps, so this is all we need.

Corollary

Continuous functors

G:core(FinMeas >0)G: core(\Fin\Meas_{\gt 0}) \to \mathbb{R}

obeying the equations G(XY)=G(X)+G(Y) and G(X×Y)=G(X)×G(Y) correspond to continuous functions

g:[0,)g: [0,\infty) \to \mathbb{R}

of the form g(x)=x α with α0.

Proof - This should follow from our earlier results.

Corollary

Continuous functors

F:FinMeas >0F: \Fin\Meas_{\gt 0} \to \mathbb{R}

obeying the equation F(XY)=F(X)+F(Y) and F(X×Y)=F(X)×F(Y) correspond to continuous functions

f:[0,)f: [0,\infty) \to \mathbb{R}

of the form f(x)=x α with α1.

Proof - This should follow from our earlier results.

Suppose (S,μ) is an object in FinMeas. The continuous functor

G:core(FinMeas >0)G: core(\Fin\Meas_{\gt 0}) \to \mathbb{R}

corresponding to the function g(x)=x α is then equal to

G(S,μ)= iSμ i αG(S,\mu) = \sum_{i \in S} \mu_i^\alpha

We may fix (S,μ) and think of G(S,μ) as a function of α[0,). This function plays an important role in physics, where it is called the partition function of X=(S,μ). We denote this by

Z(X):[0,)[0,),Z(X) : [0,\infty) \to [0,\infty),

so that

Z(X)(α)= iSμ i αZ(X)(\alpha) = \sum_{i \in S} \mu_i^\alpha

The following proposition says the partition function is a ‘complete invariant’ of positive finite measure spaces:

Proposition

Two objects of FinMeas >0 have the same partition function if and only if they are isomorphic.

Proof - It is obvious that isomorphic objects of FinMeas >0 have the same partition function. So, we need to show that we can recover X=(S,μ) from its partition function. To see this, note that Z(X)(α) is the Laplace transform of this measure on the real line:

ν= iSP lnμ i\nu = \sum_{i \in S} \mathbf{P}_{\ln \mu_i}

Namely:

0 e αsdν(s) = iS 0 e αsP lnμ i(s)ds = iSe klnμ i = iSμ i α = Z(X)(α)\begin{array}{ccl} \int_0^\infty e^{\alpha s} d\nu(s) &=& \sum_{i \in S} \int_0^\infty e^{\alpha s} \mathbf{P}_{\ln \mu_i}(s) \, d s \\ &=& \sum_{i \in S} e^{k \ln \mu_i} \\ &=& \sum_{i \in S} \mu_i^\alpha \\ &=& Z(X)(\alpha) \end{array}

So, we can recover the measure ν from the partition function Z(X) by taking its inverse Laplace transform. Furthermore, we can recover the multiset of numbers μ i from the measure ν.

We can restate this proposition in an even more complicated way. For any (essentially small) category C let C̲ be the decategorification of C: that is, the set of isomorphism classes of objects of C. If C is a symmetric rig category then C̲ will naturally become a commutative rig.

Definition

Let , the rig of partition functions, be the set of functions f:[0,)[0,) of the form

Z(α)= iSp i αZ(\alpha) = \sum_{i \in S} p_i^\alpha

where S is an arbitrary finite set and p i are arbitrary positive real numbers. is a closed under pointwise addition and multiplication, making it into commutative rig.

Proposition

The map Z:FinMeas >0̲ sending each each isomorphism class of strictly positive finite measure space to its partition function is an isomorphism of commutative rigs.

Proof - When you penetrate the jargon, the only nonobvious part here is that Z is a bijection, which was shown in the previous proposition.

Convex algebras

Definition

The monad for convex sets is a monad on Set sending any set X to the set of finitely-supported probability distributions on X.

For example, this monad sends {1,,n} to a set P n, which can be identified with the (n1)-simplex. This monad is a finitary monad, so can be thought about in a few different ways.

A finitary monad is the same thing as a finitary algebraic theory. This one can be presented by a family (* t) t[0,1] of binary operations, subject to some equations. They’re probably exactly those equations that hold in a convex subset of n if we put x* ty=(1t)x+ty.

(I suspect there’s a theorem here: that for n1, n “satisfies no laws”. This means there are no equations between the various operations (x,y)(1t)x+ty beyond those forced by its being an algebra for this theory.)

A finitary algebraic theory can also be thought of as a kind of souped-up operad. In a symmetric operad P, one has for each bijection σ:{1,,n}{1,,n} an induced map σ *:P nP n. In a finitary algebraic theory, one has the same thing for arbitrary functions between finite sets, not just bijections. In other words, a finitary algebraic theory amounts to a non-symmetric operad P together with, for each function θ:{1,,m}{1,,n} between finite sets, an induced map θ *:P mP n, satisfying suitable axioms.

Definition

The underlying symmetric operad for the convex monad is called the operad for convex algebras, and denote P. An algebra of P is called a convex algebra.

The space of n-ary operations for this operad is P n, the space of probability distributions on {1,,n}. The composition of operations is supposed to be obvious, but we should probably write down a formula for it. The maps ”θ *:P nP m” can be defined by pushforward of measures. An algebra for the algebraic theory of convex algebras is an algebra X for the operad with the further property that the square

P m×X n 1×θ * P m×X m θ *×1 P n×X n X\begin{matrix} \mathbf{P}_m \times X^n &\stackrel{1 \times \theta^*}{\to} &\mathbf{P}_m \times X^m \\ \theta_*\times 1 \downarrow & &\downarrow \\ \mathbf{P}_n \times X^n &\to &X \end{matrix}

commutes for all θ:{1,,m}{1,,n}.

Note that P is naturally a topological operad, where the topology on P n is the usual topology on the (n1)-simplex. In our applications, we focus on algebras of P in topological categories with finite products. We call these convex algebras in . Here are some examples:

  • Any convex subset of n is naturally a convex algebra in Top. In particular, [0,) is such.

  • Moreover, if we regard the topological monoid ([0,),+,0) as a one-object topological category, then it is a convex algebra in Cat(Top).

  • FinMeas is also a convex algebra in Cat(Top).

  • FinProb is also a convex algebra in Cat(Top).

Shannon entropy

Fadeev’s characterization

Rényi presents a nice characterization of Shannon entropy due to Fadeev here:

It is the first result mentioned in the paper. To translate into our framework, it helps to write a probability measure on the finite set {1,,n} as an n-tuple of numbers p i[0,1] with i=1 np i=1.

Theorem

(Fadeev) Suppose F:FinProb 0[0,) is a continuous functor obeying the magical property:

F(tp 1,(1t)p 1,p 2,,p n)=F(p 1,,p n)+p 1F(t,1t)

for all t[0,1]. Then F is a nonnegative multiple of the Shannon entropy H 1.

An operadic formulation

One may reformulate Fadeev’s theorem in terms of operads. The details are written out here:

What follows is a sketch.

Let P be an operad in a finite product category . We may speak of (strict) P-algebras in Cat(), the category of internal categories in . There is a notion of lax map between P-algebras in Cat().

Briefly put: if X and Y are P-algebras, a lax map from X to Y is a functor XY together with, for each n, a natural transformation

P n×X n P n×Y n X Y\begin{matrix} P_n \times X^n &\to &P_n \times Y^n \\ \downarrow &\Leftarrow &\downarrow \\ X &\to &Y \end{matrix}

satisfying one axiom involving composition in P, one involving the unit of P, and, if P is symmetric, one involving the symmetric group action on P.

Now take X=1. Let us call a lax map 1Y a lax point of Y. Batanin has called them “internal P-algebras” in Y, and maybe we’ll end up preferring that name.

Examples

  • If =Set and P is the terminal non-symmetric operad then Y is a monoidal category and a lax point of Y is a monoid in Y.

  • If =Set and P is the non-symmetric operad for M-sets, where M is a monoid, then Y is a category with an M-action and a lax point of Y is an object yY together a map myy for each mM, satisfying the obvious action-type axioms.

In what follows we apply this idea to the case where =Top and P is the symmetric, topological operad in which P n is the space of probability distributions on {1,,n} (otherwise known as the (n1)-simplex).

Proposition

The lax points of the P-algebra (,+,0) are precisely the scalar multiples of Shannon entropy.

Sketch of proof - Write out the axioms in the definition of lax map and unwind them in this particular case. One concerns composition, and this becomes the most substantial of the Fadeev axioms on Shannon entropy: the so-called ‘magical property’ mentioned above. One concerns units, and is redundant. One concerns symmetry. Then because we’re entirely in Top, you’ve got one concerning continuity. So we’ve got all the Fadeev axioms except for normalization, and the result follows from (Rényi’s statement of) Fadeev’s theorem.

There’s a very similar characterization of Shannon extropy, or rather the scalar (real) powers of Shannon extropy. There we simply use the P-algebra ((0,),,1), which is isomorphic to (,+,0) by taking exponentials.

We might also want to get rid of the “scalar multiples/powers” part and hit it on the nose. This seems more feasible with extropy rather than entropy, since there’s no choice of base. Still, I don’t see a nice way to do it.

Another operadic formulation

This section is in a more informal style…

A math puzzle

We begin with a sadly familiar problem:

Suppose you live in a town with a limited number of tolerable restaurants. Every Friday you go out for dinner. You randomly choose a restaurant according to a certain probability distribution $P$. If you go to the $i$th restaurant, you then choose a dish from the menu according to some probability distribution $Q_i$. How surprising will your choice be, on average?

Note: I’m only talking about how surprised you are by the choice itself—not about any additional surprise due to the salad being exceptionally tasty, or the cook having dropped her wedding ring into your soup, or something like that. So if there’s only one restaurant in town and they only serve one dish, you won’t be at all surprised by your choice. But if there were thousands of restaurants serving hundreds of dishes each, there’d be room for a lot of surprise.

This still sounds like an impossibly vague puzzle. So, I should admit that while ‘surprise’ sounds psychological and very complicated, I’m really just using it as a cute synonym for Shannon entropy. Shannon entropy can be thought of as a measure of ‘expected surprise’.

Now, ‘expected surprise’ sounds like an oxymoron: how can something expected be a surprise? But in this context ‘expected’ means ‘average’. The idea is that your ‘surprise’ at an outcome with probability p is defined to be ln(p). Then, if there are a bunch of possible outcomes distributed according to some probability distribution P, the ‘expected’ surprise or Shannon entropy is:

(1)S(P)= ip iln(p i)S(P) = - \sum_i p_i \ln(p_i)

where p i is the probability of the ith outcome. This is a weighted average where each surprise is weighted by its probability of happening.

So, we really do have a well-defined math puzzle, and it’s not even very hard.

Glomming together probabilities

But the interesting thing about this problem is that it involves an operation which I’ll call ‘glomming together’ probability distributions. First you choose a restaurant according to some probability distribution P on the set of restaurants. Then you choose a meal according to some probability distribution Q i. If there are n restaurants in town, you wind up eating meals in a way described by some probability distribution we’ll call

P(Q 1,,Q n)P \circ (Q_1, \dots, Q_n )

A bit more formally:

Suppose P is a probability distribution on the set {1,,n} and Q i are probability distributions on finite sets X i, where i=1,,n. Suppose the probability distribution P assigns a probability p i to each element i{1,,n}, and suppose the distribution Q i assigns a probability q ij to each element jX i.

Then we can glom together the probability distributions Q i using P and get a new probability distribution P(Q 1,,Q n) on the disjoint union of all the sets X i. The resulting probability for each element (i,j) in the disjoint union i=1 nX i is just p iq ij.

These probabilities sum to 1 as they should:

i,jp iq ij= i(p i jq ij)= ip i=1\sum_{i, j} p_i q_{i j} = \sum_i \left( p_i \sum_j q_{i j} \right) = \sum_i p_i = 1

Note: what j ranges over depends on i! But it’s all kosher.

Glomming together numbers

There’s something even simpler than glomming together probability measures: we can glom together numbers!

Suppose we have a probability measure P on the set {1,,n} and for each element iS we have a number x i. Then we can define a new number P(x 1,,x n) by

P(x 1,,x n)= i=1 np ix iP(x_1, \dots, x_n) = \sum_{i = 1}^n p_i x_i

This is just the ‘weighted average’ of the numbers x i. We have already seen a weighted average in the definition of Shannon entropy, equation (2).

Glomming together entropies

Now we’re ready to answer the math puzzle:

(2)S(P(Q 1,,Q n))=S(P)+P(S(Q 1),,S(Q n))S(P\circ(Q_1, \ldots, Q_n)) = S(P) + P(S(Q_1), \ldots, S(Q_n))

On the left we’re glomming together a bunch of probability distributions using P and then taking the entropy. On the right we’re taking the entropies of those probability distributions and then glomming the resulting numbers together using P. But there’s also an extra term: the entropy of P!

In other words: your expected surprise won’t be just the weighted average of your expected surprises at the various different restaurants, namely P(S(Q 1),,S(Q n)). It’ll be that plus the expected surprise of your choice of restaurant, namely S(P).

You might not have expected such a simple formula. But it’s easy to prove:

S(P(Q 1,,Q n)) = i,jp iq ijln(p iq ij) = i,jp iq ijln(p i) i,jp iq ijln(q ij) = ip iln(p i) ip iS(Q i) = S(P)+P(S(Q 1),,S(Q n))\begin{aligned} S(P\circ(Q_1, \ldots, Q_n)) &=& -\sum_{i, j} p_i q_{i j} \ln(p_i q_{i j}) \\ &=& - \sum_{i, j} p_i q_{i j} \ln(p_i) - \sum_{i, j} p_i q_{i j} \ln(q_{i j}) \\ &=& - \sum_{i} p_i \ln(p_i) - \sum_{i} p_i S(Q_i) \\ &=& S(P) + P(S(Q_1), \dots, S(Q_n)) \end{aligned}

Fadeev’s theorem

The formula for glomming together entropies is more important than you might think: it comes very close to characterizing Shannon entropy! We can restate Fadeev’s theorem (mentioned above) as follows:

Theorem

(Fadeev) Suppose F is a function sending probability measures on finite sets to real numbers. Suppose that:

  1. F is invariant under bijections.

  2. F is continuous.

  3. F(P(Q 1,,Q n))=F(P)+P(F(Q 1),,F(Q n)).

Then F is a real multiple of Shannon entropy.

In item 1 we’re using the fact that a bijection f:XX between finite sets together with a probability distribution on X gives a probability distribution on X; we want these to have the same entropy. In item 2 we’re using the standard topology on n to put a topology on the set of probability distributions on any n-element set. For a proof of this theorem, see the beginning of Rényi's 1961 paper.

What’s going on?

While this equation is cute:

S(P(Q 1,,Q n))=S(P)+P(S(Q 1),,S(Q n))S(P\circ(Q_1, \ldots, Q_n)) = S(P) + P(S(Q_1), \ldots, S(Q_n))

it’s a bit tricky to find its proper place in the world of abstract algebra. A probability distribution P can act either on a list of probability distributions or a list of numbers. Shannon entropy gives a map S from probability distributions to numbers. So, if you’re algebraically inclined, you would want S to be a ‘homomorphism’, obeying a law like this:

S(P(Q 1,,Q n))=P(S(Q 1),,S(Q n))(sorry,nottrue)S(P\circ(Q_1, \ldots, Q_n)) = P(S(Q_1), \ldots, S(Q_n)) \qquad (sorry, \; not \; true)

We see laws of this sort all over math. But the true law has an extra term. What’s going on?

The ‘glomming’ business can be formalized using operads, and Tom’s answer is roughly: Shannon entropy is not a ‘homomorphism’ of operad algebras, but only a ‘lax homomorphism’. For an explanation of what this means, go here.

Right now I want to propose an alternative answer. I hope we can combine it with Tom’s answer and reach a deeper understanding of what’s going on.

For starters, consider another law that has an ‘extra term’:

ddx(fg)=(ddxf)g+f(ddxg)\frac{d} {d x} (f g) = (\frac{d}{d x} f )g + f (\frac{d}{d x} g)

In math jargon, the product rule says that taking the derivative of a function at a point is not a homomorphism from smooth functions to real numbers: it’s a ’derivation’. We can get a derivation on an algebra by differentiating a family of algebra homomorphisms. Similarly, we can get the funny rule obeyed by entropy by differentiating a family of operad algebra homomorphisms.

Let’s see how.

Glomming together partition functions

There’s something more fundamental than the Shannon entropy of a probability distribution: namely, it’s partition function. At least that’s how physicists think. Let me explain.

Suppose P is a probability measure on a finite set X. Let p i be the probability of the element iX. We can always write

p i=e H ip_i = e^{-H_i}

where

H:X(,]H : X \to (-\infty, \infty]

is a function called the energy or Hamiltonian. The idea is that the probability of a system being in some state decreases exponentially with the energy of that state; we allow the energy + to account for zero probabilities.

But physicists always go further and introduce a parameter β(0,) which stands for the reciprocal of temperature, to capture the fact that states of high energy are even less likely to be occupied when it’s cold. Now we make p i into a function of β:

p i(β)=e βH ip_i(\beta) = e^{-\beta H_i}

or if you prefer, simply

p i(β)=p i βp_i(\beta) = p_i^\beta

When β=1, these numbers are just the probabilities p i. But when β1, there’s no reason for them to sum to 1. To get a probability measure, we’d have to divide by a fudge factor called the partition function:

Z(P)= iXe βH iZ(P) = \sum_{i \in X} e^{-\beta H_i}

But I won’t do that just now: I’ll let P(β) be the measure on the set X that assigns the measure p i(β) to the point iX.

Now everything is a function of β! But everything reduces to something familar at β=1, so we can stretch our old notation without breaking it. We now have functions p i sending numbers β(0,) to numbers, and a function P sending numbers β to measures on X. These reduce to our original p i and P at β=1. The partition function is also a function of β, and this equals 1 at β=1.

But here’s the important thing: the partition function obeys a simpler law than entropy when we glom probability measures together! Suppose P is a probability distribution on the set {1,,n} and Q i are probability distributions on finite sets X i, where i=1,,n. Then

(3)Z(P(Q 1,,Q n))=P(Z(Q 1),,Z(Q n))Z(P \circ (Q_1, \dots, Q_n)) = P (Z(Q_1), \dots, Z(Q_n))

So, in fancy jargon, Z is an operad algebra homomorphism!

But I need to explain what this equation means. First of all, everything is a function of β. Second, while P and Q i are only probability measures at β=1, they’re perfectly fine measures at other values of β, so all our ‘glomming’ formulas still make sense.

Let’s check to make sure we know what’s going on. An expression like P(Q 1,,Q n) could be ambiguous. We have a recipe for taking a probability measure on a finite set and extending it to a function from numbers β(0,) to measures on that set. So, we can extend P and Q i this way and then ‘glom’ them to get a measure-valued function of β. On the other hand, we can glom them and then extend them. Luckily, the results agree.

Let’s see why. Suppose P assigns the point iX the probability p i. When we extend this to a function of β we get p i β. Similarly, suppose Q i assigns the point jX i the probability q ij. When we extend this to a function of β we get q ij β. When we glom these, the measure of the point (i,j)X i will be this function of β:

p i βq ij βp_i^\beta q_{i j}^\beta

But this equals

(p iq ij) β(p_i q_{i j})^\beta

which is the result of glomming and then extending.

The right-hand side of equation (3) is also unambiguous… but I’m wasting time: if I were a physicist I’d be done proving this equation by now, instead of stewing around explaining what it means. It’s incredibly easy to prove. From what I’ve said,

Z(P(Q 1,,Q n))= i,jp i βq ij β= ip i βZ(Q i)=P(Z(Q 1),,Z(Q n))Z(P \circ (Q_1, \dots, Q_n)) = \sum_{i, j} p_i^\beta q_{i j}^\beta = \sum_i p_i^\beta Z(Q_i) = P (Z(Q_1), \dots, Z(Q_n))

From partition function to entropy

How is the Shannon entropy of a probability distribution related to its partition function? Simple:

(4)S(P)=ddβZ(P) β=1S(P) = - \left. \frac{d}{d \beta} Z(P) \right|_{\beta = 1}

Here I must emphasize that I’m only talking about the Shannon entropy S(P) of the original probability distribution, ‘at β=1’. Physicists extend S(P) to a function of β, along with everything else. That would be very good to do, but it goes beyond our goal now, and it would make the formula relating S and Z a bit more complicated, so let’s not.

Our goal now is simply to get the rule for glomming entropies by differentiating the rule for glomming partition functions. So, let’s do that using equation (4). Later I’ll show you why (4) is true.

We start with the rule for glomming partition functions:

Z(P(Q 1,,Q n))=P(Z(Q 1),,Z(Q n))Z(P \circ (Q_1, \dots, Q_n)) = P (Z(Q_1), \dots, Z(Q_n))

We differentiate with respect to β and use the product rule, taking advantage of the fact that P(Z(Q 1),,Z(Q n)) is linear in P but also linear in each of the functions Z(Q i):

ddβZ(P(Q 1,,Q n))=dPdβ(Z(Q 1),,Z(Q n))+P(ddβZ(Q 1),,ddβZ(Q n))\frac{d}{d \beta} Z(P \circ (Q_1, \dots, Q_n)) = \frac{d P}{d \beta} (Z(Q_1), \dots, Z(Q_n)) + P (\frac{d}{d \beta} Z(Q_1), \dots, \frac{d}{d \beta} Z(Q_n))

Now set β=1. We can use equation (4) and the fact that all our partition functions equal 1 at β=1:

S(P(Q 1,,Q n))=dPdβ(1,,1) β=1P(S(Q 1),,S(Q n))- S(P \circ (Q_1, \dots, Q_n)) = \left. \frac{d P}{d \beta} (1,\dots, 1) \right|_{\beta = 1} - P(S(Q_1), \dots, S(Q_n))

Hey, it looks like we’re almost done! As you can see, the product rule did most of the work, so we’re really saying that S is like a ‘derivation’. We just to work out that funny-looking first term on the right-hand side. It amounts to taking a weighted sum of 1’s, which is just a sum:

dPdβ(1,,1) β=1= idp i(β)dβ β=1\left. \frac{d P}{d \beta} (1,\dots, 1)\right|_{\beta = 1} = \sum_i \left. \frac{d p_i(\beta)}{d \beta} \right|_{\beta = 1}

and we have

dp i(β)dβ=ddβe βH i=H ie βH i\frac{d p_i(\beta)} {d\beta} = \frac{d}{d\beta} e^{-\beta H_i} = -H_i e^{-\beta H_i}

so

dp i(β)dβ β=1=H ie H i=ln(p i)p i\left. \frac{d p_i(\beta)} {d\beta} \right|_{\beta = 1} = - H_i e^{-H_i} = \ln(p_i) p_i

so

dPdβ(1,,1) β=1= ip iln(p i)=S(P)\left. \frac{d P}{d \beta} (1,\dots, 1)\right|_{\beta = 1} = \sum_i p_i \ln(p_i) = - S(P)

So, we’ve got it!

S(P(Q 1,,Q n))=S(P)+P(S(Q 1),,S(Q n))S(P \circ (Q_1, \dots, Q_n)) = S(P) + P(S(Q_1), \dots, S(Q_n))

The funny-looking rule for glomming entropies is just the derivative of the rule for glomming partition functions!

But before we go further, let’s check rule (4). For starters,

ddβZ(P) β=1=ddβ ie βH i β=1\left. \frac{d}{d \beta} Z(P) \right|_{\beta = 1} = \left. \frac{d}{d \beta} \sum_i e^{-\beta H_i} \right|_{\beta = 1}

But we’ve just seen that

ddβe βH i β=1=p iln(p i)\left. \frac{d}{d\beta} e^{-\beta H_i} \right|_{\beta = 1} = p_i \ln(p_i)

so

ddβZ(P) β=1= ip iln(p i)=S(P)\left. \frac{d}{d \beta} Z(P) \right|_{\beta = 1} = \sum_i p_i \ln(p_i) = - S(P)

The upshot

As we’ve seen already, there is deeper reason for being interested in the partition function: it’s actually a form of ‘decategorification’!

Let FinProb 0 be the core of FinProb, as defined above. Then

Z:FinProb 0RZ : FinProb_0 \to R

can be thought of as a functor from FinProb 0 to its set of isomorphism classes, viewed as a category with only identity morphisms. In other words, Z assigns to each object PFinProb 0 its partition function Z(P)… but we can recover P up to isomorphism from Z(P).

Now, FinProb 0 is an algebra of a certain operad P whose n-ary operations are the probability measures on the set {1,,n}. This is just a fancy way of talking about ‘glomming probability measures’. As a kind of consequence, the set of partition functions also becomes a P-algebra. Furthermore, Z becomes a homomorphism of P-algebras. This last statement mostly just says that

Z(P(Q 1,,Q n))=P(Z(Q 1),,Z(Q n))Z(P \circ (Q_1, \dots, Q_n)) = P (Z(Q_1), \dots, Z(Q_n))

But then we can take Z, differentiate it with respect to β, and set β=1. By abstract nonsense, the resulting functor should be some sort of ‘derivation’ of the Conv-algebra FinProb 0. Concretely, we’ve seen this mostly says that

S(P(Q 1,,Q n))=S(P)+P(S(Q 1),,S(Q n))S(P\circ(Q_1, \ldots, Q_n)) = S(P) + P(S(Q_1), \ldots, S(Q_n))

But there should also be an abstract-nonsense theory of derivations of operad algebras. (This was discussed this a bit back in week299, but only a little). So, an interesting question presents itself:

How does the ‘derivation’ way of thinking about the law S(P(Q 1,,Q n))= S(P)+P(S(Q 1),,S(Q n)) relate to Tom Leinster’s interpretation of it in terms of lax operad homomorphisms, or more precisely ‘lax points’?

If we get this straightened out, it will also be tempting to extend the story to Rényi entropy, using the fact that Rényi entropy is a kind of $q$-derivative of minus the logarithm of the partition function. The q-derivative is a kind of ‘twisted derivation’, but a bunch of the same calculations may work in some modified form.

Power means

There is a categorical characterization of the power means (of order 1), similar in spirit to some of the other results on this page. It follows from the characterization in

  • Tom Leinster, A multiplicative characterization of the power means, arXiv:1103.2574.

The real half-line as a convex algebra

Write +=[0,). For each t0, there is a convex algebra structure on the set +, given by taking the power mean of order t. The action

M t:P n× + n +M_t: \mathbf{P}_n \times \mathbb{R}_+^n \to \mathbb{R}_+

is given by M t(p,x)=( ip ix i t) 1/t if 0<t<, or x i p i if t=0, or max i:p i>0x i if t=.

Now put the order on + to make it into a category. Since M t(p,) is increasing (for each t and p), power mean of any order t0 makes + into a convex algebra in Cat.

Now put the additive structure (+,0) on +, to make it into a monoidal category. For t1 we have

M t(p,x)+M t(p,y)M t(p,x+y)M_t(p, x) + M_t(p, y) \geq M_t(p, x + y)

and M t(p,0)=0, so M t(p,) is a lax monoidal functor. So power mean of any order t1 makes + into a convex algebra in MonCat , the category of monoidal categories and lax monoidal functors.

This structure is, moreover, homogeneous:

M t(p,cx)=cM t(p,x)M_t(p, c x) = c M_t(p, x)

whenever c + and (p,x)P n× + n.

The characterization

Proposition

The convex algebra structures on +MonCat satisfying the homogeneity axiom are precisely the power means M t (1t).

Sketch proof - We show that any homogeneous convex algebra structure on + satisfies the hypotheses of the characterization theorem in arXiv:1103.2574, and is therefore one of the power means. This is elementary manipulation. The axiom called “functoriality” in that paper is exactly the commutativity of the square drawn above. “Consistency” is the unit axiom for an operad action. “Multiplicativity” follows from the composition axiom for an operad action, together with homogeneity.

I wouldn’t call this result a rephrasing of the result in my paper; it’s probably rather weaker (i.e. easier to prove), because the all-important multiplicativity axiom involves only a very restricted kind of operadic composition.

Remarks on homogeneity

The proposition would be nicer if the homogeneity axiom could be either dropped or turned into something more categorical. I don’t know what to do about this.

One possibility is to observe that the monoid ( +,,1) acts on the monoidal category ( +,,+,0), and that homogeneity amounts to preservation of that action. So we could think about convex algebras in +MonCat , the category of monoidal categories equipped with an action of the multiplicative monoid +. That should work, but seems clumsy.

Another possibility is to use the following strange kind of monoidal functor. Let’s say that a lax monoidal functor F:AB is homogeneous if for all aA and n, the coherence map

F(a) nF(a n)F(a)^{\otimes n} \to F(a^{\otimes n})

is invertible. The composite of homogeneous functors is again homogeneous. Let MonCat h be the category of monoidal categories and homogeneous lax functors. Then the convex algebra structures on +MonCat h are exactly the power means of order t[1,]. (Homogeneity of this kind only tells us that M(p,cx)=cM(p,x) when c is a natural number, but we can deduce it for all positive reals.) That’s a reasonably clean statement, but the concept of homogeneous lax functor seems strange, and was invented only for this purpose. I’ve never seen it before.

I don’t know whether the homogeneity axiom is simply redundant. I’ve tried to prove that it is, and I’ve tried to prove that it isn’t, without success.

Pairing measures with functions

This is Tom, still.

To study quadratic forms, you really need to study bilinear forms. Similarly, I wonder whether here, it would help to look not just at expressions such as ip i α, but also at the associated expressions ip ix i α1. (You see such expressions in the definition of power mean; I have in the back of my mind the kind of thing in arXiv:1103.2574, and closely related things in my work on diversity.)

In any case, here’s a theorem in the same spirit as some of those above.

Let FinMeasFun be the topological category in which:

  • an object is a finite set S equipped with a (nonnegative) measure μ and a function ϕ:S[0,)

  • a map (S,μ,ϕ)(S,μ,ϕ) is a map f:SS of finite sets such that f *(μ)=μ and f *(ϕ)=ϕ (i.e. ϕf=ϕ).

Since the base space is a mere finite set, μ and ϕ are actually of the same type. The fancy terminology hides this, and maybe we’re embarrassed to admit it – we’re probably dreaming of higher things. Nevertheless, we’ll take advantage of it later.

FinMeasFun is a topological rig category, in an obvious way. So too is 0, the topological rig of all real numbers regarded as a rig category with no maps other than identities. (The 0 subscript is supposed to mean “the only maps are isos”, as in what John wrote.)

A rig functor

FinMeasFun 0\Fin\Meas\Fun \to \mathbb{R}_0

is determined by its effect on objects, since 0 is categorically discrete. It’s the same thing as a homomorphism from Π 0(FinMeasFun), the rig of connected-components of FinMeasFun, to . For each t0, there’s a continuous rig functor J t:FinMeasFun 0 given by

J t(S,μ,ϕ)= sSμ sϕ(s) t.J_t(S, \mu, \phi) = \sum_{s \in S} \mu_s \phi(s)^t.

(Or more fancily: J t(S,μ,ϕ)= Sϕ tdμ. In the case t=0 and ϕ(s)=0, continuity forces us to put 0 0=1.)

Proposition

The continuous rig functors FinMeasFun 0 are precisely the functors J t (t0).

Sketch proof - Let J be a continuous rig functor. As usual, J is determined by what it does to singletons. Multiplicativity and continuity imply that J({},μ,ϕ)=μ uϕ() t for some constants u and t. Then functoriality (i.e. constancy of J on connected-components) implies that u=1.

Remark: I know 2×2=4 versions of this proposition, including the one just stated. The first choice is whether to “allow zero” or not. By “allowing zero” I mean that in the definition of FinMeasFun, μ s0 and ϕ(s)0 (as above); by “not allowing zero” I mean that we demand μ s>0 and ϕ(s)>0.

The second choice is whether to work internally to Top or Poset. (The point here is that in order to exclude the non-obvious solutions to the Cauchy functional equation ψ(a+b)=ψ(a)+ψ(b), you need some extra hypothesis such as continuity or order-preservation.) The order on the objects of FinMeasFun is: (S,μ,ϕ)(S,μ,ϕ) iff S=S, μ sμ s for all sS, and ϕ(s)ϕ(s) for all sS. The order on the maps of FinMeasFun is: fg iff dom(f)dom(g), cod(f)cod(g), and f=g as functions.

Here are the four versions:

  1. Allowing 0 and working in Top. This is the proposition above, where the continuous rig functors are J t for t0.

  2. Not allowing 0 and working in Top. Then we get J t for all t.

  3. Allowing 0 and working in Poset. Then we get J t for t0, together with another rig functor J ε. This new functor J ε is almost the same as J 0, but whereas in the definition of J 0 we take 0 0=1, in the definition of J ε we take 0 0=0. Explicitly, J ε(S,μ,ϕ)=lim t0+J t(S,μ,ϕ)=μ(suppϕ) (whereas J 0(S,μ,ϕ)=μ(S)).

  4. Not allowing 0 and working in Poset. Then we get J t for t0 (but not J ε; or rather, we don’t see the difference between J ε and J 0).

Now I’ll take advantage of the fact that a measure on a finite set is the same thing as a function: if (S,μ)FinMeas then (S,μ,μ)FinMeasFun. (I’m going to use FinMeas in a slightly different sense to John, dropping his assumption that when (S,μ) is an object of FinMeas, the only subset of S to have measure zero is the empty set. So an object of my FinMeas is just a finite set equipped with a measure.)

Digression: Actually, what I’m really taking advantage of is the fact that any measure on a finite set gives rise to a function. This is, if you like, its Radon-Nikodym derivative with respect to counting measure. So counting measure is playing a silent role as the canonical reference measure on a finite set.

I’ll invent some terminology. A map f:(S,μ)(S,μ) in FinMeas is an equivalence if there exist subsets AS and AS such that f 1A=A, the induced map AA is a bijection, and μ(SA)=0=μ(SA). I think this is a standard concept in measure theory, but I forget the standard name; maybe just “measure-isomorphism”?

Lemma

The maps (S,μ,μ)(S,μ,μ) in FinMeasFun are the equivalences (S,μ)(S,μ).

Proof - straightforward manipulation of the definitions.

Let FinMeas 1 be the category whose objects are those of FinMeas and whose maps are the equivalences. Thus, FinMeas 1 embeds as a full subcategory of FinMeasFun:

FinMeas 1FinMeasFun\Fin\Meas_1 \hookrightarrow \Fin\Meas\Fun

via (S,μ)(S,μ,μ). Also, FinMeas 1 is closed under the rig category operations on FinMeasFun, so it’s a sub-rig-category.

John’s remarks on convergence in FinMeas suggest that perhaps he really wants to be using FinMeas 1 instead of FinMeas 0…? In any case, FinMeas 0 is a sub-rig-category of FinMeas 1, and it’s easy to see that the continuous rig functors FinMeas 1 0 are exactly those of the form G α (α), where

G α(S,μ)= sSμ s α.G_\alpha (S, \mu) = \sum_{s \in S} \mu_s^\alpha.

This is basically Corollary 1 above.

We now have a commutative triangle, or we would if I knew how to draw one. The composite

FinMeas 1FinMeasFunJ α1 0\Fin\Meas_1 \hookrightarrow \Fin\Meas\Fun \stackrel{J_{\alpha - 1}}{\to} \mathbb{R}_0

is equal to

FinMeas 1G α 0,\Fin\Meas_1 \stackrel{G_\alpha}{\to} \mathbb{R}_0,

for every α. Algebraically, this just says that

sμ sμ s α1= sμ s α.\sum_s \mu_s \mu_s^{\alpha - 1} = \sum_s \mu_s^\alpha.

But it seems to me to be noteworthy, because we naturally see the appearance of α1 (as opposed to α), which is a recurrent player in this drama.

The basic inequalities of information theory

For random variables X, Y, let us write (X,Y) for the ‘joint random variable’ whose distribution is the joint distribution of X and Y. Its entropy is the joint entropy H((X,Y)), which we also write as H(X,Y).

The basic inequalities of information theory are the following inequalities for Shannon entropy:

  • Non-negativity of entropy: H(X)0 for all random variables X.
  • Non-negativity of conditional entropy: H(XY)=H(X,Y)H(Y)0 for all X, Y.
  • Non-negativity of mutual information: I(X:Y)=H(X)+H(Y)H(X,Y)0 for all X, Y.
  • Non-negativity of conditional mutual information: I(X:YZ)=H(Z)+H(X,Z)+H(Y,Z)H(X,Y,Z)0 for all X, Y, Z.

Some of these inequalities can be generalized to relative Shannon entropy, Rényi entropy, or (combining both generalizations) relative Rényi entropy, also known as ‘Rényi divergence’. For a fairly thorough introduction to inequalities obeyed by Rényi divergence, see:

• T. A. L. van Even, When Data Compression and Statistics Disagree: Two Frequentist Challenges For the Minimum Description Length Principle, Chap. 6: Rényi Divergence, Ph.D. thesis, Leiden University, 2010.

Here however we will start by studying these inequalities for the Shannon entropy of probability measures on finite sets.

The first three types of inequalities can be seen as degenerate cases of the fourth, when taking one of the variables being deterministic.

We have already seen how to formulate the first two properties categorically: for the Shannon entropy functor

H:FinProb( +,) op,H : \Fin\Prob \rightarrow \left(\mathbb{R}_+,\leq\right)^{\mathrm{op}} \:,

the non-negativity of entropy is clear by the definition of the target category, while the non-negativity of conditional entropy follows by considering H on the projection map (X,Y)Y.

But what about the other two types of basic inequalities?

Coslice categories and mutual information

Here’s how to do it for the non-negativity of mutual information. First of all, we need to make sense of the concept of joint distribution. Regarding random variables X and Y as objects of FinProb does not yet specify a joint distribution; in general, there will be many joint ddistributions (X,Y) compatible with the given distributions of X and Y. What we need is to assume the existence of a sample space on which X and Y live.

So consider an object ΩFinProb; we think of it as the sample space conditional to which all other random variables will be considered: X is a random variable on Ω if it comes equipped with a morphism ΩX. In other words, a random variable on Ω is an object in the coslice category Ω/FinProb.

Proposition

If X and Y are random variables over Ω, then their joint distribution (X,Y) is equal to the categorical product X× ΩY in Ω/FinProb.

Proof - Straightforward.(?)

As a side remark, this also shows the following:

Corollary

The category FinProb does not have products.

Proof - If FinProb would have products, then any coslice category Ω/FinProb would inherit these products. However since the joint distribution of two random variables with given distribution does in general depend on the sample space on which these random variables live, the product in Ω/FinProb does depend on Ω.

We resume the main line of thought. By composing the functor H with the natural functor Ω/FinProbFinProb, we can also consider Shannon entropy as a functor on Ω/FinProb, which is, by abuse of notation, also denoted by H.

Proposition

Regarding the categorical product as the monoidal structure on Ω/FinProb, Shannon entropy becomes a lax monoidal functor

H:Ω/FinProb( +,) opH:\Omega/\Fin\Prob\rightarrow\left(\mathbb{R}_+,\leq\right)^{\mathrm{op}}

Proof - The monoidal unit of FinProb is given by the terminal object, which is any single-outcome random variable *, while in 0 the monoidal unit is 0. Thus the compatibility of H with the monoidal unit states that

0H(*)0\geq H(\ast)

which is equivalent to the condition that the one-outcome random variable * has vanishing entropy, H(*)=0.

The compatibility of H with the monoidal product requires

H(X)+H(Y)H(X× ΩY)H(X) + H(Y) \geq H(X\times_{\Omega}Y)

which is precisely the non-negativity of mutual information since H(X× ΩY)=H(X,Y).

Slice categories and conditional entropy

We have just seen that coslice categories under objects in FinProb are relevant for defining joint distributions of random variables. Surprisingly, also slice categories over objects in FinProb play an important role. Let us fix an object ZFinProb and consider the slice category FinProb/Z. An object (X,f X)FinProb/Z is a finite probability space X together with a measure-preserving function f X:XZ. We may think of f X as defining a partitioning of the domain of X, or as a fibration with base Z.

Given an object (X,f X)FinProb/Z, we define the conditional entropy by

H(X/Z)H(X)H(Z)H(X/Z)\equiv H(X)-H(Z)

(Since this is not exactly the usual definition of conditional entropy, we use a slightly different notation.) This quantity measures the information content of X conditional to Z: given that we know the value of Z, how much more information do we expect to get when learning the value of X?

This reduces to the usual concept of conditional Shannon entropy as follows: given random variables A and B, the conditional entropy is given by

H(AB)=H(A,B)H(B)H(A|B)=H(A,B)-H(B)

which is of the above form if we take Z=B, X=(A,B) and f:XZ to be the projection onto the second component.

On the category FinProb/Z, there is a monoidal product defined as follows: for objects (X,f X) and (Y,f Y) the product ‘over Z’ is given by the pullback

X× ZY={(x,y)f X(x)=f Y(y)}X\times^Z Y = \left\{(x,y)\:|\:f_X(x)=f_Y(y)\right\}

together with the measure defined as

P X× ZY(x,y)=P X(x)P Y(y)P Z(f X(x))P_{X\times^Z Y}(x,y) = \frac{P_X(x)\cdot P_Y(y)}{P_Z(f_X(x))}

which just states that X and Y are taken to be ‘fiberwise independent’ random variables. (When the denominator of this expression vanishes, its numerator also vanishes automatically, and we take the whole expression to be 0.)

It should be clear how this monoidal product operates on morphisms. Its unit is given by the object (Z,id Z).

Proposition

With these definitions, the conditional Shannon entropy H(/Z) is a monoidal functor

H(/Z):FinProb/Z( +,) opH(\cdot/Z):\Fin\Prob/Z\rightarrow\left(\mathbb{R}_+,\leq\right)^{\mathrm{op}}

Proof - Functoriality is clear by the previous results. That H(/Z) preserves the monoidal unit means that

H(Z/Z)=H(Z)H(Z)=0.H(Z/Z) = H(Z) - H(Z) = 0\:.

So it remains to check that H(/Z) preserves the monoidal product. This follows from a straightforward calculation:

H(X× ZY/Z)= zZ (x,y)X× ZYP X(x)P Y(y)P Z(z)logP X(x)P Y(y)P Z(z)H(Z)= xXP X(x)logP X(x) yYP Y(y)logP Y(y)2H(Z)=H(X/Z)+H(Y/Z)H(X\times^Z Y/Z)=-\sum_{z\in Z}\sum_{(x,y)\in X\times^Z Y}\frac{P_X(x)P_Y(y)}{P_Z(z)}\log\frac{P_X(x)P_Y(y)}{P_Z(z)}-H(Z)=-\sum_{x\in X}P_X(x)\log P_X(x)-\sum_{y\in Y}P_Y(y)\log P_Y(y) - 2H(Z)=H(X/Z)+H(Y/Z)

Bislice categories and conditional mutual information

Now it is time to combine these observations in order to find a categorical formulation of the non-negativity of conditional mutual information, which subsumes all the other basic information inequalities as degenerate cases.

So let us fix a sample space ΩFinProb and a ‘cosample space’ ZFinProb. We also need to fix a measure-preserving function g:ΩZ which anchors Z as a random variable on Ω. We now consider the category of probability spaces ‘between’ Ω and Z. More precisely, we take Ω/FinProb/Z to be the category where an object is a triple (X,i X,f X) consisting of a probability space XFinProb and measure-preserving functions i X:ΩX and f X:XZ such that f Xi X=g. As morphisms of Ω/FinProb/Z, we take those measure-preserving functions which are compatible with the given data i and f .

This is not a comma category, is it? Should it still be called ‘bislice category’?

Proposition

If (X,i X,f X),(Y,i Y,f Y)Ω/FinProb/Z, then the categorical product, denoted by X× Ω ZY, is the joint distribution of X and Y on the sample space Ω.

Proof - Straightforward.(?)

As above, this gives the same consequence:

Corollary

The category Ω/FinProb does not have products.

The object (Z,g,id Z) is the unit of this monoidal structure.

Proposition

With respect to this monoidal structure, the conditional entropy H(/Z) is a lax monoidal functor

H(/Z):Ω/FinProb/Z( +,) opH(\cdot/Z):\Omega/\Fin\Prob/Z\rightarrow\left(\mathbb{R}_+,\leq\right)^{\mathrm{op}}

The compatibility with the monoidal product is equivalent to the non-negativity of conditional mutual information.

Proof - Again functoriality is clear. For the unit, being lax monoidal means that

H(Z/Z)0H(Z/Z)\leq 0

which holds true by H(Z/Z)=H(Z)H(Z)=0. The compatibility of the monoidal products states that

(5)H(X× Ω ZY/Z)H(X/Z)+H(Y/Z)H(X\times^Z_\Omega Y/Z)\leq H(X/Z)+H(Y/Z)

That this inequality holds for Shannon entropy follows from second the assertion, stating that it is equivalent to the non-negativity of conditional mutual information. This equivalence still remains to be proven. To see this, note that X is isomorphic to the joint variable (X,Z), and similarly Y is isomorphic to (Y,Z). With this, (5) reads

H(X,Y,ZZ)H(X,YZ)+H(Y,ZZ)H(X,Y,Z|Z)\leq H(X,Y|Z)+H(Y,Z|Z)

Upon expanding the conditional entropies in terms of joint entropies, this becomes

(6)H(X,Y,Z)+H(Z)H(X,Z)+H(Y,Z),i.e.I(X:YZ)0.H(X,Y,Z)+H(Z)\leq H(X,Z)+H(Y,Z) ,\quad i.e. \quad I(X:Y|Z) \ge 0\;.

Therefore (5) follows from the non-negativity of conditional mutual information.

The converse direction works similarly: given a triple of random variables X, Y, Z on Ω, we can consider (X,Z) and (Y,Z), which are also random variables on Ω and come equipped with the projection to Z, as objects in Ω/FinProb/Z. Then again, (5) is equivalent to (6).

Rényi entropy

The logarithm of Z has the physical meaning of free energy. The logarithm of Z divided by 1α is called the Rényi entropy:

H α(S,μ)=11αln iSμ i αH_\alpha(S,\mu) = \frac{1}{1-\alpha} \ln \sum_{i \in S} {\mu_i}^\alpha

This can be seen as roughly a q-derivative of the free energy; see

for details.

For α>0 with α1, H α defines a continuous functor

H α:FinMeas 0H_\alpha : \Fin\Meas_0 \to \mathbb{R}

The case α=1 can probably be taken care of via a limit; for now let’s assume so.

Thanks to

Z(X×Y)=Z(X)×Z(Y)Z(X \times Y) = Z(X) \times Z(Y)

the Rényi entropy obeys

H α(X×Y)=H α(X)+H α(Y)H_\alpha(X \times Y) = H_\alpha(X) + H_\alpha(Y)

For probability measures the Rényi entropy also obeys

H α(X)0H_\alpha(X) \ge 0

Thus, if we treat [0,) as a topological subcategory of , the Rényi entropy restricts to a continuous functor

H α:FinProb 0[0,)H_\alpha : \Fin\Prob_0 \to [0,\infty)

where we treat probability measure spaces where the only set of measure zero is the empty set as forming a topological subcategory FinProb of FinMeas.

Furthermore

H α(p 1,,p k)H α(p 1,,p k1+p k)H_\alpha(p_1, \dots, p_k) \ge H_\alpha(p_1, \dots, p_{k-1} + p_k)

where we write a probability measure on a set of the form {1,,} as a list of nonnegative numbers that sum to 1. Thus we have:

Proposition

For α1, the Rényi entropy H α:FinProb 0[0,) extends to a continuous functor

H α:FinProb[0,)H_\alpha : \Fin\Prob \to [0,\infty)

Relative Rényi entropy

This is Tobias. I just want to write things down in little detail to at least have a record. So this section needs much more care. In particular, if we decide to only consider cases where are probalities are strictly positive, then one can apply continuity arguments.

Let p=(p 1,,p n) and r=(r 1,,r n) be two probability distributions on an n-element set. Then the relative Rényi entropy of order q is given by

D q(pr)=1q1log ip i qr i q1D_q(p||r) = \frac{1}{q-1} \log \sum_i \frac{p_i^q}{r_i^{q-1}}

Here, we set D q(pr) to be whenever there is an index i with r i=0, but p i>0. In terms of power means,

D q(pr)=logM q1(p,pr)D_q(p||r) = \log M_{q-1} \left( p,\frac{p}{r} \right)

where pr(i)=p(i)r(i) is the Radon-Nikodym derivative (we set 0/0=0 here). D q(qr) is non-negative for q[0,].

To Do?: see whether the notation is good and add some more introductory bla-bla

What are good properties of Rényi entropy? It is clearly invariant under permutations of the of the indices when these are applied to p and q at the same time. Furthermore, it is also invariant under adding additional randomness: we can add additional randomness to p and q by choosing some λ[0,1] and considering the (n+1)-outcome distributions p=(λp 1,(1λ)p 1,p 2,,p n) and r=(λr 1,(1λ)r 1,r 2,,r n). Then,

D q(pr)=1q1log(λp 1 qr 1 q1+(1λ)p 1 qr 1 q1+ i=2 np i qr i q1)=D q(pr)D_q(p'||r') = \frac{1}{q-1} \log \left( \lambda\frac{p_1^q}{r_1^{q-1}}+(1-\lambda)\frac{p_1^q}{r_1^{q-1}} + \sum_{i=2}^n \frac{p_i^q}{r_i^{q-1}} \right) = D_q(p||r)

So under which kind of operation does D q change at all? One thing we have not considered so far is coarse-graining, i.e. the identification of some outcomes to a single one. Again applying this to both distributions at the same time, we obtain e.g. the distributions p^=(p 1+p 2,p 3,,p n) and r^=(r 1+r 2,r 3,,r n).

Proposition

In this notation,

D q(p^r^)D q(pr)D_q(\hat{p}||\hat{r})\leq D_q(p||r)

for all q[0,].

Proof - For q1, this inequality states that

(p 1+p 2)(p 1+p 2r 1+r 2) q1p 1(p 1r 1 q1)+p 2(p 2r 2 q1)(p_1 + p_2) \left( \frac{p_1+p_2}{r_1+r_2} \right)^{q-1} \leq p_1 \left( \frac{p_1}{r_1}^{q-1} \right) + p_2 \left( \frac{p_2}{r_2}^{q-1} \right)

which is equivalent to

M 1(w,x) q1=M q1(w,x) q1M_{-1}(w,x)^{q-1} = M_{q-1}(w,x)^{q-1}

with w=(p 1p 1+p 2,p 2p 1+p 2) and x i=p i/r i. This in turn always holds since the power means M t are increasing functions of the parameter t. Allowing the second argument of the power means to also take on the value (which means that, due to the particular form of the arguments, that the corresponding weight will not be 0), this also still holds in the degenerate case where some r i vanishes.

(Under construction…)

References

  • D. K. Faddeev, On the concept of entropy of a finite probabilistic scheme, Uspehi Mat. Nauk (N.S.) 11 (1956), no. 1(67), pp. 227–231. (In Russian.) German Translation: Zum Begriff der Entropie eines endlichen Wahrscheinlichkeitsschemas, Arbeiten zur Informationstheorie I, Berlin, Deutscher Verlag der Wissenschaften, 1957, pp. 85-90.
  • J. Havrda and F. Charvát, Quantification method of classification processes: concept of structural α-entropy, Kybernetika 3 (1967), 30–35.
  • S. Mac Lane, Categories for the Working Mathematician, Graduate Texts in Mathematics 5, Springer, 1971.
  • G. P. Patil and C. Taillie, Diversity as a concept and its measurement, Journal of the American Statistical Association 77 (1982), 548–561.
  • C. Tsallis, Possible generalization of Boltzmann–Gibbs statistics, Journal of Statistical Physics 52 (1988), 479–487.
Revised on June 20, 2013 23:44:29 by John Baez (169.235.199.119)