These are notes by John Baez, Tobias Fritz and Tom Leinster. The idea is to develop a deeper understanding of entropy, free energy, the partition function and related ideas in probability theory and statistical mechanics using the tools of modern algebra: categories, monads, algebraic theories, operads and the like.
Some of these notes were turned into a paper:
This paper was developed in a series of blog conversations, in this order:
John Baez, Category-theoretic characterizations of entropy.
Tom Leinster, Entropies vs. means.
Tom Leinster, An operadic introduction to entropy.
John Baez, A characterization of entropy.
The paper was written up on the nLab here:
and there is also a draft of a paper that never got finished, here:
In our paper A characterization of entropy in terms of information loss we gave a characterization of Shannon and more generally Tsallis entropy, not just for probability measures on finite sets, but for arbitrary (nonnegative) measures on these sets. This characterization uses a certain category:
Let be the category whose objects are finite sets equipped with measures and whose morphisms are measure-preserving functions.
We noted that one can take arbitrary nonnegative linear combinations of objects and morphisms in . These can be built up from direct sums and scalar multiplication by nonnegative real numbers, defined as follows.
For ‘direct sums’, first note that the disjoint union of two finite sets equipped with measures is another thing of the same sort. We write the disjoint union of as . Then, given morphisms , there is a unique morphism that restricts to on the measure space and on the measure space .
For ‘scalar multiplication’, first note that we can multiply a measure by a nonnegative real number and get a new measure. So, given an object and a number we obtain an object with the same underlying set and with . Then, given a morphism , there is a unique morphism that has the same underlying function as .
Tsallis entropy can be seen as the unique invariant of morphisms in that is additive under composition, additive under direct sums, homogeneous with respect to scalar multiplication and also continuous:
Let . Suppose is any map sending morphisms in to numbers in , and obeying these four properties:
whenever are composable morphisms.
for all morphisms .
Homogeneity of degree :
for all morphisms and all .
Continuity: is continuous.
Then there exists a constant such that for any morphism in ,
where is the Tsallis entropy of order . Conversely, for any constant , this formula determines a map obeying conditions 1-4.
In the simplest case, namely homogeneity of degree , the Tsallis entropy reduces to the Shannon entropy.
In fact, such an really defines a functor from to a category with one object and with nonnegative real numbers as morphisms. So we now go ahead and formulate our results in the language of category theory. For this we need to pick a way to view the nonnegative real numbers as a category:
Let be the category with a single object and as the set of morphisms from this object to itself, with addition as composition.
The categories and have much in common, which is why it is reasonable to assign an entropy change, namely a morphism in , to each morphism in . Here we focus on three similarities between these categories.
First, both and are topological categories. Recall that a topological category is a category where the set of objects and the set of morphisms form topological spaces, and all the category operations are continuous. A continuous functor is a functor between topological categories that maps objects to objects and morphisms to morphisms in a continuous way.
To make into a topological category, we use the only possible topology on the set of objects, and the standard topology on the set of morphisms . To make into a topological category, we should say not only when a sequence of objects or morphisms converges, but also when any net converges. A net of objects converges to if eventually all the have the same underlying finite set , and then in the usual topology on for all . We say a net of morphisms converges if and for some and for large enough the underlying function of equals a fixed function . The observant reader will note that is a ‘large’ topological category, meaning that its set of objects lives in a larger Grothendieck universe than that containing the finite sets we use here. This will not present a problem.
Second, both and come equipped with a notion of ‘addition’ making them into symmetric monoidal categories. For this addition is trivial on objects, and the usual addition of obvious nonnegative real numbers on morphisms. The resulting functor
makes into a symmetric monoidal category. On the other hand, we have already seen how disjoint union of finite sets equipped with measures defines a ‘direct sum’ operation on . This gives a functor
which makes into a symmetric monoidal category.
Third, both and can be equipped with a notion of ‘scalar multiplication’ by any number . This gives an action of the multiplicative monoid on these categories, by which we mean monoid homomorphisms
where for a category , the monoid consists of all functors with functor composition as monoid multiplication.
Now let us think a bit about how the topological, additive and multiplicative structures on and interact with each other. Note that while in general on ,
we do have canonical isomorphisms
So acts not just as endofunctors on , but as symmetric monoidal endofunctors. Writing for the monoid of symmetric monoidal endofunctors of a symmetric monoidal category , we hence have a monoid homomorphism
So, given a topological monoid , let us call a topological symmetric monoidal category equipped with a monoidal functor
providing a continuous action of on an -module category. We can summarize the previous two paragraphs by saying that is a -module category, and for each there is an -module category .
By a morphism of -module categories, we mean a symmetric monoidal functor which is compatible with the action of . In this way, we obtain a category of -module categories; this is the coslice category over in the category of symmetric monoidal categories with strict symmetric monoidal functors.
Using these concepts, we see that Theorem 2 implies a nice characterization of Shannon entropy:
Suppose is a morphism of -module categories. Then for any morphism in ,
for some , where is Shannon entropy. Conversely, for any constant this equation defines a morphism of -module categories.
What about the Tsallis entropy for ? In fact, for each number there is a way to make into a -module category, by letting act trivially on objects and as multiplication by on morphisms. We write for equipped with this action of .
The action of on the category has properties already mentioned in the case . Namely, there are equations
and so the action actually also operates in terms of symmetric monoidal endofunctors,
This homomorphism is also continuous. So, for each we see that is a -module. Theorem Theorem 2 implies:
Let . Suppose is a morphism of -module categories. Then for any morphism in ,
for some , where is the Tsallis entropy of order . Conversely, for any constant this equation defines a morphism of -module categories.
Let be the category of strictly positive finite measure spaces, where:
• an object is a finite set equipped with a strictly positive measure: that is, a measure for which every set has nonnegative measure, and the only set of measure zero is the empty set.
• a morphism is a function such that the pushforward of is .
We define a topological category is a category internal to , and a continuous functor is a functor internal to . There is a category with topological categories as objects, and continuous functors as morphisms.
We shall make into a topological category as follows. The topology on objects and morphisms is the obvious one except that I’m tempted to do the following: when a sequence of measures on a set is trying to converge (in the obvious sense!) to a measure for which only points in some subset have nonzero measure, we say that
In simple heuristic terms: call the points in a finite set ‘states’. Then when the probability of certain possible states goes to zero, these states effectively ‘go away’.
Note that one interesting feature of this definition is that the space of objects of is connected.
Let be the topological category where objects are real numbers and there is a unique morphism whenever (and none otherwise).
Let us classify all continuous functors
with the property that
Here at left denotes the ‘direct sum’ of measure spaces, while at right it denotes the usual sum of real numbers.
restricts to a continuous functor
Here is the core of the category , meaning that we keep the same objects but include only the isomorphisms. Note that this restriction loses no information. So, if we classify all continuous functors
obeying the equation , the answer to our problem will be easy.
obeying the equation correspond to continuous functions
with the property that .
Proof Sketch - Since every object in is a direc sum of one-element measure spaces, the functor is determined by its value on those. A one-element measure space is just a nonnegative real number, so is determined by a function
Continuity of the functor implies that is continuous. Note that applied to the empty set must give 0. Thanks to the funny topology on and the continuity of , this forces . Conversely, given any such function we can define on the finite measure space by
where is of the point .
obeying the equation correspond to continuous functions
such that and and .
Proof Sketch - Use the previous proposition and show that for to extend to some (necessarily unique) , it is necessary and sufficient that the corresponding function obeys . We need this condition so that any map from the 2-point measure space to a 1-point measure space gets sent to a morphism in . Every map in is surjective and every surjection is a composite of coproducts of copies of the identity and these 2-points-to-1 maps, so this is all we need.
obeying the equations and correspond to continuous functions
of the form with .
Proof - This should follow from our earlier results.
obeying the equation and correspond to continuous functions
of the form with .
Proof - This should follow from our earlier results.
Suppose is an object in . The continuous functor
corresponding to the function is then equal to
We may fix and think of as a function of . This function plays an important role in physics, where it is called the partition function of . We denote this by
The following proposition says the partition function is a ‘complete invariant’ of positive finite measure spaces:
Two objects of have the same partition function if and only if they are isomorphic.
Proof - It is obvious that isomorphic objects of have the same partition function. So, we need to show that we can recover from its partition function. To see this, note that is the Laplace transform of this measure on the real line:
So, we can recover the measure from the partition function by taking its inverse Laplace transform. Furthermore, we can recover the multiset of numbers from the measure .
We can restate this proposition in an even more complicated way. For any (essentially small) category let be the decategorification of : that is, the set of isomorphism classes of objects of . If is a symmetric rig category then will naturally become a commutative rig.
Let , the rig of partition functions, be the set of functions of the form
where is an arbitrary finite set and are arbitrary positive real numbers. is a closed under pointwise addition and multiplication, making it into commutative rig.
The map sending each each isomorphism class of strictly positive finite measure space to its partition function is an isomorphism of commutative rigs.
Proof - When you penetrate the jargon, the only nonobvious part here is that is a bijection, which was shown in the previous proposition.
The monad for convex sets is a monad on sending any set to the set of finitely-supported probability distributions on .
For example, this monad sends to a set , which can be identified with the -simplex. This monad is a finitary monad, so can be thought about in a few different ways.
A finitary monad is the same thing as a finitary algebraic theory. This one can be presented by a family of binary operations, subject to some equations. They’re probably exactly those equations that hold in a convex subset of if we put .
(I suspect there’s a theorem here: that for , “satisfies no laws”. This means there are no equations between the various operations beyond those forced by its being an algebra for this theory.)
A finitary algebraic theory can also be thought of as a kind of souped-up operad. In a symmetric operad , one has for each bijection an induced map . In a finitary algebraic theory, one has the same thing for arbitrary functions between finite sets, not just bijections. In other words, a finitary algebraic theory amounts to a non-symmetric operad together with, for each function between finite sets, an induced map , satisfying suitable axioms.
The underlying symmetric operad for the convex monad is called the operad for convex algebras, and denote . An algebra of is called a convex algebra.
The space of -ary operations for this operad is , the space of probability distributions on . The composition of operations is supposed to be obvious, but we should probably write down a formula for it. The maps “” can be defined by pushforward of measures. An algebra for the algebraic theory of convex algebras is an algebra for the operad with the further property that the square
commutes for all .
Note that is naturally a topological operad, where the topology on is the usual topology on the -simplex. In our applications, we focus on algebras of in topological categories with finite products. We call these convex algebras in . Here are some examples:
Any convex subset of is naturally a convex algebra in . In particular, is such.
Moreover, if we regard the topological monoid as a one-object topological category, then it is a convex algebra in .
is also a convex algebra in .
is also a convex algebra in .
Rényi presents a nice characterization of Shannon entropy due to Fadeev here:
It is the first result mentioned in the paper. To translate into our framework, it helps to write a probability measure on the finite set as an -tuple of numbers with .
(Fadeev) Suppose is a continuous functor obeying the magical property:
for all . Then is a nonnegative multiple of the Shannon entropy .
One may reformulate Fadeev’s theorem in terms of operads. The details are written out here:
What follows is a sketch.
Let be an operad in a finite product category . We may speak of (strict) -algebras in , the category of internal categories in . There is a notion of lax map between -algebras in .
Briefly put: if and are -algebras, a lax map from to is a functor together with, for each , a natural transformation
satisfying one axiom involving composition in , one involving the unit of , and, if is symmetric, one involving the symmetric group action on .
Now take . Let us call a lax map a lax point of . Batanin has called them “internal -algebras” in , and maybe we’ll end up preferring that name.
If and is the terminal non-symmetric operad then is a monoidal category and a lax point of is a monoid in .
If and is the non-symmetric operad for -sets, where is a monoid, then is a category with an -action and a lax point of is an object together a map for each , satisfying the obvious action-type axioms.
In what follows we apply this idea to the case where and is the symmetric, topological operad in which is the space of probability distributions on (otherwise known as the -simplex).
The lax points of the -algebra are precisely the scalar multiples of Shannon entropy.
Sketch of proof - Write out the axioms in the definition of lax map and unwind them in this particular case. One concerns composition, and this becomes the most substantial of the Fadeev axioms on Shannon entropy: the so-called ‘magical property’ mentioned above. One concerns units, and is redundant. One concerns symmetry. Then because we’re entirely in , you’ve got one concerning continuity. So we’ve got all the Fadeev axioms except for normalization, and the result follows from (Rényi’s statement of) Fadeev’s theorem.
There’s a very similar characterization of Shannon extropy, or rather the scalar (real) powers of Shannon extropy. There we simply use the -algebra , which is isomorphic to by taking exponentials.
We might also want to get rid of the “scalar multiples/powers” part and hit it on the nose. This seems more feasible with extropy rather than entropy, since there’s no choice of base. Still, I don’t see a nice way to do it.
This section is in a more informal style…
We begin with a sadly familiar problem:
Suppose you live in a town with a limited number of tolerable restaurants. Every Friday you go out for dinner. You randomly choose a restaurant according to a certain probability distribution $P$. If you go to the $i$th restaurant, you then choose a dish from the menu according to some probability distribution $Q_i$. How surprising will your choice be, on average?
Note: I’m only talking about how surprised you are by the choice itself—not about any additional surprise due to the salad being exceptionally tasty, or the cook having dropped her wedding ring into your soup, or something like that. So if there’s only one restaurant in town and they only serve one dish, you won’t be at all surprised by your choice. But if there were thousands of restaurants serving hundreds of dishes each, there’d be room for a lot of surprise.
This still sounds like an impossibly vague puzzle. So, I should admit that while ‘surprise’ sounds psychological and very complicated, I’m really just using it as a cute synonym for Shannon entropy. Shannon entropy can be thought of as a measure of ‘expected surprise’.
Now, ‘expected surprise’ sounds like an oxymoron: how can something expected be a surprise? But in this context ‘expected’ means ‘average’. The idea is that your ‘surprise’ at an outcome with probability is defined to be . Then, if there are a bunch of possible outcomes distributed according to some probability distribution , the ‘expected’ surprise or Shannon entropy is:
where is the probability of the th outcome. This is a weighted average where each surprise is weighted by its probability of happening.
So, we really do have a well-defined math puzzle, and it’s not even very hard.
But the interesting thing about this problem is that it involves an operation which I’ll call ‘glomming together’ probability distributions. First you choose a restaurant according to some probability distribution on the set of restaurants. Then you choose a meal according to some probability distribution . If there are restaurants in town, you wind up eating meals in a way described by some probability distribution we’ll call
A bit more formally:
Suppose is a probability distribution on the set and are probability distributions on finite sets , where . Suppose the probability distribution assigns a probability to each element , and suppose the distribution assigns a probability to each element .
Then we can glom together the probability distributions using and get a new probability distribution on the disjoint union of all the sets . The resulting probability for each element in the disjoint union is just .
These probabilities sum to 1 as they should:
Note: what ranges over depends on ! But it’s all kosher.
There’s something even simpler than glomming together probability measures: we can glom together numbers!
Suppose we have a probability measure on the set and for each element we have a number . Then we can define a new number by
This is just the ‘weighted average’ of the numbers . We have already seen a weighted average in the definition of Shannon entropy, equation (2).
Now we’re ready to answer the math puzzle:
On the left we’re glomming together a bunch of probability distributions using and then taking the entropy. On the right we’re taking the entropies of those probability distributions and then glomming the resulting numbers together using . But there’s also an extra term: the entropy of !
In other words: your expected surprise won’t be just the weighted average of your expected surprises at the various different restaurants, namely . It’ll be that plus the expected surprise of your choice of restaurant, namely .
You might not have expected such a simple formula. But it’s easy to prove:
The formula for glomming together entropies is more important than you might think: it comes very close to characterizing Shannon entropy! We can restate Fadeev’s theorem (mentioned above) as follows:
(Fadeev) Suppose is a function sending probability measures on finite sets to real numbers. Suppose that:
is invariant under bijections.
Then is a real multiple of Shannon entropy.
In item 1 we’re using the fact that a bijection between finite sets together with a probability distribution on gives a probability distribution on ; we want these to have the same entropy. In item 2 we’re using the standard topology on to put a topology on the set of probability distributions on any -element set. For a proof of this theorem, see the beginning of Rényi's 1961 paper.
While this equation is cute:
it’s a bit tricky to find its proper place in the world of abstract algebra. A probability distribution can act either on a list of probability distributions or a list of numbers. Shannon entropy gives a map from probability distributions to numbers. So, if you’re algebraically inclined, you would want to be a ‘homomorphism’, obeying a law like this:
We see laws of this sort all over math. But the true law has an extra term. What’s going on?
The ‘glomming’ business can be formalized using operads, and Tom’s answer is roughly: Shannon entropy is not a ‘homomorphism’ of operad algebras, but only a ‘lax homomorphism’. For an explanation of what this means, go here.
Right now I want to propose an alternative answer. I hope we can combine it with Tom’s answer and reach a deeper understanding of what’s going on.
For starters, consider another law that has an ‘extra term’:
In math jargon, the product rule says that taking the derivative of a function at a point is not a homomorphism from smooth functions to real numbers: it’s a ‘derivation’. We can get a derivation on an algebra by differentiating a family of algebra homomorphisms. Similarly, we can get the funny rule obeyed by entropy by differentiating a family of operad algebra homomorphisms.
Let’s see how.
There’s something more fundamental than the Shannon entropy of a probability distribution: namely, it’s partition function. At least that’s how physicists think. Let me explain.
Suppose is a probability measure on a finite set . Let be the probability of the element . We can always write
is a function called the energy or Hamiltonian. The idea is that the probability of a system being in some state decreases exponentially with the energy of that state; we allow the energy to account for zero probabilities.
But physicists always go further and introduce a parameter which stands for the reciprocal of temperature, to capture the fact that states of high energy are even less likely to be occupied when it’s cold. Now we make into a function of :
or if you prefer, simply
When , these numbers are just the probabilities . But when , there’s no reason for them to sum to 1. To get a probability measure, we’d have to divide by a fudge factor called the partition function:
But I won’t do that just now: I’ll let be the measure on the set that assigns the measure to the point .
Now everything is a function of ! But everything reduces to something familar at , so we can stretch our old notation without breaking it. We now have functions sending numbers to numbers, and a function sending numbers to measures on . These reduce to our original and at . The partition function is also a function of , and this equals 1 at .
But here’s the important thing: the partition function obeys a simpler law than entropy when we glom probability measures together! Suppose is a probability distribution on the set and are probability distributions on finite sets , where . Then
So, in fancy jargon, is an operad algebra homomorphism!
But I need to explain what this equation means. First of all, everything is a function of . Second, while and are only probability measures at , they’re perfectly fine measures at other values of , so all our ‘glomming’ formulas still make sense.
Let’s check to make sure we know what’s going on. An expression like could be ambiguous. We have a recipe for taking a probability measure on a finite set and extending it to a function from numbers to measures on that set. So, we can extend and this way and then ‘glom’ them to get a measure-valued function of . On the other hand, we can glom them and then extend them. Luckily, the results agree.
Let’s see why. Suppose assigns the point the probability . When we extend this to a function of we get . Similarly, suppose assigns the point the probability . When we extend this to a function of we get . When we glom these, the measure of the point will be this function of :
But this equals
which is the result of glomming and then extending.
The right-hand side of equation (3) is also unambiguous… but I’m wasting time: if I were a physicist I’d be done proving this equation by now, instead of stewing around explaining what it means. It’s incredibly easy to prove. From what I’ve said,
How is the Shannon entropy of a probability distribution related to its partition function? Simple:
Here I must emphasize that I’m only talking about the Shannon entropy of the original probability distribution, ‘at ’. Physicists extend to a function of , along with everything else. That would be very good to do, but it goes beyond our goal now, and it would make the formula relating and a bit more complicated, so let’s not.
Our goal now is simply to get the rule for glomming entropies by differentiating the rule for glomming partition functions. So, let’s do that using equation (4). Later I’ll show you why (4) is true.
We start with the rule for glomming partition functions:
We differentiate with respect to and use the product rule, taking advantage of the fact that is linear in but also linear in each of the functions :
Now set . We can use equation (4) and the fact that all our partition functions equal 1 at :
Hey, it looks like we’re almost done! As you can see, the product rule did most of the work, so we’re really saying that is like a ‘derivation’. We just to work out that funny-looking first term on the right-hand side. It amounts to taking a weighted sum of 1’s, which is just a sum:
and we have
So, we’ve got it!
The funny-looking rule for glomming entropies is just the derivative of the rule for glomming partition functions!
But before we go further, let’s check rule (4). For starters,
But we’ve just seen that
As we’ve seen already, there is deeper reason for being interested in the partition function: it’s actually a form of ‘decategorification’!
Let be the core of , as defined above. Then
can be thought of as a functor from to its set of isomorphism classes, viewed as a category with only identity morphisms. In other words, assigns to each object its partition function … but we can recover up to isomorphism from .
Now, is an algebra of a certain operad whose -ary operations are the probability measures on the set . This is just a fancy way of talking about ‘glomming probability measures’. As a kind of consequence, the set of partition functions also becomes a -algebra. Furthermore, becomes a homomorphism of -algebras. This last statement mostly just says that
But then we can take , differentiate it with respect to , and set . By abstract nonsense, the resulting functor should be some sort of ‘derivation’ of the -algebra . Concretely, we’ve seen this mostly says that
But there should also be an abstract-nonsense theory of derivations of operad algebras. (This was discussed this a bit back in week299, but only a little). So, an interesting question presents itself:
How does the ‘derivation’ way of thinking about the law relate to Tom Leinster’s interpretation of it in terms of lax operad homomorphisms, or more precisely ‘lax points’?
If we get this straightened out, it will also be tempting to extend the story to Rényi entropy, using the fact that Rényi entropy is a kind of -derivative of minus the logarithm of the partition function. The -derivative is a kind of ‘twisted derivation’, but a bunch of the same calculations may work in some modified form.
There is a categorical characterization of the power means (of order ), similar in spirit to some of the other results on this page. It follows from the characterization in
Write . For each , there is a convex algebra structure on the set , given by taking the power mean of order . The action
is given by if , or if , or if .
Now put the order on to make it into a category. Since is increasing (for each and ), power mean of any order makes into a convex algebra in .
Now put the additive structure on , to make it into a monoidal category. For we have
and , so is a lax monoidal functor. So power mean of any order makes into a convex algebra in , the category of monoidal categories and lax monoidal functors.
This structure is, moreover, homogeneous:
whenever and .
The convex algebra structures on satisfying the homogeneity axiom are precisely the power means ().
Sketch proof - We show that any homogeneous convex algebra structure on satisfies the hypotheses of the characterization theorem in arXiv:1103.2574, and is therefore one of the power means. This is elementary manipulation. The axiom called “functoriality” in that paper is exactly the commutativity of the square drawn above. “Consistency” is the unit axiom for an operad action. “Multiplicativity” follows from the composition axiom for an operad action, together with homogeneity.
I wouldn’t call this result a rephrasing of the result in my paper; it’s probably rather weaker (i.e. easier to prove), because the all-important multiplicativity axiom involves only a very restricted kind of operadic composition.
The proposition would be nicer if the homogeneity axiom could be either dropped or turned into something more categorical. I don’t know what to do about this.
One possibility is to observe that the monoid acts on the monoidal category , and that homogeneity amounts to preservation of that action. So we could think about convex algebras in , the category of monoidal categories equipped with an action of the multiplicative monoid . That should work, but seems clumsy.
Another possibility is to use the following strange kind of monoidal functor. Let’s say that a lax monoidal functor is homogeneous if for all and , the coherence map
is invertible. The composite of homogeneous functors is again homogeneous. Let be the category of monoidal categories and homogeneous lax functors. Then the convex algebra structures on are exactly the power means of order . (Homogeneity of this kind only tells us that when is a natural number, but we can deduce it for all positive reals.) That’s a reasonably clean statement, but the concept of homogeneous lax functor seems strange, and was invented only for this purpose. I’ve never seen it before.
I don’t know whether the homogeneity axiom is simply redundant. I’ve tried to prove that it is, and I’ve tried to prove that it isn’t, without success.
This is Tom, still.
To study quadratic forms, you really need to study bilinear forms. Similarly, I wonder whether here, it would help to look not just at expressions such as , but also at the associated expressions . (You see such expressions in the definition of power mean; I have in the back of my mind the kind of thing in arXiv:1103.2574, and closely related things in my work on diversity.)
In any case, here’s a theorem in the same spirit as some of those above.
Let be the topological category in which:
an object is a finite set equipped with a (nonnegative) measure and a function
a map is a map of finite sets such that and (i.e. ).
Since the base space is a mere finite set, and are actually of the same type. The fancy terminology hides this, and maybe we’re embarrassed to admit it – we’re probably dreaming of higher things. Nevertheless, we’ll take advantage of it later.
is a topological rig category, in an obvious way. So too is , the topological rig of all real numbers regarded as a rig category with no maps other than identities. (The subscript is supposed to mean “the only maps are isos”, as in what John wrote.)
A rig functor
is determined by its effect on objects, since is categorically discrete. It’s the same thing as a homomorphism from , the rig of connected-components of , to . For each , there’s a continuous rig functor given by
(Or more fancily: . In the case and , continuity forces us to put .)
The continuous rig functors are precisely the functors ().
Remark: I know versions of this proposition, including the one just stated. The first choice is whether to “allow zero” or not. By “allowing zero” I mean that in the definition of , and (as above); by “not allowing zero” I mean that we demand and .
The second choice is whether to work internally to or . (The point here is that in order to exclude the non-obvious solutions to the Cauchy functional equation , you need some extra hypothesis such as continuity or order-preservation.) The order on the objects of is: iff , for all , and for all . The order on the maps of is: iff , , and as functions.
Here are the four versions:
Allowing 0 and working in . This is the proposition above, where the continuous rig functors are for .
Not allowing 0 and working in . Then we get for all .
Allowing 0 and working in . Then we get for , together with another rig functor . This new functor is almost the same as , but whereas in the definition of we take , in the definition of we take . Explicitly, (whereas ).
Not allowing 0 and working in . Then we get for (but not ; or rather, we don’t see the difference between and ).
: I’ll continue to use the same setup as in the Proposition (allowing and working in ), for no other reason than that it’s the one I first thought of.
Digression: Actually, what I’m really taking advantage of is the fact that any measure on a finite set gives rise to a function. This is, if you like, its Radon-Nikodym derivative with respect to counting measure. So counting measure is playing a silent role as the canonical reference measure on a finite set.
I’ll invent some terminology. A map in is an equivalence if there exist subsets and such that , the induced map is a bijection, and . I think this is a standard concept in measure theory, but I forget the standard name; maybe just “measure-isomorphism”?
The maps in are the equivalences .
Proof - straightforward manipulation of the definitions.
Let be the category whose objects are those of and whose maps are the equivalences. Thus, embeds as a full subcategory of :
via . Also, is closed under the rig category operations on , so it’s a sub-rig-category.
John’s remarks on convergence in suggest that perhaps he really wants to be using instead of …? In any case, is a sub-rig-category of , and it’s easy to see that the continuous rig functors are exactly those of the form (), where
This is basically Corollary 1 above.
We now have a commutative triangle, or we would if I knew how to draw one. The composite
is equal to
for every . Algebraically, this just says that
But it seems to me to be noteworthy, because we naturally see the appearance of (as opposed to ), which is a recurrent player in this drama.
For random variables , , let us write for the ‘joint random variable’ whose distribution is the joint distribution of and . Its entropy is the joint entropy , which we also write as .
The basic inequalities of information theory are the following inequalities for Shannon entropy:
Some of these inequalities can be generalized to relative Shannon entropy, Rényi entropy, or (combining both generalizations) relative Rényi entropy, also known as ‘Rényi divergence’. For a fairly thorough introduction to inequalities obeyed by Rényi divergence, see:
• T. A. L. van Even, When Data Compression and Statistics Disagree: Two Frequentist Challenges For the Minimum Description Length Principle, Chap. 6: Rényi Divergence, Ph.D. thesis, Leiden University, 2010.
Here however we will start by studying these inequalities for the Shannon entropy of probability measures on finite sets.
The first three types of inequalities can be seen as degenerate cases of the fourth, when taking one of the variables being deterministic.
We have already seen how to formulate the first two properties categorically: for the Shannon entropy functor
the non-negativity of entropy is clear by the definition of the target category, while the non-negativity of conditional entropy follows by considering on the projection map .
But what about the other two types of basic inequalities?
Here’s how to do it for the non-negativity of mutual information. First of all, we need to make sense of the concept of joint distribution. Regarding random variables and as objects of does not yet specify a joint distribution; in general, there will be many joint ddistributions compatible with the given distributions of and . What we need is to assume the existence of a sample space on which and live.
So consider an object ; we think of it as the sample space conditional to which all other random variables will be considered: is a random variable on if it comes equipped with a morphism . In other words, a random variable on is an object in the coslice category .
If and are random variables over , then their joint distribution is equal to the categorical product in .
Proof - Straightforward.(?)
As a side remark, this also shows the following:
The category does not have products.
Proof - If would have products, then any coslice category would inherit these products. However since the joint distribution of two random variables with given distribution does in general depend on the sample space on which these random variables live, the product in does depend on .
We resume the main line of thought. By composing the functor with the natural functor , we can also consider Shannon entropy as a functor on , which is, by abuse of notation, also denoted by .
Regarding the categorical product as the monoidal structure on , Shannon entropy becomes a lax monoidal functor
Proof - The monoidal unit of is given by the terminal object, which is any single-outcome random variable , while in the monoidal unit is . Thus the compatibility of with the monoidal unit states that
which is equivalent to the condition that the one-outcome random variable has vanishing entropy, .
The compatibility of with the monoidal product requires
which is precisely the non-negativity of mutual information since .
We have just seen that coslice categories under objects in are relevant for defining joint distributions of random variables. Surprisingly, also slice categories over objects in play an important role. Let us fix an object and consider the slice category . An object is a finite probability space together with a measure-preserving function . We may think of as defining a partitioning of the domain of , or as a fibration with base .
Given an object , we define the conditional entropy by
(Since this is not exactly the usual definition of conditional entropy, we use a slightly different notation.) This quantity measures the information content of conditional to : given that we know the value of , how much more information do we expect to get when learning the value of ?
This reduces to the usual concept of conditional Shannon entropy as follows: given random variables and , the conditional entropy is given by
which is of the above form if we take , and to be the projection onto the second component.
On the category , there is a monoidal product defined as follows: for objects and the product ‘over ’ is given by the pullback
together with the measure defined as
which just states that and are taken to be ‘fiberwise independent’ random variables. (When the denominator of this expression vanishes, its numerator also vanishes automatically, and we take the whole expression to be .)
It should be clear how this monoidal product operates on morphisms. Its unit is given by the object .
With these definitions, the conditional Shannon entropy is a monoidal functor
Proof - Functoriality is clear by the previous results. That preserves the monoidal unit means that
So it remains to check that preserves the monoidal product. This follows from a straightforward calculation:
Now it is time to combine these observations in order to find a categorical formulation of the non-negativity of conditional mutual information, which subsumes all the other basic information inequalities as degenerate cases.
So let us fix a sample space and a ‘cosample space’ . We also need to fix a measure-preserving function which anchors as a random variable on . We now consider the category of probability spaces ‘between’ and . More precisely, we take to be the category where an object is a triple consisting of a probability space and measure-preserving functions and such that . As morphisms of , we take those measure-preserving functions which are compatible with the given data and .
This is not a comma category, is it? Should it still be called ‘bislice category’?
If , then the categorical product, denoted by , is the joint distribution of and on the sample space .
Proof - Straightforward.(?)
As above, this gives the same consequence:
The category does not have products.
The object is the unit of this monoidal structure.
With respect to this monoidal structure, the conditional entropy is a lax monoidal functor
The compatibility with the monoidal product is equivalent to the non-negativity of conditional mutual information.
Proof - Again functoriality is clear. For the unit, being lax monoidal means that
which holds true by . The compatibility of the monoidal products states that
That this inequality holds for Shannon entropy follows from second the assertion, stating that it is equivalent to the non-negativity of conditional mutual information. This equivalence still remains to be proven. To see this, note that is isomorphic to the joint variable , and similarly is isomorphic to . With this, (5) reads
Upon expanding the conditional entropies in terms of joint entropies, this becomes
Therefore (5) follows from the non-negativity of conditional mutual information.
The converse direction works similarly: given a triple of random variables , , on , we can consider and , which are also random variables on and come equipped with the projection to , as objects in . Then again, (5) is equivalent to (6).
The logarithm of has the physical meaning of free energy. The logarithm of divided by is called the Rényi entropy:
This can be seen as roughly a -derivative of the free energy; see
For with , defines a continuous functor
The case can probably be taken care of via a limit; for now let’s assume so.
the Rényi entropy obeys
For probability measures the Rényi entropy also obeys
Thus, if we treat as a topological subcategory of , the Rényi entropy restricts to a continuous functor
where we treat probability measure spaces where the only set of measure zero is the empty set as forming a topological subcategory of .
where we write a probability measure on a set of the form as a list of nonnegative numbers that sum to 1. Thus we have:
For , the Rényi entropy extends to a continuous functor
This is Tobias. I just want to write things down in little detail to at least have a record. So this section needs much more care. In particular, if we decide to only consider cases where are probalities are strictly positive, then one can apply continuity arguments.
Let and be two probability distributions on an -element set. Then the relative Rényi entropy of order is given by
Here, we set to be whenever there is an index with , but . In terms of power means,
where is the Radon-Nikodym derivative (we set here). is non-negative for .
To Do?: see whether the notation is good and add some more introductory bla-bla
What are good properties of Rényi entropy? It is clearly invariant under permutations of the of the indices when these are applied to and at the same time. Furthermore, it is also invariant under adding additional randomness: we can add additional randomness to and by choosing some and considering the -outcome distributions and . Then,
So under which kind of operation does change at all? One thing we have not considered so far is coarse-graining, i.e. the identification of some outcomes to a single one. Again applying this to both distributions at the same time, we obtain e.g. the distributions and .
In this notation,
for all .
Proof - For , this inequality states that
which is equivalent to
with and . This in turn always holds since the power means are increasing functions of the parameter . Allowing the second argument of the power means to also take on the value (which means that, due to the particular form of the arguments, that the corresponding weight will not be ), this also still holds in the degenerate case where some vanishes.