nLab Markov category

Contents

Context

Measure and probability theory

Monoidal categories

monoidal categories

With braiding

With duals for objects

With duals for morphisms

With traces

Closed structure

Special sorts of products

Semisimplicity

Morphisms

Internal monoids

Examples

Theorems

In higher category theory

Contents

Idea

The formalism of Markov categories and copy-discard categories is one of the most recent categorical approaches to probability theory.

Intuitively, a Markov category is a category where morphisms behave like “random functions”, for example, Markov kernels form such a category (hence the name). Canonical examples are Kleisli categories of probability monads. The formalism is however far more general.

Markov categories have a graphical formalism, based on string diagrams, that allows to easily keep track of the stochastic dependencies, as well as of randomness and determinism. Using this formalism one can interpret, prove and even generalize several results of probability and related fields (see Categorical probability – results for more details).

Definitions

A concise definition, which we interpret below, can be given as follows: A Markov category is a semicartesian symmetric monoidal category (C,,1)(C,\otimes,1) which supplies commutative comonoids, i.e. in which every object XX is equipped with the structure of a commutative internal comonoid compatible with the tensor product.

The comultiplication and counit maps are usually denoted by copy:XXXcopy: X \to X \otimes X and del:X1del: X\to 1.

Let’s now restate the definition in a more intuitive, mathematically equivalent way, by making use of string diagrams.

String diagrams for monoidal categories

It is useful to introduce a string diagram notation for Markov categories, which is used throughout the literature, and which makes stochastic interactions particularly intuitive.

First of all, we represent a morphism f:XYf:X\to Y in a monoidal category as follows, which we read from bottom to top:

Notice that, across the literature, convention vary (for example, also left-to-right is used.) When domain and codomain are clear, we will omit them:

We represent the object XX, as well as its identity morphism, simply as a wire:

The (sequential) composition of morphisms f:XYf:X\to Y and g:YZg:Y\to Z is drawn as follows:

The tensor product of the monoidal category, or parallel composition, is represented as follows. First of all, given objects AA, BB, XX and YY, a morphism h:ABXYh:A\otimes B\to X\otimes Y is depicted as a box with multiple input and output wires:

Now given f:AXf:A\to X and g:BYg:B\to Y, the parallel composition fg:ABXYf\otimes g:A\otimes B\to X\otimes Y is represented by juxtaposition:

This highlights that this situation is non-signalling: here, XX depends on AA only, and not on BB, while in the generic case of hh above, XX may depend both on AA and BB. This will be particularly relevant when we discuss stochastic interactions (see below).

Similarly, one can form the product of three or more objects and morphisms. Note that in the diagrams, the product is strictly associative: string diagrams, rather than the original category, technically represent a strict monoidal category which is monoidally equivalent? to the original one. (Such a category always exists thanks to the coherence theorem.)

In a similar vein, the monoidal unit is usually not appearing in string diagrams, whenever this does not cause confusion. This way, for example, the diagram

represents the object XX, but also XIX\otimes I, IXI\otimes X, and so on.

Particular importance is given to morphisms from the monoidal unit, in the form p:IXp:I\to X, and which here we call states, and represent as follows.

These will model probabilistic states, such as probability measures (see below). In the literature, states are also depicted as triangles.

To conclude this section, the braiding of the monoidal category is represented as follows:

The unitors and associators are similarly omitted from the diagrams, except when it causes ambiguity.

Since we work in a symmetric monoidal category, the braiding is its own inverse.

Commutative comonoids

Let XX be an object of a symmetric monoidal category. A commutative comonoid structure on XX consists first of all of maps XXXX\mapsto X\otimes X and XIX\to I, called the comultiplication and counit, or copy and discard. For the purposes of Markov categories, we denote them by copy:XXXcopy: X \to X \otimes X and del:X1del: X\to 1

In terms of string diagrams, we represent these maps as follows: which we can think of as “copying the state of XX” and “discarding it”.

These maps need to satisfy the following conditions (commutativity, counitality, coassociativity):

which we can interpret as the facts that

  • The two copies are interchangeable;
  • Copying once and discarding one of the two copies is the same as doing nothing;
  • Copying once and then copying one of the copies is the same as copying the other copy.

In other words, these conditions assure that the interpretation as “copying and discarding information” is consistent.

Markov and CD categories

A copy-discard (CD) category or garbage-share (GS) monoidal category is a symmetric monoidal category where

  • Every object is equipped with a distinguished commutative comonoid structure;

  • The comonoid structures are compatible with the tensor products in the following ways.

First for all, for all objects XX and YY,

copy XY=(id Xb Y,Xid Y)(copy Xcopy Y), copy_{X\otimes Y} \; =\; (id_X\otimes b_{Y,X} \otimes id_Y) ( copy_X \otimes copy_Y ),

where bb denotes the braiding. In terms of string diagrams:

Similarly, we also have the following conditions (recall that the monoidal unit, unitors and associators are not depicted):

A Markov category or affine CD category is a CD category where any of these equivalent conditions hold:

  • The monoidal unit II is a terminal object (i.e. the category is semicartesian? monoidal);
  • The discard (counit) maps del:XIdel:X\to I are part of a natural transformation idIid\Rightarrow I;
  • For every morphism f:XYf:X\to Y, we have that del Yf=del Xdel_Y \circ f = \del_X.

The last equation can be written in terms of string diagrams as follows:

This condition can be interpreted as follows:

  • From the point of view of information theory, it says that processing any information and discarding the result is the same as discarding the input altogether.
  • From the point of view of probability theory, it says that probabilities are normalized to one. (More on that below.)

Notice also that if we assume that the monoidal unit is terminal, all compatibility conditions between the comonoid structures and the tensor product except the first one are satisfied.

Examples

Let’s start with a simple example where randommess is not involved, the category of sets. We then look at more interesting cases where randomness is involved, FinStoch and Stoch.

See also the detailed list below.

The copy-discard structure of Set

Consider the category Set of sets and functions, with its usual cartesian monoidal structure (given by the cartesian product of sets).

Every set XX is equipped with canonical maps

which can be interpreted as “copying the value of xx” and “discarding it” (we denote by ** the unique element of the terminal set 11). It can be easily checked that these maps make XX a commutative comonoid, and that these maps are compatible with the tensor product. Since the monoidal unit is terminal, we have a Markov category.

The same can be done in every cartesian monoidal category:

Proposition

In a cartesian monoidal category every object is equipped with a unique commutative comonoid structure, defined in terms of the universal property of the product.

In cartesian monoidal categories, no randomness is present (as defined below), and so the Markov structure is purely about manipulating information deterministically: information can be copied, discarded, and processed deterministically (through arbitrary morphisms).

Remark

Note that, while the copy-discard structure of Set and of other cartesian monoidal categories may seem trivial, it is not. The copy map is implicitly used whenever we are reusing a variable in an expression. For example, every function f:X×XYf:X\times X\to Y also defines a function f:XYf':X\to Y by just “pluggin in xx twice”:

f(x)=f(x,x). f'(x) = f(x,x) .

What happened under the hood is the following composition: In string diagrams, we write this as follows.

Similarly, the discard map is used whenever we are not using a variable in an expression. For example, every function g:XYg:X\to Y also defines a function g:X×AYg':X\times A\to Y which “does not really depend on AA”:

g(x,a)=g(x). g'(x,a) = g(x) .

What happened under the hood is the following composition, which in string diagrams we write as follows.

In particular, every element yYy\in Y can be seen as a constant function XYX\to Y which does not really depend on XX.

As we will see, once probabilities are involved, copying and discarding will require particular care, due to the presence of possible correlation. Markov and CD categories allow us to keep track of these operations graphically, avoiding many possible pitfalls, and even using them to our advantage.

The category FinStoch of stochastic matrices

Let’s now look at a category where randomness is involved. Let XX and YY be finite sets. Recall that a stochastic matrix, or stochastic map, between them is a function such that

(1) yYf(y|x)=1 \sum_{y\in Y} f(y|x) = 1

for every xXx\in X.

We can interpret the number f(y|x)f(y|x) as the probability of transitioning from xx to yy, in a stochastic way. Note that in particular, every deterministic function defines a stochastic matrix: if f:XYf:X\to Y is an ordinary function, we can write a stochastic matrix δ f\delta_f from XX to YY by saying that δ f(y|x)=1\delta_f(y|x)=1 if y=f(x)y=f(x), and 00 otherwise. Therefore these random transitions include the deterministic ones as a special case.

See stochastic map for more information, and for how they compose.

Similarly to what we did for Set, we can assign copy and discard maps to our object, which intuitively copy and discard information deterministically. First of all, as monoidal structure we again take the cartesian product of sets. (In this category, it is not a categorical product, but it’s still a monoidal one.)

The tensor product of morphisms f:AXf:A\to X and g:BYg:B\to Y

is the stochastic matrix ABXYA\otimes B\to X\otimes Y of entries

(fg)(x,y|a,b)=f(x|a)g(y|b), (f\otimes g) (x,y|a,b) = f(x|a) \, g(y|b) ,

which says that the two transitions are happening independently. In particular, for the case of states, the product of p:1Xp:1\to X and q:1Xq:1\to X,

is the product probability

(pq)(x,y)=p(x)q(y), (p\otimes q)(x,y) = p(x)\,q(y) ,

which as usual denotes independent events.

We define the copy and discard maps as the maps δ copy\delta_copy and δ del\delta_del induced by the ones of Set (see the example above). Explicitly, the stochastic map copy:XXXcopy:X\to X\otimes X is defined by

copy(x,x|x)={1 x=x=x; 0 otherwise. copy(x',x''|x) = \begin{cases} 1 & x'=x''=x ; \\ 0 & otherwise. \end{cases}

The stochastic map del:X1del:X\to 1 satisfies

del(*|x)=1, del(*|x) = 1 ,

where ** is the unique element of the terminal set 11.

Again, one can check that these maps make XX a commutative comonoid, and that these maps are compatible with the tensor product.

The reason why we have a Markov category and not just a CD category is exactly the normalization condition (1): given a stochastic map f:XYf:X\to Y, the condition del Yf=del Xdel_Y\circ f = del_X reads exactly as

(del Yf)(*|x)= yYdel Y(*|y)f(y|x)= yY1f(y|x)=1=del X(*|x) (del_Y\circ f) (*|x) = \sum_{y\in Y} del_Y(*|y) \, f(y|x) = \sum_{y\in Y} 1\, f(y|x) = 1 = \del_X(*|x)

for every xXx\in X.

The category Stoch of Markov kernels

Recall that given measurable spaces (X,𝒜)(X,\mathcal{A}) and (Y,)(Y,\mathcal{B}), a Markov kernel between them is a function which is measurable in xx and which is a probability measure in BB. We can view it as a generalization of a stochastic matrix outside the discrete case, where the number k(B|x)k(B|x) is the probability of transitioning into BB if we are in xx. (See Markov kernel for more on this.)

The category Stoch of measurable spaces and Markov kernels has a Markov category structure similar to the one of FinStoch. As for FinStoch, as monoidal structure we take the one induced by the cartesian product of measurable spaces. Given measurable spaces (X,𝒜)(X,\mathcal{A}) and (Y,)(Y,\mathcal{B}) (or more briefly, XX and YY), we take XYX\otimes Y to be the cartesian product of sets X×YX\times Y, equipped with the tensor product sigma-algebra 𝒜\mathcal{A}\otimes\mathcal{B}, the one generated by the sets A×BX×YA\times B\subseteq X\times Y for A𝒜A\in\mathcal{A} and BB\in\mathcal{B}. Just as in FinStoch, on morphisms we have the product of kernels, and the same is true for probability measures.

The copy and discard maps XXXX\to X\otimes X and X1X\to 1, just as for FinStoch (see above), are inherited from the ones of Set. Explicitly, for each object (X,𝒜)(X,\mathcal{A}), we have that for all xXx\in X and for all A,A𝒜A,A'\in\mathcal{A},

copy(A×A|x)=δ(AA|x)={1 xAA; 0 xAA copy(A\times A'|x) = \delta(A\cap A'|x) = \begin{cases} 1 & x\in A\cap A' ; \\ 0 & x\notin A\cap A' \end{cases}

and

del({*}|x)=1, del(\{*\}|x) = 1 ,

where ** is the unique element of the terminal measurable space 11, the one-point space.

These again equip each object with a commutative comonoid structure compatible with the tensor product.

Just as for FinStoch, the fact that 11 is terminal in Stoch reflects the fact that Markov kernels are normalized: for every kernel f:XYf:X\to Y and for every xXx\in X,

(del Yf)({*}|x)= Ydel Y({*}|y)f(dy|x)= Y1f(dy|x)=1=del X({*}|x). (del_Y\circ f) (\{*\}|x) = \int_Y del_Y(\{*\}|y) \, f(dy|x) = \int_Y 1\, f(dy|x) = 1 = \del_X(\{*\}|x) .

Kleisli categories of probability monads

As explained in probability monad, often monads can be used to “add extra morphisms” to a category, to model behaviors that were not present in the original category. These new morphisms form the Kleisli category, and are used, for example, in computer science to model computational effects. In particular, monads can be used to add stochastic maps to deterministic ones (more at probability monad), and very often, the resulting Kleisli category is a Markov category.

Let CC be a cartesian monoidal category, which we can interpret as encoding “no randomness” (more on that below). For example, Set. As we have seen above, CC has a canonical copy and discard structure.

The Kleisli category of a commutative monad (or monoidal monad) PP on CC, for example of most probability monads, has a canonical copy-discard structure inherited from the one of CC. First of all, for every monoidal monad, the Kleisli category is canonically monoidal again. To define a copy-discard structure, we can use the one of CC and the unit of the monad:

If moreover we have that PP preserves the monoidal unit (P11P1\cong 1, i.e. the monad is affine), we have that 11 is terminal in the Kleisli category as well, and so we have a Markov category.

Several Markov categories are obtained in this way, where the randomness is, in some sense, generated by the monad. For example, Stoch is the Kleisli category of the Giry monad.

Core concepts

States, channels, joints, marginals

The Markov category structures can be interpreted in terms of probability, statistics, and information theory in these ways.

First of all, a state on XX

has the interpretation of a probability measure on XX, or in terms of information theory, a noisy source with alphabet XX.

This reflect the intuition in our main examples:

  • In FinStoch, a state is exactly a n×1n\times 1 stochastic matrix, i.e.~a column vector whose entries sum to one. In other words, it is exactly a (discrete) probability distribution on XX, of entries p(x)p(x).
  • In Stoch, a state es exactly a probability measure on a measurable space (X,𝒜)(X,\mathcal{A}), of values p(A)p(A) for A𝒜A\in\mathcal{A}.
  • In Set (no randomness!), a state is exactly a function 1X1\to X, i.e. a point xx of XX.

Let’s now turn to morphisms:

As we have already remarked, these can be thought of as functions with possibly random outcomes, or, from the point of view of information theory potentially noisy channels. In our basic examples:

Morphisms in a Markov category have an additional useful interpretation: a morphism f:XYf:X\to Y can also be seen as a state on YY which depends by a parameter XX. Indeed:

  • In FinStoch, a stochastic map f:XYf:X\to Y, of entries f(y|x)f(y|x), can be seen as a probability distribution on YY which depends by a parameter xXx\in X;
  • In Stoch, a Markov kernel f:(X,𝒜)(Y,)f:(X,\mathcal{A})\to (Y,\mathcal{B}) of values f(B|x)f(B|x) can be seen as a probability measure on YY which depends measurably on xXx\in X, sometimes called a regular conditional distribution.
  • In Set (no randomness!), a function f:XYf:X\to Y can be seen as a point of YY parametrized by XX.

This idea can be seen as a stochastic version of generalized elements.

Let’s now turn to slightly more complex objects. States in the form p:IXYp:I\to X\otimes Y are called joint states.

This has the interpretation of “shared randomness between XX and YY, possibly with correlation or other interactions”.

  • In FinStoch, a joint state on XYX\otimes Y is a joint probability distribution on X×YX\times Y, of entries p(x,y)p(x,y).
  • In Stoch, a joint state on XYX\otimes Y is a joint probability measure on (X×Y,𝒜)(X\times Y,\mathcal{A}\otimes\mathcal{B}), specified uniquely by its values p(A×B)p(A\times B) for A𝒜A\in\mathcal{A} and BB\in\mathcal{B}.
  • In Set, a joint state on XYX\otimes Y is an ordered pair (x,y)(x,y).

Given a joint state pp on XYX\otimes Y, its marginals are the states p Xp_X and p Yp_Y, on XX and YY respectively, given by discarding the unobserved variable:

This formalizes both the idea of “not observing” as well as the one of averaging or integrating over something:

  • In FinStoch, the marginal distribution p Xp_X on XX formed by the joint distribution pp on X×YX\times Y is given by summing over YY:
    p X(x)= yYp(x,y) p_X(x) = \sum_{y\in Y} p(x,y)
  • In Stoch, the marginal distribution p Xp_X on (X,𝒜)(X,\mathcal{A}) formed by the joint distribution pp on X×YX\times Y is given by integrating over YY:
    p X(A)=p(A×Y)= Y1 A(x)p(dxdy) p_X(A) = p(A\times Y) = \int_Y 1_A(x)\,p(d x\,d y)

    for all A𝒜A\in\mathcal{A}.

  • In Set, the marginal state xXx\in X formed by the joint state (x,y)XY(x,y)\in X\otimes Y is simply given by discarding the second component.

Similarly, we call a joint morphism a morphism in the form p:AXYp:A\to X\otimes Y.

We can view this equivalently as a joint state which depends on a parameter AA. We form its marginal morphisms p X:AXp_X:A\to X and p X:AYp_X:A\to Y by discarding the unobserved variables:

This way, also the marginals will depend on the parameter AA.

Stochastic independence

A joint state pp on XYX\otimes Y is said to exhibit independence of XX and YY if and only if the following equation holds.

In other words, pp denotes independendence if and only if it is the product of its marginals. In particular:

  • In FinStoch, the joint distribution p(x,y)p(x,y) denotes independence if it is in the form p X(x)p Y(y)p_X(x)\,p_Y(y) ;
  • In Stoch, pp on (X×Y,𝒜)(X\times Y, \mathcal{A}\otimes\mathcal{B}) denotes independence if for every A𝒜A\in\mathcal{A} and BB\in\mathcal{B}, p(A×B)=p X(A)p Y(B)p(A\times B) = p_X(A)\,p_Y(B);
  • In Set, every joint state (x,y)(x,y) is trivially the product of its marginals xx and yy: no nontrivial correlation can happen when there is no randomness (see Proposition below).

We can interpret the string diagram as follows: when pp denotes independence of XX and YY, then pp “splits” and leaves them disconnected. This denotes the fact that no information flow from XX to YY is possible, i.e. observing the value of XX does not increase our knowledge over YY. For example, knowing the last name of a person does not tell us anything about their age (approximately). Instead, in case of correlation (including nonlinear), knowing XX can tell us something about YY: for example, knowing the height of a person can tell us something (not everything) about their age. In that case, the diagram remains connected. This correspondence between inferential information flow and topological connectedness of the string diagram is not a coincidence, it holds in general and is very deep: Markov categories have been proven to be probabilistic graphical models? (Fritz-Klingler’23).

More generally, a joint morphism p:WXYp:W\to X\otimes Y is said to exhibit conditional independence of XX and YY given WW if and only if the following holds.

This can be interpreted as the fact that XX and YY are independent for each choice of fixed input WW. For example,

  • In FinStoch, this says exactly that pp is in the form

    p(x,y|w)=p X(x|w)p Y(y|w), p(x,y|w) = p_X(x|w)\,p_Y(y|w) ,

    which recovers the usual conditional independence in the discrete case.

  • In Stoch, this says exactly that for all measurable sets AXA\subseteq X and BYB\subseteq Y,

    p(A×B|w)=p X(A|w)p Y(B|w), p(A\times B|w) = p_X(A|w)\,p_Y(B|w) ,

    which recovers the usual conditional independence in the measure-theoretic case.

  • In Set, once again conditional independence always holds.

We can interpret the string diagram as follows. In the case of conditional independence, it is not true anymore, in general, that XX and YY are fully disconnected. However, they are connected only through WW. That is, observing XX may tell us something about YY, but none of that information is new to us if we already know WW.

For example, XX could be the number of mathematicians in a random city, and YY could be annual consumption of coffee in the same city. We might observe a correlation between the two. However, both the amount of mathematicians and the consumption of coffee correlate with the population size of the city, WW. It could be that given WW, that is, only looking at cities with similar population size, we see no real correlation between the amount of coffee consumed and the number of mathematicians. In this case,

  • If we don’t know the population size, knowing the coffee consumption may help us estimating the number of mathematicians;
  • If we already know the population size, knowing the coffee consumption does not give us a more accurate estimate about the number of mathematicians. One says that the knowledge of WW explains away the correlation between XX and YY.

(This is just an example, real data about mathematicians and coffee may differ.)

Once again, the inferential information flow is perfectly reflected by the topological properties of the string diagram. This is one of the main assets of Markov categories: no matter how complex the model is, there always a perfect correspondence between the diagram topology and the interaction structure.

Let’s now consider the case of Set: why is every morphism displaying conditional independence? The reason is that if p:WX×Yp:W\to X\times Y is a (deterministic) function, then knowing WW will already tell us everything about YY (and XX). Therefore there is no additional knowledge than observing XX can give us, if we know WW. (See also Proposition below.)

Let’s now make precise the idea of determinism.

Deterministic morphisms

We have said that Markov categories are about describing morphisms which we can think of as “behaving randomly”. Let’s make this precise, or rather, let’s define precisely what it means to have no randomness.

A state p:IXp:I\to X in a Markov category is called deterministic if and only if copyp=pp\mathrm{copy}\circ p = p\otimes p, i.e. if the following holds.

This reflect the traditional idea of probability theory of a zero-one law, a random variable which is independent of itself. In particular:

  • In FinStoch, a distribution pp is deterministic if and only if for all x,xXx,x'\in X,

    p(x)δ x,x=p(x)p(x), p(x)\,\delta_{x,x'} = p(x)\,p(x') ,

    i.e. if and only if for all xXx\in X,

    p(x)=p(x) 2. p(x) = p(x)^2 .

    This means precisely that for all xx, either p(x)=0p(x)=0 or p(x)=1p(x)=1, i.e. that the probability is concentrated on a single point.

  • In Stoch, a similar calculation shows that pp is deterministic if and only if for every measurable subset AXA\subseteq X, p(A)=0p(A)=0 or p(A)=1p(A)=1. That is, we are (almost) certain about which events happen and which do not. On well-behaved measurable spaces, such as standard Borel spaces, this means exactly that pp is a Dirac delta, a measure concentrated at a single point.

  • In Set, every state (i.e. single point) is deterministic.

More generally, a morphism f:XYf:X\to Y in a Markov category is called deterministic if it commutes with the copy map,

copyf=(ff)copy. copy \circ f \;=\; (f\otimes f) \circ copy .

In terms of string diagrams:

This formalizes the idea of a deterministic transition, or function. In particular:

  • In FinStoch, by means of a similar calculation as above, a stochastic matrix k:XYk:X\to Y is deterministic if and only if for all x,yXx,y\in X, k(y|x)=0k(y|x)=0 or k(y|x)=1k(y|x)=1. This means precisely that kk is in the form δ f\delta_f for a set-theoretic function ff (see here).

  • In Stoch, a Markov kernel k:XYk:X\to Y is deterministic if and only if for every xXx\in X and every measurable BYB\subseteq Y, k(B|x)=0k(B|x)=0 or k(B|x)=1k(B|x)=1. On well-behaved measurable spaces, such as standard Borel spaces, this means exactly that kk is in the form δ f\delta_f for a measurable function ff (see here).

  • In Set, every morphism is deterministic: for every xXx\in X both side of the equations yield

    (f(x),f(x))=(f(x),f(x)), \big(f(x),f(x)\big) = \big(f(x),f(x)\big) ,

    where on the left we have applied ff to xx and copied the result, while on the right we have copied xx and applied ff to both copies.

Let’s now see where the generic case differs from Set, and further motivate this definition. Suppose that ff is a “random” function between real numbers, which adds to the input the result of the roll of a die. Given a number xx, we can roll a die, add the resulting value (say, nn) to xx, and then copy the result, to get (x+n,x+n)(x+n,x+n). Or, we could copy the value xx, roll two dice (or roll the die twice), and add the two resulting values (say, mm and nn) to the two copies of xx, obtaining (x+m,x+n)(x+m,x+n). The two results are likely to differ, and in particular will follow different distributions: one perfectly correlated, one with conditional independence of the outputs given the inputs. One can take this as a definition of randomness: it’s a process that may give a different result if you do it twice. Or equivalently, that copying the information before or after the process has taken place gives different results.

A deterministic morphism is one that does not exhibit this behavior, i.e. that commutes with the copy map.

Deterministic morphisms are closed under composition. One usually denotes the wide subcategory of deterministic morphisms of a Markov category CC by C detC_det.

We have seen that in a Markov category, every morphism commutes with the discard map. Therefore, from the category-theoretic perspective, we can equivalently see deterministic morphisms exactly as the morphisms of comonoids.

The following statement makes precise the idea, sketched in the previous sections, that no inference is possible without randomness:

Proposition

For a Markov category, the following conditions are equivalent.

  1. Every morphism is deterministic;
  2. The copy maps are natural;
  3. The category is cartesian monoidal;
  4. Every morphism displays conditional independence of its outputs given its inputs.

Therefore, we can consider Markov categories as a generalization of cartesian monoidal categories to the case where “randomness is present”.

(However, using Markov categories to study cartesian categories, while technically possible, usually yields trivial results, since all the correlations are trivial and no inference is possible.)

Almost sure equality

Another concept of probability and measure theory that Markov categories formalize is almost sure equality. The fact that certain propositions, in probability theory, are only true up to a null set, is sometimes considered an unavoidable but confusing technicality of the theory. In Markov categories this concept is equally inevitable, but it acquires a precise conceptual meaning, and has very deep-reaching consequences.

Let f,g:XYf,g:X\to Y be morphisms in a Markov category, and let pp be a state on XX. We say that ff and gg are pp-almost surely equal if and only if the following equation holds.

Similar definitions can be given even if ff and pp depend on additional parameters, see (Fritz et al’23) for the details.

In Stoch and FinStoch, this definition recovers the traditional notions of almost sure equality:

  • In FinStoch, the equation reads

    p(x)f(y|x)=p(x)g(y|x) p(x)\,f(y|x) = p(x)\,g(y|x)

    for all xXx\in X and yYy\in Y. This means exactly that for all yYy\in Y, and for all those xXx\in X with p(x)0p(x) \ne 0, f(y|x)=g(y|x)f(y|x)=g(y|x). That is, ff and gg agree on the support of pp.

  • In Stoch, the equation reads

    Af(B|x)p(dx)= Ag(B|x)p(dx) \int_A f(B|x)\,p(d x) = \int_A g(B|x)\,p(d x)

    for all measurable AXA\subseteq X and BYB\subseteq Y. That is, for all BB, the quantities f(B|x)f(B|x) and g(B|x)g(B|x) agree on a measure-one subset of XX (which may depend on BB in general). This recovers the notion of almost sure equality for Markov kernels.

In the same vein, given a morphism f:XYf:X\to Y and a state pp on XX we say that ff is pp-almost surely deterministic if and only if the determinism equation for ff holds pp-almost surely. In string diagrams:

  • In FinStoch, this says exactly that f(y|x)f(y|x) is either zero or one for all xx such that p(x)0p(x) \ne 0;

  • In Stoch, this says that for all measurable BYB\subseteq Y, for xx on a measure-one set, f(B|x)f(B|x) is either zero or one.

Conditionals

One of the most powerful concepts in the theory of Markov categories is the idea of conditioning. We saw, when talking about stochastic independence, that given a joint state on X×YX\times Y,

one can try, from observing XX, to infer something about YY. This idea of information flowing from XX to YY, often, can be modelled as a morphism XYX\to Y, in general not deterministic, since XX may not determine YY completely. Let’s define this precisely.

Let pp be a joint state on XYX\otimes Y. A conditional distribution on YY given XX is a morphism p| X:XYp|_X:X\to Y such that the following equation holds:

  • In FinStoch, this means a stochastic map of entries p| X(y|x)p|_X(y|x) such that for all xXx\in X and yYy\in Y,

    p(x,y)=p X(x)p| X(y|x). p(x,y) = p_X(x)\,p|_X(y|x) .

    (Note the difference between the marginal, p Xp_X, and the conditional, p| Xp|_X.) This is the usual notion of conditional distribution in the discrete case, usually obtained by

    p| X(y|x)=p(x,y)p(x) p|_X(y|x) = \frac{p(x,y)}{p(x)}

    when p(x)0p(x)\ne 0, and arbitrary otherwise.

  • In Stoch, this means a Markov kernel p| X:XYp|_X:X\to Y such that for all measurable AXA\subseteq X and BYB\subseteq Y,

    p(AB)= Ap| X(B|x)p X(dx). p(A\otimes B) = \int_A p|_X(B|x)\,p_X(d x) .

    This is known in probability theory as a regular conditional distribution.

Note that, if a conditional distribution exists, it is unique only up to p Xp_X-almost sure equality. Moreover,

It is helpful to think of a conditional distribution as of turning the output XX into an input. In terms of string diagram, it is sometimes helpful to represent this as follows,

which says that “in the inside of p| Xp|_X, a wire has been bent to turn an output into an input”.

A particularly important conditional distribution is given by Bayesian inverses.

Let pp be a state on XX and let f:XYf:X\to Y be a morphisms. A Bayesian inverse of ff with respect to pp is a morphism f :YXf^\dagger:Y\to X such that the following equation holds, where q=fpq=f\circ p.

Note that this is the same as a conditional given YY for the joint on the right-hand side. The equation can be seen as a version of Bayes' formula:

  • In FinStoch, it says that

    q(y)f p (x|y)=p(x)f(y|x), q(y) \,f^\dagger_p(x|y) = p(x) \,f(y|x) ,

    which is in the usual form

    P(y)P(x|y)=P(x)P(y|x), P(y) \, P(x|y) = P(x) \, P(y|x) ,

    which recovers the usual Bayesian version for stochastic maps.

  • In Stoch, similarly, it says that for all measurable AXA\subseteq X and BYB\subseteq Y,

    Bf p (A|y)q(dy)= Af(B|x)p(dx). \int_B f^\dagger_p(A|y) \, q(d y) = \int_A f(B|x)\,p(d x) .

    This recovers the Bayesian inversion of Markov kernels.

Once again, Bayesian inversions exist for all standard Borel spaces, and in general, if they exist, they are unique qq-almost surely.

To conclude this section, let’s give the most general definition of conditional. Let h:AXYh:A\to X\otimes Y be a joint morphism. A conditional of hh given XX is a morphism h| X:XAYh|_X:X\otimes A\to Y such that the following equation holds.

A Markov category is said to have conditionals if every conditional in the form above exists.

Both FinStoch and BorelStoch have conditionals.

Further structures and properties

Representable Markov categories

As we explained above, a probability monad often gives rise to a Markov category.

Conversely, consider a Markov category CC.

  • The category C detC_det of deterministic morphisms has no randomness;
  • There may be have a monad on C detC_\det which encodes randomness (see probability monad);
  • The Markov category CC may be the Kleisli category of this monad.

Let’s make this precise in terms of a universal property.

Let XX be an object in a Markov category CC. A distribution object for XX is an object, which we denote by PXP X, together with a bijection

C det(A,PX)C(A,X) C_det(A,P X) \cong C(A,X)

for all objects AA, natural in AA against deterministic morphisms.

The object PXP X has the interpretation of being a space of probability measures over XX. Setting A=IA=I in the definition, we see in particular that the deterministic states (or “points”) of PXP X are exactly the (generic, random) states on XX. For Stoch, this says that the points of PXP X should be exactly the probability measures on XX. Indeed, in Stoch every object has a distribution object, given by the Giry monad.

More generally:

Theorem

For a Markov category CC, the following conditions are equivalent.

  1. Every object of CC has a distribution object.
  2. The inclusion functor C detCC_det\to C has a right adjoint (which we denote by PP).
  3. CC is isomorphic to the Kleisli category of a commutative affine monad on C detC_det.

A Markov category CC is called representable if any of the conditions of the theorem above holds.

The category FinStoch is not representable, since the objects are finite sets, and the set of distribution PXP X over a finite set XX (with more than two elements) would be infinite. It is however a subcategory of the Kleisli category of the distribution monad, of all sets and all stochastic maps between them.

By means of the Yoneda lemma, we can equivalently define a distribution object as an object PXP X together with a distinguished morphism PXXP X\to X, which we denote by “sampsamp”, such that the map is a bijection for all objects AA.

This map can be thought of as being the universal random map, in the sense that by the bijection above, every Kleisli morphism can be seen as the composition of sampsamp with a deterministic map.

From the point of view of probability theory, in Stoch, we can view this map PXXP X\to X as taking a distribution pPXp\in P X, and returning a random point of XX distributed according to pp. In other words, it is sampling from pp.

This also corresponds to the idea of sampling in probabilistic programming?, (see monad (in computer science)).

Kolmogorov products

Kolmogorov products are a way to equip a Markov category with an infinitary tensor product. The idea can be seen as an abstraction of the Kolmogorov extension theorem for probability measures.

Let CC be a Markov category, and let (X j) jJ(X_j)_{j\in J} be a family of objects. A Kolmogorov product of them is an infinite tensor product

X J= jJX j X_J=\bigotimes_{j\in J} X_j

such that all finite projections (marginalizations) X JX FX_J\to X_F, with FJF\subseteq J finite, are deterministic morphisms.

Using the universal property of limits, this says that a joint distribution on X JX_J is equivalently given by a family of joint distributions on all finite subsets of JJ, compatible with the restrictions.

Because of this, a Kolmogorov product on CC restricts to a cartesian product in the category C detC_det of deterministic morphisms.

Positivity and causality

Positivity and causality are two additional axioms for Markov categories, introduced in Fritz’20, which hold in categories of kernels such as Stoch.

A Markov category is called positive if and only if whenever two morphisms f:XYf:X\to Y and g:YZg:Y\to Z are such that gf:XZg\circ f:X\to Z is deterministic, then the following holds,

i.e. the resulting joint morphism on the left makes YY and ZZ conditionally independent given XX.

Equivalently, a Markov category is positive if and only if for every joint morphism h:AXYh:A\to X\otimes Y such that one of its marginals is deterministic, hh exhibits conditional independents of its outputs given its input.

In a positive Markov category, every isomorphism is deterministic (Fritz’20, Remark 11.28).

A Markov category is called causal if, whenever the following equation holds,

then the following, stronger condition also holds.

Theorem

Every Markov category with all conditionals is causal (Fritz’20, Proposition 11.34), and every causal Markov category is positive (Fritz et al’23, Theorem 2.24). Both converses are false.

The category Stoch is causal (but ot doesn’t have all conditionals).

The ProbStoch construction

The ProbStoch construction on a Markov category formalizes the idea of a category of probability spaces. Morphisms are identified when they agree almost surely, so that universal properties in that category tend to encode universal properties that hold “up to probability zero”. Moreover, if conditionals exist, the ProbStoch construction gives an abstraction of the category of couplings.

Let CC be a causal Markov category. The category ProbStoch(C)ProbStoch(C) is defined as follows.

  • Its objects are pairs (X,p)(X,p) where XX is an object of CC, and p:IXp:I\to X is a state on XX.
  • Its morphisms (X,p)(Y,q)(X,p)\to(Y,q) are equivalence classes of morphisms f:XYf:X\to Y of CC under pp-almost sure equality, and such that fp=qf\circ p=q.

The causality condition is required in order for composition to be well defined on the equivalence classes.

In general ProbStoch(C)ProbStoch(C) is not a Markov category, but only a semicartesian monoidal category.

Proposition

If a Markov category CC has all conditionals, ProbStoch(C)ProbStoch(C) is a dagger category, with the daggers given by Bayesian inverses.

For example, BorelStoch has all conditionals, and ProbStoch(BorelStoch)ProbStoch(BorelStoch) is the category Krn.

Quantum versions

(…)

History

In the history of Markov categories, and of categorical probability in general, it happened very often that a number of independent researchers developed similar ideas unaware of each other’s work.

The first study of a category of Markov kernels (here called Stoch) seems to be due to Lawvere, and recorded in unpublished notes (Lawvere’62, see also the explanation by Lawvere himself in this video, time 1:00:21.) The same category was studied independently by Chentsov (Chentsov’65), who wrote in Russian. It is likely that Lawvere and Chentsov remained unaware of each other’s work for several decades.

Giry (Giry’82) first described Stoch as the Kleisli category of the monad which brings her name. Shortly after, computer scientists such as Moggi and Plotkin started considering monads in the context of effectful programming (Moggi’89), with probabilistic computation as a special case (Jones-Plotkin’89). (More details in monad (in computer science).) This is also the time when a number of probability monads were introduced, see there for more.

The first definition of a garbage-share (a.k.a. copy-discard) category seems to be due to Gadducci (Gadducci’96), who in his PhD thesis introduced such a structure (for strict monoidal categories), with in mind applications to formal languages (in particular graph term rewriting). In subsequent papers together with Andrea Corradini some further properties of these construction were studied, including higher-categorical ones (Corradini-Gadducci’97, Corradini-Gadducci’99a, Corradini-Gadducci’99b, Corradini-Gadducci’02).

The first mention of Markov and CD-categories used for the purpose of probability seems to have first appeared in the work of Golubtsov (Golubtsov’99), who named the structure “category of information converters” or “information transformers”, likely unaware of Gadducci’s work, and first writing in Russian. In subsequent work (Golubtsov’02, Golubtsov-Mosaliuk’02, Golubtsov’04) these structures were connected to Kleisli categories, introducing what now we call a representable Markov category.

In parallel, and seemingly unaware of Golubtsov’s work, Coecke and Spekkens developed a string-diagrammatic framework for Bayesian inference, for the purpose of generalizing these concept from classical to quantum (Coecke-Spekkens’11). The resulting framework is somewhat different from the one of Markov categories, but very similar in aims and philosophy.

Once again seemingly unaware of Golubtsov’s work, Fong studied categorical structures to model Bayesian networks in his master’s thesis (Fong’12), in particular by means of equipping the objects of a category with comonoid structures.

Finally, the current form of Markov and copy-discard categories, which is also the one presented in this article, was developed in the works of Cho and Jacobs (Cho-Jacobs’19) and of Fritz (Fritz’20).

Detailed list

Category Probability monad Conditionals Positivity Kolmogorov products Further references
Stoch Giry monad on Meas No (Fritz'20, Example 11.3) Yes (Fritz'20, Example 11.25) Not in general Fritz'20
BorelStoch Giry monad on Pol Yes (Kallenberg '17, B-M'19) Yes (Fritz'20, Section 11) Countable (Fritz-Rischel'19) Fritz'20
FinStoch Not representable (closely related to the distribution monad) Yes (Fritz'20, Example 11.6) Yes (Fritz'20, Section 11) No Fritz'20
TychStoch probability monad on topological spaces (here, Tychonoff) No Yes (Fritz et al'23) ? Fritz et al'23
QBStoch probability monad on quasi-Borel spaces No (Sabok et al'20) No (Sabok et al'20) ? Fritz et al'23
Gauss ? Yes (Fritz'20) Yes (Fritz'20) No Stein et al.'23
FinSetMulti Finite nonempty power set monad? Yes (Fritz'20) Yes (Fritz'20) No (Fritz-Rischel'19) Stein et al.'23

(…to be expanded…)

See also

References

Main sources

The main definitions and results in this article can be found in

  • Tobias Fritz, A synthetic approach to Markov kernels, conditional independence and theorems on sufficient statistics, Advances of Mathematics 370, 2020. (arXiv:1908.07021)

and in the following reference (where Markov categories are called affine CD categories):

  • Kenta Cho, Bart Jacobs, Disintegration and Bayesian Inversion via String Diagrams, Mathematical Structures of Computer Science 29, 2019. (arXiv:1709.00322)

The graphical model (separation and d-separation) aspects of conditional independence for Markov categories have been studied in

  • Tobias Fritz and Andreas Klingler, The d-separation criterion in Categorical Probability, Journal of Machine Learning Research 24(46), 2023. (arXiv)

Representable Markov categories have been studied in

  • Tobias Fritz, Tomáš Gonda, Paolo Perrone, Eigil Fjeldgren Rischel, Representable Markov categories and comparison of statistical experiments in categorical probability, Theoretical Computer Science 961, 2023. (arXiv:2010.07416)

as well as in the following work, together with an in-depth treatment of positivity and causality:

  • Tobias Fritz, Tomáš Gonda, Nicholas Gauguin Houghton-Larsen, Paolo Perrone and Dario Stein, Dilations and information flow axioms in categorical probability. Mathematical Structures in Computer Science, 2023. (arXiv:2211.02507).

Kolmogorov products have been studied in

  • Tobias Fritz and Eigil Fjeldgren Rischel, Infinite products and zero-one laws in categorical probability, Compositionality 2(3) 2020. (arXiv:1912.02769)

Introductory material

Some of the introductory content on this page is taken from the following reference:

  • Paolo Perrone, Markov Categories and Entropy, IEEE Transactions on Information Theory 70(3), 2024. (arXiv)

Video lectures suitable for beginners:

  • Paolo Perrone, Tutorial on Markov categories, Applied Category Theory 2023, University of Maryland, USA. (YouTube)

  • Paolo Perrone, Universal properties in probability theory, Category Theory 2023, Louvain-la-Neuve, Belgium. (YouTube)

  • (in Spanish) Paolo Perrone, Categorical Probability and Information Theory, Universidad Carlos III Madrid, Spain (video here)

  • Eigil Fjeldgren Rischel, Introduction to Markov categories, Categorical Probability and Statistics Workshop, 2020. (YouTube)

Slides from talks:

  • Tobias Fritz, Categorical Probability and the de Finetti theorem, New York City Category Seminar, 2021. (pdf)

  • Tobias Fritz, Probability and Statistics as a Theory of Information Flow, 2020 (pdf)

  • Tobias Fritz, A synthetic introduction to probability and statistics, 2019 (pdf)

Historical references (see History):

  • W. Lawvere, The category of probabilistic mappings, ms. 12 pages, 1962

    (Lawvere Probability 1962)

  • N. N. Chentsov, The categories of mathematical statistics, Dokl. Akad. SSSR 164, 1965.

  • Michèle Giry, A categorical approach to probability theory, Categorical aspects of topology and analysis, Lecture notes in Mathematics 915, Springer, 1982.

  • Eugenio Moggi, Computational lambda-calculus and monads, Proceedings of LICS, 1989. (doi)

  • Claire Jones and Gordon Plotkin, A probabilistic powerdomain of evaluations, Proceedings of LICS, 1989. (doi)

  • Fabio Gadducci, On the Algebraic Approach to Concurrent Term Rewriting, PhD thesis, University of Pisa, 1996. (available here)

  • Andrea Corradini and Fabio Gadducci, A 2-categorical presentation of term graph rewriting. In CTCS, LNCS 1290, 1997.

  • Andrea Corradini and Fabio Gadducci, An algebraic presentation of term graphs, via gs-monoidal categories, Applied Categorical Structures 7(4), 1999

  • Andrea Corradini and Fabio Gadducci, Rewriting on cyclic structures: equivalence between the operational and the categorical description, RAIRO Theoretical Informatics and Applications 33(4-5), 1999.

  • Andrea Corradini and Fabio Gadducci, A functorial semantics for multi-algebras and partial algebras, with applications to syntax, Theoretical Computer Science 286(2), 2002.

  • Peter V. Golubtsov, Axiomatic description of categories of information converters, Problemy Peredachi Informatsii (Problems in Information Transmission) 35(3), 1999.

  • Peter V. Golubtsov, Monoidal Kleisli category as a background for information transformers theory. Information Theory and Information Processing 2, 2002.

  • Peter V. Golubtsov and Stepan S. Moskaliuk, Method of additional structures on the objects of a monoidal Kleisli category as a background for information transformers theory, Hadronic Journal 25(2), 2002.

  • Peter V. Golubtsov, Information transformers: category-theoretical structure, informativeness, decision-making problems, Hadronic Journal Supplements 19(3), 2004.

  • Bob Coecke and Robert W. Spekkens, Picturing classical and quantum Bayesian inference, Synthese, 186(3), 2012. (arXiv)

  • Brendan Fong, Causal theories: A categorical prespective on Bayesian networks, master thesis, University of Oxford, 2012. (arXiv)

Further references

  • Tobias Fritz, Tomáš Gonda, Paolo Perrone, De Finetti’s theorem in categorical probability. Journal of Stochastic Analysis, 2021. (arXiv:2105.02639)

  • Tobias Fritz, Wendong Liang, Free gs-monoidal categories and free Markov categories, Applied Categorical Structures 31, 21, 2023. (arXiv:2204.02284)

  • Sean Moss, Paolo Perrone, Probability monads with submonads of deterministic states, Proceedings of LICS, 2022. (arXiv:2204.07003)

  • Sean Moss and Paolo Perrone, A category-theoretic proof of the ergodic decomposition theorem, Ergodic Theory and Dynamical Systems, 2023. (arXiv:2207.07353)

  • Marcin Sabok, Sam Staton, Dario Stein, Michael Wolman, Probabilistic Programming Semantics for Name Generation, Proceedings POPL, 2021. (arXiv)

  • Elena Di Lavore and Mario Román, Evidential Decision Theory via Partial Markov Categories, Proceedings of LICS, 2023. (arXiv)

  • Bart Jacobs, Multinomial and Hypergeometric Distributions in Markov Categories, Proceedings of MFPS 2021. (arXiv)

  • Tobias Fritz, Fabio Gadducci, Davide Trotta, Andrea Corradini, From gs-monoidal to oplax-cartesian categories: constructions and functorial completeness, Applied Categorical Structures 31, 42, 2023. (arXiv)

  • Tobias Fritz, Fabio Gadducci, Paolo Perrone and Davide Trotta, Weakly Markov categories and weakly affine monads, Proceedings of CALCO, LIPIcs 10, 2023. (arXiv)

  • Dario Stein and Richard Samuelson, A Category for unifying Gaussian Probability and Nondeterminism, Proceedings of CALCO, LIPIcs 10, 2023. (arXiv)

  • Dario Stein and Sam Staton, Probabilistic Programming with Exact Conditions, JACM, 2023. (arXiv)

  • Tobias Fritz, Tomáš Gonda, Antonio Lorenzin, Paolo Perrone and Dario Stein, Absolute continuity, supports and idempotent splitting in categorical probability, 2023. (arXiv:2308.00651)

  • Dario Stein and Richard Samuelson, Towards a Compositional Framework for Convex Analysis (with Applications to Probability Theory), 2023. (arXiv)

  • Nathaniel Virgo, Unifilar Machines and the Adjoint Structure of Bayesian Filtering, Proceedings of ACT, 2023. (arXiv)

  • Noé Ensarguet and Paolo Perrone, Categorical probability spaces, ergodic decompositions, and transitions to equilibrium, 2023. (arXiv:2310.04267)

  • Tobias Fritz and Antonio Lorenzin, Involutive Markov categories and the quantum de Finetti theorem, 2023. (arXiv)

  • Tobias Fritz, Andreas Klingler, Drew McNeely, Areeb Shah-Mohammed and Yuwen Wang, Hidden Markov models and the Bayes filter in categorical probability, 2024. (arXiv)

  • Arthur J. Parzygnat, Inverses, disintegrations, and Bayesian inversion in quantum Markov categories, 2020. (arXiv)

  • Arthur J. Parzygnat and Benjamin P. Russo, A noncommutative Bayes theorem, Linear Algebra Applications 644, 2022. (arXiv)

  • Arthur J. Parzygnat, Conditional distributions for quantum systems, EPTCS 343, 2021. (arXiv)

  • James Fullwood and Arthur J. Parzygnat, On quantum states over time, Proceedings of the Royal Society A 478, 2022. (arXiv)

  • Arthur J. Parzygnat, Francesco Buscemi, Axioms for retrodiction: achieving time-reversal symmetry with a prior, Quantum 7(1013), 2023. arXiv

  • James Fullwood and Arthur J. Parzygnat, From time-reversal symmetry to quantum Bayes rules, PRX Quantum 4, 2023. (arXiv)

The research on Markov categories is still largely in progress, so this list will keep growing.

References about Markov kernels

  • Olaf Kallenberg, Random Measures, Theory and Applications, Springer, 2017.

  • Vladimir Bogachev and Il’ya Malofeev, Kantorovich problems and conditional measures depending on a parameter. (arXiv)

See also the references at Markov kernel.

category: probability

Last revised on February 20, 2024 at 17:16:28. See the history of this page for a list of all contributions to it.