# nLab Markov category

Contents

### Context

#### Measure and probability theory

measure theory

probability theory

## Applications

#### Monoidal categories

monoidal categories

With braiding

With duals for objects

With duals for morphisms

With traces

Closed structure

Special sorts of products

Semisimplicity

Morphisms

Internal monoids

Examples

Theorems

In higher category theory

# Contents

## Idea

The formalism of Markov categories and copy-discard categories is one of the most recent categorical approaches to probability theory.

Intuitively, a Markov category is a category where morphisms behave like “random functions”, for example, Markov kernels form such a category (hence the name). Canonical examples are Kleisli categories of probability monads. The formalism is however far more general.

Markov categories have a graphical formalism, based on string diagrams, that allows to easily keep track of the stochastic dependencies, as well as of randomness and determinism. Using this formalism one can interpret, prove and even generalize several results of probability and related fields (see Categorical probability – results for more details).

## Definitions

A concise definition, which we interpret below, can be given as follows: A Markov category is a semicartesian symmetric monoidal category $(C,\otimes,1)$ which supplies commutative comonoids, i.e. in which every object $X$ is equipped with the structure of a commutative internal comonoid compatible with the tensor product.

The comultiplication and counit maps are usually denoted by $copy: X \to X \otimes X$ and $del: X\to 1$.

Let’s now restate the definition in a more intuitive, mathematically equivalent way, by making use of string diagrams.

### String diagrams for monoidal categories

It is useful to introduce a string diagram notation for Markov categories, which is used throughout the literature, and which makes stochastic interactions particularly intuitive.

First of all, we represent a morphism $f:X\to Y$ in a monoidal category as follows, which we read from bottom to top:

Notice that, across the literature, convention vary (for example, also left-to-right is used.) When domain and codomain are clear, we will omit them:

We represent the object $X$, as well as its identity morphism, simply as a wire:

The (sequential) composition of morphisms $f:X\to Y$ and $g:Y\to Z$ is drawn as follows:

The tensor product of the monoidal category, or parallel composition, is represented as follows. First of all, given objects $A$, $B$, $X$ and $Y$, a morphism $h:A\otimes B\to X\otimes Y$ is depicted as a box with multiple input and output wires:

Now given $f:A\to X$ and $g:B\to Y$, the parallel composition $f\otimes g:A\otimes B\to X\otimes Y$ is represented by juxtaposition:

This highlights that this situation is non-signalling: here, $X$ depends on $A$ only, and not on $B$, while in the generic case of $h$ above, $X$ may depend both on $A$ and $B$. This will be particularly relevant when we discuss stochastic interactions (see below).

Similarly, one can form the product of three or more objects and morphisms. Note that in the diagrams, the product is strictly associative: string diagrams, rather than the original category, technically represent a strict monoidal category which is monoidally equivalent? to the original one. (Such a category always exists thanks to the coherence theorem.)

In a similar vein, the monoidal unit is usually not appearing in string diagrams, whenever this does not cause confusion. This way, for example, the diagram

represents the object $X$, but also $X\otimes I$, $I\otimes X$, and so on.

Particular importance is given to morphisms from the monoidal unit, in the form $p:I\to X$, and which here we call states, and represent as follows.

These will model probabilistic states, such as probability measures (see below). In the literature, states are also depicted as triangles.

To conclude this section, the braiding of the monoidal category is represented as follows:

The unitors and associators are similarly omitted from the diagrams, except when it causes ambiguity.

Since we work in a symmetric monoidal category, the braiding is its own inverse.

### Commutative comonoids

Let $X$ be an object of a symmetric monoidal category. A commutative comonoid structure on $X$ consists first of all of maps $X\mapsto X\otimes X$ and $X\to I$, called the comultiplication and counit, or copy and discard. For the purposes of Markov categories, we denote them by $copy: X \to X \otimes X$ and $del: X\to 1$

In terms of string diagrams, we represent these maps as follows: which we can think of as “copying the state of $X$” and “discarding it”.

These maps need to satisfy the following conditions (commutativity, counitality, coassociativity):

which we can interpret as the facts that

• The two copies are interchangeable;
• Copying once and discarding one of the two copies is the same as doing nothing;
• Copying once and then copying one of the copies is the same as copying the other copy.

In other words, these conditions assure that the interpretation as “copying and discarding information” is consistent.

### Markov and CD categories

A copy-discard (CD) category or garbage-share (GS) monoidal category is a symmetric monoidal category where

• Every object is equipped with a distinguished commutative comonoid structure;

• The comonoid structures are compatible with the tensor products in the following ways.

First for all, for all objects $X$ and $Y$,

$copy_{X\otimes Y} \; =\; (id_X\otimes b_{Y,X} \otimes id_Y) ( copy_X \otimes copy_Y ),$

where $b$ denotes the braiding. In terms of string diagrams:

Similarly, we also have the following conditions (recall that the monoidal unit, unitors and associators are not depicted):

A Markov category or affine CD category is a CD category where any of these equivalent conditions hold:

• The monoidal unit $I$ is a terminal object (i.e. the category is semicartesian monoidal);
• The discard (counit) maps $del:X\to I$ are part of a natural transformation $id\Rightarrow I$;
• For every morphism $f:X\to Y$, we have that $del_Y \circ f = \del_X$.

The last equation can be written in terms of string diagrams as follows:

This condition can be interpreted as follows:

• From the point of view of information theory, it says that processing any information and discarding the result is the same as discarding the input altogether.
• From the point of view of probability theory, it says that probabilities are normalized to one. (More on that below.)

Notice also that if we assume that the monoidal unit is terminal, all compatibility conditions between the comonoid structures and the tensor product except the first one are satisfied.

## Examples

Let’s start with a simple example where randommess is not involved, the category of sets. We then look at more interesting cases where randomness is involved, FinStoch and Stoch.

### The copy-discard structure of Set

Consider the category Set of sets and functions, with its usual cartesian monoidal structure (given by the cartesian product of sets).

Every set $X$ is equipped with canonical maps

which can be interpreted as “copying the value of $x$” and “discarding it” (we denote by $*$ the unique element of the terminal set $1$). It can be easily checked that these maps make $X$ a commutative comonoid, and that these maps are compatible with the tensor product. Since the monoidal unit is terminal, we have a Markov category.

The same can be done in every cartesian monoidal category:

###### Proposition

In a cartesian monoidal category every object is equipped with a unique commutative comonoid structure, defined in terms of the universal property of the product.

In cartesian monoidal categories, no randomness is present (as defined below), and so the Markov structure is purely about manipulating information deterministically: information can be copied, discarded, and processed deterministically (through arbitrary morphisms).

###### Remark

Note that, while the copy-discard structure of Set and of other cartesian monoidal categories may seem trivial, it is not. The copy map is implicitly used whenever we are reusing a variable in an expression. For example, every function $f:X\times X\to Y$ also defines a function $f':X\to Y$ by just “pluggin in $x$ twice”:

$f'(x) = f(x,x) .$

What happened under the hood is the following composition: In string diagrams, we write this as follows.

Similarly, the discard map is used whenever we are not using a variable in an expression. For example, every function $g:X\to Y$ also defines a function $g':X\times A\to Y$ which “does not really depend on $A$”:

$g'(x,a) = g(x) .$

What happened under the hood is the following composition, which in string diagrams we write as follows.

In particular, every element $y\in Y$ can be seen as a constant function $X\to Y$ which does not really depend on $X$.

As we will see, once probabilities are involved, copying and discarding will require particular care, due to the presence of possible correlation. Markov and CD categories allow us to keep track of these operations graphically, avoiding many possible pitfalls, and even using them to our advantage.

### The category FinStoch of stochastic matrices

Let’s now look at a category where randomness is involved. Let $X$ and $Y$ be finite sets. Recall that a stochastic matrix, or stochastic map, between them is a function such that

(1)$\sum_{y\in Y} f(y|x) = 1$

for every $x\in X$.

We can interpret the number $f(y|x)$ as the probability of transitioning from $x$ to $y$, in a stochastic way. Note that in particular, every deterministic function defines a stochastic matrix: if $f:X\to Y$ is an ordinary function, we can write a stochastic matrix $\delta_f$ from $X$ to $Y$ by saying that $\delta_f(y|x)=1$ if $y=f(x)$, and $0$ otherwise. Therefore these random transitions include the deterministic ones as a special case.

Similarly to what we did for Set, we can assign copy and discard maps to our object, which intuitively copy and discard information deterministically. First of all, as monoidal structure we again take the cartesian product of sets. (In this category, it is not a categorical product, but it’s still a monoidal one.)

The tensor product of morphisms $f:A\to X$ and $g:B\to Y$

is the stochastic matrix $A\otimes B\to X\otimes Y$ of entries

$(f\otimes g) (x,y|a,b) = f(x|a) \, g(y|b) ,$

which says that the two transitions are happening independently. In particular, for the case of states, the product of $p:1\to X$ and $q:1\to X$,

is the product probability

$(p\otimes q)(x,y) = p(x)\,q(y) ,$

which as usual denotes independent events.

We define the copy and discard maps as the maps $\delta_copy$ and $\delta_del$ induced by the ones of Set (see the example above). Explicitly, the stochastic map $copy:X\to X\otimes X$ is defined by

$copy(x',x''|x) = \begin{cases} 1 & x'=x''=x ; \\ 0 & otherwise. \end{cases}$

The stochastic map $del:X\to 1$ satisfies

$del(*|x) = 1 ,$

where $*$ is the unique element of the terminal set $1$.

Again, one can check that these maps make $X$ a commutative comonoid, and that these maps are compatible with the tensor product.

The reason why we have a Markov category and not just a CD category is exactly the normalization condition (1): given a stochastic map $f:X\to Y$, the condition $del_Y\circ f = del_X$ reads exactly as

$(del_Y\circ f) (*|x) = \sum_{y\in Y} del_Y(*|y) \, f(y|x) = \sum_{y\in Y} 1\, f(y|x) = 1 = \del_X(*|x)$

for every $x\in X$.

### The category Stoch of Markov kernels

Recall that given measurable spaces $(X,\mathcal{A})$ and $(Y,\mathcal{B})$, a Markov kernel between them is a function which is measurable in $x$ and which is a probability measure in $B$. We can view it as a generalization of a stochastic matrix outside the discrete case, where the number $k(B|x)$ is the probability of transitioning into $B$ if we are in $x$. (See Markov kernel for more on this.)

The category Stoch of measurable spaces and Markov kernels has a Markov category structure similar to the one of FinStoch. As for FinStoch, as monoidal structure we take the one induced by the cartesian product of measurable spaces. Given measurable spaces $(X,\mathcal{A})$ and $(Y,\mathcal{B})$ (or more briefly, $X$ and $Y$), we take $X\otimes Y$ to be the cartesian product of sets $X\times Y$, equipped with the tensor product sigma-algebra $\mathcal{A}\otimes\mathcal{B}$, the one generated by the sets $A\times B\subseteq X\times Y$ for $A\in\mathcal{A}$ and $B\in\mathcal{B}$. Just as in FinStoch, on morphisms we have the product of kernels, and the same is true for probability measures.

The copy and discard maps $X\to X\otimes X$ and $X\to 1$, just as for FinStoch (see above), are inherited from the ones of Set. Explicitly, for each object $(X,\mathcal{A})$, we have that for all $x\in X$ and for all $A,A'\in\mathcal{A}$,

$copy(A\times A'|x) = \delta(A\cap A'|x) = \begin{cases} 1 & x\in A\cap A' ; \\ 0 & x\notin A\cap A' \end{cases}$

and

$del(\{*\}|x) = 1 ,$

where $*$ is the unique element of the terminal measurable space $1$, the one-point space.

These again equip each object with a commutative comonoid structure compatible with the tensor product.

Just as for FinStoch, the fact that $1$ is terminal in Stoch reflects the fact that Markov kernels are normalized: for every kernel $f:X\to Y$ and for every $x\in X$,

$(del_Y\circ f) (\{*\}|x) = \int_Y del_Y(\{*\}|y) \, f(dy|x) = \int_Y 1\, f(dy|x) = 1 = \del_X(\{*\}|x) .$

### Kleisli categories of probability monads

As explained in probability monad, often monads can be used to “add extra morphisms” to a category, to model behaviors that were not present in the original category. These new morphisms form the Kleisli category, and are used, for example, in computer science to model computational effects. In particular, monads can be used to add stochastic maps to deterministic ones (more at probability monad), and very often, the resulting Kleisli category is a Markov category.

Let $C$ be a cartesian monoidal category, which we can interpret as encoding “no randomness” (more on that below). For example, Set. As we have seen above, $C$ has a canonical copy and discard structure.

The Kleisli category of a commutative monad (or monoidal monad) $P$ on $C$, for example of most probability monads, has a canonical copy-discard structure inherited from the one of $C$. First of all, for every monoidal monad, the Kleisli category is canonically monoidal again. To define a copy-discard structure, we can use the one of $C$ and the unit of the monad:

If moreover we have that $P$ preserves the monoidal unit ($P1\cong 1$, i.e. the monad is affine), we have that $1$ is terminal in the Kleisli category as well, and so we have a Markov category.

Several Markov categories are obtained in this way, where the randomness is, in some sense, generated by the monad. For example, Stoch is the Kleisli category of the Giry monad.

## Core concepts

### States, channels, joints, marginals

The Markov category structures can be interpreted in terms of probability, statistics, and information theory in these ways.

First of all, a state on $X$

has the interpretation of a probability measure on $X$, or in terms of information theory, a noisy source with alphabet $X$.

This reflect the intuition in our main examples:

• In FinStoch, a state is exactly a $n\times 1$ stochastic matrix, i.e.~a column vector whose entries sum to one. In other words, it is exactly a (discrete) probability distribution on $X$, of entries $p(x)$.
• In Stoch, a state es exactly a probability measure on a measurable space $(X,\mathcal{A})$, of values $p(A)$ for $A\in\mathcal{A}$.
• In Set (no randomness!), a state is exactly a function $1\to X$, i.e. a point $x$ of $X$.

Let’s now turn to morphisms:

As we have already remarked, these can be thought of as functions with possibly random outcomes, or, from the point of view of information theory potentially noisy channels. In our basic examples:

Morphisms in a Markov category have an additional useful interpretation: a morphism $f:X\to Y$ can also be seen as a state on $Y$ which depends by a parameter $X$. Indeed:

• In FinStoch, a stochastic map $f:X\to Y$, of entries $f(y|x)$, can be seen as a probability distribution on $Y$ which depends by a parameter $x\in X$;
• In Stoch, a Markov kernel $f:(X,\mathcal{A})\to (Y,\mathcal{B})$ of values $f(B|x)$ can be seen as a probability measure on $Y$ which depends measurably on $x\in X$, sometimes called a regular conditional distribution.
• In Set (no randomness!), a function $f:X\to Y$ can be seen as a point of $Y$ parametrized by $X$.

This idea can be seen as a stochastic version of generalized elements.

Let’s now turn to slightly more complex objects. States in the form $p:I\to X\otimes Y$ are called joint states.

This has the interpretation of “shared randomness between $X$ and $Y$, possibly with correlation or other interactions”.

• In FinStoch, a joint state on $X\otimes Y$ is a joint probability distribution on $X\times Y$, of entries $p(x,y)$.
• In Stoch, a joint state on $X\otimes Y$ is a joint probability measure on $(X\times Y,\mathcal{A}\otimes\mathcal{B})$, specified uniquely by its values $p(A\times B)$ for $A\in\mathcal{A}$ and $B\in\mathcal{B}$.
• In Set, a joint state on $X\otimes Y$ is an ordered pair $(x,y)$.

Given a joint state $p$ on $X\otimes Y$, its marginals are the states $p_X$ and $p_Y$, on $X$ and $Y$ respectively, given by discarding the unobserved variable:

This formalizes both the idea of “not observing” as well as the one of averaging or integrating over something:

• In FinStoch, the marginal distribution $p_X$ on $X$ formed by the joint distribution $p$ on $X\times Y$ is given by summing over $Y$:
$p_X(x) = \sum_{y\in Y} p(x,y)$
• In Stoch, the marginal distribution $p_X$ on $(X,\mathcal{A})$ formed by the joint distribution $p$ on $X\times Y$ is given by integrating over $Y$:
$p_X(A) = p(A\times Y) = \int_Y 1_A(x)\,p(d x\,d y)$

for all $A\in\mathcal{A}$.

• In Set, the marginal state $x\in X$ formed by the joint state $(x,y)\in X\otimes Y$ is simply given by discarding the second component.

Similarly, we call a joint morphism a morphism in the form $p:A\to X\otimes Y$.

We can view this equivalently as a joint state which depends on a parameter $A$. We form its marginal morphisms $p_X:A\to X$ and $p_X:A\to Y$ by discarding the unobserved variables:

This way, also the marginals will depend on the parameter $A$.

### Stochastic independence

A joint state $p$ on $X\otimes Y$ is said to exhibit independence of $X$ and $Y$ if and only if the following equation holds.

In other words, $p$ denotes independendence if and only if it is the product of its marginals. In particular:

• In FinStoch, the joint distribution $p(x,y)$ denotes independence if it is in the form $p_X(x)\,p_Y(y)$ ;
• In Stoch, $p$ on $(X\times Y, \mathcal{A}\otimes\mathcal{B})$ denotes independence if for every $A\in\mathcal{A}$ and $B\in\mathcal{B}$, $p(A\times B) = p_X(A)\,p_Y(B)$;
• In Set, every joint state $(x,y)$ is trivially the product of its marginals $x$ and $y$: no nontrivial correlation can happen when there is no randomness (see Proposition below).

We can interpret the string diagram as follows: when $p$ denotes independence of $X$ and $Y$, then $p$ “splits” and leaves them disconnected. This denotes the fact that no information flow from $X$ to $Y$ is possible, i.e. observing the value of $X$ does not increase our knowledge over $Y$. For example, knowing the last name of a person does not tell us anything about their age (approximately). Instead, in case of correlation (including nonlinear), knowing $X$ can tell us something about $Y$: for example, knowing the height of a person can tell us something (not everything) about their age. In that case, the diagram remains connected. This correspondence between inferential information flow and topological connectedness of the string diagram is not a coincidence, it holds in general and is very deep: Markov categories have been proven to be probabilistic graphical models? (Fritz-Klingler’23).

More generally, a joint morphism $p:W\to X\otimes Y$ is said to exhibit conditional independence of $X$ and $Y$ given $W$ if and only if the following holds.

This can be interpreted as the fact that $X$ and $Y$ are independent for each choice of fixed input $W$. For example,

• In FinStoch, this says exactly that $p$ is in the form

$p(x,y|w) = p_X(x|w)\,p_Y(y|w) ,$

which recovers the usual conditional independence in the discrete case.

• In Stoch, this says exactly that for all measurable sets $A\subseteq X$ and $B\subseteq Y$,

$p(A\times B|w) = p_X(A|w)\,p_Y(B|w) ,$

which recovers the usual conditional independence in the measure-theoretic case.

• In Set, once again conditional independence always holds.

We can interpret the string diagram as follows. In the case of conditional independence, it is not true anymore, in general, that $X$ and $Y$ are fully disconnected. However, they are connected only through $W$. That is, observing $X$ may tell us something about $Y$, but none of that information is new to us if we already know $W$.

For example, $X$ could be the number of mathematicians in a random city, and $Y$ could be annual consumption of coffee in the same city. We might observe a correlation between the two. However, both the amount of mathematicians and the consumption of coffee correlate with the population size of the city, $W$. It could be that given $W$, that is, only looking at cities with similar population size, we see no real correlation between the amount of coffee consumed and the number of mathematicians. In this case,

• If we don’t know the population size, knowing the coffee consumption may help us estimating the number of mathematicians;
• If we already know the population size, knowing the coffee consumption does not give us a more accurate estimate about the number of mathematicians. One says that the knowledge of $W$ explains away the correlation between $X$ and $Y$.

(This is just an example, real data about mathematicians and coffee may differ.)

Once again, the inferential information flow is perfectly reflected by the topological properties of the string diagram. This is one of the main assets of Markov categories: no matter how complex the model is, there always a perfect correspondence between the diagram topology and the interaction structure.

Let’s now consider the case of Set: why is every morphism displaying conditional independence? The reason is that if $p:W\to X\times Y$ is a (deterministic) function, then knowing $W$ will already tell us everything about $Y$ (and $X$). Therefore there is no additional knowledge than observing $X$ can give us, if we know $W$. (See also Proposition below.)

Let’s now make precise the idea of determinism.

### Deterministic morphisms

We have said that Markov categories are about describing morphisms which we can think of as “behaving randomly”. Let’s make this precise, or rather, let’s define precisely what it means to have no randomness.

A state $p:I\to X$ in a Markov category is called deterministic if and only if $\mathrm{copy}\circ p = p\otimes p$, i.e. if the following holds.

This reflect the traditional idea of probability theory of a zero-one measure, a random variable which is independent of itself. In particular:

• In FinStoch, a distribution $p$ is deterministic if and only if for all $x,x'\in X$,

$p(x)\,\delta_{x,x'} = p(x)\,p(x') ,$

i.e. if and only if for all $x\in X$,

$p(x) = p(x)^2 .$

This means precisely that for all $x$, either $p(x)=0$ or $p(x)=1$, i.e. that the probability is concentrated on a single point.

• In Stoch, a similar calculation shows that $p$ is deterministic if and only if for every measurable subset $A\subseteq X$, $p(A)=0$ or $p(A)=1$ (zero-one measure). That is, we are (almost) certain about which events happen and which do not. On well-behaved measurable spaces, such as standard Borel spaces, this means exactly that $p$ is a Dirac delta, a measure concentrated at a single point.

• In Set, every state (i.e. single point) is deterministic.

More generally, a morphism $f:X\to Y$ in a Markov category is called deterministic if it commutes with the copy map,

$copy \circ f \;=\; (f\otimes f) \circ copy .$

In terms of string diagrams:

This formalizes the idea of a deterministic transition, or function. In particular:

• In FinStoch, by means of a similar calculation as above, a stochastic matrix $k:X\to Y$ is deterministic if and only if for all $x,y\in X$, $k(y|x)=0$ or $k(y|x)=1$. This means precisely that $k$ is in the form $\delta_f$ for a set-theoretic function $f$ (see here).

• In Stoch, a Markov kernel $k:X\to Y$ is deterministic if and only if for every $x\in X$ and every measurable $B\subseteq Y$, $k(B|x)=0$ or $k(B|x)=1$ (zero-one kernels). On well-behaved measurable spaces, such as standard Borel spaces, this means exactly that $k$ is in the form $\delta_f$ for a measurable function $f$ (see here).

• In Set, every morphism is deterministic: for every $x\in X$ both side of the equations yield

$\big(f(x),f(x)\big) = \big(f(x),f(x)\big) ,$

where on the left we have applied $f$ to $x$ and copied the result, while on the right we have copied $x$ and applied $f$ to both copies.

Let’s now see where the generic case differs from Set, and further motivate this definition. Suppose that $f$ is a “random” function between real numbers, which adds to the input the result of the roll of a die. Given a number $x$, we can roll a die, add the resulting value (say, $n$) to $x$, and then copy the result, to get $(x+n,x+n)$. Or, we could copy the value $x$, roll two dice (or roll the die twice), and add the two resulting values (say, $m$ and $n$) to the two copies of $x$, obtaining $(x+m,x+n)$. The two results are likely to differ, and in particular will follow different distributions: one perfectly correlated, one with conditional independence of the outputs given the inputs. One can take this as a definition of randomness: it’s a process that may give a different result if you do it twice. Or equivalently, that copying the information before or after the process has taken place gives different results.

A deterministic morphism is one that does not exhibit this behavior, i.e. that commutes with the copy map.

Deterministic morphisms are closed under composition. One usually denotes the wide subcategory of deterministic morphisms of a Markov category $C$ by $C_det$.

We have seen that in a Markov category, every morphism commutes with the discard map. Therefore, from the category-theoretic perspective, we can equivalently see deterministic morphisms exactly as the morphisms of comonoids.

The following statement makes precise the idea, sketched in the previous sections, that no inference is possible without randomness:

###### Proposition

For a Markov category, the following conditions are equivalent.

1. Every morphism is deterministic;
2. The copy maps are natural;
3. The category is cartesian monoidal;
4. Every morphism displays conditional independence of its outputs given its inputs.

Therefore, we can consider Markov categories as a generalization of cartesian monoidal categories to the case where “randomness is present”.

(However, using Markov categories to study cartesian categories, while technically possible, usually yields trivial results, since all the correlations are trivial and no inference is possible.)

### Almost sure equality

Another concept of probability and measure theory that Markov categories formalize is almost sure equality. The fact that certain propositions, in probability theory, are only true up to a null set, is sometimes considered an unavoidable but confusing technicality of the theory. In Markov categories this concept is equally inevitable, but it acquires a precise conceptual meaning, and has very deep-reaching consequences.

Let $f,g:X\to Y$ be morphisms in a Markov category, and let $p$ be a state on $X$. We say that $f$ and $g$ are $p$-almost surely equal if and only if the following equation holds.

Similar definitions can be given even if $f$ and $p$ depend on additional parameters, see (Fritz et al’23) for the details.

In Stoch and FinStoch, this definition recovers the traditional notions of almost sure equality:

• In FinStoch, the equation reads

$p(x)\,f(y|x) = p(x)\,g(y|x)$

for all $x\in X$ and $y\in Y$. This means exactly that for all $y\in Y$, and for all those $x\in X$ with $p(x) \ne 0$, $f(y|x)=g(y|x)$. That is, $f$ and $g$ agree on the support of $p$.

• In Stoch, the equation reads

$\int_A f(B|x)\,p(d x) = \int_A g(B|x)\,p(d x)$

for all measurable $A\subseteq X$ and $B\subseteq Y$. That is, for all $B$, the quantities $f(B|x)$ and $g(B|x)$ agree on a measure-one subset of $X$ (which may depend on $B$ in general). This recovers the notion of almost sure equality for Markov kernels.

In the same vein, given a morphism $f:X\to Y$ and a state $p$ on $X$ we say that $f$ is $p$-almost surely deterministic if and only if the determinism equation for $f$ holds $p$-almost surely. In string diagrams:

• In FinStoch, this says exactly that $f(y|x)$ is either zero or one for all $x$ such that $p(x) \ne 0$;

• In Stoch, this says that for all measurable $B\subseteq Y$, for $x$ on a measure-one set, $f(B|x)$ is either zero or one.

### Conditionals

One of the most powerful concepts in the theory of Markov categories is the idea of conditioning. We saw, when talking about stochastic independence, that given a joint state on $X\times Y$,

one can try, from observing $X$, to infer something about $Y$. This idea of information flowing from $X$ to $Y$, often, can be modelled as a morphism $X\to Y$, in general not deterministic, since $X$ may not determine $Y$ completely. Let’s define this precisely.

Let $p$ be a joint state on $X\otimes Y$. A conditional distribution on $Y$ given $X$ is a morphism $p|_X:X\to Y$ such that the following equation holds:

• In FinStoch, this means a stochastic map of entries $p|_X(y|x)$ such that for all $x\in X$ and $y\in Y$,

$p(x,y) = p_X(x)\,p|_X(y|x) .$

(Note the difference between the marginal, $p_X$, and the conditional, $p|_X$.) This is the usual notion of conditional distribution in the discrete case, usually obtained by

$p|_X(y|x) = \frac{p(x,y)}{p(x)}$

when $p(x)\ne 0$, and arbitrary otherwise.

• In Stoch, this means a Markov kernel $p|_X:X\to Y$ such that for all measurable $A\subseteq X$ and $B\subseteq Y$,

$p(A\otimes B) = \int_A p|_X(B|x)\,p_X(d x) .$

This is known in probability theory as a regular conditional distribution.

Note that, if a conditional distribution exists, it is unique only up to $p_X$-almost sure equality. Moreover,

It is helpful to think of a conditional distribution as of turning the output $X$ into an input. In terms of string diagram, it is sometimes helpful to represent this as follows,

which says that “in the inside of $p|_X$, a wire has been bent to turn an output into an input”.

A particularly important conditional distribution is given by Bayesian inverses.

Let $p$ be a state on $X$ and let $f:X\to Y$ be a morphisms. A Bayesian inverse of $f$ with respect to $p$ is a morphism $f^\dagger:Y\to X$ such that the following equation holds, where $q=f\circ p$.

Note that this is the same as a conditional given $Y$ for the joint on the right-hand side. The equation can be seen as a version of Bayes' formula:

• In FinStoch, it says that

$q(y) \,f^\dagger_p(x|y) = p(x) \,f(y|x) ,$

which is in the usual form

$P(y) \, P(x|y) = P(x) \, P(y|x) ,$

which recovers the usual Bayesian version for stochastic maps.

• In Stoch, similarly, it says that for all measurable $A\subseteq X$ and $B\subseteq Y$,

$\int_B f^\dagger_p(A|y) \, q(d y) = \int_A f(B|x)\,p(d x) .$

This recovers the Bayesian inversion of Markov kernels.

Once again, Bayesian inversions exist for all standard Borel spaces, and in general, if they exist, they are unique $q$-almost surely.

To conclude this section, let’s give the most general definition of conditional. Let $h:A\to X\otimes Y$ be a joint morphism. A conditional of $h$ given $X$ is a morphism $h|_X:X\otimes A\to Y$ such that the following equation holds.

A Markov category is said to have conditionals if every conditional in the form above exists.

Both FinStoch and BorelStoch have conditionals.

## Further structures and properties

### Representable Markov categories

As we explained above, a probability monad often gives rise to a Markov category.

Conversely, consider a Markov category $C$.

• The category $C_det$ of deterministic morphisms has no randomness;
• There may be have a monad on $C_\det$ which encodes randomness (see probability monad);
• The Markov category $C$ may be the Kleisli category of this monad.

Let’s make this precise in terms of a universal property.

Let $X$ be an object in a Markov category $C$. A distribution object for $X$ is an object, which we denote by $P X$, together with a bijection

$C_det(A,P X) \cong C(A,X)$

for all objects $A$, natural in $A$ against deterministic morphisms.

The object $P X$ has the interpretation of being a space of probability measures over $X$. Setting $A=I$ in the definition, we see in particular that the deterministic states (or “points”) of $P X$ are exactly the (generic, random) states on $X$. For Stoch, this says that the points of $P X$ should be exactly the probability measures on $X$. Indeed, in Stoch every object has a distribution object, given by the Giry monad.

More generally:

###### Theorem

For a Markov category $C$, the following conditions are equivalent.

1. Every object of $C$ has a distribution object.
2. The inclusion functor $C_det\to C$ has a right adjoint (which we denote by $P$).
3. $C$ is isomorphic to the Kleisli category of a commutative affine monad on $C_det$.

A Markov category $C$ is called representable if any of the conditions of the theorem above holds.

The category FinStoch is not representable, since the objects are finite sets, and the set of distribution $P X$ over a finite set $X$ (with more than two elements) would be infinite. It is however a subcategory of the Kleisli category of the distribution monad, of all sets and all stochastic maps between them.

By means of the Yoneda lemma, we can equivalently define a distribution object as an object $P X$ together with a distinguished morphism $P X\to X$, which we denote by “$samp$” and call the sampling map, such that the assignment is a bijection for all objects $A$.

The map $samp$ can be thought of as being the universal random map, in the sense that by the bijection above, every Kleisli morphism can be seen as the composition of $samp$ with a deterministic map.

From the point of view of probability theory, in Stoch, we can view this map $P X\to X$ as taking a distribution $p\in P X$, and returning a random point of $X$ distributed according to $p$. In other words, it is sampling from $p$.

This also corresponds to the idea of sampling in probabilistic programming?, (see monad (in computer science)).

### Kolmogorov products

Kolmogorov products are a way to equip a Markov category with an infinitary tensor product. The idea can be seen as an abstraction of the Kolmogorov extension theorem for probability measures.

Let $C$ be a Markov category, and let $(X_j)_{j\in J}$ be a family of objects. A Kolmogorov product of them is an infinite tensor product

$X_J=\bigotimes_{j\in J} X_j$

such that all finite projections (marginalizations) $X_J\to X_F$, with $F\subseteq J$ finite, are deterministic morphisms.

Using the universal property of limits, this says that a joint distribution on $X_J$ is equivalently given by a family of joint distributions on all finite subsets of $J$, compatible with the restrictions.

Because of this, a Kolmogorov product on $C$ restricts to a cartesian product in the category $C_det$ of deterministic morphisms.

### Positivity and causality

Positivity and causality are two additional axioms for Markov categories, introduced in Fritz’20, which hold in categories of kernels such as Stoch.

A Markov category is called positive if and only if whenever two morphisms $f:X\to Y$ and $g:Y\to Z$ are such that $g\circ f:X\to Z$ is deterministic, then the following holds,

i.e. the resulting joint morphism on the left makes $Y$ and $Z$ conditionally independent given $X$.

Equivalently, a Markov category is positive if and only if for every joint morphism $h:A\to X\otimes Y$ such that one of its marginals is deterministic, $h$ exhibits conditional independents of its outputs given its input.

In a positive Markov category, every isomorphism is deterministic (Fritz’20, Remark 11.28).

A Markov category is called causal if, whenever the following equation holds,

then the following, stronger condition also holds.

###### Theorem

Every Markov category with all conditionals is causal (Fritz’20, Proposition 11.34), and every causal Markov category is positive (Fritz et al’23, Theorem 2.24). Both converses are false.

The category Stoch is causal (but ot doesn’t have all conditionals).

### The ProbStoch construction

The ProbStoch construction on a Markov category formalizes the idea of a category of probability spaces. Morphisms are identified when they agree almost surely, so that universal properties in that category tend to encode universal properties that hold “up to probability zero”. Moreover, if conditionals exist, the ProbStoch construction gives an abstraction of the category of couplings.

Let $C$ be a causal Markov category. The category $ProbStoch(C)$ is defined as follows.

• Its objects are pairs $(X,p)$ where $X$ is an object of $C$, and $p:I\to X$ is a state on $X$.
• Its morphisms $(X,p)\to(Y,q)$ are equivalence classes of morphisms $f:X\to Y$ of $C$ under $p$-almost sure equality, and such that $f\circ p=q$.

The causality condition is required in order for composition to be well defined on the equivalence classes.

In general $ProbStoch(C)$ is not a Markov category, but only a semicartesian monoidal category.

###### Proposition

If a Markov category $C$ has all conditionals, $ProbStoch(C)$ is a dagger category, with the daggers given by Bayesian inverses.

For example, BorelStoch has all conditionals, and $ProbStoch(BorelStoch)$ is the category Krn.

## Quantum versions

(For now, see the references.)

## History

In the history of Markov categories, and of categorical probability in general, it happened very often that a number of independent researchers developed similar ideas unaware of each other’s work.

The first study of a category of Markov kernels (here called Stoch) seems to be due to Lawvere, and recorded in unpublished notes (Lawvere’62, see also the explanation by Lawvere himself in this video, time 1:00:21.) The same category was studied independently by Chentsov (Chentsov’65), who wrote in Russian. It is likely that Lawvere and Chentsov remained unaware of each other’s work for several decades.

Giry (Giry’82) first described Stoch as the Kleisli category of the monad which brings her name. Shortly after, computer scientists such as Moggi and Plotkin started considering monads in the context of effectful programming (Moggi’89), with probabilistic computation as a special case (Jones-Plotkin’89). (More details in monad (in computer science).) This is also the time when a number of probability monads were introduced, see there for more.

The first mention of Markov and CD-categories used for the purpose of probability seems to have first appeared in the work of Golubtsov (Golubtsov’99), who named the structure “category of information converters” or “information transformers”, likely unaware of Gadducci’s work, and first writing in Russian. In subsequent work (Golubtsov’02, Golubtsov-Mosaliuk’02, Golubtsov’04) these structures were connected to Kleisli categories, introducing what now we call a representable Markov category.

In parallel, and seemingly unaware of Golubtsov’s work, Coecke and Spekkens developed a string-diagrammatic framework for Bayesian inference, for the purpose of generalizing these concept from classical to quantum (Coecke-Spekkens’11). The resulting framework is somewhat different from the one of Markov categories, but very similar in aims and philosophy.

Once again seemingly unaware of Golubtsov’s work, Fong studied categorical structures to model Bayesian networks in his master’s thesis (Fong’12), in particular by means of equipping the objects of a category with comonoid structures.

Finally, the current form of Markov and copy-discard categories, which is also the one presented in this article, was developed in the works of Cho and Jacobs (Cho-Jacobs’19) and of Fritz (Fritz’20).

## Detailed list

Markov category Probability monad Conditionals Positivity Kolmogorov products Further references
Stoch Giry monad on SoberMeas No (Fritz'20, Example 11.3) Yes (Fritz'20, Example 11.25) Not in general Fritz'20
BorelStoch Giry monad on BorelMeas Yes (Kallenberg '17, B-M'19) Yes (Fritz'20, Section 11) Countable (Fritz-Rischel'19) Fritz'20
FinStoch Not representable (closely related to the distribution monad) Yes (Fritz'20, Example 11.6) Yes (Fritz'20, Section 11) No Fritz'20
TychStoch probability monad on topological spaces (here, Tychonoff) No Yes (Fritz et al'23) ? Fritz et al'23
QBStoch probability monad on quasi-Borel spaces No (Sabok et al'20) No (Sabok et al'20) ? Fritz et al'23
Gauss ? Yes (Fritz'20) Yes (Fritz'20) No Stein et al.'23
FinSetMulti Finite nonempty power set monad Yes (Fritz'20) Yes (Fritz'20) No (Fritz-Rischel'19) Stein et al.'23

(…to be expanded…)

## References

### Main sources

The main definitions and results in this article can be found in

• Tobias Fritz, A synthetic approach to Markov kernels, conditional independence and theorems on sufficient statistics, Advances of Mathematics 370, 2020. (arXiv:1908.07021)

and in the following reference (where Markov categories are called affine CD categories):

• Kenta Cho, Bart Jacobs, Disintegration and Bayesian Inversion via String Diagrams, Mathematical Structures of Computer Science 29, 2019. (arXiv:1709.00322)

The graphical model (separation and d-separation) aspects of conditional independence for Markov categories have been studied in

• Tobias Fritz and Andreas Klingler, The d-separation criterion in Categorical Probability, Journal of Machine Learning Research 24(46), 2023. (arXiv)

Representable Markov categories have been studied in

• Tobias Fritz, Tomáš Gonda, Paolo Perrone, Eigil Fjeldgren Rischel, Representable Markov categories and comparison of statistical experiments in categorical probability, Theoretical Computer Science 961, 2023. (arXiv:2010.07416)

as well as in the following work, together with an in-depth treatment of positivity and causality:

• Tobias Fritz, Tomáš Gonda, Nicholas Gauguin Houghton-Larsen, Paolo Perrone and Dario Stein, Dilations and information flow axioms in categorical probability. Mathematical Structures in Computer Science, 2023. (arXiv:2211.02507).

Kolmogorov products have been studied in

• Tobias Fritz and Eigil Fjeldgren Rischel, Infinite products and zero-one laws in categorical probability, Compositionality 2(3) 2020. (arXiv:1912.02769)

### Introductory material

Some of the introductory content on this page is taken from the following reference:

• Paolo Perrone, Markov Categories and Entropy, IEEE Transactions on Information Theory 70(3), 2024. (arXiv)

Video lectures suitable for beginners:

• Paolo Perrone, Tutorial on Markov categories, Applied Category Theory 2023, University of Maryland, USA. (YouTube)

• Paolo Perrone, Universal properties in probability theory, Category Theory 2023, Louvain-la-Neuve, Belgium. (YouTube)

• (in Spanish) Paolo Perrone, Categorical Probability and Information Theory, Universidad Carlos III Madrid, Spain (video here)

• Eigil Fjeldgren Rischel, Introduction to Markov categories, Categorical Probability and Statistics Workshop, 2020. (YouTube)

Slides from talks:

• Tobias Fritz, Categorical Probability and the de Finetti theorem, New York City Category Seminar, 2021. (pdf)

• Tobias Fritz, Probability and Statistics as a Theory of Information Flow, 2020 (pdf)

• Tobias Fritz, A synthetic introduction to probability and statistics, 2019 (pdf)

### Historical references (see History):

• W. Lawvere, The category of probabilistic mappings, ms. 12 pages, 1962

• N. N. Chentsov, The categories of mathematical statistics, Dokl. Akad. SSSR 164, 1965.

• Michèle Giry, A categorical approach to probability theory, Categorical aspects of topology and analysis, Lecture notes in Mathematics 915, Springer, 1982.

• Eugenio Moggi, Computational lambda-calculus and monads, Proceedings of LICS, 1989. (doi)

• Claire Jones and Gordon Plotkin, A probabilistic powerdomain of evaluations, Proceedings of LICS, 1989. (doi)

• Fabio Gadducci, On the Algebraic Approach to Concurrent Term Rewriting, PhD thesis, University of Pisa, 1996. (available here)

• Andrea Corradini and Fabio Gadducci, A 2-categorical presentation of term graph rewriting. In CTCS, LNCS 1290, 1997.

• Andrea Corradini and Fabio Gadducci, An algebraic presentation of term graphs, via gs-monoidal categories, Applied Categorical Structures 7(4), 1999

• Andrea Corradini and Fabio Gadducci, Rewriting on cyclic structures: equivalence between the operational and the categorical description, RAIRO Theoretical Informatics and Applications 33(4-5), 1999.

• Andrea Corradini and Fabio Gadducci, A functorial semantics for multi-algebras and partial algebras, with applications to syntax, Theoretical Computer Science 286(2), 2002.

• Peter V. Golubtsov, Axiomatic description of categories of information converters, Problemy Peredachi Informatsii (Problems in Information Transmission) 35(3), 1999.

• Peter V. Golubtsov, Monoidal Kleisli category as a background for information transformers theory. Information Theory and Information Processing 2, 2002.

• Peter V. Golubtsov and Stepan S. Moskaliuk, Method of additional structures on the objects of a monoidal Kleisli category as a background for information transformers theory, Hadronic Journal 25(2), 2002.

• Peter V. Golubtsov, Information transformers: category-theoretical structure, informativeness, decision-making problems, Hadronic Journal Supplements 19(3), 2004.

• Bob Coecke and Robert W. Spekkens, Picturing classical and quantum Bayesian inference, Synthese, 186(3), 2012. (arXiv)

• Brendan Fong, Causal theories: A categorical prespective on Bayesian networks, master thesis, University of Oxford, 2012. (arXiv)

### Further references

• Tobias Fritz, Tomáš Gonda, Paolo Perrone, De Finetti’s theorem in categorical probability. Journal of Stochastic Analysis, 2021. (arXiv:2105.02639)

• Tobias Fritz, Wendong Liang, Free gs-monoidal categories and free Markov categories, Applied Categorical Structures 31, 21, 2023. (arXiv:2204.02284)

• Sean Moss, Paolo Perrone, Probability monads with submonads of deterministic states, Proceedings of LICS, 2022. (arXiv:2204.07003)

• Sean Moss and Paolo Perrone, A category-theoretic proof of the ergodic decomposition theorem, Ergodic Theory and Dynamical Systems, 2023. (arXiv:2207.07353)

• Marcin Sabok, Sam Staton, Dario Stein, Michael Wolman, Probabilistic Programming Semantics for Name Generation, Proceedings POPL, 2021. (arXiv)

• Elena Di Lavore and Mario Román, Evidential Decision Theory via Partial Markov Categories, Proceedings of LICS, 2023. (arXiv)

• Bart Jacobs, Multinomial and Hypergeometric Distributions in Markov Categories, Proceedings of MFPS 2021. (arXiv)

• Tobias Fritz, Fabio Gadducci, Davide Trotta, Andrea Corradini, From gs-monoidal to oplax-cartesian categories: constructions and functorial completeness, Applied Categorical Structures 31, 42, 2023. (arXiv)

• Tobias Fritz, Fabio Gadducci, Paolo Perrone and Davide Trotta, Weakly Markov categories and weakly affine monads, Proceedings of CALCO, LIPIcs 10, 2023. (arXiv)

• Dario Stein and Richard Samuelson, A Category for unifying Gaussian Probability and Nondeterminism, Proceedings of CALCO, LIPIcs 10, 2023. (arXiv)

• Dario Stein and Sam Staton, Probabilistic Programming with Exact Conditions, JACM, 2023. (arXiv)

• Tobias Fritz, Tomáš Gonda, Antonio Lorenzin, Paolo Perrone and Dario Stein, Absolute continuity, supports and idempotent splitting in categorical probability, 2023. (arXiv:2308.00651)

• Dario Stein and Richard Samuelson, Towards a Compositional Framework for Convex Analysis (with Applications to Probability Theory), 2023. (arXiv)

• Nathaniel Virgo, Unifilar Machines and the Adjoint Structure of Bayesian Filtering, Proceedings of ACT, 2023. (arXiv)

• Noé Ensarguet and Paolo Perrone, Categorical probability spaces, ergodic decompositions, and transitions to equilibrium, 2023. (arXiv:2310.04267)

• Tobias Fritz and Antonio Lorenzin, Involutive Markov categories and the quantum de Finetti theorem, 2023. (arXiv)

• Tobias Fritz, Andreas Klingler, Drew McNeely, Areeb Shah-Mohammed and Yuwen Wang, Hidden Markov models and the Bayes filter in categorical probability, 2024. (arXiv)

• Arthur J. Parzygnat, Inverses, disintegrations, and Bayesian inversion in quantum Markov categories, 2020. (arXiv)

• Arthur J. Parzygnat and Benjamin P. Russo, A noncommutative Bayes theorem, Linear Algebra Applications 644, 2022. (arXiv)

• Arthur J. Parzygnat, Conditional distributions for quantum systems, EPTCS 343, 2021. (arXiv)

• James Fullwood and Arthur J. Parzygnat, On quantum states over time, Proceedings of the Royal Society A 478, 2022. (arXiv)

• Arthur J. Parzygnat, Francesco Buscemi, Axioms for retrodiction: achieving time-reversal symmetry with a prior, Quantum 7(1013), 2023. arXiv

• James Fullwood and Arthur J. Parzygnat, From time-reversal symmetry to quantum Bayes rules, PRX Quantum 4, 2023. (arXiv)

The research on Markov categories is still largely in progress, so this list will keep growing.