A Markov kernel (also called transition kernel, stochastic kernel, or probability kernel) is a mathematical formalization of a “function with random outcomes”, or “random transition”. (Sometimes the term transition kernel denotes a more general, non-normalized mapping.)
It can be thought of as a generalization of a stochastic map outside the finite discrete case. (Sometimes the term “stochastic map” is itself used to denote a Markov kernel.)
Markov kernels, in the form of regular conditional distributions (see below) are also the standard setting for talking about conditional probability outside of the discrete setting.
Markov kernels and their categories are among the basic building blocks of categorical probability.
Let $(X,\mathcal{A})$ and $(Y,\mathcal{B})$ be measurable spaces, i.e. sets equipped each with a sigma-algebra. A Markov kernel $(X,\mathcal{A})\to(Y,\mathcal{B})$ is an assignment which is
The quantity $k(B|x)$ can be thought of as the probability of transitioning to $B$ if we are in $x$, or of the conditional probability of the event $B$ given $x$.
A Markov kernel $k:(X,\mathcal{A})\to(Y,\mathcal{B})$ can be equivalently expressed as a measurable map where $P$ denotes the Giry monad on Meas, with the Giry sigma-algebra. In other words, a Markov kernel is exactly a Kleisli morphism for the Giry monad. Particular Markov kernels are also Kleisli morphisms for other probability monads.
Consider two $k,h:(X,\mathcal{A})\to(Y,\mathcal{B})$ and a measure $p$ on $(X,\mathcal{A})$. The kernels $k$ and $h$ are said to be $p$-almost surely equal or $p$-almost everywhere equal if for every $B\in\mathcal{B}$, the quantities $k(B|x)$ and $h(B|x)$ only differ on a set of measure zero. Note that in general this set depends on the choice of $B$, but in some cases, such as when $(Y,\mathcal{B})$ is standard Borel, it can be taken independently of $B$.
Let $(X,\mathcal{A},p)$ and $(Y,\mathcal{B},q)$ be probability spaces, i.e. measurable spaces equipped each with a probability measure. A measure-preserving kernel $(X,\mathcal{A},p)\to(Y,\mathcal{B},q)$ is a kernel between the underlying measurable spaces $k:(X,\mathcal{A})\to(Y,\mathcal{B})$ such that for every $B\in\mathcal{B}$,
The identity function on a measurable set $(X,\mathcal{A})$ defines a kernel as follows,
for every $x\in X$ and $A\in\mathcal{A}$. This gives the identity morphisms in the categories of kernels below.
More generally, let $f:(X,\mathcal{A})\to (Y,\mathcal{B})$ be a measurable function. We can define the kernel $\delta_f:(X,\mathcal{A})\to (Y,\mathcal{B})$ as follows,
for every $x\in X$ and $B\in\mathcal{B}$. Intuitively, this kernel represents the deterministic transition, from $x$ to $f(x)$ with probability one. This construction induces a functor from the category Meas to the categories of kernels below. (Compare with the analogous construction for stochastic maps.)
Note that if $f:(X,\mathcal{A},p)\to(Y,\mathcal{B},q)$ is a measure-preserving function, then $\delta_f$ is a measure-preserving kernel in the sense defined above.
Given a measure-preserving Markov kernel $k:(X,\mathcal{A},p)\to(Y,\mathcal{B},q)$, a Bayesian inverse of $k$ is a Markov kernel $k^\dagger:(Y,\mathcal{B},q)\to(X,\mathcal{A},p)$ such that for all $A\in\mathcal{A}$ and $B\in\mathcal{B}$,
The last formula can be seen as an instance of Bayes' formula, as it generalize the usual discrete formula
We have that
Every kernel between standard Borel spaces admits a Bayesian inverse;
If a kernel $k:(X,\mathcal{A},p)\to(Y,\mathcal{B},q)$ admits a Bayesian inverse $k^\dagger:(Y,\mathcal{B},q)\to(X,\mathcal{A},p)$, then any kernel $h:(Y,\mathcal{B},q)\to(X,\mathcal{A},p)$ is a Bayesian inverse of $k$ if and only if it is $q$-almost surely equal to $k^\dagger$.
Bayesian inverses, when they exist, depend crucially on the measure $p$.
Bayesian inversion makes the category Krn (see below) a dagger category.
A Markov kernel can be seen as a probability measure which depends measurably on a parameter. Therefore, together with the idea of conditional expectation, they a very useful tool to talk about conditional probability without incurring into paradoxes. Let’s see how.
Let $(X,\mathcal{A},p)$ be a probability space, and consider a sub-sigma-algebra $\mathcal{B}\subseteq\mathcal{A}$. Recall that
the conditional expectation of an integrable function $f:(X,\mathcal{A},p)\to\mathbb{R}$ given $\mathcal{B}$ is a function $\mathbb{E}[f|\mathcal{B}]:X\to\mathbb{R}$ which is $\mathcal{B}$-measurable (“coarser” than $f$), and such that for every $B\in\mathcal{B}$,
(Such a function always exists by the Radon-Nikodym theorem.)
the conditional probability of an event (measurable subset) $A\in\mathcal{A}$ is the conditional expectation of its indicator function:
Note that $\mathbb{P}[A|\mathcal{B}]$ is a function,
which is $\mathcal{B}$-measurable for every $A\in\mathcal{A}$. Therefore, in order for the assignment to be a Markov kernel $(X,\mathcal{B})\to(X,\mathcal{A})$ (see the definition above), we moreover need that for every $x\in X$ the assignment
be a probability measure on $(X,\mathcal{A})$. This is not always the case. When this happens, we call the resulting kernel $(X,\mathcal{B})\to(X,\mathcal{A})$ a regular conditional distribution given $\mathcal{B}$. We can view a regular conditional distribution as a probability measure on $(X,\mathcal{A})$ which is parametrized in a measurable way by $(X,\mathcal{B})$.
The disintegration theorem says that when $(X,\mathcal{A})$ is standard Borel, regular conditional distributions in the form above always exist.
Using the disintegration theorem we can obtain conditional probability distributions in a way that generalizes the discrete formula
to the non-discrete setting.
Given standard Borel spaces $(X,\mathcal{A})$ and $(Y,\mathcal{B})$, their product $(X\times Y,\mathcal{A}\otimes\mathcal{B})$ is again standard Borel. The projection map $\pi_1:(X\times Y,\mathcal{A}\otimes\mathcal{B})\to (X,\mathcal{A})$ induces a sub-sigma-algebra $\pi_1^{-1}(\mathcal{A})$ on $X\times Y$ (which makes it isomorphic to $(X,\mathcal{A})$ in both Stoch and Krn, see Section 2.1 of this paper).
Given a joint distribution $r$ on $(X\times Y,\mathcal{A}\otimes\mathcal{B})$, we can then form the regular conditional $(X\times Y,\pi_1^{-1}(\mathcal{A}))\to(X\times Y,\mathcal{A}\otimes\mathcal{B})$ (or equivalently, up to isomorphism, $(X,\mathcal{A})\to(X\times Y,\mathcal{A}\otimes\mathcal{B})$). Further composing with the projection $\pi_2:(X\times Y,\mathcal{A}\otimes\mathcal{B})\to(Y,\mathcal{B})$ (or equivalently restricting the kernel to the sub-sigma-algebra $\pi_2^{-1}(\mathcal{B})$) we then get a kernel sometimes denoted simply $r(B|x)$ or $r'(B|x)$.
This kernel satisfies the property that for every $A\in\mathcal{A}$ and $B\in\mathcal{B}$,
and is therefore a generalization of conditional probability to the non-discrete setting: compare it to the formula
Stoch is the category of measurable spaces and Markov kernels between them.
One can also form the category of probability spaces and equivalence classes of Markov kernels between them (under almost sure equality).
One can restrict the category above to standard Borel probability spaces, obtaining a dagger category sometimes denoted by Krn, which is equivalent to the category of couplings.
See also:
Discussion via category theory:
F. W. Lawvere, The Category of Probabilistic Mappings, 1962, PDF.
Fredrik Dahlqvist, Vincent Danos, Ilias Garnier, and Alexandra Silva, Borel kernels and their approximation, categorically. In MFPS 34: Proceedings of the Thirty-Fourth Conference on the Mathematical Foundations of Programming Semantics, volume 341, 91–119, 2018. arXiv.
Tobias Fritz, A synthetic approach to Markov kernels, conditional independence and theorems on sufficient statistics. Adv. Math., 370:107239, 2020. arXiv:1908.07021.
Noé Ensarguet, Paolo Perrone, Categorical probability spaces, ergodic decompositions, and transitions to equilibrium. arXiv.
Last revised on February 7, 2024 at 16:58:40. See the history of this page for a list of all contributions to it.