nLab Markov kernel

Contents

Idea

In probability theory, Markov kernels (also called transition kernels, stochastic kernels, or probability kernels) describe random state transitions or other forms of stochastic dependence by assigning to each input a probability law for the possible outputs. (Sometimes the term transition kernel denotes a more general, non-normalized mapping.)

This may be thought of as a generalization of stochastic maps outside the finite discrete case. (Sometimes the term β€œstochastic maps” is itself used to denote Markov kernels.)

Markov kernels, in the form of regular conditional distributions (see below) are also the standard setting for talking about conditional probability outside of the discrete case (and without incurring in paradoxes).

Markov kernels and the categories which they fir. are among the basic building blocks of categorical probability theory.

Definition

Let (X,π’œ)(X,\mathcal{A}) and (Y,ℬ)(Y,\mathcal{B}) be measurable spaces, i.e. sets equipped each with a sigma-algebra. A Markov kernel (X,π’œ)β†’(Y,ℬ)(X,\mathcal{A})\to(Y,\mathcal{B}) is an assignment which is

The quantity k(B|x)k(B|x) can be thought of as the probability of transitioning to BB if we are in xx, or as the conditional probability of the event BB given xx.

Basic structures and properties

As Kleisli morphisms

A Markov kernel k:(X,π’œ)β†’(Y,ℬ)k:(X,\mathcal{A})\to(Y,\mathcal{B}) can be equivalently expressed as a measurable map where PP denotes the Giry monad on Meas, with the Giry sigma-algebra. In other words, a Markov kernel is exactly a Kleisli morphism for the Giry monad. Particular Markov kernels are also Kleisli morphisms for other probability monads.

Almost sure equality

Consider two k,h:(X,π’œ)β†’(Y,ℬ)k,h:(X,\mathcal{A})\to(Y,\mathcal{B}) and a measure pp on (X,π’œ)(X,\mathcal{A}). The kernels kk and hh are said to be pp-almost surely equal or pp-almost everywhere equal if for every Bβˆˆβ„¬B\in\mathcal{B}, the quantities k(B|x)k(B|x) and h(B|x)h(B|x) only differ on a set of measure zero. Note that in general this set depends on the choice of BB, but in some cases, such as when (Y,ℬ)(Y,\mathcal{B}) is standard Borel, it can be taken independently of BB.

Measure-preserving kernels

Let (X,π’œ,p)(X,\mathcal{A},p) and (Y,ℬ,q)(Y,\mathcal{B},q) be probability spaces, i.e. measurable spaces equipped each with a probability measure. A measure-preserving kernel (X,π’œ,p)β†’(Y,ℬ,q)(X,\mathcal{A},p)\to(Y,\mathcal{B},q) is a kernel between the underlying measurable spaces k:(X,π’œ)β†’(Y,ℬ)k:(X,\mathcal{A})\to(Y,\mathcal{B}) such that for every Bβˆˆβ„¬B\in\mathcal{B},

∫ Xk(B|x)p(dx)=q(B). \int_X k(B|x)\,p(d x) = q(B) .

Kernels from deterministic functions

The identity function on a measurable set (X,π’œ)(X,\mathcal{A}) defines a kernel as follows,

Ξ΄(A|x)=Ξ΄ x(A)=1 A(x)={1 x∈A; 0 xβˆ‰A \delta(A|x) = \delta_x(A) = 1_A(x) = \begin{cases} 1 & x\in A ; \\ 0 & x\notin A \end{cases}

for every x∈Xx\in X and Aβˆˆπ’œA\in\mathcal{A}. This gives the identity morphisms in the categories of kernels below.

More generally, let f:(X,π’œ)β†’(Y,ℬ)f:(X,\mathcal{A})\to (Y,\mathcal{B}) be a measurable function. We can define the kernel Ξ΄ f:(X,π’œ)β†’(Y,ℬ)\delta_f:(X,\mathcal{A})\to (Y,\mathcal{B}) as follows,

Ξ΄ f(B|x)=Ξ΄ f(x)(B)=1 f βˆ’1(B)(x)={1 f(x)∈B; 0 f(x)βˆ‰B \delta_f(B|x) = \delta_f(x)(B) = 1_{f^{-1}(B)}(x) = \begin{cases} 1 & f(x)\in B ; \\ 0 & f(x)\notin B \end{cases}

for every x∈Xx\in X and Bβˆˆβ„¬B\in\mathcal{B}. Intuitively, this kernel represents the deterministic transition, from xx to f(x)f(x) with probability one. This construction induces a functor from the category Meas to the categories of kernels below. (Compare with the analogous construction for stochastic maps.)

Every kernel in the form Ξ΄ f\delta_f is a zero-one kernel, but the converse is not true in general.

Note that if f:(X,π’œ,p)β†’(Y,ℬ,q)f:(X,\mathcal{A},p)\to(Y,\mathcal{B},q) is a measure-preserving function, then Ξ΄ f\delta_f is a measure-preserving kernel in the sense defined above.

Bayesian inversion

Given a measure-preserving Markov kernel k:(X,π’œ,p)β†’(Y,ℬ,q)k:(X,\mathcal{A},p)\to(Y,\mathcal{B},q), a Bayesian inverse of kk is a Markov kernel k †:(Y,ℬ,q)β†’(X,π’œ,p)k^\dagger:(Y,\mathcal{B},q)\to(X,\mathcal{A},p) such that for all Aβˆˆπ’œA\in\mathcal{A} and Bβˆˆβ„¬B\in\mathcal{B},

∫ Ak(B|x)p(dx)=∫ Bk †(A|y)q(dy). \int_A k(B|x)\,p(d x) = \int_B k^\dagger(A|y)\,q(d y) .

The last formula can be seen as an instance of Bayes' formula, as it generalize the usual discrete formula

P(b|a)P(a)=P(a|b)P(b). P(b|a)\,P(a) = P(a|b)\,P(b) .

We have that

  • Every kernel between standard Borel spaces admits a Bayesian inverse;

  • If a kernel k:(X,π’œ,p)β†’(Y,ℬ,q)k:(X,\mathcal{A},p)\to(Y,\mathcal{B},q) admits a Bayesian inverse k †:(Y,ℬ,q)β†’(X,π’œ,p)k^\dagger:(Y,\mathcal{B},q)\to(X,\mathcal{A},p), then any kernel h:(Y,ℬ,q)β†’(X,π’œ,p)h:(Y,\mathcal{B},q)\to(X,\mathcal{A},p) is a Bayesian inverse of kk if and only if it is qq-almost surely equal to k †k^\dagger.

  • Bayesian inverses, when they exist, depend crucially on the measure pp.

  • Bayesian inversion makes the category Krn (see below) a dagger category.

Regular conditional distributions

A Markov kernel can be seen as a probability measure which depends measurably on a parameter. Therefore, together with the idea of conditional expectation, it’s a very useful tool to talk about conditional probability without incurring paradoxes. Let’s see how.

Let (X,π’œ,p)(X,\mathcal{A},p) be a probability space, and consider a sub-sigma-algebra β„¬βŠ†π’œ\mathcal{B}\subseteq\mathcal{A}. Recall that

  • the conditional expectation of an integrable function f:(X,π’œ,p)→ℝf:(X,\mathcal{A},p)\to\mathbb{R} given ℬ\mathcal{B} is a function 𝔼[f|ℬ]:X→ℝ\mathbb{E}[f|\mathcal{B}]:X\to\mathbb{R} which is ℬ\mathcal{B}-measurable (β€œcoarser” than ff), and such that for every Bβˆˆβ„¬B\in\mathcal{B},

    ∫ Bfdp=∫ B𝔼[f|ℬ]dp. \int_B f \, d p = \int_B \mathbb{E}[f|\mathcal{B}] \, d p .

    (Such a function always exists by the Radon-Nikodym theorem.)

  • the conditional probability of an event (measurable subset) Aβˆˆπ’œA\in\mathcal{A} is the conditional expectation of its indicator function:

    β„™[A|ℬ]=𝔼[1 A|ℬ]. \mathbb{P}[A|\mathcal{B}] = \mathbb{E}[1_A|\mathcal{B}] .

Note that β„™[A|ℬ]\mathbb{P}[A|\mathcal{B}] is a function,

x↦ℙ[A|ℬ](x) x\mapsto \mathbb{P}[A|\mathcal{B}](x)

which is ℬ\mathcal{B}-measurable for every Aβˆˆπ’œA\in\mathcal{A}. Therefore, in order for the assignment to be a Markov kernel (X,ℬ)β†’(X,π’œ)(X,\mathcal{B})\to(X,\mathcal{A}) (see the definition above), we moreover need that for every x∈Xx\in X the assignment

A↦ℙ[A|ℬ](x) A \mapsto \mathbb{P}[A|\mathcal{B}](x)

be a probability measure on (X,π’œ)(X,\mathcal{A}). This is not always the case. When this happens, we call the resulting kernel (X,ℬ)β†’(X,π’œ)(X,\mathcal{B})\to(X,\mathcal{A}) a regular conditional distribution given ℬ\mathcal{B}. We can view a regular conditional distribution as a probability measure on (X,π’œ)(X,\mathcal{A}) which is parametrized in a measurable way by (X,ℬ)(X,\mathcal{B}).

The disintegration theorem says that when (X,π’œ)(X,\mathcal{A}) is standard Borel, regular conditional distributions in the form above always exist.

Regular conditionals from a joint distribution

Using the disintegration theorem, we can obtain conditional probability distributions in a way that generalizes the discrete formula

p(b|a)=p(a,b)p(a) p(b|a) = \frac{p(a,b)}{p(a)}

to the non-discrete setting.

Given standard Borel spaces (X,π’œ)(X,\mathcal{A}) and (Y,ℬ)(Y,\mathcal{B}), their product (XΓ—Y,π’œβŠ—β„¬)(X\times Y,\mathcal{A}\otimes\mathcal{B}) is again standard Borel. The projection map Ο€ 1:(XΓ—Y,π’œβŠ—β„¬)β†’(X,π’œ)\pi_1:(X\times Y,\mathcal{A}\otimes\mathcal{B})\to (X,\mathcal{A}) induces a sub-sigma-algebra Ο€ 1 βˆ’1(π’œ)\pi_1^{-1}(\mathcal{A}) on XΓ—YX\times Y (which makes it isomorphic to (X,π’œ)(X,\mathcal{A}) in both Stoch and Krn, see Section 2.1 of this paper).

Given a joint distribution rr on (XΓ—Y,π’œβŠ—β„¬)(X\times Y,\mathcal{A}\otimes\mathcal{B}), we can form the regular conditional (XΓ—Y,Ο€ 1 βˆ’1(π’œ))β†’(XΓ—Y,π’œβŠ—β„¬)(X\times Y,\pi_1^{-1}(\mathcal{A}))\to(X\times Y,\mathcal{A}\otimes\mathcal{B}) (or equivalently, up to isomorphism, (X,π’œ)β†’(XΓ—Y,π’œβŠ—β„¬)(X,\mathcal{A})\to(X\times Y,\mathcal{A}\otimes\mathcal{B})). Further composing with the projection Ο€ 2:(XΓ—Y,π’œβŠ—β„¬)β†’(Y,ℬ)\pi_2:(X\times Y,\mathcal{A}\otimes\mathcal{B})\to(Y,\mathcal{B}) (or equivalently restricting the kernel to the sub-sigma-algebra Ο€ 2 βˆ’1(ℬ)\pi_2^{-1}(\mathcal{B})), we get a kernel sometimes denoted simply by r(B|x)r(B|x) or rβ€²(B|x)r'(B|x).

This kernel satisfies the property that for every Aβˆˆπ’œA\in\mathcal{A} and Bβˆˆβ„¬B\in\mathcal{B},

∫ Arβ€²(B|x)p(dx)=r(AΓ—B), \int_A r'(B|x) \, p(d x) = r(A\times B) ,

and is therefore a generalization of conditional probability to the non-discrete setting. Compare it to the formula

P(b|a)P(a)=P(a,b). P(b|a)\,P(a) = P(a,b) .

Categories of Markov kernels

References

See also:

Discussion via category theory:

  • F. W. Lawvere, The Category of Probabilistic Mappings, 1962, PDF.

  • Fredrik Dahlqvist, Vincent Danos, Ilias Garnier, and Alexandra Silva, Borel kernels and their approximation, categorically. In MFPS 34: Proceedings of the Thirty-Fourth Conference on the Mathematical Foundations of Programming Semantics, volume 341, 91–119, 2018. arXiv.

  • Tobias Fritz, A synthetic approach to Markov kernels, conditional independence and theorems on sufficient statistics. Adv. Math., 370:107239, 2020. arXiv:1908.07021.

  • NoΓ© Ensarguet, Paolo Perrone, Categorical probability spaces, ergodic decompositions, and transitions to equilibrium. arXiv.

category: probability

Last revised on June 29, 2026 at 11:37:16. See the history of this page for a list of all contributions to it.