nLab empirical distribution

Contents

Contents

Idea

In probability theory, the empirical distribution is the probability distribution formed by taking empirical frequencies of a phenomenon, and dividing by the total number of cases.

For example, if we flip a coin 5 times, the empirical frequency is the probability distribution on the space {heads,tails}\{heads, tails\} given by

p(heads)=#heads5,p(tails)=#tails5. p(heads) \;=\; \frac{\# heads}{5}, \qquad p(tails) \;=\; \frac{\# tails}{5} .

For instance, if we have obtained “heads” 3 times and “tails” 2 times, we have

p(heads)=35=0.6,p(tails)=25=0.4. p(heads) \;=\; \frac{3}{5} \;=\; 0.6, \qquad p(tails) \;=\; \frac{2}{5} \;=\; 0.4.

The name empirical distribution denotes both the distribution obtained by sampling a finite amount of data, as well as the limit (when it exists) resulting from an infinite sequence of observations, usually generated from a stochastic process.

In statistics it is used as an estimator? of the distribution of a random variable whenever it is possible to take iid samples.

In measure-theoretic probability

Let XX be a measurable space. For each xXx\in X, denote by δ x\delta_x the Dirac delta distribution given by

δ x(A)=1 A(x)={1 xA; 0 xA \delta_x(A) \;=\; 1_A(x) \;=\; \begin{cases} 1 & x\in A ; \\ 0 & x\notin A \end{cases}

for all measurable AXA\subseteq X.

Let now NN be a finite set. We can view the product space X NX^N as the space of finite sequences (x 1,,x n)(x_1,\dots,x_n) of elements of XX. The empirical distribution of a finite sequence (x 1,,x n)X N(x_1,\dots,x_n)\in X^N is the probability measure on XX given by

δ x 1++δ x nn, \frac{\delta_{x_1}+\dots+\delta_{x_n}}{n} ,

meaning that it assigns to each measurable AXA\subseteq X the value

1 A(x 1)++1 A(x n)n=#{x iinA}n. \frac{1_A(x_1)+\dots+1_A(x_n)}{n} \;=\; \frac{\#\{x_i in A\}}{n} .

Similarly, we can view the countable product X X^\mathbb{N} as the space of infinite sequences (x 1,x 2,x 3)(x_1,x_2,x_3\dots) of elements of XX. The empirical distribution of a sequence (x 1,x 2,x 3,)X (x_1,x_2,x_3,\dots)\in X^\mathbb{N} is the probability measure on XX given by the limit, if it exists,

lim n1n i=1 nδ x i. \lim_{n\to\infty} \frac{1}{n} \sum_{i=1}^n \delta_{x_i} .

If the x ix_i are random variables, and so they form a stochastic process (for example, if they are coin flips), the empirical distribution, if it exists, is a random variable as well.

In categorical probability

(…)

Properties

See also

category: probability

Last revised on July 15, 2024 at 16:47:57. See the history of this page for a list of all contributions to it.