Notice that, despite the minus sign in this formula, is a nonnegative function (since for ); more precisely, takes values in . The term ‘surprisal’ is intended to suggest how surprised one ought to be upon learning that the event modelled by is true: from no surprise for an event with probability to infinite surprise for an event with probability .
The expected surprisal of is then
(with when ). Like , is a nonnegative function; it is also important that is concave. Both and are , but for different reasons; when because, upon observing an event with probability , one gains no information; while when because one expects never to observe an event with probability . The maximum possible value of is (so if we use natural logarithms), whihc occurs when .
We have not specified the base of the logarithm, which amounts to a constant factor (proportional to the logarithm of the base), which we think of as specifying the unit of measurement of entropy. Common choices for the base are (whose unit is the bit, originally a unit of memory in computer science), (byte: bits), (trit), (nat or neper), (bel, originally a unit of relative power intensity in telegraphy, or ban, dit, or hartley), and (decibel: of a bel). In applications to statistical physics, common bases are approximately (joule per kelvin), (calorie per mole-kelvin), etc.
Recall that a partition of a set is a family of subsets of (the parts of the partition) such that is the union of the parts and any two distinct parts are disjoint (or better, for constructive mathematics, two parts are equal if their intersection is inhabited).
When is a probability space, we may relax both conditions: for the union of , we require only that it be a full set; for the intersections of pairs of elements of , we require only that they be null sets (or better, for constructive mathematics, that when , where is the outer measure? corresponding to ).
For definiteness, call such a collection of subsets a -almost partition; a -almost partition is measurable if each of its part is measurable (in which case we can use instead of ).
This is a general mathematical definition of entropy.
In words, the entropy is the supremum, over all ways of expressing as an internal disjoint union of finitely many elements of the -algebra , of the sum, over these measurable sets, of the expected surprisals of these sets. This supremum can also be expressed as a limit as we take to be finer and finer, since is concave and the partitions are directed.
We have written this so that is a finite partition of ; without loss of generality, we may require only that be a -almost partition. In constructive mathematics, it seems that we must use this weakened condition, at least the part that allows to merely be full.
This definition is very general, and it is instructive to look at special cases.
Given a probability space , the entropy of this probability space is the entropy, with respect to , of the -algebra of all measurable subsets of .
Every measurable almost-partition of a measure space (indeed, any family of measurable subsets) generates a -algebra. The entropy of a measurable almost-partition of a probability measure space is the entropy, with respect to , of the -algebra generated by . The formula (1) may then be written
since an infinite sum (of nonnegative terms) may also be defined as a supremum. (Actually, the supremum in the infinite sum does not quite match the supremum in (1), so there is a bit of a theorem to prove here.)
In most of the following special cases, we will consider only partitions, although it would be possible also to consider more general -algebras.
Recall that a discrete probability space is a set equipped with a function such that ; since is possible for only countably many , must be countable. We make into a measure space (with every subset measurable) by defining . Since every inhabited set has positive measure, every almost-partition of is a partition; since every set is measurable, any partition is measurable.
Given a discrete probability space and a partition of , the entropy of with respect to is defined to be the entropy of with respect to the probability measure induced by . Simplifying (2), we find
More specially, the entropy of the discrete probability space is the entropy of the partition of into singletons; we find
This is actually a special case of the entropy of a probability space, since the -algebra generated by the singletons is the power set of .
Yet more specially, the entropy of a finite set is the entropy of equipped with the uniform discrete probability measure; we find
Of all probability measures on , the uniform measure has the maximum entropy?.
Recall that a Borel measure? on an interval in the real line is absolutely continuous if whenever is a null set (with respect to Lebesgue measure), or better such that whenever the Lebesgue measure of is positive. In this case, we can take the Radon–Nikodym derivative of with respect to Lebesgue measure, to get an integrable function , called the probability distribution function; we recover by
where the integral is taken with respect to Lebesgue measure.
If is a partition (or a Lebesgue-almost-partition) of an interval into Borel sets, then the entropy of with respect to an integrable function is the entropy of with respect to the measure induced by using the integral formula (4); we find
On the other hand, the entropy of the probability distribution space is the entropy of the entire -algebra of all Borel sets (which is not generated by a partition) with respect to ; we find
by a fairly complicated argument.
I haven't actually managed to check this argument yet, although my memory tags it as a true fact. —Toby
So just as the entropy of a probability distribution is given by , so the entropy of a density operator is
using the functional calculus.
There is a way to fit this into the framework given by (1), but I don't remember it (and never really understood it).
For two finite probability distributions and , their relative entropy is
Or alternatively, for two density matrices, their relative entropy is
For more on this see relative entropy.
As hinted above, any probability distribution on a phase space in classical physics has an entropy, and any density matrix on a Hilbert space in quantum physics has an entropy. However, these are microscopic entropy, which is not the usual entropy in thermodynamics and most other branches of physics. (In particular, microscopic entropy is conserved, rather than increasing with time.)
Instead, physicists use coarse-grained entropy, which corresponds mathematically to taking the entropy of a -algebra much smaller than the -algebra of all measurable sets. Given a classical system with microscopic degrees of freedom, we identify macroscopic degrees of freedom that we can reasonably expect to measure, giving a map from to (or more generally, a map from an -dimensional microscopic phase space to an -dimensional macroscopic phase space). Then the -algebra of all measurable sets in pulls back to a -algebra on , and the macroscopic entropy of a statistical state is the conditional entropy? of this -algebra. (Typically, is on the order of Avogadro's number, while is rarely more than half a dozen, and often as small as .)
If we specify a state by a point in , a macroscopic pure state, and assume a uniform probability distribution on its fibre in , then this results in the maximum entropy?. If this fibre were a finite set, then we would recover Boltzmann's formula (3). This is never exactly true in classical statistical physics, but it is often nevertheless a very good approximation. (Boltzmann's formula actually makes better physical sense in quantum statistical physics, even though Boltzmann himself did not live to see this.)
A more sophisticated approach (pioneered by Josiah Gibbs?) is to consider all possible mixed microstates (that is all possible probability distributions on the space of pure microstates) whose expectation values of total energy and other extensive quantities (among those that are functions of the macrostate) match the given pure macrostate (point in ). We pick the mixed microstate with the maximum entropy?. If this is a thermal state?, then we say that the macrostate has a temperature, but it has an entropy in any case.
The concept of entropy was introduced, by Rudolf Clausius in 1865, in the context of physics, and then adapted to information theory by Claude Shannon in 1948, to quantum mechanics by John von Neumann in 1955, to ergodic theory by Andrey Kolmogorov and Sinai in 1958, and to topological dynamics by Adler, Konheim and McAndrew in 1965.
A survey of entropy in operator algebras is in
A large collection of references on quantum entropy is in
After the concept of entropy proved enormously useful in practice, many people have tried to find a more abstract foundation for the concept (and its variants) by characterizing it as the unique measure satisfying some list of plausible-sounding axioms.
Entropy-like quantities appear in the study of many PDEs, with entropy estimates. For an intro see