The definition of Shannon Entropy in this math notation, for the P distribution, is:
The degree of randomness in a discrete probability distribution P can be measured in terms of Shannon entropy [106] .
Shannon entropy appears in fundamental contexts in communications theory and in statistical physics [100] . Efforts to derive Shannon entropy from some deeper theory drove early efforts to at least obtain axiomatic derivations, with the one used by Khinchine given in the next section being the most popular. The axiomatic approach is limited by the assumptions of its axioms, however, so it was not until the fundamental role of relative entropy was established in an “information geometry” context [113–115], that a path to show that Shannon entropy is uniquely qualified as a measure was established (c. 1999). The fundamental (extremal optimum) aspect of relative entropy (and Shannon entropy as a simple case) is found by differential geometry arguments akin to those of Einstein on Riemannian spaces (here involving spaces defined by the family of exponential distributions). Whereas the “natural” notion of metric and distance locally is given by the Minkowski metric and Euclidean distance, a similar analysis on comparing distributions (evaluating their “distance” from eachother) indicates the natural measure is relative entropy (which reduces to Shannon entropy in variational contexts when the relative entropy is relative to the uniform probability distribution). Further details on this derivation are given in Chapter 8.
3.1.1 The Khinchin Derivation
In his now famous 1948 paper [106] , Claude Shannon provided a qualitative measure for entropy in connection with communication theory. The Shannon entropy measure was later put on a more formal footing by A. I. Khinchin in an article where he proves that with certain assumptions the Shannon entropy is unique [107] . (Dozens of similar axiomatic proofs have since been made.) A statement of the theorem is as follows:
Khinchine Uniqueness Theorem: Let H(p1, p2, …, pn) be a function defined for any integer n and for all values p1, p2, …, pn such that pk≥ 0 (k = 1, 2, …, n), and ∑kpk = 1. If for any function n this function is continuous with respect to its arguments, and if the function obeys the three properties listed below, then H(p1, p2, …, pn) = −λ∑kpklog(pk), where λ is a positive constant (with Shannon entropy recovered for convention λ = 1). The three properties are:
1 For given n and for ∑kpk = 1, the function takes its largest value for pk = 1/n (k = 1, 2, …, n). This is equivalent to Laplace’s principle of insufficient reason, which says if you do not know anything assume the uniform distribution (also agrees with Occam’s Razor assumption of minimum structure).
2 H(ab) = H(a) + Ha(b), where Ha(b) = –∑ap(a)log(p(b|a)), is the conditional entropy. This is consistent with H(ab) = H(a) + H(b), for probabilities of a and b independent, with modifications involving conditional probability being used when not independent.
3 H(p1, p2, …, pn, 0) = H(p1, p2, …, pn). This reductive relationship, or something like it, is implicitly assumed when describing any system in “isolation.”
Note that the above axiomatic derivation is still “weak” in that it assumes the existence of the conditional entropy in property (2).
3.1.2 Maximum Entropy Principle
The law of large numbers (Section 2.6.1), and related central limit theorem, explain the ubiquitous appearance of the Gaussian (a.k.a., Normal) distribution in Nature and statistical analysis. Even when speaking of a probability distribution purely in the abstract, the Gaussian distribution (amongst a collection) still stands out in a singular way. This is revealed when seeking the discrete probability distribution that maximizes the Shannon entropy subject to constraints. The Lagrangian optimization method is a mathematical formalism to solve problems of this type, where you want to optimize something, but must do so subject to constraints. Lagrangians are described in detail in Chapters 6, 7, and 10. For our purposes here, once you know how to group the terms to create the Lagrangian expression appropriate to your problem, the problem is then reduced to simple differential calculus and algebra (you take a derivative of the Lagrangian and solve for it being zero – the classic way to find an extremum from calculus). I will skip most of the math here, and just state the Lagrangians and their solutions in the small examples that follow.
If no constraint on probabilities, other than that they sum to 1, the Lagrangian form for the optimization is as follows:
where, ∂L/∂pk = 0 → pk = e−(1 + λ) for all k, thus pk = 1/n for system with n outcomes. Thus, the maximum entropy hypothesis in this circumstance results in Laplace’s Principle of Insufficient Reasoning, a.k.a., principle of indifference, where if you do not know any better, use the uniform distribution.
If you have as prior information the existence of the mean, μ, of some quantity x, then you have the Lagrangian:
where, ∂L/∂pk = 0 → pk = A exp(−δxk), leading to the exponential distribution. If for the latter we had the mean of the function, f(xk), of some random variable X, then a similar derivation would again yield the exponentional distribution pk = A exp(−δf(xk) ), where now A is not simply a normalization factor, but is known as the partition function and it has a variety of generative properties vis‐à‐vis statistical mechanics and thermal physics.
If you have as prior information the existence of the mean and variance of some quantity (the first and second statistical moments), then you have the Lagrangian: