Читать книгу Informatics and Machine Learning - Stephen Winters-Hilt - Страница 47

3.1.2 Maximum Entropy Principle

The law of large numbers (Section 2.6.1), and related central limit theorem, explain the ubiquitous appearance of the Gaussian (a.k.a., Normal) distribution in Nature and statistical analysis. Even when speaking of a probability distribution purely in the abstract, the Gaussian distribution (amongst a collection) still stands out in a singular way. This is revealed when seeking the discrete probability distribution that maximizes the Shannon entropy subject to constraints. The Lagrangian optimization method is a mathematical formalism to solve problems of this type, where you want to optimize something, but must do so subject to constraints. Lagrangians are described in detail in Chapters 6, 7, and 10. For our purposes here, once you know how to group the terms to create the Lagrangian expression appropriate to your problem, the problem is then reduced to simple differential calculus and algebra (you take a derivative of the Lagrangian and solve for it being zero – the classic way to find an extremum from calculus). I will skip most of the math here, and just state the Lagrangians and their solutions in the small examples that follow.

If no constraint on probabilities, other than that they sum to 1, the Lagrangian form for the optimization is as follows:

where, ∂L/∂p_k = 0 → p_k = e^{−(1 + λ)} for all k, thus p_k = 1/n for system with n outcomes. Thus, the maximum entropy hypothesis in this circumstance results in Laplace’s Principle of Insufficient Reasoning, a.k.a., principle of indifference, where if you do not know any better, use the uniform distribution.

If you have as prior information the existence of the mean, μ, of some quantity x, then you have the Lagrangian:

where, ∂L/∂p_k = 0 → p_k = A exp(−δx_k), leading to the exponential distribution. If for the latter we had the mean of the function, f(x_k), of some random variable X, then a similar derivation would again yield the exponentional distribution p_k = A exp(−δf(x_k) ), where now A is not simply a normalization factor, but is known as the partition function and it has a variety of generative properties vis‐à‐vis statistical mechanics and thermal physics.

If you have as prior information the existence of the mean and variance of some quantity (the first and second statistical moments), then you have the Lagrangian:

where, ∂L/∂p_k = 0→ the Gaussian distribution (see Exercise 3.3).

With the introduction of Shannon entropy above, c. 1948, a reformulation of statistical mechanics was indicated (Jaynes [112] ) whereby entropy could be made the starting point for the entire theory by way of maximum entropy with whatever system constraints – immediately giving rise to the classic distributions seen in nature for various systems (itself an alternate derivation starting point for statistical mechanics already noted by Maxwell over 100 years ago). So instead of introducing other statistical mechanics concepts (ergodicity, equal a priori probabilities, etc.) and matching the resulting derivations to phenomenological thermodynamics equations to get entropy, with the Jaynes derivation we start with entropy and maximize it directly to obtain the rest of the theory.

Подняться наверх