Читать книгу Computational Statistics in Data Science - Группа авторов - Страница 17

2.3 Big M

Global optimization, or the problem of finding the minimum of a function with arbitrarily many local minima, is NP‐complete in general [30], meaning – in layman's terms – it is impossibly hard. In the absence of a tractable theory, by which one might prove one's global optimization procedure works, brute‐force grid and random searches and heuristic methods such as particle swarm optimization [31] and genetic algorithms [32] have been popular. Due to the overwhelming difficulty of global optimization, a large portion of the optimization literature has focused on the particularly well‐behaved class of convex functions [33, 34], which do not admit multiple local minima. Since Fisher introduced his “maximum likelihood” in 1922 [35], statisticians have thought in terms of maximization, but convexity theory still applies by a trivial negation of the objective function. Nonetheless, most statisticians safely ignored concavity during the twentieth century: exponential family log‐likelihoods are log‐concave, so Newton–Raphson and Fisher scoring are guaranteed optimality in the context of GLMs [12, 34].

Nearing the end of the twentieth century, multimodality and nonconvexity became more important for statisticians considering high‐dimensional regression, that is, regression with many covariates (big ). Here, for purposes of interpretability and variance reduction, one would like to induce sparsity on the weights vector by performing best subset selection [36, 37]:

(3)

where , and denotes the ‐norm, that is, the number of nonzero elements. Because best subset selection requires an immensely difficult nonconvex optimization, Tibshirani [38] famously replaces the ‐norm with the ‐norm, thereby providing sparsity, while nonetheless maintaining convexity.

Historically, Bayesians have paid much less attention to convexity than have optimization researchers. This is most likely because the basic theory [13] of MCMC does not require such restrictions: even if a target distribution has one million modes, the well‐constructed Markov chain explores them all in the limit. Despite these theoretical guarantees, a small literature has developed to tackle multimodal Bayesian inference [39–42] because multimodal target distributions do present a challenge in practice. In analogy with Equation (3), Bayesians seek to induce sparsity by specifiying priors such as the spike‐and‐slab [43–45], for example,

As with the best subset selection objective function, the spike‐and‐slab target distribution becomes heavily multimodal as grows and the support of 's discrete distribution grows to potential configurations.

In the following section, we present an alternative Bayesian sparse regression approach that mitigates the combinatorial problem along with a state‐of‐the‐art computational technique that scales well both in and .

Computational Statistics in Data Science

Подняться наверх