Читать книгу Hardness of Approximation Between P and NP - Aviad Rubinstein - Страница 10

Оглавление

Preliminaries

Notation

We use 0n (respectively 1n) to denote the length-n vectors whose value is 0 (1) in every coordinate. For vectors x, y ∈ ℝn, we let

denote the normalized 2-norm. Unless specified otherwise, when we say that x and y are Δ-close (or Δ-far), we mean Δ-close in normalized 2-norm. Similarly, for a binary string π ∈ {0, 1}ⁿ, we denote

For a graph G = (V, E) and S ⊆ V, we use den(S) ∈ [0, 1] to denote the density of subgraph S,

2.1 Nash Equilibrium and Relaxations

A mixed strategy of player i is a distribution xi over i’s set of actions, Ai. We say that a vector of mixed strategies x ∈ × _j∆A_j is a Nash equilibrium if every strategy ai in the support of every xi is a best response to the actions of the mixed strategies of the rest of the players, x_−i. Formally, for every ai ∈ supp(xi),

Equivalently, x is a Nash equilibrium if each mixed strategy xi is a best response to x_−i:

Each of those equivalent definitions can be generalized to include approximation in a different way.

Definition 2.1

ϵ-approximate Nash Equilibrium. We say that x is an ϵ-Approximate Nash Equilibrium (ϵ-ANE) if each xi is an ϵ-best response to x_−i:

On the other hand, we generalize the first definition of Nash equilibrium in the following stricter definition:

Definition 2.2

ϵ-well-supported Nash Equilibrium. x is an ϵ-Well-Supported Nash Equilibrium (ϵ-WSNE) if every ai in the support of xi is an ϵ-best response to x_−i: for every ai ∈ supp(xi),

WeakNash

We can further relax the (already more lenient) notion of ϵ-ANE by requiring that the ϵ-best response condition only hold for most of the players (rather than all of them).

Definition 2.3

(ϵ, δ)-WeakNash [Babichenko et al. 2016]. We say that x isan (∈, δ)-WeakNash if for a (1 − δ)-fraction of i’s, xi is an ϵ-best mixed response to x_−i:

Definition 2.4

(∈, δ)-well-supported WeakNash. x is an (∈, δ)-well-supported WeakNash if for a (1 − δ)-fraction of i’s, every ai in the support of xi is an ϵ-best response to x_−i:

2.2 PPAD and END-OF-A-LINE

The END-OF-A-LINE of a problem considers an implicitly represented, exponential-size, directed graph whose vertices have in- and out-degree at most 1 (this is without loss of generality). The special vertex 0_n has in-degree 0, and the goal is to find another odd-degree vertex. The graph is a union of lines and cycles, so in particular the line starting at 0_n ends with another odd-degree vertex.

The graph is implicitly defined with functions S, P that, for each vertex, give its Successor (outgoing neighbor) and Predecessor (incoming neighbor). In the computational variant of END-OF-A-LINE, S, P are given as explicit circuits, whereas in the query complexity variant they are given as oracles. There is also a communication complexity variant, whose definition is more involved and is deferred to Section 3.4.

In the formal definition of the computational variant, we also have to consider the case that S, P are inconsistent, i.e., for some u ≠ v we have S(u) = v, but P(v) ≠ u; we also allow the algorithm to succeed by finding such an inconsistency. (In the oracle model we can explicitly require that there are no inconsistencies.)

Definition 2.5

END-OF-A-LINE. The input to the problem is functions S, P : {0, 1}ⁿ → {0, 1}ⁿ, such that S(0_n) = 0_n ≠ P(0_n). The output is another odd-degree vertex 0_n ≠ v ∈ {0, 1}ⁿ such that P(S(v)) ≠ S(P(v)).

The computational complexity class PPAD is defined as the class of all total search problems reducible to END-OF-A-LINE.

MEMBERSHIP END-OF-A-LINE

The following variant of END-OF-A-LINE is equivalent and more convenient for some of our reductions. In particular, the problem is restricted to a subset of the vertices. The restricted vertex-set is defined implicitly via a membership function T : {0, 1}ⁿ → {0, 1}. Abusing notation, let T also denote the restricted set of vertices whose T-value is 1. We think of S and P as only applying to vertices in T, and connecting them to other vertices in T. Formally, we also allow the algorithm to return any violations.

Definition 2.6

MEMBERSHIP END-OF-A-LINE. The input to the problem is functions S, P : {0, 1}ⁿ → {0, 1}ⁿ and T : {0, 1}ⁿ → {0, 1}, such that S(0_n) = 0_n ≠ P(0_n) and T(0_n), T(S(0_n)) = 1. The output is another odd-degree vertex 0_n ≠ v ∈ {0, 1}ⁿ such that T(v) = 1 and v satisfies at least one of the following:

End-of-a-line. P(S(v)) = S(P(v));or

Boundary condition. T(S(v)) = 0 or T(P(v)) = 0.

Lemma 2.1

END-OF-A-LINE is equivalent to MEMBERSHIP END-OF-A-LINE.

Proof

Given an instance of END-OF-A-LINE, we can construct an equivalent instance of MEMBERSHIP END-OF-A-LINE by setting T ≡ 1. In the other direction, we can add self-loops to every vertex v such that T(v) = 0 (i.e., P(v) = S(v) = v); this guarantees that v is never a solution to the new END-OF-A-LINE instance.

■

We will be interested in restricted (but equally hard) variants of MEMBERSHIP END-OF-A-LINE. For example, in Section 16.2 we define LOCAL END-OF-A-LINE where, among other restrictions, T, S, P are AC⁰ circuits. In particular, in Chapter 3, we will consider a variant where each vertex is a priori restricted to have at most two potential incoming/outgoing neighbors, and the functions S, P merely specify which neighbor is chosen. We then abuse notation and let S, P output just a single bit.

2.3 Exponential Time Hypotheses

Our quasi-polynomial hardness results are conditional on the following hypotheses. We begin with the “plain” ETH:

Hypothesis 2.1

Exponential time hypothesis (ETH) [Impagliazzo et al. 2001]. 3SAT takes time 2^{Ω (n)}.

Since a Nash equilibrium always exists, we are unlikely to have a reduction (even of subexponential size) from 3SAT to Nash equilibrium. Instead, we need to assume the following analogue of ETH for the total class PPAD:

Hypothesis 2.2

ETH for PPAD [Babichenko et al. 2016]. Solving END-OF-A-LINE requires time .¹

In Section 13.3 we will prove a quasi-polynomial lower bound on the running time for counting the number of communities in a social network. This result is also conditional, but requires the following much weaker #ETH assumption:

Hypothesis 2.3

#ETH [Dell et al. 2014]. Given a 3SAT formula, counting the number of satisfying assignments takes time 2^Ω(n).

2.4 PCP Theorems

2.4.1 2CSP and the PCP Theorem

In the 2CSP problem (Constraint Satisfaction Problem), we are given a graph G = (V, E) on | V | = n vertices, where each of the edges (u, v) ∈ E is associated with some constraint function ψu, v : Σ × Σ → {0, 1} that specifies a set of legal “colorings” of u and v, from some finite alphabet Σ (2 in the term “2CSP” stands for the “arity” of each constraint, which always involves two variables). Let us denote by ψ the entire 2CSP instance, and define by OPT(ψ) the optimum (maximum) fraction of satisfied constraints in the associated graph G, over all possible assignments (colorings) of V.

The starting point of some of our reductions is the following version of the Probabilistically Checkable Proof (PCP) theorem, which asserts that it is NP-hard to distinguish a 2CSP instance whose value is 1, and one whose value is 1 − η, where η is some small constant:

Theorem 2.1

PCP Theorem [Dinur 2007]. Given a 3SAT instance φ of size n, there is a polynomial time reduction that produces a 2CSP instance ψ, with size |ψ| = n · polylog n variables and constraints, and constant alphabet size such that

• (Completeness) If OPT(φ) = 1 then OPT(ψ) = 1.

• (Soundness) If OPT(φ) < 1 then OPT(ψ) < 1 − η, for some constant η = Ω(1).

• (Graph) The constraint graph is d-regular, for some constant d, and bipartite.

See, e.g., the full version of Braverman et al. [2017]or Aaronson et al. [2014] for derivations of this formulation of the PCP theorem.

Notice that since the size of the reduction is near linear, ETH implies that solving the above problem requires near exponential time.

Corollary 2.1

Let ψ be as in Theorem 2.1. Then assuming ETH, distinguishing between OPT(ψ) = 1 and OPT(ψ) < 1 − η requires time .

Label Cover

Definition 2.7

LABEL COVER. LABEL COVER is a maximization problem, and a special case of 2CSP. The input is a bipartite graph G = (A, B, E), alphabets, Σ_B, Σ_B and a projection πe : Σ_A → Σ_B for every e ∈ E.

The output is a labeling φA : A → Σ_A, φΒ : Β → Σ_B. Given a labeling, we say that a constraint (or edge) (a, b) ∈ E is satisfied if π_{(a, b)} (φ_A(a)) = φB(b). The value of a labeling is the fraction of e ∈ E that are satisfied by the labeling. The value of the instance is the maximum fraction of constraints satisfied by any assignment.

We often encounter an assignment that only labels a subset of A ⋃ Β but leaves the rest unlabeled. We refer to such assignment as a partial assignment to an instance; more specifically, for any V ⊆ A ⋃ Β, a V-partial assignment (or partial assignment on V) is a function ϕ : V → Σ. For notational convenience, we sometimes write Σ^V to denote the set of all functions from V to Σ.

Theorem 2.2

Moshkovitz-Raz PCP [Moshkovitz and Raz 2010 (Theorem 11)]. For every n and every ϵ > 0 (in particular, ϵ may be a function of n), solving 3SAT on inputs of size n can be reduced to distinguishing between the case that a (dA, dB)-bi-regular instance of LABEL COVER, with parameters |A| + |B| = n^1+o(1) · poly(1/∈), |Σ_A| = 2^poly(1/∈), and dA, dB, |Σ_B| = poly(1/∈), is completely satisfiable, versus the case that it has value at most ϵ.

Counting the number of satisfying assignments is even harder. The following hardness is well-known, and we sketch its proof only for completeness:

Fact 2.1

There is a linear-time reduction from #3SAT to counting the number of satisfying assignments of a LABEL COVER instance.

Proof

Construct a vertex in A for each variable and a vertex in Β for each clause. Set Σ_A ≜ {0, 1} and let ΣB ≜ {0, 1}³ \ (000) (i.e., Σ_B is the set of satisfying assignments for a 3SAT clause, after applying negations). Now if variable x appears in clause C, add a constraint that the assignments to x and C are consistent (taking into account the sign of x in C). Notice that: (i) any assignment to A corresponds to a unique assignment to the 3SAT formula; and (ii) if the 3SAT formula is satisfied, this assignment uniquely defines a satisfying assignment to Β. Therefore there is a one-to-one correspondence between satisfying assignments to the 3SAT formula and to the instance of LABEL COVER.

■

2.5 Learning Theory

For a universe (or ground set) U, a concept C is simply a subset of U and a concept class C is a collection of concepts. For convenience, we sometimes relax the definition and allow the concepts to not be subsets of U; all definitions here extend naturally to this case.

The VC and Littlestone’s Dimensions can be defined as follows.

Definition 2.8

VC Dimension [Vapnik and Chervonenkis 1971]. A subset S ⊆ U is said to be shattered by a concept class C if, for every T ⊆ S, there exists a concept C ∈ C such that T = S ⋂ C.

The VC Dimension VC-dim(C, U) of a concept class C with respect to the universe U is the largest d such that there exists a subset S ⊆ U of size d that is shattered by C.

Definition 2.9

Mistake tree and Littlestone’s Dimension [Littlestone 1987]. A depth-d instance-labeled tree of U is a full binary tree of depth d such that every internal node of the tree is assigned an element of U. For convenience, we will identify each node in the tree canonically by a binary string s of length at most d.

A depth-d mistake tree (aka shattered tree [Ben-David et al. 2009]) for a universe U and a concept class C is a depth-d instance-labeled tree of U such that, if we let vs ∈ U denote the element assigned to the vertex s for every s ∈ {0, 1}^<d, then, for every leaf ℓ ∈ {0, 1}^d, there exists a concept C ∈ C that agrees with the path from root to it, i.e., that, for every ∈ C iff ℓ_i+1 = 1 where ℓ_≤i denotes the prefix of ℓ of length i.

The Littlestone’s Dimension L-dim(C, U) of a concept class C with respect to the universe U is defined as the maximum d such that there exists a depth-d mistake tree for U, C.

An equivalent formulation of Littlestone’s Dimension is through mistakes made in online learning, as stated below. This interpretation will be useful in the proof of Theorem 14.4.

Definition 2.10

Mistake bound. An online algorithm A is an algorithm that, at time step i, is given an element xi ∈ U and the algorithm outputs a prediction pi ∈ {0, 1} whether x is in the class. After the prediction, the algorithm is told the correct answer hi ∈ {0, 1}. For a sequence (x₁, h₁),…, (xn, hn), a prediction mistake of A is defined as the number of incorrect predictions, i.e., Σi∈n 1[pi ≠ hi]. The mistake bound of A for a concept class C is defined as the maximum prediction mistake of A over all the sequences (x₁, h₁),…, (xn, hn), which corresponds to a concept C ∈ C (i.e., hi = 1[xi ∈ C] for all i ∈ [n]).

Theorem 2.3

Littlestone [1987]. For any universe U and any concept class C, L-dim(C, U) is equal to the minimum mistake bound of C, U over all online algorithms.

The following facts are well-known and follow easily from the above definitions.

Fact 2.2

For any universe U and concept class C, we have

Fact 2.3

For any two universes U₁, U₂ and any concept class C,

2.6 Information Theory

In this section, we introduce information-theoretic quantities used mostly in Chapter 12. For a more thorough introduction, the reader should refer to Cover and Thomas [2012]. Unless stated otherwise, all log’s in this book paper are base-2.

Definition 2.11

Let μ be a probability distribution on sample space Ω. The Shannon entropy (or just entropy) of μ, denoted by H(μ), is defined as .

Definition 2.12

Binary entropy function. For p ∈ [0, 1], the binary entropy function is defined as follows (with a slight abuse of notation): H(p) := − p log p − (1 − p) log(1 − p).

Fact 2.4

Concavity of binary entropy. Let μ be a distribution on [0, 1], and let p ~ μ. Then .

For a random variable A, we shall write H(A) to denote the entropy of the induced distribution on the support of A. We use the same abuse of notation for other information-theoretic quantities appearing later in this section.

Definition 2.13

The conditional entropy of a random variable A conditioned on Β is defined as

Fact 2.5

Chain rule.

Fact 2.6

Conditioning decreases entropy. .

Another measure we will use (briefly) in our proof is that of mutual information, which informally captures the correlation between two random variables.

Definition 2.14

Conditional mutual information. The mutual information between two random variables A and Β, denoted by I(A; B), is defined as

The conditional mutual information between A and Β given C, denoted by I(A; B|C), is defined as

The following is a well-known fact on mutual information.

Fact 2.7

Data processing inequality. Suppose we have the following Markov Chain:

where X⊥Z|Y. Then I(X; Z) ≥ I(X; Z) or, equivalently, H(X|Y) ≤ H(X|Z).

Mutual information is related to the Kullback Leiber Divergence similarity measure (also known as relative entropy).

Definition 2.15

Kullback-Leiber Divergence. Given two probability distributions μ₁ and μ₂ on the same sample space Ω such that (∀_ω ∈ Ω)(μ₂(ω) = 0 ⇒ μ₁(ω) = 0), the Kullback-Leibler Divergence between them is defined as

The connection between the mutual information and the Kullback-Leibler divergence is provided by the following fact.

Fact 2.8

For random variables A, B, and C we have

2.7 Useful Lemmata

2.7.1 Concentration

Lemma 2.2

Chernoff bound. Let X1,…, Xn be independent and identically distributed (i.i.d.) random variables taking value from {0, 1} and let p be the probability that Xi = 1; then, for any δ > 0, we have

2.7.2 Pseudorandomness

Theorem 2.4

k-wise independence Chernoff bound [Schmidt et al. 1995 (Theorem 5.I)]. Let x₁ … xn ∈ [0, 1] be k-wise independent random variables, and let and δ ≤ 1. Then

2.7.3 λ-Biased Sets

Definition 2.16

λ-biased sets. Let G be a finite field, and t > 0 an integer. A multiset S ⊆ G^t is λ-biased if for every non-trivial character χ of Gt,

Lemma 2.3

[Azar et al. 1998 (Theorem 3.2)]. A randomly chosen multi-set S ⊆ G^t of cardinality Θ(t log |G|/λ²) is λ-biased with high probability.

For many applications, an explicit construction is necessary. In our case, however, we can enumerate over all sets S of sufficient cardinality in quasi-polynomial time.² The following Sampling Lemma due to Ben-Sasson et al. [2003] allows us to estimate the average of any function over Gt using only one line and (1 + o(1)) log₂ |G^t| randomness:

Lemma 2.4

Sampling Lemma [Ben-Sasson et al. 2003 (Lemma 4.3)]. Let B : G^t → [0, 1]. Then, for any ϵ > 0,

2.7.4 Partitions

Given a 2CSP formula, we provide a few techniques to deterministically partition n variables to approximately subsets of approximately variables each, so that the number of constraints between every pair of partitions is approximately as expected.

Greedy Partition

Lemma 2.5

Let G = (V, E) bea d-regular graph and n ≜ |V|. We can partition V into n/k disjoint subsets {S₁,…, S_n/k} of size at most 2k such that:

Proof

We assign vertices to subsets iteratively, and show by induction that we can always maintain (2.1) and the bound on the subset size. Since the average set size is less than k, we have by Markov’s inequality that at each step less than half of the subsets are full. The next vertex we want to assign, v, has neighbors in at most d subsets. By our induction hypothesis, each Si is of size at most 2k, so on average over j ∈ [n/k], it has less than 4dk²/n neighbors in each Sj. Applying Markov’s inequality again, Si has at least 8d²k²/n neighbors in less than a (1/2d)-fraction of subsets Sj. In total, we ruled out less than half of the subsets for being full, and less than half of the subsets for having too many neighbors with subsets that contain neighbors of v. Therefore there always exists some subset Si to which we can add v while maintaining the induction hypothesis.

■

Derandomized Partition

We use Chernoff bound with Θ(log n)-wise independent variables to deterministically partition variables into subsets of cardinality ≈ . Our (somewhat naive) deterministic algorithm for finding a good partition takes quasi-polynomial time (n^{O(log n)}), which is negligible with respect to the sub-exponential size of our reduction.³

Lemma 2.6

Let G = (A, B, E) be a bipartite (dA, dB)-bi-regular graph, and let nA ≜ |A|, nB ≜ |B|; set also n ≜ nB + nA and . Let be an arbitrary partition of B into disjoint subsets of size ρ. There is a quasi-polynomial deterministic algorithm (alternatively, linear-time randomized algorithm) that finds a partition of A into , such that:

and

Proof

Suppose that we place each a ∈ A into a uniformly random Si. By Chernoff bound and union bound, (2.2) and (2.3) hold with high probability. Now, by Chernoff bound for k-wise independent variables (Theorem 2.4), it suffices to partition A using a Θ(log n)-wise independent distribution. Such distribution can be generated with a sample space of n^{O(log n)} (e.g., [Alon et al. 1986]). Therefore, we can enumerate over all possibilities in quasi-polynomial time. By the probabilistic argument, we will find at least one partition that satisfies (2.2) and (2.3).

■

2.7.5 How to Catch a Far-from-Uniform Distribution

The following lemma due to Daskalakis and Papadimitriou [2009] implies that

Lemma 2.7

Lemma 3 in the full version of Daskalakis and Papadimitriou [2009]. Let be real numbers satisfying the following properties for some . Then .

2.7.6 Simulation Theorem

Let D : {0, 1}^N → {0, 1} be a decision problem. We consider the following query complexity model (called also decision tree complexity). Each query is an index k ∈ [N] and the answer is the k-th bit of the input. The randomized query complexity of D is denoted by (D), where δ is the allowed probability of error.

We also consider the following communication complexity model. Here, for every k ∈ [N], Alice holds a vector ak ∈ {0, 1}^M and Bob holds an index βk ∈ [M], for some M = poly(N). Their goal is to compute D for the input (α₁(β₁),…, αN(β_N)). The standard bounded error two-party probabilistic communication complexity of the simulated problem D is denoted by .

To “lift” from query complexity hardness to communication complexity, we use the following recent simulation theorem for BPP, due to Göös et al. [2017].

Theorem 2.5

BPP simulation theorem, [Göös et al. 2017]. There exists M = poly(N) such that for any constants 0 < δ <1/2,

1. As usual, n is the size of the description of the instance, i.e., the size of the circuits S and P.

2. Note that we need an ϵ-biased set for a large field . Such constructions are not as common in the literature, which mostly focuses on the field . To the best of our knowledge, existing explicit constructions for larger fields require much larger cardinality. Nevertheless, for our modest pseudo-randomness desiderata, we could actually use the explicit construction from Alon et al. [1992]. For ease of presentation, we prefer to brute-force derandomize the construction from Azar et al. [1998].

3. Do not confuse this with the quasi-polynomial lower bound we obtain for the running time of the community detection problem.

Hardness of Approximation Between P and NP

Подняться наверх