Читать книгу The Sparse Fourier Transform - Haitham Hassanieh - Страница 13
Оглавление4
Optimizing Runtime Complexity
4.1 Introduction
The algorithm presented in Chapter 3 was the first algorithm to outperform FFT in practice for reasonably sparse signals. However, it has a runtime of
,
which is polynomial in n and only outperforms FFT for k smaller than Θ(n/ log n).
4.1.1 Results
In this chapter, we address this limitation by presenting two new algorithms for the Sparse Fourier Transform. We show
• an O(k log n)-time algorithm for the exactly k-sparse case, and
• an O(k log n log(n/k))-time algorithm for the general case.
The key property of both algorithms is their ability to achieve o(n log n) time, and thus improve over the FFT, for any k = o(n). These algorithms are the first known algorithms that satisfy this property. Moreover, if one assumes that FFT is optimal and hence the DFT cannot be computed in less than O(n log n) time, the algorithm for the exactly k-sparse case is optimal1 as long as k = nΩ(1). Under the same assumption, the result for the general case is at most one log log n factor away from the optimal runtime for the case of “large” sparsity .
For the general case, given a signal x, the algorithm computes a k-sparse approximation of its Fourier transform, x̂ that satisfies the following guarantee:
where C is some approximation factor and the minimization is over k-sparse signals.
Furthermore, our algorithm for the exactly sparse case is quite simple and has low big-Oh constants. In particular, our implementation of a variant of this algorithm, described in Chapter 6, is faster than FFTW, a highly efficient implementation of the FFT, for n = 222 and k ≤ 217 [Hassanieh et al. 2012]. In contrast, for the same signal size, the algorithms in Chapter 3 were faster than FFTW only for k ≤ 2000.2
We complement our algorithmic results by showing that any algorithm that works for the general case must use at least Ω(k log(n/k)/ log log n) samples from x. The proof of this lower bound can be found in Appendix C. The lower bound uses techniques from Price and Woodruff [2011], which shows a lower bound of Ω(k log(n/k)) for the number of arbitrary linear measurements needed to compute the k-sparse approximation of an n-dimensional vector x̂. In comparison to Price and Woodruff [2011], our bound is slightly worse but it holds even for adaptive sampling, where the algorithm selects the samples based on the values of the previously sampled coordinates.3 Note that our algorithms are non-adaptive, and thus limited by the more stringent lower bound of Price and Woodruff [2011].
4.1.2 Techniques
Recall from Chapter 3 that we can use the flat window filters coupled with a random permutation of the spectrum to bin/bucketize the Fourier coefficients into a small number of buckets. We can then use that to estimate the positions and values of the large frequency coefficients that were isolated in their own bucket. Here, we use the same filters introduced in Chapter 3. In this case, a filter G have the property that the value of Ĝ is “large” over a constant fraction of the pass region, referred to as the “super-pass” region. We say that a coefficient is “isolated” if it falls into a filter’s super-pass region and no other coefficient falls into filter’s pass region. Since the super-pass region of our filters is a constant fraction of the pass region, the probability of isolating a coefficient is constant.
However, the main difference in this chapter, that allows us to achieve the stated running times, is a fast method for locating and estimating isolated coefficients. Further, our algorithm is iterative, so we also provide a fast method for updating the signal so that identified coefficients are not considered in future iterations. Below, we describe these methods in more detail.
New Techniques: Location and Estimation
Our location and estimation methods depends on whether we handle the exactly sparse case or the general case. In the exactly sparse case, we show how to estimate the position of an isolated Fourier coefficient using only two samples of the filtered signal. Specifically, we show that the phase difference between the two samples is linear in the index of the coefficient, and hence we can recover the index by estimating the phases. This approach is inspired by the frequency offset estimation in orthogonal frequency division multiplexing (OFDM), which is the modulation method used in modern wireless technologies (see Heiskala and Terry [2001, chapter 2]).
In order to design an algorithm4 for the general case, we employ a different approach. Specifically, we can use two samples to estimate (with constant probability) individual bits of the index of an isolated coefficient. Similar approaches have been employed in prior work. However, in those papers, the index was recovered bit by bit, and one needed Ω(log log n) samples per bit to recover all bits correctly with constant probability. In contrast, we recover the index one block of bits at a time, where each block consists of O(log log n) bits. This approach is inspired by the fast sparse recovery algorithm of Gilbert et al. [2010]. Applying this idea in our context, however, requires new techniques. The reason is that, unlike in Gilbert et al. [2010], we do not have the freedom of using arbitrary “linear measurements” of the vector x̂, and we can only use the measurements induced by the Fourier transform.5 As a result, the extension from “bit recovery” to “block recovery” is the most technically involved part of the algorithm. Section 4.3.1 contains further intuition on this part.
New Techniques: Updating the Signal
The aforementioned techniques recover the position and the value of any isolated coefficient. However, during each filtering step, each coefficient becomes isolated only with constant probability. Therefore, the filtering process needs to be repeated to ensure that each coefficient is correctly identified. In Chapter 3, the algorithm simply performs the filtering O(log n) times and uses the median estimator to identify each coefficient with high probability. This, however, would lead to a running time of O(k log2 n) in the k-sparse case, since each filtering step takes k log n time.
One could reduce the filtering time by subtracting the identified coefficients from the signal. In this way, the number of non-zero coefficients would be reduced by a constant factor after each iteration, so the cost of the first iteration would dominate the total running time. Unfortunately, subtracting the recovered coefficients from the signal is a computationally costly operation, corresponding to a so-called non-uniform DFT (see Gilbert et al. [2008] for details). Its cost would override any potential savings.
In this chapter, we introduce a different approach: instead of subtracting the identified coefficients from the signal, we subtract them directly from the bins obtained by filtering the signal. The latter operation can be done in time linear in the number of subtracted coefficients, since each of them “falls” into only one bin. Hence, the computational costs of each iteration can be decomposed into two terms, corresponding to filtering the original signal and subtracting the coefficients. For the exactly sparse case these terms are as follows.
• The cost of filtering the original signal is O(B log n), where B is the number of bins. B is set to O(k′), where k′ is the number of yet-unidentified coefficients. Thus, initially B is equal to O(k), but its value decreases by a constant factor after each iteration.
• The cost of subtracting the identified coefficients from the bins is O(k).
Since the number of iterations is O(log k), and the cost of filtering is dominated by the first iteration, the total running time is O(k log n) for the exactly sparse case.
For the general case, we need to set k′ and B more carefully to obtain the desired running time. The cost of each iterative step is multiplied by the number of filtering steps needed to compute the location of the coefficients, which is Θ(log(n/B)).If we set B = Θ(k′), this would be Θ(log n) in most iterations, giving a Θ(k log2 n) running time. This is too slow when k is close to n. We avoid this by decreasing B more slowly and k′ more quickly. In the r-th iteration, we set B = k/ poly(r). This allows the total number of bins to remain O(k) while keeping log(n/B) small—at most O(log log k) more than log (n/k). Then, by having k′ decrease according to k′ = k/rΘ(r) rather than k/2Θ(r), we decrease the number of rounds to O(log k/ log log k). Some careful analysis shows that this counteracts the log log k loss in the log (n/B) term, achieving the desired O(k log n log(n/k)) running time.
4.2 Algorithm for the Exactly Sparse Case
In this section, we assume x̂i ∈ {−L, …, L} for some precision parameter L. To simplify the bounds, we assume L ≤ nc for some constant c > 0; otherwise the log n term in the running time bound is replaced by log L. We also assume that x̂ is exactly k-sparse. We will use the filter G with parameter δ = 1/(4n2L).
Definition 4.1 We say that is a flat window function with parameters B > 1, δ > 0, and α > 0 if and satisfies
• = 1 for |i| ≤ (1 − α)n/(2B) and = 0 for |i| ≤ n/(2B),
• ∈ [0,1] for all i, and
• .
The above notion corresponds to the (1/(2B), (1 − α)/(2B), δ, O(B/α log(n/δ))-flat window function. In Appendix D, we give efficient constructions of such window functions, where G can be computed in time and for each i, can be computed in O(log(n/δ)) time. Of course, for i ∉ [(1 − α)n/(2B), n/(2B)], ∈ {0,1} can be computed in O(1) time. The fact that takes ω(1) time to compute for i ∈ [(1 − α)n/(2B), n/(2B)] will add some complexity to our algorithm and analysis. We will need to ensure that we rarely need to compute such values. A practical implementation might find it more convenient to precompute the window functions in a preprocessing stage, rather than compute them on the fly.
The algorithm NOISELESSSPARSEFFT (SFT 3.0) is described as Algorithm 4.1. The algorithm has three functions:
• HASHTOBINS. This permutes the spectrum of with Pσ,a,b, then “hashes” to B bins. The guarantee will be described in Lemma 4.1.
• NOISELESSSPARSEFFTINNER. Given time-domain access to x and a sparse vector ẑ such that is k′-sparse, this function finds “most” of .
• NOISELESSSPARSEFFT. This iterates NOISELESSSPARSEFFTINNER until it finds x̂ exactly.
Algorithm 4.1 SFT 3.0: Exact Sparse Fourier Transform for k = o(n)
We analyze the algorithm “bottom-up,” starting from the lower-level procedures.
Analysis of NOISELESSSPARSEFFTINNER and HASHTOBINS
For any execution of NOISELESSSPARSEFFTINNER, define the support S = supp(x̂ − ẑ). Recall that πσ,b(i) = σ(i − b) mod n. Define hσ,b(i) = round(πσ,b(i)B/n) and . Note that therefore |oσ,b(i)| ≤ n/(2B). We will refer to hσ,b(i) as the “bin” that the frequency i is mapped into, and oσ,b(i) as the “offset.” For any i ∈ S define two types of events associated with i and S and defined over the probability space induced by σ and b:
• “Collision” event Ecoll(i): holds iff hσ,b(i) ∈ hσ,b(S\ {i}), and
• “Large offset” event Eoff (i): holds iff |oσ,b(i)| ≥ (1 − α)n/(2B).
Claim 4.1 For any i ∈ S, the event Ecoll(i) holds with probability at most 4|S|/B.
Proof Consider distinct i, j ∈ S. By Lemma 2.1,
By a union bound over j ∈ S, Pr[Ecoll(i)] ≤ 4 |S| /B.
Claim 4.2 For any i ∈ S, the event Eoff (i) holds with probability at most α.
Proof Note that oσ,b(i) ≡ πσ,b(i) ≡ σ(i − b) (mod n/B). For any odd σ and any l ∈ [n/B], we have that Prb[σ(i − b) ≡ l (mod n/B)]= B/n. Since only αn/B offsets oσ,b(i) cause Eoff (i), the claim follows.
Lemma 4.1 Suppose B divides n. The output û of HASHTOBINS satisfies
Let The running time of HASHTOBINS is
Proof The proof can be found in Appendix A.3.
Lemma 4.2 Consider any i ∈ S such that neither Ecoll(i) nor Eoff(i) holds. Let j = hσ,b(i). Then
and j ∈ J.
Proof The proof can be found in Appendix A.4.
For each invocation of NOISELESSSPARSEFFTINNER, let P be the set of all pairs (i, v) for which the command ŵi ← v was executed. Claims 4.1 and 4.2 and Lemma 4.2 together guarantee that for each i ∈ S the probability that P does not contain the pair (i, (x̂ − ẑ)i) is at most 4|S|/B + α. We complement this observation with the following claim.
Claim 4.3 For any j ∈ J we have j ∈ hσ,b(S). Therefore, |J| = |P| ≤ |S|.
Proof Consider any j ∉ hσ,b(S). From Equation (A.1) in the proof of Lemma 4.2 it follows that |ûj| ≤ δnL < 1/2.
Lemma 4.3 Consider an execution of NOISELESSSPARSEFFTINNER, and let S = supp(x̂ − ẑ).If |S| ≤ k′, then
Proof Let e denote the number of coordinates i ∈ S for which either Ecoll(i) or Eoff (i) holds. Each such coordinate might not appear in P with the correct value, leading to an incorrect value of ŵi. In fact, it might result in an arbitrary pair (i′, v′) being added to P, which in turn could lead to an incorrect value of . By Claim 4.3 these are the only ways that ŵ can be assigned an incorrect value. Thus, we have
Since E[e] ≤ (4|S|/B + α)|S| ≤ (4β + α)|S|, the lemma follows.
Analysis of NOISELESSSPARSEFFT
Consider the tth iteration of the procedure, and define St = supp(x̂ − ẑ) where ẑ denotes the value of the variable at the beginning of loop. Note that |S0| = |supp(x̂)| ≤ k.
We also define an indicator variable It which is equal to 0 iff |St|/|St−1| ≤ 1/8. If It = 1 we say the tth iteration was not successful. Let γ = 8 · 8(β + α). From Lemma 4.3 it follows that . From Claim 4.3 it follows that even if the tth iteration is not successful, then |St|/|St−1| ≤ 2.
For any t ≤ 1, define an event E(t) that occurs iff . Observe that if none of the events E(1) … E(t) holds then |St| ≤ k/2t.
Lemma 4.4 Let E = E(1) ∪ … ∪ E (λ) for λ = 1 + log k. Assume that (4γ)1\2 < 1/4. Then Pr[E] ≤ 1/3.
Proof Let t′ = [t/2]. We have
Therefore,
Theorem 4.1 Suppose x̂ is k-sparse with entries from {−L, …, L} for some known L = nO(1). Then the algorithm NOISELESSSPARSEFFT runs in expected O(k log n) time and returns the correct vector x̂ with probability at least 2/3.
Proof The correctness follows from Lemma 4.4. The running time is dominated by O(log k) executions of HASHTOBINS.
Assuming a correct run, in every round t we have
Therefore,
so the expected running time of each execution of HASHTOBINS is . Setting α = Θ(2−t/2) and β = Θ(1), the expected running time in round t is . Therefore the total expected running time is O(k log n). ■
4.3 Algorithm for the General Case
For the general case, we only achieve Equation (4.1) for . In general, for any parameter δ > 0 we can add to the right hand side of Equation (4.1) and run in time O(k log(n/k) log(n/δ)).
Pseudocode for the general algorithm SFT 4.0 is shown in Algorithms 4.2 and 4.3.
4.3.1 Intuition
Let S denote the “heavy” O(k/∊) coordinates of x̂. The overarching algorithm SPARSEFFT (SFT 4.0) works by first “locating” a set L containing most of S, then “estimating” x̂L to get ẑ. It then repeats on . We will show that each heavy coordinate has a large constant probability of both being in L and being estimated well. As a result, is probably nearly k/4-sparse, so we can run the next iteration with k → k/4. The later iterations then run faster and achieve a higher success probability, so the total running time is dominated by the time in the first iteration and the total error probability is bounded by a constant.
Algorithm 4.2 SFT 4.0: General Sparse Fourier Transform for k = o(n), part 1 of 2
In the rest of this intuition, we will discuss the first iteration of SPARSEFFT with simplified constants. In this iteration, hashes are to B = O(k/∈) bins and, with 3/4 probability, we get ẑ so is nearly k/4-sparse. The actual algorithm will involve a parameter α in each iteration, roughly guaranteeing that with 1 − probability, we get ẑ so is nearly k-sparse; the formal guarantee will be given by Lemma 4.11. For this intuition we only consider the first iteration where α is a constant.
Location
As in the noiseless case, to locate the “heavy” coordinates we consider the “bins” computed by HASHTOBINS with Pσ,a,b. This roughly corresponds to first permuting the coordinates according to the (almost) pairwise independent permutation Pσ,a,b, partitioning the coordinates into B = O(k/∊) “bins” of n/B consecutive indices, and observing the sum of values in each bin. We get that each heavy coordinate i has a large constant probability that the following two events occur: no other heavy coordinate lies in the same bin, and only a small (i.e., O(∊/k)) fraction of the mass from non-heavy coordinates lies in the same bin. For such a “well-hashed” coordinate i, we would like to find its location τ = πσ,b(i) = σ(i − b) among the ∈n/k < n/k consecutive values that hash to the same bin. Let
Algorithm 4.3 SFT 4.0: General Sparse Fourier Transform for k = o(n), part 2 of 2
so . In the noiseless case, we showed that the difference in phase in the bin using Pσ,0,b and using Pσ,1,b is plus a negligible O(δ) term. With noise this may not be true; however, we can say for any β ∈ [n] that the difference in phase between using Pσ,a,b and Pσ,α+βb, as a distribution over uniformly random α ∈ [n], is with (for example) E[ν2] = 1/100 (all operations on phases modulo 2π). We can only hope to get a constant number of bits from such a “measurement.” So our task is to find τ within a region Q of size n/k using O(log(n/k)) “measurements” of this form.
One method for doing so would be to simply do measurements with random β ∈ [n]. Then each measurement lies within π/4 of with at least probability. On the other hand, for j ≠ τ and as a distribution over β, is roughly uniformly distributed around the circle. As a result, each measurement is probably more than π/4 away from . Hence, O(log(n/k)) repetitions suffice to distinguish among the n/k possibilities for τ. However, while the number of measurements is small, it is not clear how to decode in polylog rather than Ω.(n/k) time.
To solve this, we instead do a t-ary search on the location for t = Θ(log n). At each of O(logt(n/k)) levels, we split our current candidate region Q into t consecutive subregions Q1, …, Qt, each of size w. Now, rather than choosing β ∈ [n], we choose . By the upper bound on β, for each q ∈ [t] the values { | j ∈ Qq} all lie within of each other on the circle. On the other hand, if |j − τ| > 16w, then will still be roughly uniformly distributed about the circle. As a result, we can check a single candidate element eq from each subregion: if eq is in the same subregion as τ, each measurement usually agrees in phase; but if eq is more than 16 subregions away, each measurement usually disagrees in phase. Hence with O(log t) measurements, we can locate τ to within O(1) consecutive subregions with failure probability 1/tΘ(1). The decoding time is O(t log t).
This primitive LOCATEINNER lets us narrow down the candidate region for τ to a subregion that is a t′ = Ω(t) factor smaller. By repeating LOCATEINNER times, LOCATESIGNAL can find τ precisely. The number of measurements is then O(log t logt(n/k)) = O(log(n/k)) and the decoding time is O(t log t logt(n/k)) = O(log(n/k) log n). Furthermore, the “measurements” (which are actually calls to HASHTOBINS) are non-adaptive, so we can perform them in parallel for all O(k/∈) bins, with O(log(n/δ)) average time per measurement. This gives O(k log(n/k) · log(n/δ)) total time for LOCATESIGNAL.
This lets us locate every heavy and “well-hashed” coordinate with 1/tΘ(1) = o(1) failure probability, so every heavy coordinate is located with arbitrarily high constant probability.
Estimation
By contrast, estimation is fairly simple. As in Algorithm 4.1, we can estimate as , where û is the output of HASHTOBINS. Unlike in Algorithm 4.1, we now have noise that may cause a single such estimate to be poor even if i is “well-hashed.” However, we can show that for a random permutation Pσ,a,b the estimate is “good” with constant probability. ESTIMATEVALUES takes the median of Rest = O(log ) such samples, getting a good estimate with 1 − ∊/64 probability. Given a candidate set L of size k/∈, with 7/8 probability at most k/8 of the coordinates are badly estimated. On the other hand, with 7/8 probability, at least 7k/8 of the heavy coordinates are both located and well estimated. This suffices to show that, with 3/4 probability, the largest k elements J of our estimate ŵ contains good estimates of 3k/4 large coordinates, so is close to k/4-sparse.
4.3.2 Analysis
Formal Definitions
As in the noiseless case, we define πσ,b(i) = σ(i − b) mod n, hσ,b(i) = round πσ,b(i)B/n), and oσ,b(i) = πσ,b(i) − hσ,b(i)n/B. We say hσ,b(i) is the “bin” that frequency i is mapped into, and oσ,b(i) is the “offset.” We define
Define
In each iteration of SPARSEFFT, define x̂′ = x̂ − ẑ, and let
Then and . We will show that each i ∈ S is found by LOCATESIGNAL with probability , when .
For any i ∈ S define three types of events associated with i and S and defined over the probability space induced by σ and b:
• “Collision” event Ecoll(i): holds iff ;
• “Large offset” event Eoff(i): holds iff ; and
• “Large noise” event Enoise(i): holds .
By Claims 4.1 and 4.2, and for any i ∈ S
Claim 4.4 For any i ∈ S, .
Proof For each j ≠ i, by Lemma 2.1.
Then
The result follows by Markov’s inequality.
We will show for i ∈ S that if none of Ecoll(i), Eoff(i), and Enoise(i) hold then SPARSEFFTINNER recovers with 1 − O(α) probability.
Lemma 4.5 Let a ∈ [n] uniformly at random, B divide n, and the other parameters be arbitrary in
Then for any i ∈ [n] with and none of Ecoll(i), Eoff(i), or Enoise(i) holding,
Proof The proof can be found in Appendix A.5.
Properties of LOCATESIGNAL
In our intuition, we made a claim that if uniformly at random, and i > 16w, then is “roughly uniformly distributed about the circle” and hence not concentrated in any small region. This is clear if β is chosen as a random real number; it is less clear in our setting where β is a random integer in this range. We now prove a lemma that formalizes this claim.
Lemma 4.6 Let T ⊂ [m] consist of t consecutive integers, and suppose β ∈ T uniformly at random. Then for any i ∈ [n] and set S ⊂ [n] of l consecutive integers,
Proof Note that any interval of length l can cover at most elements of any arithmetic sequence of common difference i. Then is such a sequence, and there are at most intervals an + S overlapping this sequence. Hence, at most of the β ∈ [m] have βi mod n ∈ S. Hence, .
Lemma 4.7 Let i ∈ S. Suppose none of Ecoll(i), Eoff(i), and Enoise(i) hold, and let j = hσ,b(i). Consider any run of LOCATEINNER with . Let f > 0 be a parameter such that
for C larger than some fixed constant. Then with probability at least ,