Читать книгу Informatics and Machine Learning - Stephen Winters-Hilt - Страница 22
1.9 Stochastic Sequential Analysis (SSA) Protocol (Deep Learning Without NNs)
ОглавлениеThe SSA protocol is shown in Figure 1.5 (from prior publications and patent work, see [1–3]) and is a general signal‐processing flow topology and database schema (Left Panel), with specialized variants for CCC (Center) and kinetic feature extraction based on blockade‐level duration observations (Right). The SSA Protocol allows for the discovery, characterization, and classification of localizable, approximately‐stationary, statistical signal structures in channel current data, or genomic data, or sequential data in general. The core signal processing stage in Figure 1.5 is usually the feature extraction stage, where central to the signal processing protocol is a generalized HMM. The SSA Protocol also has a built‐in recovery protocol for weak signal handling, outlined next, where the HMM methods are complemented by the strengths of other ML methods.
Figure 1.5 (Left) The general stochastic sequential analysis flow topology. (Center) The general signal processing flow in performing channel current analysis is typically Input ➔ tFSA ➔ Meta‐HMMBD ➔ SVM ➔ Output. (Right) Notable differences occur in channel current cheminformatics during state discovery when EVA‐projection (emission variance amplification projection), or a similar method, is used to achieve a quantization on states, then have Input ➔ tFSA ➔ HMMBD/EVA (state discovery) ➔ meta‐HMMBD‐side ➔ SVM ➔ Output. While, in gene‐finding just have: Input ➔ meta‐HMMBD‐side ➔ Output. In gene‐finding, however, the HMM internal “sensors” are sometimes replaced, locally, with profile‐HMMs [1, 3] (equivalent to position‐dependent Markov Models, or pMM’s, see Chapter 7), or SVM‐based profiling [1, 3], so the topology can differ not only in the connections between the boxes shown, but in their ability to embed in other boxes as part of an internal refinement.
Source: Based on Winters‐Hilt [1, 3].
The sequence of algorithmic methods used in the SSA Protocol, for the information‐processing flow topology shown in Figure 1.5, comprise a weak signal handling protocol as follows: (i) the weakness in the (fast) Finite State Automaton (FSA) methods will be shown to be their difficulty in nonlocal structure identification, for which HMM methods (and tuning metaheuristics) are the solution; (ii) for the HMM, in turn, the main weakness is in local sensing “classification” due to conditional independence assumptions. Once in the setting of a classification problem, however, the problem can be solved via incorporation of generalized SVM methods [1, 3]. If facing only classification task (data already preprocessed), the SVM will also be the method of choice in what follows. (iii) The weakness of the SVM, whether used for classification or clustering, but especially for the latter, is the need to optimize over algorithmic, model (kernel), chunking, and other process parameters during learning. This is solved via use of metaheuristics for optimization such as simulated annealing, and genetic algorithm optimization in (iv). The main weaknesses in the metaheuristic tuning effort is partly resolved via use of the “front‐end” methods, like the FSA, and partly resolved by a knowledge discovery process using the SVM clustering methods. The SSA Protocol weak signal acquisition and analysis method thereby establishes a robust signal processing platform.
The HMM methods are the central methodology or stage in the SSA Protocol, particularly in the gene finders, and sometimes with the CCC protocol or implementation, in that the other stages can be dropped or merged with the HMM stage in many incarnations. For example, in some CCC analysis situations the tFSA methods could be totally eliminated in favor of the more accurate (but time consuming) HMM‐based approaches to the problem, with signal states defined or explored in more or less the same setting, but with the optimized Viterbi path solution taken as the basis for the signal acquisition.
The HMM features, and other features (from NN, wavelet, or spike profiling, etc.) can be fused and selected via use of various data fusion methods, such as a modified Adaboost selection (from [1, 3], and Chapter 11). The HMM‐based feature extraction provides a well‐focused set of “eyes” on the data, no matter what its nature, according to the underpinnings of its Bayesian statistical representation. The key is that the HMM not be too limiting in its state definition, while there is the typical engineering trade‐off on the choice of number of states, N, which impacts the order of computation via a quadratic factor of N in the various dynamic programming calculations (comprising the Viterbi and Baum–Welch algorithms among others).
The HMM “sensor” capabilities can be significantly improved via switching from profile‐Markov Model (pMM) sensors to pMM/SVM‐based sensors, as indicated in [1, 3] and Chapter 7, where the improved performance and generalization capability of this approach is demonstrated.
In standard band‐limited (and not time‐limited) signal analysis with periodic waveforms, sampling is done at the Nyquist rate to have a fully reproducible signal. If the sample information is needed elsewhere, it is then compressed (possibly lossy) and transmitted (a “smart encoder”). The received data is then decompressed and reconstructed (by simply summing wave components, e.g. a “simple” decoder). If the signal is sparse or compressible, then compressive sensing [190] can be used, where sampling and compression are combined into one efficient step to obtain compressive measurements (the simple encoding in [190] since a set of random projections are employed), which are then transmitted (general details on noise in this context are described in [191, 192]). On the receiving end, the decompression and reconstruction steps are, likewise, combined using an asymmetric “smart” decoding step. This progression toward asymmetric compressive signal processing can be taken a step further if we consider signal sequences to be equivalent if they have the same stationary statistics. What is obtained is a method similar to compressive sensing, but involving stationary‐statistics generative‐projection sensing, where the signal processing is non‐lossy at the level of stationary statistics equivalence. In the SCW signal analysis the signal source is generative in that it is describable via use of a HMM, and the HMM’s Viterbi‐derived generative projections are used to describe the sparse components contributing to the signal source. In SCW encoding the modulation of stationary statistics can be man‐made or natural, with the latter in many experimental situations involving a flow phenomenology that has stationary statistics. If the signal is man‐made, usually the underlying stochastic process is still a natural source, where it is the changes in the stationary statistics that is under the control of the man‐made encoding scheme. Transmission and reception are then followed by generative projection via Viterbi‐HMM template matching or via Viterbi‐HMM feature extraction followed by separate classification (using SVM). So in the SCW approach the encoding is even simpler (possibly non‐existent, other than directly passing quantized signal) and is applicable to any noise source with stationary statistics (e.g. a stationary signal with reproducible statistics, the case for many experimental observations). The decoding must be even “smarter,” on the other hand, in that generalized Viterbi algorithms are used, and possibly other ML methods as well, SVMs in particular. An example of the stationary statistics sensing with a ML‐based decoder is described in application to CCC studies in Chapter 14.