Читать книгу Informatics and Machine Learning - Stephen Winters-Hilt - Страница 18
1.5.2 HMMs for Cheminformatics and Generic Signal Analysis
ОглавлениеThe prospect of having a HMM feature extraction in the streaming signal processing pipeline (O(L), for size L data process) offers powerful real‐time feature extraction capabilities and specialized filtering (all of which is implemented in the Nanoscope, Chapter 14). One such processing method, described in Chapter 6, is HMM/Expectation Maximization (EM) EVA (Emission Variance Amplification) Projection which has application in providing simplified automated tFSA Kinetic Feature Extraction from channel current signal. What is needed is the equivalent of low‐pass filtering on blockade levels while retaining sharpness on the timing of the level changes. This is not possible with the standard low‐pass filter because the edges get blurred out in the local filtering process, but notice how this does not happen with the HMM‐based filter, for the data shown in Figure 1.4.
HMM is a common intrinsic statistical sequence modeling method (implementations and applications are mainly drawn from [135–158] in what follows), so the question naturally arises – how to optimally incorporate extrinsic “side‐information” into a HMM? This can be done by treating duration distribution information itself as side‐information and a process is shown for incorporating side‐information into a HMM. It is thereby demonstrated how to bootstrap from a HMM to a HMMD (more generally, a hidden semi‐Markov model or HSMM, as it will be described in Chapter 7).
In many applications, the ability to incorporate the state duration into the HMM is very important because conventional HMM‐based, Viterbi and Baum‐Welch algorithms are otherwise critically constrained in their modeling ability to distributions on state intervals that are geometric (this is shown in Chapter 7). This can lead to a significant decoding failure in noisy environments when the state‐interval distributions are not geometric (or approximately geometric). The starkest contrast occurs for multimodal distributions and heavy‐tailed distributions, the latter occurring for exon and intron length distributions (thus critical in gene finders). The hidden Markov model with binned duration (HMMBD) algorithm eliminates the HMM geometric distribution modeling constraint, as well as the HMMD maximum duration constraint, and offers a significant reduction in computational time for all HMMBD‐based methods to be approximately equal to the computational time of the HMM‐process alone.
Figure 1.4 Edge feature enhancement via HMM/EM EVA filter. The filter “projects” via a Gaussian parameterization on emissions with variance boosted by the factor indicated. From prior publications by the author [1–3].
Source: Based on Winters‐Hilt [1–3].
In adopting any model with “more parameters,” such as a HMMBD over a HMM, there is potentially a problem with having sufficient data to support the additional modeling. This is generally not a problem in any HMM model that requires thousands of samples of non‐self transitions for sensor modeling, such as for the gene‐finding that is described in what follows, since knowing the boundary positions allows the regions of self‐transitions (the durations) to be extracted with similar sample number as well, which is typically sufficient for effective modeling of the duration distributions in a HMMD.
Improvement to overall HMM application rests not only with the aforementioned improvements to the HMM/HMMBD, but also with improvements to the hidden state model and emission model. This is because standard HMMs are at low Markov order in transitions (first) and in emissions (zeroth), and transitions are decoupled from emissions (which can miss critical structure in the model, such as state transition probabilities that are sequence dependent). This weakness is eliminated if we generalize to the largest state‐emission clique possible, fully interpolated on the data set, as is done with the generalized‐clique HMM, where gene finding is performed on the Caenorhabditis elegans genome. The clique generalization improves the modeling of the critical signal information at the transitions between exon regions and noncoding regions, e.g. intron and junk regions. In doing this we arrive at a HMM structure identification platform that is novel, and robustly performing, in a number of ways.
Prior HMM‐based systems for SSA had undesirable limitations and disadvantages. For example, the speed of operation made such systems difficult, if not impossible, to use for real‐time analysis of information. In the SSA Protocol described here, distributed generalized HMM processing together with the use of the SVM‐based Classification and Clustering Methods (described next) permit the general use of the SSA Protocol free of the usual limitations. After the HMM and SSA methods are described, their synergistic union is used to convey a new approach to signal analysis with HMM methods, including a new form of stochastic‐carrier wave (SCW) communication.