Читать книгу Informatics and Machine Learning - Stephen Winters-Hilt - Страница 14
1.3 FSA‐Based Signal Acquisition and Bioinformatics
ОглавлениеMany signal features of interest are time limited and not band limited in the observational context of interest, such as noise “clicks,” “spikes,” or impulses. To acquire these signal features a time‐domain finite state automaton (tFSA) is often most appropriate [116–124]. Human hearing, for example, is a nonlinear system that thereby circumvents the restrictions of the Gabor limit (to allow for musical geniuses, for example, who have “perfect pitch”), where time‐frequency acuity surpasses what would be possible by linear signal processing alone [116] , such as with Nyquist sampled linear response recording devices that are bound by the limits imposed by the Fourier uncertainty principle (or Benedick’s theorem) [117] . Thus, even when the powerful Fourier Transform or Hidden Markov Model (HMM) feature extraction methods are utilized to full advantage, there is often a sector of the signal analysis that is only conveniently accessible to analysis by way of FSAs (without significant oversampling), such that a parallel processing with both HMM and FSA methods is often needed (results demonstrating this in the context of channel current analysis [1–3] will be described in Chapter 14). Not all of the methods employed at the FSA processing stage derive from standard signal processing approaches, either, some are purely statistical such as with oversampling [118] (used in radar range oversampling [119, 120]) and dithering [121] (used in device stabilization and to reduce quantization error [122, 123]).
All of the tFSA signal acquisition methods described in Chapters 2–4 are O(L), i.e. they scan the data with a computational complexity no greater than that of simply seeing the data (via a “read” or “touch” command, O(L) is known as “order of,” or “big‐oh,” notation). Because the signal acquisition is only O(L) it is not significantly costly, computationally, to simply repeat the acquisition analysis multiple times with a more informed process with each iteration, to have arrived at a “bootstrap” signal acquisition process. In such a setting, signal acquisition is often done with bias to very high specificity initially (and sensitivity very poor), to get a “gold standard” set of highly likely true signals that can be data mined for their attributes. With a filter stage thereby trained, later scan passes can pass suspected signals with very weak specificity (very high sensitivity now) with high specificity then recovered by use of the filter. This then allows a bootstrap process to a very high specificity (SP) and sensitivity (SN) at the tFSA acquisition stage on the signals of interest.
An example of a bootstrap FSA from genomic analysis is to first scan through a genome base‐by‐base and obtain counts on nucleotide pairs with different gap sizes between the nucleotides observed [1, 3]. This then allows a mutual information analysis on the nucleotide pairs taken at the different gap sizes (shown in Chatpers 3 and 4). What is found for prokaryotic genomes, with their highly dense gene placement, that is mostly protein coding (i.e. where there is little “junk” deoxyribonucleic acid (DNA) and no introns), is a clear signal indicating anomalous statistical linkages on bases three apart [1, 3, 60]. What is discovered thereby is codon structure, where the coding information comes in groups of three bases. Knowing this, a repeated pass (bootstrap) with frequency analysis of the 64 possible 3‐base groupings can then be done, at which point the anomalously low counts on “stop” codons is then observed. Upon identification of the stop codons their placement (topology) in the genome can then be examined and it is found that their counts are anomalously low because there are large stretches of regions with no stop codon (e.g. there are stop codon “voids,” known as open reading frames, or “ORF”s). The codon void topologies are examined in a comparative genomic analysis in [60] (and shown in Chapter 3). The stop codons, which should occur every 21 codons on average if DNA sequence data was random, are sometimes not seen for stretches of several hundred codons. For the genomic data we are finding the longer genes, whose anomalous non‐random DNA sequence is more distinctive the longer the gene‐coding region. This basic analysis can provide a gene‐finder on prokaryotic genomes that comprises a one‐page Python script that can perform with 90–99% accuracy depending on the prokaryotic genome (shown in Chapter 3). A second page of Python coding to introduce a “filter,” along the lines of the bootstrap learning process mentioned above, leads to an ab initio prokaryotic gene‐predictor with 98.0–99.9% accuracy. Python code to accomplish this is shown in Chapter 4. In this bootstrap acquisition process all that is used is the raw genomic data (with its highly structured intrinsic statistics) and methods for identifying statistical anomalies and informatics structural anomalies: (i) anomalously high mutual information is identified (revealing codon structure); (ii) anomalously high (or low) statistics on an attribute or event is then identified (low stop codon counts, lengthy stop codon voids); then anomalously high sub‐sequences (binding site motifs) are found in the neighborhood of the identified ORFs (used in the filter).
Ad hoc signal acquisition refers to finding the solution for “this” situation (whatever “this” is) without consideration of wider application. The solution is strongly data dependent in other words. Data dependent methodologies are, by definition, not defined at the outset, but must be invented as the data begins to be understood. As with data dependency in non‐evolutionary search metaheuristics, where there is no optimal search method that is guaranteed to always work well, here there is no optimal signal acquisition method known in advance. This is simply restating a fundamental limit from non‐evolutionary search metaheuristics in another form [1, 3]. What can be done, however, is assemble the core tools and techniques from which a solution can be constructed and to perform a bootstrap algorithmic learning process with those tools (examples in what follows) to arrive at a functional signal acquisition on the data being analyzed. A universal, automated, bootstrap learning process may eventually be possible using evolutionary learning algorithms. This is related to the co‐evolutionary Free Lunch Theorem [1, 3], and this is discussed in Chapter 12.
“Bootstrap” refers to a method of problem solving when the problem is solved by seemingly paradoxical measures (the name references Baron von Munchausen who freed the horse he was riding from a bog by pulling himself, and the horse with him, up by his bootstraps). Such algorithmic methods often involve repeated passes over the data sequence, with improved priors, or a trained filter, among other things, to have improved performance. The bootstrap amplifier from electrical engineering is an amplifier circuit where part of the output is used as input, particularly at start‐up (known as bootstrapping), allowing proper self‐initialization to a functional state (by amplifying ambient circuit noise in some cases). The bootstrap FSA proposed here is a meta‐algorithmic method in that performance “feedback” with learning is used in algorithmic refinements with iterated meta‐algorithmic learning to arrive at a functional signal acquisition status.
Acquisition is often all that is needed in a signal analysis problem, where a basic means to acquire the signals is sought, to be followed by a basic statistical analysis on those signals and their occurrences. Various methods for signal acquisition using FSA constructs are described in what follows, with focus on statistical anomalies to identify the presence of signal and “lock on” [1, 3]. The signal acquisition is initially only guided by use of statistical measures to recognize anomalies. Informatics methods and information theory measures are central to the design of a good FSA acquisition method, however, and will be reviewed in the signal acquisition context [1, 3], along with HMMs.
Thus, FSA processes allow signal regions to be identified, or “acquired,” in O(L) time. Furthermore, in that same order of time complexity, an entire panoply of statistical moments can also be computed on the signals (and used in a bootstrap learning process). The O(L) feature extraction of statistical moments on the signal region acquired may suffice for localized events and structures. For sequential information or events, however, there is often a non‐local, or extended structural, aspect to the signal sought. In these situations we need a general, powerful, way to analyze sequential signal data that is stochastic (random, but with statistics, such as average, that may be unchanging over time if “stationary,” for example). The general method for performing stochastic sequential analysis (SSA) is via HMMs, as will be extensively described in Chapters 6 and 7, and briefly summarized in Section 1.5 that follows. HMM approaches require an identification of “states” in the signal analysis. If an identification of states is difficult, such as in situations where there can be changes in meaning according to context, e.g. language, then HMMs may not be useful. Text and language analytics are described in Chapters 5 and 13, and briefly outlined in the next section.