Читать книгу Probability - Robert P. Dobrow - Страница 13
ОглавлениеINTRODUCTION
All theory, dear friend, is gray, but the golden tree of life springs ever green.
—Johann Wolfgang von Goethe
Probability began by first considering games of chance. But today, it has practical applications in areas as diverse as astronomy, economics, social networks, and zoology that enrich the theory and give the subject its unique appeal.
In this book, we will flip coins, roll dice, and pick balls from urns, all the standard fare of a probability course. But we have also tried to make connections with real-life applications and illustrate the theory with examples that are current and engaging.
You will see some of the following case studies again throughout the text. They are meant to whet your appetite for what is to come.
I.1 Walking the Web
There are about one trillion websites on the Internet. When you google a phrase like “Can Chuck Norris divide by zero?,” a remarkable algorithm called PageRank searches these sites and returns a list ranked by importance and relevance, all in the blink of an eye. PageRank is the heart of the Google search engine. The algorithm assigns an “importance value” to each web page and gives it a rank to determine how useful it is.
PageRank is a significant accomplishment of mathematics and linear algebra. It can be understood using probability. Of use are probability concepts called Markov chains and random walks, explored in Chapter 11. Imagine a web surfer who starts at some web page and clicks on a link at random to find a new site. At each page, the surfer chooses from one of the available hypertext links equally at random. If there are two links, it is a coin toss, heads or tails, to decide which one to pick. If there are 100 links, each one has a 1% chance of being chosen. As the web surfer moves from page to random page, they are performing a random walk on the web.
What is the PageRank of site ? Suppose the web surfer has been randomly walking the web for a very long time (infinitely long in theory). The probability that they visit site is precisely the PageRank of that site. Sites that have lots of incoming links will have a higher PageRank value than sites with fewer links.
The PageRank algorithm is actually best understood as an assignment of probabilities to each site on the web. Such a list of numbers is called a probability distribution. And since it comes as the result of a theoretically infinitely long random walk, it is known as the limiting distribution of the random walk. Remarkably, the PageRank values for billions of websites can be computed quickly and in real time.
I.2 Benford's Law
Turn to a random page in this book. Look in the middle of the page and point to the first number you see. Write down the first digit of that number.
You might think that such first digits are equally likely to be any integer from 1 to 9. But a remarkable probability rule known as Benford's law predicts that most of your first digits will be 1 or 2; the chances are almost 50%. The probabilities go down as the numbers get bigger, with the chance that the first digit is 9 being less than 5% (Fig. I.1).
Benford's law, also known as the “first-digit phenomenon,” was discovered over 100 years ago, but it has generated new interest in recent years. There are a huge number of datasets that exhibit Benford's law, including street addresses, populations of cities, stock prices, mathematical constants, birth rates, heights of mountains, and line items on tax returns. The last example, in particular, caught the eye of business Professor Mark Nigrini who showed that Benford's law can be used in forensic accounting and auditing as an indicator of fraud [2012].
FIGURE I.1: Benford's law describes the frequencies of first digits for many real-life datasets.
Durtschi et al. [2004] describe an investigation of a large medical center in the western United States. The distribution of first digits of check amounts differed significantly from Benford's law. A subsequent investigation uncovered that the financial officer had created bogus shell insurance companies in her own name and was writing large refund checks to those companies. Applications to international trade were investigated in Cerioli et al. [2019].
I.3 Searching the Genome
Few areas of modern science employ probability more than biology and genetics. A strand of DNA, with its four nucleotide bases adenine, cytosine, guanine, and thymine, abbreviated by their first letters, presents itself as a sequence of outcomes of a four-sided die. The enormity of the data—about three billion “letters” per strand of human DNA—makes randomized methods relevant and viable.
Restriction sites are locations on the DNA that contain a specific sequence of nucleotides, such as G-A-A-T-T-C. Such sites are important to identify because they are locations where the DNA can be cut and studied. Finding all these locations is akin to finding patterns of heads and tails in a long sequence of coin tosses. Theoretical limit theorems for idealized sequences of coin tosses become practically relevant for exploring the genome. The locations for such restriction sites are well described by the Poisson process, a fundamental class of random processes that model locations of restriction sites on a chromosome, as well as car accidents on the highway, service times at a fast food chain, and when you get your text messages.
On the macrolevel, random processes are used to study the evolution of DNA over time in order to construct evolutionary trees showing the divergence of species. DNA sequences change over time as a result of mutation and natural selection. Models for sequence evolution, called Markov processes, are continuous time analogues of the type of random walk models introduced earlier.
Miller et al. [2012] analyze the sequenced polar bear genome and give evidence that the size of the bear population fluctuated with key climactic events over the past million years, growing in periods of cooling and shrinking in periods of warming. Their paper, published in the Proceedings of the National Academy of Sciences, is all biology and genetics. But the appendix of supporting information is all probability and statistics. Similar analyses, rooted in probability theory, continue to be performed investigating relationships between species, as described in Mather et al. [2020].
I.4 Big Data
The search for the Higgs boson, the so-called “God particle,” at the Large Hadron Collider in Geneva, Switzerland, generated 200 petabytes of data (1 petabyte = bytes). That is as much data as the total amount of printed material in the world at the time! In physics, genomics, climate science, marketing, even online gaming and film, the sizes of datasets being generated are staggering. How to store, transmit, visualize, and process such data is one the great challenges of science.
Probability is being used in a central way for such problems in a methodology called compressed sensing.
In the average hospital, many terabytes (1 terabyte = bytes) of digital magnetic resonance imaging (MRI) data are generated each year. A half-hour MRI scan might collect 100 Mb of data. These data are then compressed to a smaller image, say 5 Mb, with little loss of clarity or detail. Medical and most natural images are compressible since lots of pixels have similar values. Compression algorithms work by essentially representing the image as a sum of simple functions (such as sine waves) and then discarding those terms that have low information content. This is a fundamental idea in signal processing, and essentially what is done when you take a picture on your cell phone and then convert it to a JPEG file for sending to a friend or uploading to the web.
Compressed sensing asks: If the data are ultimately compressible, is it really necessary to acquire all the data in the first place? Can just the final compressed data be what is initially gathered? And the startling answer is that by randomly sampling the object of interest, the final image can be reconstructed with similar results as if the object had been fully sampled. Random sampling of MRI scans produces an image of similar quality as when the entire object is scanned. The new technique has reduced MRI scan time to one-seventh the original time, from about half an hour to less than 5 minutes, and shows enormous promise for many other applied areas. For more information on this topic, the reader is directed to Mackenzie [2009].
I.5 From Application to Theory
Having sung the praises of applications and case studies, we come back to the importance of theory.
Probability has been called the science of uncertainty. “Mathematical probability” may seem an oxymoron like jumbo shrimp or civil war. If any discipline can profess a claim of “certainty,” surely it is mathematics with its adherence to rigorous proof and timeless results.
One of the great achievements of modern mathematics was putting the study of probability on a solid scientific foundation. This was done in the 1930s, when the Russian mathematician Andrey Nikolaevich Kolmogorov built up probability theory in a rigorous way similarly to how Euclid built up geometry. Much of his work is the material of a graduate-level course, but the basic framework of axiom, definition, theory, and proof sets the framework for the modern treatment of the subject.
One of the joys of learning probability is the compelling imagery we can exploit. Geometers draw circles and squares; probabilists toss coins and roll dice. There is no perfect circle in the physical universe. And the “fair coin” is an idealized model. Yet when you take real pennies and toss them repeatedly, the results conform so beautifully to the theory.
In this book, we use the computer program R. R is free software and an interactive computing environment available for download at http://www.r-project.org/
. If you have never used R before, we encourage you to work through the introductory R supplement to familiarize yourself with the language. As you work through the text, the associated supplements support working with the code and script files. The script files only require R. For working with the supplements, you can read the pdf versions, or if you want to run the code yourself, we recommend using RStudio to open these RMarkdown files. RStudio has a free version, and it provides a useful user interface for R. RMarkdown files allow R code to be interwoven with text in a reproducible fashion.
Simulation plays a significant role in this book. Simulation is the use of random numbers to generate samples from a random experiment. Today, it is a bedrock tool in the sciences and data analysis. Many problems that were for all practical purposes impossible to solve before the computer age are now easily handled with simulation.
There are many compelling reasons for including simulation in a probability course. Simulation helps build invaluable intuition for how random phenomena behave. It will also give you a flexible platform to test how changes in assumptions and parameters can affect outcomes. And the exercise of translating theoretical models into usable simulation code (easy to do in R) will make the subject more concrete and hopefully easier to understand.
And, most importantly, it is fun! Students enjoy the hands-on approach to the subject that simulation offers. It is thrilling to see some complex theoretical calculation “magically” verified by a simulation.
To succeed in this subject, read carefully, work through the examples, and do as many problems as you can. But most of all, enjoy the ride!
The results concerning fluctuations in coin tossing show that widely held beliefs …are fallacious. They are so amazing and so at variance with common intuition that even sophisticated colleagues doubted that coins actually misbehave as theory predicts. The record of a simulated experiment is therefore included….
—William Feller, An Introduction to Probability Theory and Its Applications, Vol. 1, Third Edition (1968), page xi.