Читать книгу Medical Statistics - David Machin - Страница 24

Medical Statistics and Data Science

Because of the availability of large amounts of data over the last few decades, the term data science has emerged to describe the substantial current intellectual effort around research with the goal of extracting information from these data. The type of data currently available in all sorts of application domains is often massive in size, very heterogeneous and far from being collected under designed or controlled experimental conditions. Nonetheless, it contains information, often substantial information, and it has been argued that data science is a new interdisciplinary approach that makes maximal use of this information. However, data alone is typically not that informative and (machine) learning from data needs conceptual frameworks. Data science would seem to encompass statistics. However, we would argue that statistics is crucial for providing conceptual frameworks that enhance the understanding of fundamental phenomena, highlight limitations and provide a formalism for properly founded data analysis, information extraction and quantification of uncertainty, as well as for the analysis and development of algorithms that carry out these key tasks.

As taught at a number of universities, data science differs from statistics in a number of ways. Statistics originated before the computer and its core concern is with statistical models. However, no serious statistician is beguiled into confusing their model with reality (‘All models are wrong, but some are useful’ to quote the famous statistician John Tukey). However, models are very useful in describing how the world might be, and for making generalisations beyond the data. Data science is empirical, reliant on large data sets, whereas one of the key successes of statistics is doing inference on relatively small data sets, such as those available in agriculture and laboratories. Data science is often used for prediction, and the idea is that with the vast amounts of data now available electronically (such as that provided by national health services) one can look at empirical relationships and build up accurate predictors, such as how drugs will behave in individuals. These predictions are often highly successful, but lacking models it can be difficult to know why it makes some predictions, and how generalizable the predictions might be. Data science is related to the concept of ‘big data’. However, simply because a sample is large does not mean it is unbiased.

A case in point is the reported link between taking hormone replacement therapy (HRT) and lower heart disease rates observed in some large data sets. However, a key issue is whether women who use HRT are already more health conscious. It can be difficult to know whether this fact is adequately accounted for in conclusions drawn from the big data. Thus, it was only when the results of the randomised controlled trial of the use of HRT (Writing Group for the Women's Health Initiative Investigators 2002) became available that HRT was shown not to protect against heart disease. In fact, the trial identified an increased risk for total cardiovascular disease with hazard ratio 1.22 and 95% confidence interval 1.09 to 1.36 (the technical terms will be explained in Chapter 11). In this example, big data led to a wrong conclusion.

Подняться наверх