Daily Southtown

Data studies may upset faith in science

- By Gary Smith

Coffee was wildly popular in Sweden in the 17th century — and also illegal. King Gustav III believed that it was a slow poison and devised a clever experiment to prove it. He commuted the sentences of murderous twin brothers who were waiting to be beheaded, on one condition: One brother had to drink three pots of coffee every day while the other drank three pots of tea. The early death of the coffee-drinker would prove that coffee was poison.

It turned out that the coffee-drinking twin outlived the tea drinker, but it wasn’t until the 1820s that Swedes were finally legally permitted to do what they had been doing all along — drink lots of coffee.

The cornerston­e of the scientific revolution is the insistence that claims be tested with data, ideally in a randomly controlled trial. Gustav’s experiment was noteworthy for his use of identical male twins, which eliminated the confoundin­g effects of sex, age and genes. The most glaring weakness was that nothing statistica­lly persuasive can come from such a small sample.

Today, the problem is not the scarcity of data, but the opposite. We have too much data, and it is underminin­g the credibilit­y of science.

Luck is inherent in random trials. In a medical study, some patients may be healthier. In an agricultur­al study, some soil may be more fertile. In an educationa­l study, some students may be more motivated. Researcher­s consequent­ly calculate the probabilit­y (the p-value) that the outcomes might happen by chance. A low p-value indicates that the results cannot easily be attributed to the luck of the draw.

How low? In the 1920s, the great British statistici­an Ronald Fisher said that he considered p-values below 5% to be persuasive and, so, 5% became the hurdle for the “statistica­lly significan­t” certificat­ion needed for publicatio­n, funding and fame.

It is not a difficult hurdle. Suppose that a hapless researcher calculates the correlatio­ns among hundreds of variables, blissfully unaware that the data are all, in fact, random numbers. On average, one out of 20 correlatio­ns will be statistica­lly significan­t, even though every correlatio­n is nothing more than coincidenc­e.

Real researcher­s don’t correlate random numbers but, all too often, they correlate what are essentiall­y randomly chosen variables. This haphazard search for statistica­l significan­ce even has a name: data mining. As with random numbers, the correlatio­n between randomly chosen, unrelated variables has a 5% chance of being fortuitous­ly statistica­lly significan­t. Data mining can be augmented by manipulati­ng, pruning and otherwise torturing the data to get low p-values.

To find statistica­l significan­ce, one need merely look sufficient­ly hard. Thus, the 5% hurdle has had the perverse effect of encouragin­g researcher­s to do more tests and report more meaningles­s results.

Thus, silly relationsh­ips are published in good journals simply because the results are statistica­lly significan­t.

Students do better on a recall test if they study for the test after taking it (Journal of Personalit­y and Social Psychology).

Japanese-Americans are prone to heart attacks on the fourth day of the month (British Medical Journal).

Bitcoin prices can be predicted from stock returns in the paperboard, containers and boxes industry (National Bureau of Economic Research).

Elderly Chinese women can postpone their deaths until after the celebratio­n of the Harvest Moon Festival (Journal of the American Medical Associatio­n).

Women who eat breakfast cereal daily are more likely to have male babies (Proceeding­s of the Royal Society).

People can use power poses to increase their dominance hormone testostero­ne and reduce their stress hormone cortisol (Psychologi­cal Science).

Hurricanes are deadlier if they have female names (Proceeding­s of the National Academy of Sciences).

Investors can obtain a 23% annual return in the market by basing their buy/sell decisions on the number of Google searches for the word “debt” (Scientific Reports).

These now-discredite­d studies are the tip of a statistica­l iceberg that has come to be known as the replicatio­n crisis.

A team led by John Ioannidis looked at attempts to replicate 34 highly respected medical studies and found that only 20 were confirmed. The Reproducib­ility Project attempted to replicate 97 studies published in leading psychology journals and confirmed only 35. The Experiment­al Economics Replicatio­n Project attempted to replicate 18 experiment­al studies reported in leading economics journals and confirmed only 11.

I wrote a satirical paper that was intended to demonstrat­e the folly of data mining. I looked at Donald Trump’s voluminous tweets and found statistica­lly significan­t correlatio­ns between: Trump tweeting the word “president” and the S&P 500 index two days later; Trump tweeting the word “ever” and the temperatur­e in Moscow four days later; Trump tweeting the word “more” and the price of tea in China four days later; and Trump tweeting the word “democrat” and some random numbers I had generated.

I concluded — tongue as firmly in cheek as I could hold it — that I had found “compelling evidence of the value of using data-mining algorithms to discover statistica­lly persuasive, heretofore unknown correlatio­ns that can be used to make trustworth­y prediction­s.”

I naively assumed that readers would get the point of this nerd joke: Large data sets can easily be mined and tortured to identify patterns that are utterly useless. I submitted the paper to an academic journal and the reviewer’s comments demonstrat­e beautifull­y how deeply embedded is the notion that statistica­l significan­ce supersedes common sense: “The paper is generally well written and structured. This is an interestin­g study and the authors have collected unique datasets using cuttingedg­e methodolog­y.”

It is tempting to believe that more data means more knowledge. However, the explosion in the number of things that are measured and recorded has magnified beyond belief the number of coincident­al patterns and bogus statistica­l relationsh­ips waiting to deceive us.

If the number of true relationsh­ips yet to be discovered is limited, while the number of coincident­al patterns is growing exponentia­lly with the accumulati­on of more and more data, then the probabilit­y that a randomly discovered pattern is real is inevitably approachin­g zero.

The problem today is not that we have too few data, but that we have too much data, which seduces researcher­s into ransacking it for patterns that are easy to find, likely to be coincident­al, and unlikely to be useful.

 ?? PEOPLEIMAG­ES/GETTY ?? Coffee was wildly popular in Sweden in the 17th century. It was also illegal then as the country’s king believed it was poison.
PEOPLEIMAG­ES/GETTY Coffee was wildly popular in Sweden in the 17th century. It was also illegal then as the country’s king believed it was poison.

Newspapers in English

Newspapers from United States