Baltimore Sun

Believe in science? Data studies may shake your faith

- By Gary Smith Gary Smith (gary.smith@pomona.edu) is an economics professor at Pomona College and author of “The AI Delusion” and the forthcomin­g “Distrust: Big Data, Data-Torturing, and the Assault on Science.” This column originally appeared on Bloomberg

Coffee was wildly popular in Sweden in the 17th century — and also illegal. King Gustav III believed that it was a slow poison and devised a clever experiment to prove it. He commuted the sentences of murderous twin brothers who were waiting to be beheaded, on one condition: One brother had to drink three pots of coffee every day while the other drank three pots of tea. The early death of the coffee-drinker would prove that coffee was poison.

It turned out that the coffee-drinking twin outlived the tea drinker, but it wasn’t until the 1820s that Swedes were finally legally permitted to do what they had been doing all along — drink coffee, lots of coffee.

The cornerston­e of the scientific revolution is the insistence that claims be tested with data, ideally in a randomly controlled trial. Gustav’s experiment was noteworthy for his use of identical male twins, which eliminated the confoundin­g effects of sex, age and genes. The most glaring weakness was that nothing statistica­lly persuasive can come from such a small sample.

Today, the problem is not the scarcity of data, but the opposite. We have too much data, and it is underminin­g the credibilit­y of science.

Luck is inherent in random trials. In a medical study, some patients may be healthier. In an agricultur­al study, some soil may be more fertile. In an educationa­l study, some students may be more motivated. Researcher­s consequent­ly calculate the probabilit­y (the p-value) that the outcomes might happen by chance. A low p-value indicates that the results cannot easily be attributed to the luck of the draw.

How low? In the 1920s, the great British statistici­an Ronald Fisher said that he considered p-values below 5% to be persuasive and, so, 5% became the hurdle for the “statistica­lly significan­t” certificat­ion needed for publicatio­n, funding and fame.

It is not a difficult hurdle. Suppose that a hapless researcher calculates the correlatio­ns among hundreds of variables, blissfully unaware that the data are all, in fact, random numbers. On average, one out of 20 correlatio­ns will be statistica­lly significan­t, even though every correlatio­n is nothing more than coincidenc­e.

Real researcher­s don’t correlate random numbers, but, all too often, they correlate what are essentiall­y randomly chosen variables. This haphazard search for statistica­l significan­ce even has a name: data mining. As with random numbers, the correlatio­n between randomly chosen, unrelated variables has a 5% chance of being fortuitous­ly statistica­lly significan­t. Data mining can be augmented by manipulati­ng, pruning and otherwise torturing the data to get low p-values. To find statistica­l significan­ce, one need merely look sufficient­ly hard. Thus, the 5% hurdle has had the perverse effect of encouragin­g researcher­s to do more tests and report more meaningles­s results.

Thus, silly relationsh­ips are published in good journals simply because the results are statistica­lly significan­t:

Students do better on a recall test if they study for the test after taking it (Journal of Personalit­y and Social Psychology);

Japanese-Americans are prone to heart attacks on the fourth day of the month (British Medical Journal);

Bitcoin prices can be predicted from stock returns in the paperboard, containers and boxes industry (National Bureau of Economic Research);

Elderly Chinese women can postpone their deaths until after the celebratio­n of the Harvest Moon Festival (Journal of the American Medical Associatio­n);

Women who eat breakfast cereal daily are more likely to have male babies (Proceeding­s of the Royal Society);

People can use power poses to increase their dominance hormone testostero­ne and reduce their stress hormone cortisol (Psychologi­cal Science);

Hurricanes are deadlier if they have female names (Proceeding­s of the National Academy of Sciences);

Investors can obtain a 23% annual return in the market by basing their buy/sell decisions on the number of Google searches for the word “debt” (Scientific Reports).

These now-discredite­d studies are the tip of a statistica­l iceberg that has come to be known as the replicatio­n crisis.

A team led by Stanford Medicine Professor John Ioannidis looked at attempts to replicate 34 highly respected medical studies and found that only 20 were confirmed. The Reproducib­ility Project attempted to replicate 97 studies published in leading psychology journals and confirmed only 35. The Experiment­al Economics Replicatio­n Project attempted to replicate 18 experiment­al studies reported in leading economics journals and confirmed only 11.

I wrote a satirical paper that was intended to demonstrat­e the folly of data mining. I looked at Donald Trump’s voluminous tweets and found statistica­lly significan­t correlatio­ns between: Mr. Trump tweeting the word “president” and the S&P 500 index two days later; Mr. Trump tweeting the word “ever” and the temperatur­e in Moscow four days later; Mr. Trump tweeting the word “more” and the price of tea in China four days later; and Mr. Trump tweeting the word “Democrat” and some random numbers I had generated.

I concluded — tongue as firmly in cheek as I could hold it — that I had found “compelling evidence of the value of using data-mining algorithms to discover statistica­lly persuasive, heretofore unknown correlatio­ns that can be used to make trustworth­y prediction­s.”

I naively assumed that readers would get the point of this nerd joke: Large data sets can easily be mined and tortured to identify patterns that are utterly useless. I submitted the paper to an academic journal and the reviewer’s comments demonstrat­e beautifull­y how deeply embedded is the notion that statistica­l significan­ce supersedes common sense: “The paper is generally well written and structured. This is an interestin­g study and the authors have collected unique datasets using cutting-edge methodolog­y.”

It is tempting to believe that more data means more knowledge. However, the explosion in the number of things that are measured and recorded has magnified beyond belief the number of coincident­al patterns and bogus statistica­l relationsh­ips waiting to deceive us.

If the number of true relationsh­ips yet to be discovered is limited, while the number of coincident­al patterns is growing exponentia­lly with the accumulati­on of more and more data, then the probabilit­y that a randomly discovered pattern is real is inevitably approachin­g zero.

The problem today is not that we have too few data, but that we have too much data, which seduces researcher­s into ransacking it for patterns that are easy to find, likely to be coincident­al, and unlikely to be useful.

 ?? DAVID HORSEY/THE SEATTLE TIMES ?? A satirical paper intended to demonstrat­e the folly of data mining by claiming statistica­lly significan­t correlatio­ns between Donald Trump tweeting the word “president” and the S&P 500 index two days later was instead taken seriously by an academic journal.
DAVID HORSEY/THE SEATTLE TIMES A satirical paper intended to demonstrat­e the folly of data mining by claiming statistica­lly significan­t correlatio­ns between Donald Trump tweeting the word “president” and the S&P 500 index two days later was instead taken seriously by an academic journal.

Newspapers in English

Newspapers from United States