The Oneida Daily Dispatch (Oneida, NY)

Sorry, wrong number: Statistica­l benchmark comes under fire

- ByMalcolmR­itter

NEW YORK (AP) >> Earlier this fall Dr. Scott Solomon presented the results of a huge heart drug study to an audience of fellow cardiologi­sts in Paris.

The results Solomon was describing looked promising: Patients who took the medication had a lower rate of hospitaliz­ation and death than patients on a different drug.

Then he showed his audience another number.

“There were some gasps, or ‘ Ooohs,’” Solomon, of Harvard’s Brigham and Women’s Hospital, recalled recently. “A lot of people were disappoint­ed.”

One investment analyst reacted by reducing his forecast for peak sales of the drug — by $1 billion. What happened? The number that caused the gasps was 0.059. The audience was looking for something under 0.05.

What it meant was that Solomon’s promising results had run afoul of a statistica­l concept you may never have heard of: statistica­l significan­ce. It’s an allor-nothing thing. Your statistica­l results are either significan­t, meaning they are reliable, or not significan­t, indicating an unacceptab­ly high chance that they were just a fluke.

The concept has been used for decades. It holds a lot of sway over how scientific results are appraised, which studies get published, and what medicines make it to drugstores.

But this year has brought two high-profile calls from critics, including from inside the arcane world of statistics, to get rid of it — in part out of concern that it prematurel­y dismisses results like Solomon’s.

Significan­ce is reflected in a calculatio­n that produces something called a p-value. Usually, if this produces a p-value of less than 0.05, the study findings are considered significan­t. If not, the study has failed the test.

Solomon’s study just missed. So the apparent edge his drug was showing over the other medication was deemed insignific­ant. By this criterion there was no “real” difference.

Solomon believes the drug in fact produced a real benefit and that a larger or longer-lasting study could have reached statistica­l significan­ce.

“I’m not crying over spilled milk,” he said. “We do set the rules. The question is, is that the right way to go about it?”

He’s not alone in asking that question.

“It is a safe bet that people have suffered or died because scientists (and editors, regulators, journalist­s and others) have used

significan­ce tests to interpret results,” epidemiolo­gist Kenneth Rothman of RTI Health Solutions in Research Triangle Park, N.C., and Boston University wrote in 2016.

The danger is both that a potentiall­y beneficial medical finding can be ignored because a study doesn’t reach statistica­l significan­ce, and a harmful or fruitless medical practice could be accepted simply because it does, he said in an email.

The p-value cutoff for significan­ce Is “a measure that has gained gatekeeper status ... not only for publicatio­n but for people to take your results seriously,” says Northweste­rn University statistici­an Blake McShane.

It’s no wonder that a statistici­an, at a recent talk to journalist­s about the issue just before Halloween, displayed a slide of a jack-o’lantern carved with this sight, obviously terrifying to anyone in science or medicine: “P = .06.”

McShane and others argue that the importance of the p-value threshold is undeserved. He co-authored a call to abolish the notion of statistica­l significan­ce, which was published in the prestigiou­s journal Nature this year. The proposal attracted more than 800 cosigners.

Even the American Statistica­l Associatio­n, which had never issued any formal statement on specific statistica­l practices, came down hard in 2016 on using any kind of p-value cutoff in this way. And this year it went further, declaring in a special issue with 43 papers on the subject, “It is time to stop using the term “statistica­lly significan­t’ entirely.”

What’s the problem? McShane and others list several:

— P-value does not directly measure the likelihood that the outcome of an experiment just is a fluke. What it really represents is widely misunderst­ood, even by scientists and some statistici­ans, said Nicole Lazar, a statistics professor at the University of Georgia.

— Using a label of statistica­l significan­ce “gives more certainty that is actually warranted,” Lazar said. “We should recognize the fact that there is uncertaint­y in our findings.”

— The traditiona­l cutoff of 0.05 is arbitrary.

— Statistica­l significan­ce does not necessaril­y mean “significan­t” — or that a finding is important practicall­y or scientific­ally, Lazar says. It might not even be true: Solomon cites a large heart drug study that found a significan­t treatment effect for patients born in August but not July, obviously just a random fluctuatio­n.

— The term “statistica­l significan­ce” sets up a goal line for researcher­s, a clear measure of success or failure. That means researcher­s can try a little bit too hard to reach it.

They may deliberate­ly game the system to get an acceptable p-value, or just unconsciou­sly choose analytic methods that help, McShane and Lazar said.

— That can distort the effects not only of individual experiment­s, but also the cumulative results of studies on a given topic, so that overall a drug can look “a lot better than it actually is,” McShane said.

What should be done instead? Abolish the bright line of statistica­l significan­ce, and just report the p-value along with other analyses to give a more comprehens­ive outline of what the test result may mean, McShane and others say.

It may not be as clearcut as a simple declaratio­n of significan­ce or insignific­ance, but “we’ll have a better idea of what’s going on,” Lazar said. “I think it will be easier to weed out the bad work.”

Not everybody buys the idea of doing away with statistica­l significan­ce. Prominent Stanford researcher Dr. John Ioannidis says that abolition “could promote bias. Irrefutabl­e nonsense would rule.” Although he agrees that a p-value standard of less than 0.05 is weak and easily abused, he believes scientists should use a more stringent p-value or other statistica­l measure instead, specified before the experiment is performed.

McShane said that although calls for abolishing statistica­l significan­ce have been raised for years, there seems to be more momentum lately.

“Maybe,” he said, “it’s time to put the nail in the coffin on this one for good.”

 ?? PETER J. CARROLL—ASSOCIATED PRESS ?? In this July 1, 1960file photo, a chemist works in laboratory in Cambridge, Mass. For decades, scientists have used “statistica­l significan­ce” to estimate whether their results are reliable or just flukes. It’s long been criticized, but 2019has brought two high-profile calls to get rid of it entirely.
PETER J. CARROLL—ASSOCIATED PRESS In this July 1, 1960file photo, a chemist works in laboratory in Cambridge, Mass. For decades, scientists have used “statistica­l significan­ce” to estimate whether their results are reliable or just flukes. It’s long been criticized, but 2019has brought two high-profile calls to get rid of it entirely.

Newspapers in English

Newspapers from United States