A dangerous statistical quirk • Lax bank security in developing nations
Anyone with an interest in how research forms public policy should pay attention to p-values
Decisions affecting millions of people should be made using the best possible information. That’s why researchers, public officials, and anyone with views on social policy should pay attention to a controversy in statistics. The lesson: Watch out if you see a claim of the form “x is significantly related to y.”
At issue is a statistical test that researchers in a wide range of disciplines, from medicine to economics, use to draw conclusions from data. Let’s say you have a pill that’s supposed to make people rich. You give it to 30 people, and they wind up 1 percent richer than a similar group that took a placebo.
Before you can attribute this difference to your magic pill, you need to test your results with a narrow and dangerously subtle question: How likely would you be to get this result if your pill had no effect whatsoever? If this probability, or so-called p-value, is less than a stated threshold—often set at 5 percent—the result is deemed “statistically significant.”
The problem is, people tend to place great weight on this declaration of statistical significance without understanding what it really means. A low p-value doesn’t, for example, mean that the pill almost certainly works. Any such conclusion would need more information—including, for a start, some reason to think the pill could make you richer.
In addition, statistical significance isn’t policy significance. The size of the estimated effect matters. It might be so small as to lack practical value, even though it’s statistically significant. The converse is also true: An estimated effect might be so strong as to demand attention, even though it fails the p-value test.
These reservations apply even to statistical investigation done right. Unfortunately, it very often isn’t. Researchers commonly engage in “p-hacking,” tweaking data in ways that generate low p-values but actually undermine the test. Absurd results can be made to pass the p-value test, and important findings can fail. Despite all this, a good p-value tends to be a prerequisite for publication in scholarly journals. As a result,
only a small and unrepresentative sample of research ever sees the light of day.
Why aren’t bad studies rooted out? Sometimes they are, but academic success depends on publishing novel results, so researchers have little incentive to check the work of others. Journals that publish research, and institutions that fund it, should demand more transparency. Require researchers to document their work, including any negative or “insignificant” results. Insist on replication. Supplement p-values with other measures, such as confidence intervals that indicate the size of the estimated effect. Look at the evidence as a whole, and beware of results that haven’t been repeated or that depend on a single method of measurement. And hold findings to a higher standard if they conflict with common sense.