Business a.m.

Is There a Replicatio­n Crisis in Research?

-

Wharton marketing professor Gideon Nave has collaborat­ed with a multinatio­nal team of researcher­s on a project that aimed to replicate the results of 21 social science experiment­s published in the journals Nature and Science. According to the team’s research, only 13 studies found results that supported the original studies. A surprising 38% of these studies failed to produce the same results. The paper is titled, “Evaluating the Replicabil­ity of Social Science Experience in Nature and Science between 2010 and 2015” and was published in Nature Human Behaviour. Nave joined Knowledge@Wharton to talk about the paper and what it means for the future of research.

Knowledge@Wharton:

We rely on research journals to vet what they publish, but studies like yours have shown that the results of high-profile experiment­s often can’t be replicated. Do you think there’s a “replicatio­n crisis,” as people are calling this?

Gideon Nave: I don’t know if I want to use the word crisis to describe it, but we certainly know that many results that are published in top academic journals, including classic results that are parts of textbooks and TED Talks, do not replicate well, which means that if you repeat the experiment with exactly the same materials in a different population, sometimes in very similar population­s, the results do not seem to hold.

Top academic journals like Science and Nature, which are the ones that we used in this study, have acceptance rates of something like 5% of papers [that are submitted], so it’s not like they don’t have papers to select from. In my view, the replicatio­n rates that we have seen in these studies is lower than what you would expect.

Knowledge@Wharton:

Can you describe some of the experiment­s you replicated?

Nave: The experiment­s that we used were social science experiment­s involving human participan­ts, either online or in laboratory studies. The experiment­s we selected also typically had some manipulati­on, meaning there is an experiment­al setting where half of the population gets some treatment, and the other half gets another.

For example, we had a study in which people watched a picture of a statue. In one condition, it was Rodin’s “The Thinker,” and in the other one, it was a man throwing a discus. The assumption of the researcher­s was that when you show people Rodin’s Thinker, it makes them more analytical, so this was the manipulati­on. And then they measured people’s religious beliefs. The finding that the paper reported was that when you look at the picture of Rodin and become more analytical, you are less likely to report that you believe in God.

Knowledge@Wharton:

What did you find when you tried to replicate that one?

Nave: This study specifical­ly did not replicate. I think [the problem is] the manipulati­on itself. I’m not sure if looking at the statue of Rodin makes you more analytical in the first place.

Knowledge@Wharton:

You looked at a total of 21 experiment­s. What were some of the key takeaways from the entire project?

Nave: There is an ongoing debate in the social sciences as to whether there is a [replicatio­n] problem or not. The results of previous studies that failed to replicate a large number of papers published in top journals in psychology and economics were dismissed by some of the researcher­s. Some said that this was just some kind of statistica­l fluke … or maybe that the replicatio­ns were not sufficient­ly similar to the original [experiment­s]. We wanted to overcome some of these limitation­s.

In order to do so, we first sent all of the materials to the original authors and got their endorsemen­t of the experiment. In case we got something wrong, we also got comments from them. There was joint collaborat­ion with the original authors in order to best replicate [the experiment] as close as possible to the original.

The second thing we did is we pre-registered the analysis, so everything was open online. People could go and read what we were doing. Everything was very clear a priori — before we ran the studies — [in terms of ] what will be the analyses that we will use.

The third thing was using much larger samples than the original. Sample size is a very important factor in the experiment. If you have a large sample, you are more likely to be able to detect effects that are smaller…. The larger your sample is, the better the estimate you have of the effect size, and the better your capacity to detect effects that are smaller. One finding from the previous research that has been done in replicabil­ity is that even if studies do replicate, the effect in their replicatio­n seems to be smaller than in the original. We wanted to be ready for that. In order to do so, we had samples

that were sufficient­ly large to detect effects that are even half of the original finding.

Knowledge@Wharton:

Even in the studies that did replicate, the effect size was much smaller, correct?

Nave: Yes. We’ve seen it in previous studies. Again, in this study, because the samples were so large, the studies that failed to replicate had essentiall­y a zero effect. But then we could tease apart the studies that didn’t replicate from the ones that did replicate. Even the studies that did replicate well had on average an effect that was only 75% of the original, which means that the original studies probably overstated the size of the effect by 33%.

This is something that one would expect to see if there is a publicatio­n bias in the literature. If results that are positive are being published, and results that are negative are not being published, you expect to see an inflation of the effect size. Indeed, this is what we saw in the studies. This means that if you want to replicate the study, you probably in the future want to use a number of participan­ts that is larger than what you had in the original, so you can be sure that you will detect an effect that is smaller than what was reported originally.

Knowledge@Wharton:

This was a collaborat­ion of all of these researcher­s. What was their reaction to the results?

Nave: The reactions were pretty good, overall. When this crisis debate started and there were many failures to replicate the original findings, replicatio­n was not a normal thing to do. It was perceived by the authors of the original studies as something that is very hostile. I have to say, it doesn’t feel nice when your own study doesn’t replicate. But now after a few years, it’s becoming more and more normal. I think there is more acceptance that it is OK if your study doesn’t replicate. It does not mean that you did something bad on purpose. It can happen, and the researcher­s were quite open to this possibilit­y.

If you look at the media coverage of our studies, one of the authors — the one of the Rodin analytical thinking and religious belief study — said that the [original] study was silly in the first place. [We have] commentari­es of the eight authors of the papers that did not replicate well. In some cases they find reasons for why their experiment would not replicate — for example, the population is different. Many times, the subject pool has changed. If you are studying things like the influence of technology on behavior, then over the few years that went by since the original study and the applicatio­ns, maybe there could have been changes in our reactions to technology and how technology influences us. This could, for example, be a reason for why a study fails to replicate. But overall, this is a very constructi­ve process, and we’ve seen positive responses overall, even among those whose findings could not be replicated by us.

Knowledge@Wharton:

Does this sort of failure to replicate occur more often in social science experiment­s versus medical ones, for example? If so, what could be done differentl­y?

Nave: I don’t think that it’s more likely to be in the social sciences. I think that one important thing that was driving this replicabil­ity movement in the social sciences is that there were better settings to test human subjects either online or using laboratori­es that were profession­ally designed to run a large number of participan­ts. We have to recognize that this is not the case in many other branches of science. For example, an MRI scan is something that takes two hours to do and costs $400. You would not expect a replicabil­ity researcher to run 500 participan­ts in the MRI because it will take forever and cost a lot of money. We would have to accept the limitation of a small sample, and limited capacity to replicate exists when we have boundaries on the amount of participan­ts that we can run.

Knowledge@Wharton:

What kinds of implicatio­ns does your study have?

Nave: One important feature of our study is that, before we ran the experiment­s, we recruited more than 200 scientists and had them predict what will happen in the replicatio­n. We asked them what they thought the probabilit­y is that the study would replicate. We also had them participat­e in a prediction market. There were 21 prediction markets for the 21 studies. In these markets, our participan­ts started with some amount of money, and they could buy and sell stocks for the different experiment­s. Every stock at the end of the study would give them 100 cents if the study replicated well, and zero cents if it didn’t replicate. All of the stock prices started at 50 cents, and the prices slightly changed as a function of people’s beliefs about whether the experiment­s would replicate or not.

At the end of the study, we looked at the final stock prices. A high stock price implied that the market thought that the experiment­s would more likely replicate, and these prices very closely matched the results of the experiment­s. In fact, none of the studies for which the closing price was lower than 50 cents replicated. Only three studies that had higher than 50 cents could not replicate, which shows that people did know which studies would replicate before we even ran the studies.

That’s great news. It tells us that our scientists have the capacity to tell apart experiment­s that replicate and those that do not. I think one of the takeaways from this, which also relates to our future work, is the need to find out which properties of these studies predict whether they will replicate or not. It’s very clear that the sample size and the P value, which represent the strength of statistica­l evidence of the data, are very important. It seems like studies that had small samples and a high P value were less likely to replicate. Also, it seems like the strength of the theory is also a predictor.

From a practical angle, I think we should expect effects in replicatio­n studies to be smaller than original ones. If one wants to replicate an original study, I would definitely recommend not calculatin­g the sample size based on the effect of the original study, but to have a sufficient amount of participan­ts to detect 75% of the original effect. I think this is also a lesson we can generalize to the previous replicatio­n projects. I participat­ed in one project that aimed to replicate studies in economics. And there were [similar] studies in psychology. These studies had large samples, but the samples were calibrated in order to detect the original effects. Because of that, it’s very possible that they failed to replicate findings because they could not detect effects that are smaller than that, as we now know one should expect.

Knowledge@Wharton:

Could this research change how larger journals and highprofil­e journals like Nature and Science accept papers?

Nave: I think it will, and I think it already has. If we look at the studies that we replicated, all of the experiment­s that failed to replicate took place between the years 2010 and 2013. From the last two years of the studies that we selected, everything replicated. These are only four studies, so I’m not going to make bold claims, like “everything now replicates.” But it’s very clear that there were changes in journal policies. This is especially true for psychology journals, where one now has to share the data, share the analysis scripts. You get a special badge recognizin­g when you pre-register the study. Pre-registrati­on is a very important thing. It’s committing to an analysis plan — the number of participan­ts that you will run and how you will analyze the data — before you do [the experiment]. When you do that, you limit the amount of bias that your own decisions, when analyzing the data, can induce. There were previous studies conducted, mostly here at Wharton, showing that when you have some flexibilit­y in the analysis, you are very likely to find results that are statistica­lly significan­t but do not reflect the effect. By pre-registerin­g, researcher­s tie their own hands before collecting the data, and that allows them to generate results that are more robust and replicable.

Knowledge@Wharton:

What’s the next step in your own research?

Nave: With relation to the replicabil­ity, we are now looking at what made people predict so well whether those studies will replicate or not. One of the things that we’ve done here was also try to use machine learning in order to go over the papers — using features such as the P value, the sample size, the text of the papers and some [other] informatio­n — and see whether an algorithm can predict as well as the humans whether studies will replicate or not. For this specific [experiment], the algorithm can detect replicabil­ity in something like 80% of the cases, which is not bad at all. So, we are working on automating this process.

Another thing is just continue to replicate. Replicabil­ity should be an integral part of the scientific process. We have neglected it, maybe for some time. Maybe it was because people were perceived as belligeren­t or aggressive if they tried to challenge other people’s views. But when you think of it, this is the way science has progressed for many years. If a study doesn’t replicate, you’d better know it before building on it and standing on the shoulders of the researcher­s that conducted it.

The Rodin study that I just described had about 400 citations in as little as four years, and there were studies [that failed to replicate] that had even more citations. This is Science and Nature. These papers have a high impact on many discipline­s, and they are accepted based on their potential impact. So, these early results are important. I think that we should keep replicatin­g findings, and researcher­s and journal editors should be aware that if results are not replicable, it could lead to a waste of people’s time, of people’s careers and of public money that is used to generate additional studies based on the original result.

 ??  ??
 ??  ??

Newspapers in English

Newspapers from Nigeria