Why registration of studies matter.

by Andreas Baumann

Recently, I linked to a petition, asking for signatures for the implementation of fixed protocols in clinical studies, such that all studies must be registered and all results reported, to avoid the “publication bias” problem, in which people only report findings that conform to some scheme of interest.
This matters for a number of reasons – it taints this wondrous human endeavour that we call science, but also for more pragmatic reasons, such as the fact that private medico firms use a lot of ressources trying to replicate scientific findings that might be the result of chance. To illustrate this, I’d like to tell you about two statistical concepts: significance and power (i).

A significant finding is one, which under some criterion is unlikely under the null hypothesis. Think about comparing two means m_1 and m_2, with the difference D = m_1-m_2. The null hypothesis is D = 0. This could be the height of girls and boys, and you wonder whether there is a systematic difference in their mean heights.
The thing is, when comparing their mean heights, you want to have some idea about whether or not the difference stems from random variation. Say, you could have sampled the tallest boys and the smallest girls by chance. The significance of your findings models this the assumptions of the null hypothesis – i.e., that there is no difference in reality. You run a test, and you find that the significance (p-value) of D is 0.04. This means that the chance of seeing a value of D as extreme (size of the absolute value) as this is 4%.
One thing you should know about significance: under the null (in a correctly designed RCT), it’s uniformly distributed on [0;1] – meaning that every value has the same point probability. This leads to the fact that even when there is no effect (no difference in mean heights, in the example above), one would still find a significant result from time to time by chance alone.
Power is not quite as intuitively understood, in my opinion, but it relates to the ability to detect an effect of a certain size: when you’re looking for a small effect, you need to look at more data points than if you’re looking at a large effect. Normally, in designing tests, you have a trade-off between significance and power: if you want to be very sure that you’re not picking up results by chance, you lose some of your ability to detect small effects. Think about this for a moment.The problem with clinical trials is the fact that if you test, say, 50 candidate drugs under a 5% significance criterion, you end up with a 92.3% chance of finding a significant effect of one of the drugs (0.923 = 1-0.95^50). This is itself not problematic: it’s a function of how our probability-based modeling of the real world, and things can be done about this (ii).The problem is: you’re the researcher in question, and you report the one drug that you found had a positive effect, without reporting the others. Why should you? They didn’t turn up significant results, did they? Then, your candidate drugs is picked up on, and more extensive testing is performed with larger sample sizes, etc. Maybe they can’t replicate your result, because it was the result of a multiple comparison problem.

Furthermore, when more research is performed into a field, you’ll see effect sizes shrink: because the pilot studies tend to be smaller than subsequent studies, they detect effects in the right tail of the effect distribution (iii). Having some central forms of registration for studies allow us to ponder all the evidence together: studies showing a significant effect, and studies not showing any effect. This allows practitioners to choose better options for patients, and it allows medico firms to concentrate their efforts on drugs where the data do seem to indicate promising aspect, instead of basing their efforts on statistical artifacts. A wise man proportions his belief to the evidence, as David Hume said. Let’s act as wise men and not proportion our belief to those bits of evidence that seem convenient to us.


(i) These are very elementary introductions. If you’re interested in a more rigourous approach, consult an introduction to mathematical statistics, such as Hogg, McKean & Craig. (ii) What to do about multiple comparisons problems is a matter of discussion: a modern response is to use a multilevel Bayesian modeling of the estimates.
(iii) One thing that is interesting is that research into the supermorbidity and -mortality of smokers do not seem to exhibit this pattern of shrinking effect sizes, which might suggest that studies don’t control for enough covariates.