Myths About Randomization and Balancing in Human Subjects Research
February 2025
I'm getting a lot of work from clinicians who'd like me to analyze their human subjects data. And a recurrent feature of these analyses involves managing "imbalance" in prognostic covariates or co-morbidities between the treatment groups. So imagine my surprise when I learned from a bunch of top-level sources that we don't have to do this at all.
Statistical testing for baseline differences in randomized trials has been called meaningless [1], absurd [2], and an unhealthy practice based on a mistaken belief system [3]. The widely-cited CONSORT guidelines (Consolidated Standards of Reporting Trials) agree:
"Tests of baseline differences are not necessarily wrong, just illogical. Such hypothesis testing is superfluous and can mislead investigators and their readers." [4]
Yet these analyses are some of the most requested by my clients, occupying many billable hours. I tell them not to, but they fear the persistent demands of reviewers and editors. So what exactly are we doing, and why?
The true voyage of discovery consists not of visiting new lands but in having different eyes.
–Marcel Proust, À la recherche du temps perdu
In this month's post we examine randomization and balancing from the perspective of a single article, "Seven Myths of Randomization in Clinical Trials" [5]. It's quite approachable (= basic math), covers many key points and has a good bibliography. The frequentist author jousts with his Bayesian nemeses, which adds to the spice level. And reminiscent of my classroom days, it offers a lesson in probability using three versions on a simple dice game. Full disclosure: I've elaborated on this game and humanized it by giving the players fictitious names. What follows should be useful for anyone grappling with these issues, from beginning students in probability to advanced practitioners in human subjects research.
Three Games of Dice
The game is played by Susan the statistician and Govind the gambler who are working as a team against a dice dealer. The game involves two dice, one Red and one Black. In each round Susan has to predict the probability, P, that the total score on the dice will be exactly 10. Using this prediction, Govind bets on the outcome. Depending on Susan's prediction, he can choose to bet on the score being 10 or on the score being 'not-10'. If he wins his bet, he and Susan share the winnings. This motivates Susan to give Govind the best estimates possible.
The dealer offers 3 versions of the game:
Variant 1 Variant 2 Variant 3
Before any dice-rolling, Susan gives the probability. Govind bets, and then the two dice, Red and Black, are rolled together.
The Red die is rolled, but it's hidden from the players. Susan gives the probability, Govind bets, and then the Black die is rolled.
The Red die is rolled so everyone can see it. Susan gives the probability, Govind bets, and then the Black die is rolled.
Attention, probability students!
Grab your calculators and figure out for each variant what Susan should predict. 🔢 I know you will crush it. Don't scroll down until you try.
The Curious Case of The Predictions
Variant 1: There are 6 possible outcomes for Red and 6 possible outcomes for Black, therefore (6 x 6) or 36 combinations. Of these, only 3 combinations yield a score of 10:
Red 4 + Black 6
Red 5 + Black 5
Red 6 + Black 4
So Susan knows that the probability of a score of 10 = (3 ➗ 36) or P = 1/12.
Variant 2: Although Red has been rolled and its value already determined, Susan can't see the outcome. But she knows how dice work so she has to consider each result on Red and on Black equally likely. So the probability of scoring 10 in Variant 2 is still (3 ➗ 36), the same as Variant 1. Susan's response doesn't change just because Red has been rolled already. P = 1/12.
Variant 3: In this variant, Susan has some partial information: the score on Red. Her thinking is summed up in the decision tree at right. Half of the time she will predict P = 0 and half the time she will predict P = 1/6. So if we looked at Susan's predictions over many rounds, her average prediction would be the summed values of the two possible calls divided by 2, or 1/12. How 'bout them apples?
Image Source: Fractional Investigator Services LLC
Image source: Google ImageFX
How the Dice Game Models Human Subjects Research
Each dice roll represents a research subject, like the people in this crosswalk (who are taking part in our trial).
The total score is the subject's response – the outcome we can measure.
The score on Red represents the part of the response due to prognostic factors or "confounders" such as age, gender, current medications, kidney function or cholesterol level. Game variants treat this differently:
Variant 1: No prognostic information is available.
Variant 2: There is prognostic information "out there" but no one knows what it is.
Variant 3: There is prognostic information known at baseline.
The score on Black represents the part of the response due to assigned treatment (i.e. drug vs. placebo).
The Black die alone is what we want to measure, but the Red die is always in the game. And I thought that was a very nice analogy in this paper.
Finally, notice that Susan's predictions were the same in Variants 1 and 2, and also in 3 if we average over her long-term playing (all the subjects in the trial). In only one case did she see the Red score beforehand, which did shift her thinking. But in every single case she had a probabilistic model of die behavior (how they roll) and impact (range of possible scores). And that's all we need to have for our covariates, no matter how many potential confounders there may be.
Understanding the game and its relevance, let's see how the author makes use of it discussing various myths and truths about randomization. There are 7 myths in the paper (all worth reading) but to stay focused I will only cover three (3). For accurate reference, the original numbers are maintained:
Myth 2: Balance of prognostic factors is necessary for valid inference.
This is what my clients really worry about, which is why they ask me to apply a ton of tests and adjustments to their data. But perhaps they might quit worrying. As we learned in the dice games, there are two kinds of covariates at work: those that are observed and those that are not observed. Like it or not, these are always in the game.
Here's a concrete example of an observed, unbalanced covariate and how to deal with it. I've paraphrased the author's example and added to it by providing some hypothetical numbers:
Imagine a long-term trial of an asthma inhaler (drug vs. placebo) in a sample where some of the asthmatics are already on oral steroids. But some are not. Suppose that once the trial is underway we find that the treatment groups are quite balanced on age and sex but quite unbalanced with respect to steroid use (Table 1). Panic ensues.
But this is exactly equivalent to having carried out two randomized sub-trials, one for patients on oral steroids and one for those not (Table 2). We still have perfect randomization to treatment, just an asymmetry in numbers per group. And we can combine appropriate inferences from each sub-trial back again, to make up the whole. When sample sizes are large (>200) there's almost no loss of precision.
So asymmetry in covariates is just asymmetry in numbers. And there's no requirement to balance numbers in any epidemiological study, as all the outcome measures (rates, means, and like) have built-in corrections for sample sizes (N). "It thus follows that imbalance of an observed covariate is not a problem". And if the detected imbalances are in continuous variables, such as age, analysis of covariance (ANCOVA) will take care of that nicely.
FYI, this is called post-stratification and can be used if we think that any observed covariate is going to have a major effect on the outcome. But this thinking is based on clinical judgement and not on any statistical test of percentages or "imbalances" at baseline. Gosh, I think I just put myself out of a job.
Image source: Adobe stock
Myth 5: Randomization precludes balancing covariates
It's worth pointing out that randomization, which is considered good, can sometimes produce imbalance, which is considered bad. This causes some to jettison the benefits of randomization in favor of balancing covariates at all costs. Common approaches include stratification before randomization and also patient matching. Senn observes that
"Matched pairs designs are almost never possible unless the matching is WITHIN patients, for example, as when treating both eyes for cataracts."
It would be great if people were plant seeds and we could match and block them at will before planting them in our experimental plots. However the recruitment of human subjects is continuous, not simultaneous, and we never know until the end (perhaps years later) what the subject population will be. And if matching on one variable is good why not match on 2, 3 or N? An obsession with matching leads to the realization that no two humans can actually be matched, as the only perfect match on all variables is the subject herself.
Realize also that whether you match or not, all will be reflected in the precision of your predictions. Here's Senn's example:
"I choose a famous trial...in which 29 patients suffering from enuresis were treated in separate periods of 14 days with a placebo...and [then] a treatment under investigation. The number of dry nights were noted. [Using a matched-pairs t-test] ...the observed mean difference (treatment-placebo) is 2.172 and the one-sided p-value...is 0.00074. [...] However, if you repeat the analysis ignoring the pairing, then you obtain very different results. The parametric t-test now gives a p-value of 0.0141 [...] In other words, ignoring the blocking factor leads to a p-value that is much less impressive – nearly 20 times as large as previously. [...] Although the point estimate of 2.172 is the same for the two analyses, the 95% confidence intervals are (0.91, 3.43) for the matched-pairs analysis and (0.24, 4.10) for the analysis ignoring matching."
A larger confidence interval, of course, indicating more uncertainty.
The author then re-does this analysis using both non-parametric (permutation test) and Bayesian approaches and makes the same conclusion. So it doesn't really matter what kind of "statistical Susan" you are. If you can see and adjust for covariates, adjust by all means. Which leads us to...
Myth 6: Observed covariates can be ignored because one has randomized.
Referring to the dice game:
"To ignore observed prognostic covariates is to treat variant 3 of the game as though it were game 1. This is not logical. In fact, it is not logical to ignore prognostic covariates even if they are perfectly balanced."
Remember that Susan is motivated to give Govind her best prediction to maximize their winnings. And when she did have Red die information (Variant 3) she adjusted her predictions dramatically. In the same way, we should never ignore any source of information that could help us find the truth in our voyage of discovery. The difficulty, of course, is figuring out what constitutes a prognostic variable in the first place and then what to do with it. People are not dice. Typically, one decides a priori to consider a fixed number of variables based on established clinical information and weights them somehow in the analysis. But that is beyond today's discussion.
Image source: Adobe stock
Strategic Summary
Hopefully this blog post will provide encouragement by reminding you that frequentist statistics already accommodate within- and between-group variation and therefore adjust to observed and unobserved covariates in experimental design. Once randomization is accomplished, the confidence interval is your friend:
"[Confidence intervals] are as wide and imprecise as they are because they make allowance for imbalance. An analogy can be made with engineering here. It is rather missing the point to claim that engineering calculations are useless because they depend on mathematical idealisations that cannot take all physical factors into account, whilst overlooking the fact that precisely because this is so engineers build allowances and tolerances into their calculations and specifications."
"Conventional statistical calculations have tolerance built into them. They use the fact that patients will differ not only between groups but also within groups. They use the disparity in results within groups as a means of scaling the uncertainty regarding between-group differences."
"Differences in covariates are only relevant [if] they help us predict outcomes we would have seen between groups in the absence of treatment. If we knew what these differences [in outcome] would be, then the differences in the covariates themselves would be irrelevant. Thus we do not need to concentrate on the indefinitely many varying covariates. Their relevance is bounded by outcome and, if we have randomized, the [measured outcome] variation within groups is related to the variation between [groups] in a way that can be described probabilistically by the Fisherian machinery." [6]
References and Resources
Harvey, LA. 2018. Statistical testing for baseline differences between randomised groups is not meaningful. Spinal Cord 56, 919.
Elkins, MR. 2015. Assessing baseline comparability in randomised trials. J Physiotherapy 61.4: 228.
De Boer, MR et al. 2015. Testing for baseline differences in randomized controlled trials: an unhealthy research behavior that is hard to eradicate. Int J Behav Nutr Phys Act 12:1.
Moher, D et al. 2010. CONSORT 2010 explanation and elaboration: updated guidelines for reporting parallel group randomised trials. BMJ Clinical Research 340:869.
Senn, S. 2012. Seven myths of randomisation in clinical trials. Statistics in Medicine 32.9: 1439.
Referring to Ronald A. Fisher, whose name is attached to the F-distribution.
If you liked this blog post, here's another: The NIH Simplified Review Framework