Trial and error: the perils of the p value

Too many clinical trials produce results that are statistically significant but clinically meaningless, according to two US cardiologists.

Double-blind placebo-controlled trials may be the gold standard, but all that is gold does not glister. Too often trials produce results that enable drugs to be licensed or treatments to come into use without clearly demonstrating their clinical superiority over existing practice.
 
What is needed, say Drs Sanjay Kaul and George Diamond from Cedars Sinai Medical Centre in Los Angeles, is a measure of clinical significance to support that of statistical significance. Sometimes such a measure is implicit in a trial that is sized, for example, to detect a 15 per cent improvement in outcomes: the presumption then is that 15 per cent better outcomes are clinically significant.
 
But trial investigators are often reluctant to specify what a clinically important outcome is, they say in Journal of the American College of Cardiologists. The  result is “an erroneous tendency to equate statistical significance with clinical significance”. The two may not be identical: a statistically significant finding in a big trial may represent small differences of no clinical significance, while a statistically insignificant difference in a small trial does not rule out clinically significant effects.
 
As Dr Kaul put it in an interview with Heartwire, “The p value has imposed its will on the scientific community. Many investigators interpret a very small p value as very strong evidence. While it does suggest an effect is real, a p value doesn’t give any ‘oomph’ value. We need a measure that will give us ‘oomph value’”.
 
This is not the only problem. Many trials use composite endpoints that combine many outcome events, which enables the trials to be smaller. But then the results are often driven by the commonest, but least important, of the events.
 
Take a trial called Stent PAMI, designed to show whether putting a tiny wire cage (a stent) into a blocked artery after a heart attack was better than angioplasty (clearing the artery with a balloon-like device). The end-point of this trial combined death, another heart attack, disabling stroke, or the reblocking (restonosis) of the artery within a year requiring a second procedure to clear it.
 
The trial showed stenting was better: the combined endpoint occurred in 12.6 per cent vs 20.1 per cent of patients. But this result was driven by the least important of the four possible outcomes, restonosis. There were insignificant reductions in the risk of having another heart attack or suffering a stroke, and deaths were actually increased, though again not significantly. So a positive trial result actually only affected the outcome that mattered least.
 
A review of 27 trials into heart treatments using composite endpoints found only seven were driven by “hard” outcomes such as death. The rest were made positive by outcomes such as recurrent angina, or the need to be re-admitted to hospital, both much less important to the patient than dying. This could be corrected by giving the different outcomes in the combined endpoint different weights, but that choice can be highly subjective, leading to what the authors call “intellectual gerrymandering”.
 
A final problem is the common use of subgroup analysis, to try to find in a negative trial a subgroup of patients who benefitted. The greater the number of such analyses done, the more likely that one will prove statistically significant. If 20 subgroup analyses are done, the probability of at least one false positive is 0.64. And because these analyses are underpowered, false negatives may also be common.
 
The problem is even greater when the subgroups are defined after the trial has failed to find a significant benefit for the whole group. Often authors fail to disclose this: a recent analysis of 97 trials published in New England Journal of Medicine in 2005 and 2006 found that two thirds failed to say if the subgroups were pre-specified or post-hoc.
 
The main value of subgroup analysis is to test the robustness of the finding by showing they apply consistently within the subgroups, not to search around for statistical significance in a failed trial, the authors argue.