Why Do All the Large Alzheimer’s Drug Trials Fail?


by Jim Schnabel

July 8, 2013

Through all the twists and turns of Alzheimer’s theorizing over the past three decades, one stark fact has remained constant: Drugs that aim to stop the disease process have always failed in large-scale clinical trials.

The enormous costs of these failed trials—not counting the dashed hopes of Alzheimer’s patients and their families—would have been avoided if the same negative results had shown up in earlier, cheaper clinical trials. Quite often they did not. In fact, the data from some of these earlier trials have been very encouraging. In one recent case, involving a human antibody product called IVIg, Alzheimer’s patients treated with the optimal dose for three years stabilized cognitively, even as their placebo-receiving counterparts suffered worsening dementia. The IVIg-treated patients also showed less evidence of Alzheimer’s-related brain shrinkage. Yet in a more recent, larger trial, IVIg-treated patients on the whole fared no better than a placebo group. Similarly, a Russian antihistamine drug called dimebon seemed surprisingly effective in a 183-patient trial reported in 2008—but subsequent larger trials showed no treatment effect.

Such discrepancies have raised the question: are these Alzheimer’s drugs really ineffective, or are large clinical trials being run in ways that fail to reveal their effectiveness?

“In a way we’re struggling because we don’t have the theory and the proof of concept [of effective treatment],” says Lon Schneider, the director of the Alzheimer's Disease Research and Clinical Center at the University of Southern California, and co-author of a recent analysis of Alzheimer’s trials.

Smaller is chancier

On the one hand, larger clinical trials are done precisely because smaller trials are considered less reliable. By definition, smaller trials—having fewer subjects—allow random chance a greater sway over the results. “It’s very easy to flip a coin ten times and get seven heads,” says Schneider. Thus, when the recruited patient population is small, the group selected for treatment may already have—by the luck of the draw—a very different underlying rate of disease progression, compared with the placebo group. Such a mis-shuffle can make a worthless drug seem effective, or an effective drug seem worthless—even toxic. But only the positive results are likely to be published and rewarded with further funding.

Smaller, cheaper, locally managed trials may also have relatively loose controls, compared with larger, more professionally managed trials. If, for example, patients are not properly “blinded” to their status as treatment or placebo recipients, then the treated ones could get a very large placebo-effect boost, while the placebo group gets the opposite—an adverse “nocebo” effect from the expectation that their disease will worsen unabated. This is suspected to have happened in the 183-patient dimebon trial, which was conducted in Russia. “They used stock dimebon tablets, which were uncoated so they tasted bitter,” says Schneider. The placebo tablets, which were of a slightly different color, did not taste bitter and thus were easily distinguishable from dimebon. In the larger trials, dimebon was coated to make it less distinguishable from placebo, and in that new formulation showed no sign of effectiveness.

The ever-varying yardstick

Patients want a drug that slows or stops their disease. But medical science sets the bar higher, insisting that any drug should slow or stop disease more than a placebo does. Thus, even a drug that seems to do wonders for patients—stopping their disease in its tracks—will still be deemed a failure if the placebo-receiving patients do about as well.

A key yardstick in these trials, then, is the average rate at which the placebo group declines. Ideally, this yardstick will remain more or less the same from one trial to the next. “If you have a small variance [in this rate], that gives you an instrument that’s more precise and can identify an effective drug better than an instrument with a wider variance,” says Schneider.

However, as he and Mary Sano of the Mount Sinai School of Medicine found in a 2009 analysis of 18-month Alzheimer’s drug trials, this rate of descent into dementia—as measured by a standard test known as the ADAS-cog—varies considerably from one trial’s placebo group to another, even for large, phase 3 trials. When the placebo group’s decline-rate is at the rapid end of this range, as it was for the early IVIg study, it makes the treatment seem more effective than otherwise.

Perhaps more importantly, when the placebo group’s decline rate is at the slow end of this range, it leaves little or no room for a treatment effect to show up. As Schneider and Sano put it in their paper, this inconstant and sometimes low decline rate of the placebo group “might make it more unlikely than previously assumed that a modestly effective drug can be reliably recognized, especially when the drug might work only to attenuate decline in function and not to improve function.” [italics added]

This is not just a hypothetical eventuality. A relatively slow decline of the placebo group, which would have at least partly obscured any treatment effect, was seen in the first phase 3 studies of dimebon. A clinical trial consulting firm also has written publicly about this same phenomenon in an unnamed Alzheimer’s drug trial:

[A] customer came to us with concerns that an ongoing Alzheimer’s disease trial was demonstrating drift in the ADAS-cog and MMSE scores. Patients who should have been getting worse were getting better or showing minimal decline, indicating that perhaps their disease was too mild for the trial’s enrollment criteria. [italics added]

Works for some patients but not others?

All else being equal, a bigger trial should at least be better at minimizing random differences between treatment and placebo groups—differences not only in underlying rates of disease progression but also in factors such as compliance with drug-taking. However, a bigger, more diverse patient population is not always better. Indeed, in such a population it could be harder to find a sign of efficacy for a drug that works only (or mainly) against a certain form or stage of disease. For this reason scientists often go through the dataset of a large, seemingly “failed” trial in search of patient subgroups that appeared to have responded to a drug. This subgroup analysis is sometimes done in an exploratory, after-the-fact manner, and as such has been derided as “shooting an arrow at a wall and then drawing a target around it.” “I’m afraid that academics engage in this to possibly an even greater degree than the companies do; they’re unable to say ‘our drug didn’t work,’” says Schneider.

However, subgroup comparisons can in principle point to real treatment effects, which may then be confirmed by further studies of those subgroups. The recently completed phase 3 study of IVIg, for example, was designed to detect treatment effects in any of several patient subgroups. “The two subgroups that were reported in the [sponsoring company] press release to be encouraging are apo-E4 carriers, and the moderately [vs. mildly] impaired group,” says Norman Relkin, the neurologist at Weill Cornell Medical College who designed the study. (Apo-E4 is a gene variant associated with earlier onset and higher risk of Alzheimer’s.) Relkin’s team also has been looking at biomarker data to determine how IVIg—a mix of pooled human antibodies, including anti amyloid-beta-oligomer and anti-inflammatory antibodies—might have slowed the disease process in certain types of patient. “Yes, this was a negative study, but not a failed study; I think it accomplished a lot,” says Relkin, who intends to keep studying IVIg and its potential mechanisms of action.

Incremental changes

In response to the continuing negative outcomes of Alzheimer’s clinical trials, researchers have been designing some new trials in which patients are treated earlier in the disease course—when they may respond better—and for periods longer than 18 months, to allow more divergence between treatment and placebo groups. But this “incremental” change in trial designs, as Schneider puts it, still fails to take into account that different drugs have different possible mechanisms of action, different sources of outcome variability, and different possible windows of optimal effectiveness in the disease course. “In principle some drugs could show effects at six months and twelve months while other drugs might not show an effect for a much longer period,” he says.

On the whole, though, Schneider thinks that all the sources of statistical noise are likely to obscure only the weaker treatment effects. “If we had markedly effective drugs, we would see through some of these limitations,” he says.