10  Biomarkers, Ageing Clocks and the Design of Longevity Trials

A therapy for ageing faces a peculiar problem: ageing takes a lifetime, but trials cannot. This chapter is about the measurements and study designs that make the untestable testable.

The previous part of the book set out a pharmacy of candidate rejuvenations — restriction and its mimetics, senolytics, reprogramming, regenerative cells. Each rests on a mechanism, and several work strikingly well in mice. But a mechanism is not a medicine, and a mouse is not a person. The question that now governs everything is the unglamorous one: how would we ever know that any of these slows human ageing? The answer turns out to depend less on new biology than on two instruments of a different kind — a ruler that can measure ageing faster than ageing happens, and a study design honest enough to trust it.

This chapter moves the argument from the laboratory to the clinic. Part III asked what can be done to a cell; Part IV asks how a claim about people can be made to stand up. The difficulty is structural rather than technical. A drug for hypertension can be judged in weeks against blood pressure, and against strokes within a few years; a drug for ageing is judged against a process that unfolds over eight decades and culminates in events — disease, disability, death — that no one will wait to count. The field has responded by trying to compress that timescale into something measurable, and the credibility of the entire enterprise rests on whether the compression is legitimate.

10.1 What counts as a biomarker of ageing

A biomarker of ageing is a measurable quantity that reports an individual’s biological age — the functional state of the body — rather than the years since birth. The appeal is obvious: if biological age can be read in a blood sample, then an intervention can be tested by whether it lowers that reading, long before any clinical event could be counted. The danger is equally obvious, and most of this chapter is about it: a number that merely tracks age is not the same as a number that governs it, and confusing the two is the field’s characteristic error.

The instruments fall into recognisable generations. The first epigenetic clocks were trained to predict chronological age from DNA-methylation patterns, and did so remarkably well (Hannum et al., 2013; Horvath, 2013); but a perfect predictor of chronological age is, by construction, blind to the deviations from it that matter most. The second generation was therefore trained not on age but on outcomes — PhenoAge against a composite of clinical mortality risk (Levine et al., 2018), GrimAge against time to death itself (Lu et al., 2019) — and predicts lifespan and healthspan far better for it. A third design abandons the snapshot for the slope: DunedinPACE estimates the rate at which a person is currently ageing, the quantity an intervention would most directly change (Belsky et al., 2022). Around these sit a widening family of measures: the pan-mammalian clocks that read a conserved ageing signature across species (Lu et al., 2023), transcriptomic clocks built from the universal gene-expression changes of mammalian ageing (Tyshkovskiy et al., 2026), imaging-based readouts of nuclear organisation (Alvarez-Kuglen et al., 2024), and clocks built deliberately from the accumulation of stochastic molecular noise rather than any programmed signal (Meyer & Schumacher, 2024). The mechanics of these instruments belong to Chapter 3; what matters here is their use as endpoints.

Methylation is not the only currency. For clinical trials, where blood is drawn and assays must be cheap and reproducible, an expert consensus distilled an initial field of 258 candidate markers to a short, pragmatic panel — interleukin-6, the tumour-necrosis-factor receptors, C-reactive protein, growth/differentiation factor 15, insulin, IGF-1, cystatin C, NT-proBNP and glycated haemoglobin — chosen because each is reliable to measure, relevant to ageing biology, and predictive of mortality and functional decline (Justice et al., 2018). Alongside these molecular readouts run the functional ones that geriatric medicine has used for decades — gait speed, grip strength, the cumulative frailty index — which forgo mechanism but capture the thing patients actually care about.

NoteKey concept — the four tests of a biomarker of ageing

The proliferation of clocks forced the field to agree on what makes any of them trustworthy. A usable biomarker of ageing must pass four tests, set out in the consortium framework that now anchors the discussion (Justice et al., 2018; Moqri et al., 2023). It must be reliable and feasible — measurable accurately, cheaply and repeatably. It must be relevant to ageing — grounded in the underlying biology rather than merely fashionable. It must predict outcomes — robustly forecasting mortality and age-related functional decline. And, hardest of all, it must be responsive to intervention — it must move when ageing is genuinely altered, and move in proportion to the benefit. The first three tests are satisfied by many clocks. The fourth, as the next section argues, is satisfied by almost none, and it is the only one that makes a biomarker fit to serve as a trial endpoint.

10.2 From correlate to endpoint: the validation problem

The four tests draw a sharp line, and most clocks fall on the wrong side of it. A clock can predict mortality superbly — pass test three with distinction — and still be useless for testing a therapy, because prediction and responsiveness are different properties. Grey hair predicts mortality; dyeing it changes nothing. The quantity a trial needs is not a correlate of ageing but a surrogate endpoint: a marker that sits on the causal path from intervention to outcome, such that moving the marker reliably moves the outcome in the same direction and degree. Establishing that — validating a surrogate — is far harder than building a clock, and the history of medicine is littered with surrogates that betrayed the trials built on them.

The honest evidence we have suggests the field is not there yet. When the CALERIE trial of sustained caloric restriction was read by epigenetic clocks, the verdict depended on which clock was consulted: one purpose-built pace-of-ageing measure registered a small slowing, while others showed nothing, and the cohort’s telomeres told a third story again (Section 6.1) (Waziry et al., 2023). A single intervention, a single set of participants, several instruments, several answers. That divergence is not a defect of any one clock; it is a sign that we do not yet know which signal, if any, is the one a therapy must move. There is even a deeper doubt: a strand of theory argues that the methylation changes the clocks read may be a programmed output of development rather than a record of damage, in which case resetting the clock might leave the underlying ageing untouched — a healthier-looking dial on the same engine (Gems et al., 2024).

ImportantAnalogy — the speedometer and the destination

Imagine judging whether a new engine will get a car further on a tank of fuel by watching the speedometer. Speed is easy to read and correlates with progress, so it is tempting to declare the faster car the more efficient one. But the speedometer measures velocity, not range; a car can run faster and arrive sooner at an empty tank. A validated surrogate would be a gauge wired directly to fuel consumption — moving it necessarily changes the distance travelled. Most ageing clocks today are speedometers: informative, even predictive, but not yet shown to be wired to the outcome that matters. The validation problem is the work of establishing which gauges are the fuel gauges. Until that is done, a trial that reports only a moved clock has measured velocity and called it range — precisely the slippage that Chapter 11 will track through the field’s recurring bouts of overstatement.

This is why the careful trials in this work report their biomarker results as signals rather than verdicts, and why the gap between a slowed clock and a lengthened life is the single most important caveat a reader can carry forward. It is not an argument against biomarkers — without them, longevity trials are simply impossible on any human timescale — but an argument for holding them to the fourth test before trusting them with a conclusion.

10.3 Designing a trial for a lifetime

Suppose, then, the honest course: refuse to trust an unvalidated surrogate and insist on a hard clinical endpoint — disease, disability, death. The arithmetic immediately explains why almost no one does this. A trial’s ability to detect a benefit depends on how many events it accrues, and in reasonably healthy older adults the events are mercifully rare and the plausible effect modest. Figure 10.1 makes the consequence visible: detecting a ten-per-cent slowing of ageing through its effect on a composite of age-related clinical events demands thousands of participants per arm, and at smaller effects the requirement balloons past any feasible study; a validated continuous biomarker, carrying far more information per person, brings the same question within reach of a few hundred.

Show the power calculation
library(ggplot2)

za <- qnorm(0.975); zb <- qnorm(0.80)
r  <- seq(0.05, 0.30, by = 0.005)          # fractional slowing of ageing pace

# Continuous biomarker endpoint: standardised effect scales with slowing
d  <- 3 * r
n_biomarker <- 2 * (za + zb)^2 / d^2

# Binary clinical endpoint: composite age-related events, 20% baseline
pc <- 0.20
pt <- pc * (1 - r)
n_clinical  <- (za + zb)^2 * (pc * (1 - pc) + pt * (1 - pt)) / (pc - pt)^2

df <- rbind(
  data.frame(r = r, n = n_clinical,  endpoint = "Clinical events (composite, 20% baseline)"),
  data.frame(r = r, n = n_biomarker, endpoint = "Validated biomarker (continuous)")
)
df$endpoint <- factor(df$endpoint,
  levels = c("Clinical events (composite, 20% baseline)",
             "Validated biomarker (continuous)"))

pal <- c("Clinical events (composite, 20% baseline)" = "#9C4A2E",
         "Validated biomarker (continuous)"          = "#0F6E66")

ggplot(df, aes(r, n, colour = endpoint)) +
  geom_hline(yintercept = 1000, linetype = "dotted", colour = "grey70") +
  annotate("text", x = 0.30, y = 1000, label = "~1,000 / arm",
           hjust = 1, vjust = -0.5, size = 2.8, colour = "grey55") +
  geom_line(linewidth = 1) +
  scale_x_continuous(labels = scales::percent) +
  scale_y_log10(labels = scales::label_comma(),
                breaks = c(10, 100, 1000, 10000, 100000)) +
  scale_colour_manual(values = pal) +
  labs(x = "Slowing of the pace of ageing achieved by the intervention",
       y = "Participants required per arm (log scale)",
       colour = NULL) +
  theme_minimal(base_size = 11) + theme(legend.position = "top")
Figure 10.1: Why the endpoint decides the trial, simulated. Participants required per arm (logarithmic scale) to detect a given slowing of the pace of ageing, at 80% power and a two-sided 5% significance level. A hard clinical endpoint (rust) — a composite of age-related events with a 20% incidence over the trial — needs roughly thirty times as many participants as a validated continuous biomarker (teal), and at small effect sizes climbs toward tens of thousands per arm. The figure is illustrative: real numbers depend on event rates, biomarker variance and follow-up. But the shape is the field’s central constraint, and it explains both the appeal of biomarker endpoints and the design of trials such as TAME, which raises the achievable event rate by counting many diseases at once rather than one.

The field has assembled a toolkit of partial escapes from this bind. The first is the validated surrogate of the previous section: if the fourth test can be met, the cheaper curve in the figure becomes legitimate. The second is to raise the event rate deliberately by widening the endpoint — counting not one disease but the first occurrence of any of several, a composite of age-related multimorbidity, on the geroscience premise that an intervention against ageing should delay all of them together. The third is to exploit data already collected: where a randomised trial is impossible, one can specify the trial that would answer the question and then emulate it within a large observational database, a discipline of target trial emulation that forces causal questions to be stated as explicitly as a protocol would and exposes the assumptions a naïve analysis would hide (Hernán & Robins, 2016). Adaptive and decentralised designs — recruiting and monitoring participants remotely, adjusting the protocol as data accrue — lower cost and broaden access, at the price of new threats to rigour.

The sample size for a two-arm trial scales, for a continuous endpoint, with the inverse square of the standardised effect: halving the effect quadruples the participants. For a binary clinical endpoint it scales with the event rate as well — specifically with \(p(1-p)/(p_c - p_t)^2\) — so a low baseline incidence is doubly punishing, shrinking both the contrast and the information each participant contributes. A treatment that cuts a 20% event rate to 18% produces a difference of just two percentage points, and recovering that signal from the noise of individual variation takes thousands of people followed for years (MetricGate, 2025). This is the quantitative root of the whole problem, and it has two honest solutions and one dishonest one. Honestly, you can validate a continuous surrogate (more information per person) (Moqri et al., 2024) or enrich the event rate with a multimorbidity composite or a higher-risk population (a larger contrast to detect). Dishonestly, you can quietly substitute an unvalidated biomarker for the clinical endpoint and report the small, cheap trial as though it answered the large, expensive question (Herzog et al., 2025). The figure above is, in effect, a map of which trials are affordable — and therefore of which temptations are strongest.

A concrete instance of the design problem arrives with the most radical therapy examined so far. Translating a pulsed reprogramming regimen into a human protocol means converting an animal dosing schedule into a defined human one, and choosing endpoints that can register rejuvenation without waiting for it to play out as longevity — the same coupling of dose and measurable readout examined for partial reprogramming in Section 8.2. The therapy may be novel; the measurement problem is the one this chapter has described throughout.

10.4 The regulatory puzzle: an indication that is not a disease

Even a perfectly designed trial must answer to a regulator, and here the field meets an obstacle written into the architecture of medicine. Drug regulators approve treatments for diseases, against recognised indications; ageing is not classified as a disease, and so, in the plainest sense, there is nothing for an anti-ageing drug to be approved for. This is not a bureaucratic quibble but a genuine fork. One response is to argue that ageing should be reclassified — a debate sharpened by the World Health Organization’s introduction of an “old age” code in its eleventh disease classification, which drew protest precisely because of what medicalising ageing might license or deny. The other, and the one the field has largely taken, sidesteps reclassification entirely.

That second route is the geroscience hypothesis turned into regulatory strategy. If a single underlying process drives the major chronic diseases of later life, then an intervention aimed at that process should postpone them as a group — and a trial can therefore take as its indication not “ageing” but a composite of age-related diseases, an outcome regulators already understand (Kennedy et al., 2014). The success criterion becomes the delayed onset of multimorbidity and functional decline, which is both measurable and clinically meaningful, and which conveniently raises the event rate that Figure 10.1 showed to be decisive. This is the conceptual move that made a longevity trial regulable at all, and the panel of blood biomarkers assembled for exactly this purpose (Justice et al., 2018) is its supporting apparatus — a way to show a regulator that the underlying biology is moving while the slow clinical endpoints accumulate.

CautionCaveat — an indication that is not a disease

The geroscience workaround is elegant, but it leaves three problems unresolved, and a reader should hold all three. First, a composite endpoint can be gamed by its construction: which diseases are counted, and how a “first event” is defined, can quietly determine the result, and the temptation to choose a composite that is easy to move is real (Moqri et al., 2023; Moqri et al., 2024). Second, regulatory acceptance of a surrogate is not scientific validation of it; an agency may permit a biomarker endpoint for pragmatic reasons while the fourth test remains formally unmet, and approval can then be mistaken for proof (Moqri et al., 2024). Third, and most broadly, an intervention licensed against multimorbidity in high-risk older adults tells us little about whether it should be given to healthy people for decades — which is what the popular imagination, and much of the commercial pipeline, actually has in mind. The distance between delays a cluster of diseases in the frail elderly and slows ageing in the healthy young is enormous, and it is routinely elided (Ferrucci et al., 2023; Rothwell, 2005). The regulatory framing solves the problem of how to run the trial; it does not, by itself, license the use most people are hoping for (Committee for Medicinal Products for Human Use, 2022).

10.5 Landmark trials and the emerging landscape

The principles above are not abstractions; they were forged by, and are visible in, the small set of human trials that now define the field. The exemplar is TAME — Targeting Aging with Metformin — conceived explicitly as a regulatory proof of concept rather than a bet on a particular drug (Barzilai et al., 2016). Its design embodies every move of this chapter: metformin, a cheap and exhaustively characterised diabetes drug, tested in older adults against a composite of age-related outcomes — major cardiovascular events, cancer, dementia and death — with the supporting biomarker panel running underneath (Justice et al., 2018). The point was never that metformin is the ideal geroprotector; the meta-analytic evidence in animals is in fact stronger for rapamycin, which mirrors dietary restriction in a way metformin does not (Ivimey-Cook et al., 2025). The point was to establish that a trial against ageing can be designed, run and read — to cut the regulatory path that successors would walk.

Its complement is PEARL, a decentralised trial of low-dose rapamycin that reported acceptable safety and several healthspan signals over a year in a community of self-selected participants, demonstrating both the promise and the limits of the remote, participant-driven model (Moel et al., 2025). Beyond these, the wider landscape is consistent in a way this chapter has prepared us to expect: the first-in-human senolytic studies measured feasibility, tolerability and target engagement — that the drugs reach their tissues and lower senescent-cell burden — rather than disease outcomes (Section 7.3) (Hickson et al., 2019; Justice et al., 2019); the CALERIE analysis read biological age with clocks rather than counting deaths (Waziry et al., 2023); and the reprogramming therapies of Chapter 8 approach their first human tests with the dosing-and-endpoint problem of Section 8.2 still squarely before them. Table 10.1 sets the principal trials side by side.

Table 10.1: The landmark longevity trials, and the honest scope of each. Almost without exception they measure safety, target engagement or biomarkers rather than the lengthened, healthier life the field is ultimately after — a pattern that is a feature of an early field, not a failing, but one that the next chapter insists be read clearly.
Trial Agent Design Primary endpoint What it measures
TAME Metformin Multicentre RCT, older adults Composite of age-related multimorbidity and death Whether one drug delays many diseases — and whether such a trial is regulable (Barzilai et al., 2016; Justice et al., 2018)
PEARL Rapamycin (low dose) Decentralised RCT, one year Safety and healthspan metrics Tolerability and biomarker signals of a geroprotector in the community (Moel et al., 2025)
Senolytic pilots (IPF, diabetic kidney disease, Alzheimer’s) Dasatinib + quercetin; fisetin Small, often open-label Feasibility, tolerability, senescent-cell burden Target engagement, not yet disease outcomes (see Section 7.3) (Hickson et al., 2019; Justice et al., 2019)
CALERIE Caloric restriction RCT, two years, healthy adults Cardiometabolic and immune markers; biological age Whether restriction moves a clock and improves markers, not lifespan (see Section 6.1) (Waziry et al., 2023)

The common thread is unmistakable. These are, almost to a trial, studies of whether the intervention does what its mechanism says — reaches the target, moves the marker, proves tolerable — and not yet studies of whether it lengthens a healthy life. That is the appropriate state of an early translational field, and nothing to apologise for. But it places an unusual burden on the reader of longevity science, who must constantly distinguish a moved biomarker from a changed destiny, and a regulable composite from a proven benefit.

The machinery assembled here — clocks that compress a lifetime into a blood test, surrogate endpoints awaiting validation, composites engineered to satisfy a regulator, trials that measure engagement and call it a beginning — is what makes the science of reversible ageing possible at all. It is also, precisely because it substitutes a fast proxy for a slow truth, what makes the field so easy to oversell. Every shortcut in this chapter is a place where a hopeful result can outrun its own evidence.

The next chapter follows that fault line deliberately. It turns from how evidence is made to how it is misread — to the paradigms that were embraced and then quietly discarded, the case studies of overreach, and the discipline of reading a longevity claim for what it actually shows. Having built the instruments, we now learn to doubt them well.