Learning from Literature: Integrating LLMs and Bayesian Hierarchical Modeling for Oncology Trial Design

Authors: Guannan Gong, PhD; Satrajit Roychoudhury, PhD; Allison Meisner, MD; Lajos Pusztai, MD, DPhil; Sarah B. Goldberg, MD, MPH; Wei Wei, PhD

Overview

Designing modern oncology clinical trials requires synthesizing evidence from prior studies to inform hypothesis generation, effect-size estimation, and sample size determination. In practice, this process relies heavily on qualitative summaries or aggregate statistics that incompletely capture heterogeneity across patient populations and study designs. As a result, trials may be based on misspecified assumptions, leading to underpowered studies or misleading conclusions.

To address this challenge, we introduce LEAD-ONC (Literature to Evidence for Analytics and Design in Oncology)— an AI-assisted framework that transforms published oncology trial reports into quantitative, design-relevant evidence. LEAD-ONC integrates advances in large language models (LLMs), survival data reconstruction, and Bayesian hierarchical modeling to enable principled learning from prior trials and prospective trial design under uncertainty.

Methodological Framework

Given a set of expert-curated clinical trial publications meeting predefined eligibility criteria, LEAD-ONC operates in three stages:

  1. Structured Evidence Extraction
    Large language models are used to extract baseline characteristics and trial-level metadata from published manuscripts.
  2. Survival Data Reconstruction
    Individual patient data are reconstructed from published Kaplan–Meier curves, enabling downstream modeling beyond aggregate summaries.
  3. Bayesian Hierarchical Modeling for Trial Design
    Reconstructed survival data are integrated across studies using Bayesian hierarchical models to generate predictive survival distributions tailored to a prespecified target trial population.

This framework enables probabilistic projections of treatment effects, explicitly accounting for between-trial heterogeneity and uncertainty—key elements often overlooked in traditional trial planning.

Demonstration in Non–Small-Cell Lung Cancer

We demonstrate LEAD-ONC using five phase III trials in first-line non–small-cell lung cancer evaluating PD-1 or PD-L1 inhibitors with or without CTLA-4 blockade. Clustering based on extracted baseline characteristics identified three clinically interpretable populations defined by histology.

For a hypothetical prospective randomized trial in a mixed-histology population comparing mono versus dual immune checkpoint inhibition, LEAD-ONC projected:

  • Median overall survival difference: 2.8 months
  • 95% credible interval: −2.0 to 7.6 months
  • Probability of achieving ≥3 months benefit: ~0.45

Because LEAD-ONC remains under active development, these findings are intended as methodological demonstrations rather than definitive clinical guidance.

Significance

LEAD-ONC illustrates how AI-driven extraction from the biomedical literature, combined with principled statistical modeling, can support evidence-driven oncology trial design. The framework provides a foundation for improving hypothesis formulation, power calculations, and decision-making under uncertainty, with potential applications across disease areas and therapeutic modalities.

How to Reference This Work

arXiv Citation

Gong G, Roychoudhury S, Meisner A, Pusztai L, Goldberg SB, Wei W.
Learning from Literature: Integrating LLMs and Bayesian Hierarchical Modeling for Oncology Trial Design.
arXiv preprint arXiv:2602.08172, 2026.
https://doi.org/10.48550/arXiv.2602.08172

BibTeX
@article{gong2026leadonc,
  title={Learning from Literature: Integrating LLMs and Bayesian Hierarchical Modeling for Oncology Trial Design},
  author={Gong, Guannan and Roychoudhury, Satrajit and Meisner, Allison and Pusztai, Lajos and Goldberg, Sarah B and Wei, Wei},
  journal={arXiv preprint arXiv:2602.08172},
  year={2026},
  doi={10.48550/arXiv.2602.08172}
}
            

Status and Availability

  • Status: Under active development
  • Version: arXiv v1 (submitted February 9, 2026)
  • Intended use: Methodological research and trial design exploration
About LEAD-ONC

LEAD-ONC is an AI-assisted framework that transforms published oncology trial reports into quantitative, design-ready evidence for survival analysis and trial planning.

Current App Version: v1.2.1

×