One-Arm Fixed Design with Survival endpoints

Design space explored

All comparisons were conducted for a one-arm, fixed-design randomized clinical trial with a survival primary endpoint, analyzed using the logrank test under a proportional hazards framework. The objective was to compare sample size and event calculations across commonly used software implementations over a broad but structured design space.

Varying design parameters

Four key design parameters were varied systematically:

Type I error rate (\(\alpha\), two-sided):
\(\alpha \in \{0.01, 0.05, 0.10, 0.20, 0.49\}\)
Power (\(1 - \beta\)):
\(1 - \beta \in \{0.51, 0.80, 0.90, 0.99\}\)
Hazard ratio (HR):
\(\text{HR} \in \{0.10, 0.50, 0.70, 0.90, 0.99\}\)
Survival probability at 3 years under null hypothesis:
\(S_0(3\text{y}) \in \{0.10, 0.30, 0.60, 0.90\}\)

The 3-year survival probability was used to calibrate the baseline hazard under an exponential survival assumption, ensuring consistency across methods requiring explicit specification of the underlying survival distribution.

The full factorial combination of these parameters yielded 400 distinct design scenarios.

Fixed design assumptions

To isolate the impact of the statistical method and software implementation, the following parameters were held constant across all scenarios:

Accrual duration: 3 years
Additional follow-up after end of accrual: 3 years
Test sidedness: two-sided logrank test

Accrual was assumed to be uniform over the accrual period. Censoring arose solely from administrative study termination; no additional loss to follow-up, noncompliance, or competing risks were assumed.

Relevancy classification of design scenarios

Each scenario was assigned a relevancy category based solely on the Type I error rate and power, following the same rule‑based system used for two‑arm designs:

High relevance \[ 0.02 < \alpha < 0.15 \quad\text{and}\quad \text{power} > 0.75. \] These represent realistic single‑arm designs commonly used in phase II studies, especially in rare‑disease contexts.
Medium relevance
Scenarios not meeting the criteria for high or low relevance; these represent plausible but less standard design choices.
Low relevance \[ \alpha > 0.20 \quad\text{or}\quad \text{power} < 0.60 \quad\text{or}\quad \text{power} \ge 0.99. \] These combinations are rarely used in practice but included intentionally to explore numerical robustness across asymptotic and exact methods.

The classification is descriptive only and does not filter or influence sample size computations.

Scope of the comparison

For each of the 400 scenarios, all methods and software implementations were used to compute:

The required number of events to achieve the target power under the specified hazard ratio, and
The corresponding total sample size, accounting for accrual and follow-up.

The design space was intentionally chosen to span both conventional trial settings and extreme configurations, allowing a comprehensive assessment of agreement, divergence, and stability across methods.

Methods compared

nQuery : SOT1 / One Sample Log-Rank Test with Accrual
OneArm2stage::phase2.TTE()¹
SampleSizeSingleArmSurvival::calcSampleSizeArcsine()
rashnu::oneSurvSampleSize()

Results

📌Overall: unlike the two‑arm survival settings, the one‑arm fixed survival framework shows major method‑specific differences, with no implementation providing close agreement with nQuery across realistic or extended design regions.

Across all relevancy levels, none of the evaluated methods reproduce nQuery’s sample sizes closely. Even our rewrite of OneArm2Stage (expected to match nQuery) shows systematic deviations, typically below 1.00 in high‑ and medium‑relevancy settings.
The rashnu and sssas methods implement the approach of Nagashima et al. (2020) based on an arcsin transformation of the Kaplan–Meier estimator. These two methods consistently yield larger sample sizes than nQuery, resulting in N‑ratios frequently well above 1.00, sometimes substantially so.
In high‑relevancy designs, all three methods already differ from nQuery, with OneArm2Stage underestimating and the Nagashima‑based methods overestimating. Dispersion is wide and no method stays close to the reference band.
In medium and low‑relevancy scenarios, discrepancies become even larger. Extreme deviations are driven by designs where nQuery outputs very large sample sizes, amplifying differences between formulas and underlying assumptions.
Overall pattern:
- OneArm2Stage < nQuery (systematic underestimation)
- Nagashima‑based methods > nQuery (systematic overestimation)
- No method aligns tightly with nQuery.

Relevancy	Min	Q1	Mean	Median	Q3	Max
N-Ratio 1-Arm Survival
According to nQuery sample sizes
high
oa2s	0.67	0.74	0.78	0.79	0.82	0.92
rashnu	1.46	1.56	1.67	1.63	1.76	2.01
sssas	1.22	1.29	1.37	1.32	1.42	1.59
medium
oa2s	0.49	0.67	0.78	0.81	0.88	1.04
rashnu	1.30	1.45	1.61	1.59	1.74	2.17
sssas	1.10	1.22	1.31	1.29	1.40	1.59
low
oa2s	0.22	0.65	1.14	0.86	1.10	14.33
rashnu	0.67	1.57	2.12	1.86	2.18	16.67
sssas	0.55	1.33	1.72	1.49	1.70	13.67
Red : < 50% of N-ratios ±10%

Relevancy	Min	Q1	Mean	Median	Q3	Max
N-Ratio 1-Arm Survival
According to nQuery sample sizes
high
oa2s	0.67	0.74	0.78	0.79	0.82	0.92
rashnu	1.46	1.56	1.67	1.63	1.76	2.01
sssas	1.22	1.29	1.37	1.32	1.42	1.59
medium
oa2s	0.49	0.67	0.78	0.81	0.88	1.04
rashnu	1.30	1.45	1.61	1.59	1.74	2.17
sssas	1.10	1.22	1.31	1.29	1.40	1.59
low
oa2s	0.22	0.65	1.14	0.86	1.10	14.33
rashnu	0.67	1.57	2.12	1.86	2.18	16.67
sssas	0.55	1.33	1.72	1.49	1.70	13.67
Red : < 50% of N-ratios ±10%

Interactive results

You can explore all results interactively, filter designs, compare methods, and inspect individual cases in the interactive section of the website.

➡️ “Interactive results”

References

Nagashima, Kengo, Hisashi Noma, Yasunori Sato, and Masahiko Gosho. 2020. “Sample Size Calculations for Single‐arm Survival Studies Using Transformations of the Kaplan–Meier Estimator.” Pharmaceutical Statistics 20 (3): 499–511. https://doi.org/10.1002/pst.2090.

Footnotes

Our rewrite that accepts accrual time (see Fork-repo for more details)↩︎