Two-Arm Fixed Design with Survival endpoints

Design space explored

All comparisons were conducted for a two-arm, fixed-design randomized clinical trial with a survival primary endpoint, analyzed using the logrank test under a proportional hazards framework. The objective was to compare sample size and event calculations across commonly used software implementations over a broad but structured design space.

Varying design parameters

Four key design parameters were varied systematically:

Type I error rate (\(\alpha\), two-sided):
\(\alpha \in \{0.01, 0.05, 0.10, 0.20, 0.49\}\)
Power (\(1 - \beta\)):
\(1 - \beta \in \{0.51, 0.80, 0.90, 0.99\}\)
Hazard ratio (HR):
\(\text{HR} \in \{0.10, 0.50, 0.70, 0.90, 0.99\}\)
Control-group survival probability at 3 years:
\(S_c(3\text{y}) \in \{0.10, 0.30, 0.60, 0.90\}\)

The 3-year survival probability was used to calibrate the baseline hazard under an exponential survival assumption for the control group, ensuring consistency across methods requiring explicit specification of the underlying survival distribution.

The full factorial combination of these parameters yielded 400 distinct design scenarios.

Fixed design assumptions

To isolate the impact of the statistical method and software implementation, the following parameters were held constant across all scenarios:

Accrual duration: 3 years
Additional follow-up after end of accrual: 3 years
Allocation ratio (experimental : control): 1 : 1
Test sidedness: two-sided logrank test

Accrual was assumed to be uniform over the accrual period. Censoring arose solely from administrative study termination; no additional loss to follow-up, noncompliance, or competing risks were assumed.

Relevancy classification of design scenarios

To help interpret method agreement across the full parameter grid, each scenario was classified a priori using a rule‑based assessment of its practical relevance, based solely on the Type I error rate and power:

High relevance
Scenarios with \[ 0.02 < \alpha < 0.15 \quad\text{and}\quad \text{power} > 0.75. \] These settings correspond to realistic operating characteristics used in confirmatory binary‑endpoint trials.
Medium relevance
Scenarios that do not fall into the “high” or “low” categories.
These represent plausible but less conventional design choices encountered in exploratory or constrained development programs.
Low relevance
Scenarios where \[ \alpha > 0.20 \quad\text{or}\quad \text{power} < 0.60 \quad\text{or}\quad \text{power} \ge 0.99. \] Such configurations are generally unrealistic for operational trial design, but are retained to evaluate robustness and numerical behavior under extreme conditions.

The classification does not filter scenarios; it simply provides context for understanding discrepancies between methods.

Scope of the comparison

For each of the 400 scenarios, all methods and software implementations were used to compute:

The required number of events to achieve the target power under the specified hazard ratio, and
Tthe corresponding total sample size, accounting for accrual and follow-up.

The design space was intentionally chosen to span both conventional trial settings and extreme configurations, allowing a comprehensive assessment of agreement, divergence, and stability across methods.

Methods compared:

East: Log Rank Test Given Accrual Duration and Study Duration [SU-2S-LRSD]
nQuery : STT1 / Two Sample Log-Rank Test of Exponential Survival
rpact::getSampleSizeSurvival()
rashnu::lakatosSampleSizeSurvival()
gsDesign2::fixed_design_ahr()

Results

📌Overall: within realistic design regions, the four methods provide highly consistent fixed‑design survival sample sizes relative to East.

In high‑relevancy settings, all methods yield N‑ratios extremely close to 1.00, showing near‑perfect agreement with East for typical time‑to‑event designs.
Medium‑relevancy scenarios introduce slightly more dispersion, but deviations remain small (generally within a few percent) and without systematic over‑ or under‑estimation.
Low‑relevancy cases exhibit the largest discrepancies. These extreme designs make N‑ratios mechanically unstable and are not clinically meaningful.
When ignoring these outlier regions, all four implementations again converge tightly around 1.00.
Rashnu uses a Lakatos-based approach, but the difference in results is not significant here.(See Lakatos and Lan post for more details)

Relevancy	Min	Q1	Mean	Median	Q3	Max
N-Ratio 2-Arms Survival
According to East sample sizes
high
gsdesign2	1.00	1.00	1.00	1.00	1.00	1.01
nquery	1.00	1.00	1.01	1.00	1.01	1.03
rashnu	1.00	1.00	1.00	1.00	1.00	1.02
rpact	1.00	1.00	1.00	1.00	1.00	1.00
medium
gsdesign2	0.98	1.00	1.01	1.00	1.01	1.05
nquery	1.00	1.00	1.03	1.02	1.04	1.11
rashnu	1.00	1.00	1.01	1.01	1.02	1.07
rpact	0.98	0.99	1.00	1.00	1.00	1.00
low
gsdesign2	0.85	1.00	1.21	1.00	1.07	5.44
nquery	0.98	1.00	1.26	1.01	1.13	2.84
rashnu	0.96	1.00	1.09	1.00	1.03	1.87
rpact	0.78	1.00	1.12	1.00	1.00	5.44
Red : < 50% of N-ratios ±10%

Relevancy	Min	Q1	Mean	Median	Q3	Max
N-Ratio 2-Arms Survival
According to East sample sizes
high
gsdesign2	1.00	1.00	1.00	1.00	1.00	1.01
nquery	1.00	1.00	1.01	1.00	1.01	1.03
rashnu	1.00	1.00	1.00	1.00	1.00	1.02
rpact	1.00	1.00	1.00	1.00	1.00	1.00
medium
gsdesign2	0.98	1.00	1.01	1.00	1.01	1.05
nquery	1.00	1.00	1.03	1.02	1.04	1.11
rashnu	1.00	1.00	1.01	1.01	1.02	1.07
rpact	0.98	0.99	1.00	1.00	1.00	1.00
low
gsdesign2	0.85	1.00	1.21	1.00	1.07	5.44
nquery	0.98	1.00	1.26	1.01	1.13	2.84
rashnu	0.96	1.00	1.09	1.00	1.03	1.87
rpact	0.78	1.00	1.12	1.00	1.00	5.44
Red : < 50% of N-ratios ±10%

Interactive results

You can explore all results interactively, filter designs, compare methods, and inspect individual cases in the interactive section of the website.

➡️ “Interactive results”