Two-Arm Fixed Design with Binary endpoints

Design space explored

All comparisons were performed for a two-arm, fixed-design randomized clinical trial with a binary primary endpoint. The objective was to compare sample size calculations across commonly used software implementations over a broad but controlled design space, assuming a test for a difference in proportions under a pooled-variance framework.

Varying design parameters

Four key design parameters were varied systematically:

  • Type I error rate (\(\alpha\), two-sided):
    \(\alpha \in {0.01,\ 0.05,\ 0.10,\ 0.20,\ 0.49}\)

  • Power (\(1 - \beta\)):
    \(1 - \beta \in {0.51,\ 0.80,\ 0.90,\ 0.99}\)

  • Control-group response rate:
    \(\pi_c \in {0.10,\ 0.30,\ 0.50,\ 0.80,\ 0.90}\)

  • Treatment effect (difference in proportions):
    \(\Delta\pi = \pi_e - \pi_c \in {0.05,\ 0.15,\ 0.25,\ 0.49}\)

The experimental-group response rate was derived as
\(\pi_e = \pi_c + \Delta\pi\),
with scenarios resulting in \(\pi_e > 1\) discarded.

These parameters yield 300 valid design scenarios in the full factorial combination.

Fixed design assumptions

To isolate differences due purely to the statistical method or software implementation, the following characteristics of the design were held constant across scenarios:

  • Allocation ratio (experimental : control): 1 : 1
  • Test sidedness: two-sided

Relevancy classification of design scenarios

To help interpret method agreement across the full parameter grid, each scenario was classified a priori according to its practical relevance.
The classification follows a rule‑based system determined solely by the Type I error rate and power:

  • High relevance
    Scenarios with \[ 0.02 < \alpha < 0.15 \quad\text{and}\quad \text{power} > 0.75, \] representing operating characteristics commonly used in realistic confirmatory settings.

  • Medium relevance
    Scenarios not meeting the criteria for “high” or “low.”
    These correspond to plausible but less conventional values of α and power used in exploratory or constrained designs.

  • Low relevance
    Scenarios with \[ \alpha > 0.20 \quad\text{or}\quad \text{power} < 0.60 \quad\text{or}\quad \text{power} \ge 0.99. \] These configurations are rarely used operationally (e.g., extremely liberal α or near‑perfect power), but are retained to evaluate numerical behavior in challenging cases.

The classification does not filter scenarios nor influence computation; it simply contextualizes discrepancies between methods across a wide design space.

Scope of the comparison

For each of the 300 design scenarios, all software tools and methods were used to compute the required total sample size to achieve the target power under the specified proportions.

The design space was intentionally chosen to span both conventional trial settings and extreme configurations, allowing a comprehensive assessment of agreement, divergence, and stability across methods.

Z-pooled

Methods compared

  • East: Difference of Proportions test [PN-2S-DI]
  • nQuery : PTT36 / Inequality Tests for Difference of Two Proportions
  • rpact::getSampleSizeRates()

Results

Overall: for two‑arm binary fixed‑design trials using the Z‑pooled test, nQuery and rpact reproduce East’s sample sizes with near‑perfect consistency, with meaningful differences only in edge‑case, low‑practicality scenarios.

  • In high‑ and medium‑relevancy designs, both nQuery and rpact show excellent agreement with East, with N‑ratios consistently centered around 1.00 and virtually no dispersion.
  • Even in the low‑relevancy region, the two methods remain very close to 1.00. A few isolated deviations appear, but they correspond to designs with extreme proportions or highly unbalanced variance assumptions.
  • The plot confirms this stability: points are tightly clustered at 1.00 across the full range of East sample sizes, with only a small number of outliers in the low‑relevancy panel.
  • No systematic bias is visible for either method, and neither method shows drift with increasing sample size.
N-Ratio 2-Arms Binary, pooled-variance
According to East sample sizes
Relevancy Min Q1 Mean Median Q3 Max
high
nquery 0.99 1.00 1.00 1.00 1.00 1.01
rpact 1.00 1.00 1.00 1.00 1.00 1.00
medium
nquery 0.99 1.00 1.00 1.00 1.00 1.00
rpact 1.00 1.00 1.00 1.00 1.00 1.00
low
nquery 0.88 1.00 1.00 1.00 1.00 1.11
rpact 1.00 1.00 1.00 1.00 1.00 1.00

N-Ratio 2-Arms Binary, pooled-variance
According to East sample sizes
Relevancy Min Q1 Mean Median Q3 Max
high
nquery 0.99 1.00 1.00 1.00 1.00 1.01
rpact 1.00 1.00 1.00 1.00 1.00 1.00
medium
nquery 0.99 1.00 1.00 1.00 1.00 1.00
rpact 1.00 1.00 1.00 1.00 1.00 1.00
low
nquery 0.88 1.00 1.00 1.00 1.00 1.11
rpact 1.00 1.00 1.00 1.00 1.00 1.00

Exact

Methods compared

  • East: Difference of Proportions test [PN-2S-DI]
  • nQuery : PTT36 / Inequality Tests for Difference of Two Proportions
  • bbssr::BinarySampleSize(test = "Fisher")

Results

📌Overall: for exact tests in two‑arm binary designs, bbssr aligns well with nQuery, while East produces consistently smaller sample sizes.

  • Using nQuery as the reference, bbssr stays reasonably close across all relevancy levels, with N‑ratios typically fluctuating around 1.00 and stabilising as sample sizes increase.
  • East, however, shows systematic underestimation relative to nQuery.
  • In high and medium‑relevancy scenarios, bbssr converges smoothly toward the reference, while East remains consistently below 1.00.
  • In the low‑relevancy region, both methods show more dispersion, but the general pattern remains unchanged:
    • bbssr ≈ nQuery
    • East < nQuery (structural methodological difference)
  • These differences arise from the underlying exact‑test implementations, see more at “What Do East and nQuery Compute Exactly ?”.
N-Ratio 2-Arms Binary, exact test
According to nQuery sample sizes
Relevancy Min Q1 Mean Median Q3 Max
high
bbssr 0.81 0.97 0.98 0.99 1.00 1.05
east 0.64 0.86 0.90 0.91 0.95 1.00
medium
bbssr 0.80 0.97 0.98 0.99 1.00 1.03
east 0.60 0.87 0.92 0.94 0.99 1.07
low
bbssr 0.79 0.97 0.98 0.99 1.00 1.05
east 0.57 0.83 0.89 0.92 0.97 1.07
Red : < 50% of N-ratios ±10%

N-Ratio 2-Arms Binary, exact test
According to nQuery sample sizes
Relevancy Min Q1 Mean Median Q3 Max
high
bbssr 0.81 0.97 0.98 0.99 1.00 1.05
east 0.64 0.86 0.90 0.91 0.95 1.00
medium
bbssr 0.80 0.97 0.98 0.99 1.00 1.03
east 0.60 0.87 0.92 0.94 0.99 1.07
low
bbssr 0.79 0.97 0.98 0.99 1.00 1.05
east 0.57 0.83 0.89 0.92 0.97 1.07
Red : < 50% of N-ratios ±10%

Interactive results

You can explore all results interactively, filter designs, compare methods, and inspect individual cases in the interactive section of the website.

➡️ “Interactive results”