| N-Ratio 2-Arms Binary, pooled-variance | ||||||
| According to East sample sizes | ||||||
| Relevancy | Min | Q1 | Mean | Median | Q3 | Max |
|---|---|---|---|---|---|---|
| high | ||||||
| nquery | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.01 |
| rpact | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| medium | ||||||
| nquery | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| rpact | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| low | ||||||
| nquery | 0.88 | 1.00 | 1.00 | 1.00 | 1.00 | 1.11 |
| rpact | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
Two-Arm Fixed Design with Binary endpoints
Design space explored
All comparisons were performed for a two-arm, fixed-design randomized clinical trial with a binary primary endpoint. The objective was to compare sample size calculations across commonly used software implementations over a broad but controlled design space, assuming a test for a difference in proportions under a pooled-variance framework.
Varying design parameters
Four key design parameters were varied systematically:
Type I error rate (\(\alpha\), two-sided):
\(\alpha \in {0.01,\ 0.05,\ 0.10,\ 0.20,\ 0.49}\)Power (\(1 - \beta\)):
\(1 - \beta \in {0.51,\ 0.80,\ 0.90,\ 0.99}\)Control-group response rate:
\(\pi_c \in {0.10,\ 0.30,\ 0.50,\ 0.80,\ 0.90}\)Treatment effect (difference in proportions):
\(\Delta\pi = \pi_e - \pi_c \in {0.05,\ 0.15,\ 0.25,\ 0.49}\)
The experimental-group response rate was derived as
\(\pi_e = \pi_c + \Delta\pi\),
with scenarios resulting in \(\pi_e > 1\) discarded.
These parameters yield 300 valid design scenarios in the full factorial combination.
Fixed design assumptions
To isolate differences due purely to the statistical method or software implementation, the following characteristics of the design were held constant across scenarios:
- Allocation ratio (experimental : control): 1 : 1
- Test sidedness: two-sided
Relevancy classification of design scenarios
To help interpret method agreement across the full parameter grid, each scenario was classified a priori according to its practical relevance.
The classification follows a rule‑based system determined solely by the Type I error rate and power:
High relevance
Scenarios with \[ 0.02 < \alpha < 0.15 \quad\text{and}\quad \text{power} > 0.75, \] representing operating characteristics commonly used in realistic confirmatory settings.Medium relevance
Scenarios not meeting the criteria for “high” or “low.”
These correspond to plausible but less conventional values of α and power used in exploratory or constrained designs.Low relevance
Scenarios with \[ \alpha > 0.20 \quad\text{or}\quad \text{power} < 0.60 \quad\text{or}\quad \text{power} \ge 0.99. \] These configurations are rarely used operationally (e.g., extremely liberal α or near‑perfect power), but are retained to evaluate numerical behavior in challenging cases.
The classification does not filter scenarios nor influence computation; it simply contextualizes discrepancies between methods across a wide design space.
Scope of the comparison
For each of the 300 design scenarios, all software tools and methods were used to compute the required total sample size to achieve the target power under the specified proportions.
The design space was intentionally chosen to span both conventional trial settings and extreme configurations, allowing a comprehensive assessment of agreement, divergence, and stability across methods.
Z-pooled
Methods compared
- East: Difference of Proportions test [PN-2S-DI]
- nQuery : PTT36 / Inequality Tests for Difference of Two Proportions
rpact::getSampleSizeRates()
Results
Overall: for two‑arm binary fixed‑design trials using the Z‑pooled test, nQuery and rpact reproduce East’s sample sizes with near‑perfect consistency, with meaningful differences only in edge‑case, low‑practicality scenarios.
- In high‑ and medium‑relevancy designs, both
nQueryandrpactshow excellent agreement with East, with N‑ratios consistently centered around 1.00 and virtually no dispersion. - Even in the low‑relevancy region, the two methods remain very close to 1.00. A few isolated deviations appear, but they correspond to designs with extreme proportions or highly unbalanced variance assumptions.
- The plot confirms this stability: points are tightly clustered at 1.00 across the full range of East sample sizes, with only a small number of outliers in the low‑relevancy panel.
- No systematic bias is visible for either method, and neither method shows drift with increasing sample size.
| N-Ratio 2-Arms Binary, pooled-variance | ||||||
| According to East sample sizes | ||||||
| Relevancy | Min | Q1 | Mean | Median | Q3 | Max |
|---|---|---|---|---|---|---|
| high | ||||||
| nquery | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.01 |
| rpact | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| medium | ||||||
| nquery | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| rpact | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| low | ||||||
| nquery | 0.88 | 1.00 | 1.00 | 1.00 | 1.00 | 1.11 |
| rpact | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
Exact
Methods compared
- East: Difference of Proportions test [PN-2S-DI]
- nQuery : PTT36 / Inequality Tests for Difference of Two Proportions
bbssr::BinarySampleSize(test = "Fisher")
Results
📌Overall: for exact tests in two‑arm binary designs, bbssr aligns well with nQuery, while East produces consistently smaller sample sizes.
- Using nQuery as the reference,
bbssrstays reasonably close across all relevancy levels, with N‑ratios typically fluctuating around 1.00 and stabilising as sample sizes increase. - East, however, shows systematic underestimation relative to nQuery.
- In high and medium‑relevancy scenarios,
bbssrconverges smoothly toward the reference, while East remains consistently below 1.00. - In the low‑relevancy region, both methods show more dispersion, but the general pattern remains unchanged:
bbssr≈ nQuery
- East < nQuery (structural methodological difference)
- These differences arise from the underlying exact‑test implementations, see more at “What Do East and nQuery Compute Exactly ?”.
| N-Ratio 2-Arms Binary, exact test | ||||||
| According to nQuery sample sizes | ||||||
| Relevancy | Min | Q1 | Mean | Median | Q3 | Max |
|---|---|---|---|---|---|---|
| high | ||||||
| bbssr | 0.81 | 0.97 | 0.98 | 0.99 | 1.00 | 1.05 |
| east | 0.64 | 0.86 | 0.90 | 0.91 | 0.95 | 1.00 |
| medium | ||||||
| bbssr | 0.80 | 0.97 | 0.98 | 0.99 | 1.00 | 1.03 |
| east | 0.60 | 0.87 | 0.92 | 0.94 | 0.99 | 1.07 |
| low | ||||||
| bbssr | 0.79 | 0.97 | 0.98 | 0.99 | 1.00 | 1.05 |
| east | 0.57 | 0.83 | 0.89 | 0.92 | 0.97 | 1.07 |
| Red : < 50% of N-ratios ±10% | ||||||
| N-Ratio 2-Arms Binary, exact test | ||||||
| According to nQuery sample sizes | ||||||
| Relevancy | Min | Q1 | Mean | Median | Q3 | Max |
|---|---|---|---|---|---|---|
| high | ||||||
| bbssr | 0.81 | 0.97 | 0.98 | 0.99 | 1.00 | 1.05 |
| east | 0.64 | 0.86 | 0.90 | 0.91 | 0.95 | 1.00 |
| medium | ||||||
| bbssr | 0.80 | 0.97 | 0.98 | 0.99 | 1.00 | 1.03 |
| east | 0.60 | 0.87 | 0.92 | 0.94 | 0.99 | 1.07 |
| low | ||||||
| bbssr | 0.79 | 0.97 | 0.98 | 0.99 | 1.00 | 1.05 |
| east | 0.57 | 0.83 | 0.89 | 0.92 | 0.97 | 1.07 |
| Red : < 50% of N-ratios ±10% | ||||||
Interactive results
You can explore all results interactively, filter designs, compare methods, and inspect individual cases in the interactive section of the website.