A health system implements a new sepsis protocol across 20 hospitals. A researcher plans to evaluate effectiveness using a stepped-wedge cluster randomized design where hospitals sequentially adopt the protocol every 3 months. She calculates sample size based on individual patient outcomes (mortality) needing 2,000 patients total. The biostatistician identifies a critical error. Evaluate what modification is needed.
A 41-year-old research fellow designs a non-inferiority trial comparing oral to IV antibiotics for osteomyelitis. She sets the non-inferiority margin at 10% (cure rate difference), expects 85% cure in both groups, and calculates 300 patients per arm for 80% power with α=0.025 (one-sided). Her mentor suggests this underestimates required sample size. Evaluate the mentor's concern.
A pharmaceutical company tests a new antidepressant in 500 patients (250 per arm) and finds a 2-point improvement on a 52-point depression scale compared to placebo (p=0.04). The study was originally powered to detect a 4-point difference. The company seeks FDA approval citing statistical significance. Analyze the regulatory and scientific implications.
A meta-analysis of 5 previous trials testing a surgical technique shows a pooled effect size of 15% complication reduction (from 20% to 17%, p=0.30, I²=0%). An investigator wants to design a definitive trial. She calculates that 1,200 patients per arm would provide 80% power to detect this 3% absolute difference. Analyze whether this sample size is justified.
A 52-year-old oncologist designs a trial comparing chemotherapy regimens for pancreatic cancer. She plans 60 patients per arm for 80% power to detect a 3-month improvement in median survival (from 9 to 12 months) with α=0.05. After IRB approval, a competing trial publishes results showing the control regimen actually achieves 11-month median survival. Apply the appropriate modification to the study.
A researcher completes a pilot study of 30 patients testing a new cognitive behavioral therapy for PTSD. The intervention group (n=15) showed mean improvement of 12 points on PTSD scale versus 5 points in controls (SD=8, p=0.02). She plans a larger trial and uses the pilot data to calculate that 64 patients per group will provide 80% power. Analyze the major flaw in this approach.
A 38-year-old epidemiologist is designing a study to detect a rare adverse effect of a new vaccine. The background rate of Guillain-Barré syndrome (GBS) is 1 per 100,000 person-years. She wants 80% power to detect a doubling of risk (RR=2.0) with α=0.05. The calculated sample size requires 2.4 million participants. Apply the most feasible alternative study design.
A cardiologist plans a trial comparing two anticoagulants for stroke prevention in atrial fibrillation. Historical data shows annual stroke rates of 5% with warfarin. She wants to detect a 40% relative risk reduction (to 3% annually) with 90% power and α=0.05. The sample size calculation yields 1,200 patients per arm. After 6 months of recruitment, only 400 patients total have enrolled. Evaluate the most appropriate strategy to salvage the study.
A 45-year-old physician is reviewing a published RCT that tested a new diabetes medication. The study reported 85% power to detect a 0.5% reduction in HbA1c with p<0.05. The results showed a 0.4% reduction (p=0.08). The pharmaceutical company claims the drug would be significant with a larger sample. Analyze the validity of this claim.
A researcher is designing a clinical trial to compare a new antihypertensive medication to placebo. She wants to detect a 10 mmHg difference in systolic blood pressure with 80% power and α = 0.05. The standard deviation of systolic BP in the population is 15 mmHg. She calculates that 36 patients per group are needed. The funding agency can only support 50 total patients. Apply the appropriate modification to the study design.
Explanation: ***Account for intra-cluster correlation coefficient (ICC) requiring substantial sample size inflation*** - In cluster-randomized designs, observations within the same cluster (hospital) are not independent; the **Intra-cluster Correlation Coefficient (ICC)** quantifies this correlation and must be used to calculate a **design effect**. - Neglecting the ICC leads to an **underpowered study** because the effective sample size is smaller than the total number of individual patients measured. *Adjust for multiple time periods using Bonferroni correction* - **Bonferroni correction** is used to control for **Type I error** when performing multiple independent hypothesis tests, not for determining sample size in nested longitudinal designs. - While the stepped-wedge design involves multiple time points, the primary analysis typically uses a **single model** (e.g., GEE or GLMM) that accounts for time as a fixed effect. *Use hospital-level outcomes instead of patient-level outcomes as unit of analysis* - While the hospital is the **unit of randomization**, using hospital-level means as the unit of analysis simplifies the data and causes a significant loss of **statistical information** and precision. - Modern biostatistical methods utilize **multilevel modeling** to maintain the richness of patient-level data while adjusting for the cluster-level randomization. *Include random effects for both hospital and time period in power calculation* - While random effects are important for the **analysis phase**, the "critical error" identified in the prompt refers to the initial failure to inflate the sample size based on **clustering (ICC)**. - Power calculations for stepped-wedge designs are complex and certainly involve time parameters, but **ICC-based inflation** is the most fundamental adjustment required when moving from individual to cluster randomization. *Increase alpha to 0.10 to account for cluster randomization reducing power* - Increasing the **alpha level** (significance threshold) is not a standard or scientifically acceptable method to compensate for the loss of power due to **clustering**. - Standard practice mandates maintaining an **alpha of 0.05** while appropriately increasing the **sample size** or number of clusters to reach the desired power (usually 80-90%).
Explanation: ***Correct; non-inferiority trials require larger samples than superiority trials for equivalent power*** - **Non-inferiority trials** are designed to exclude a difference greater than a pre-specified margin, which typically requires a **larger sample size** than superiority trials investigating the same outcome. - Because we are proving that the new treatment is "not much worse" (rather than "better"), the **statistical threshold** often necessitates higher enrollment to achieve adequate **power**. *Incorrect; the calculation appropriately uses one-sided alpha for non-inferiority testing* - While it is true that **non-inferiority testing** uses a **one-sided alpha (0.025)**, this does not negate the fact that such trials inherently require more participants. - The mentor's concern is about the **total N**, which remains insufficient despite using the correct one-sided alpha convention. *Correct; the margin should be set at 5% requiring doubling of sample size* - There is no universal rule that the **non-inferiority margin** must be 5%; it is determined by **clinical significance** and regulatory standards for the specific condition. - While a 5% margin would indeed increase the sample size, the 10% margin is often standard in **antibiotic trials** for osteomyelitis. *Incorrect; non-inferiority trials actually require smaller samples due to less stringent hypotheses* - This is a common misconception; non-inferiority trials are actually more demanding because the **null hypothesis** assumes the treatments are different (inferior). - Disproving **inferiority** within a tight **margin (delta)** is statistically more intensive than proving a treatment is superior to a placebo. *Correct; dropout rates in antibiotic trials necessitate 20% inflation of calculated sample size* - While **attrition bias** is a concern, there is no fixed rule that every trial needs a **20% inflation** factor. - The mentor's concern is specifically about the **base calculation** and the statistical nature of non-inferiority designs rather than just the **dropout rate**.
Explanation: ***Approval not warranted; observed effect is smaller than pre-specified clinically meaningful difference*** - Although the result is **statistically significant** (p=0.04), the observed 2-point improvement is only half of the **pre-specified 4-point difference** deemed clinically relevant. - Regulatory bodies like the **FDA** prioritize **clinical significance** over mere p-values, ensuring that a drug provides a meaningful benefit to patients' lives. *Approval warranted; the study achieved statistical significance with adequate power* - Statistical significance does not automatically justify approval if the **effect size** is too small to provide a real therapeutic advantage. - Being **powered** for a 4-point difference means the study was designed to reliably detect a larger effect; a smaller effect may be a result of **over-testing** or limited clinical utility. *Approval not warranted; the study was underpowered for the observed effect size* - If a study finds a significant result (p < 0.05), it is by definition **sufficiently powered** to detect that specific effect size in that sample. - The issue here is not **power** or sample size, but rather the **magnitude of effect** failing to meet the pre-defined target for clinical relevance. *Approval warranted if sensitivity analyses confirm robustness of findings* - **Sensitivity analyses** help confirm that results are not driven by outliers, but they cannot transform a **clinically trivial** difference into a meaningful one. - Even a robust, consistent 2-point difference remains below the **Minimum Clinically Important Difference (MCID)** set at 4 points. *Approval warranted; post-hoc power analysis shows adequate power for 2-point difference* - **Post-hoc power analysis** is generally considered scientifically flawed and redundant once the **p-value** is already known. - Demonstrating power for a 2-point difference does not erase the fact that the drug failed to meet the **threshold of efficacy** defined by the researchers at the start.
Explanation: ***Not justified; a 3% absolute reduction lacks clinical significance for most surgical outcomes*** - A **3% absolute risk reduction** (from 20% to 17%) might be statistically detectable but is often considered too minor to justify the **cost, risks, or resources** of a new surgical technique. - Investigator's focus should be on whether the **Minimal Clinically Important Difference (MCID)** is met; designing a massive trial to find a tiny effect is often a waste of resources. *Justified; the meta-analysis provides the best estimate of true effect size* - While meta-analyses are high in the evidence hierarchy, a **p-value of 0.30** indicates the pooled effect is not statistically significant and may be due to **random chance**. - Using a non-significant, potentially **spurious effect size** to power a large trial leads to a high risk of a **futile study**. *Justified only if cost-effectiveness analysis supports the intervention* - Cost-effectiveness is a secondary consideration that follows the determination of **clinical efficacy and safety**. - Even if cost-effective, the trial remains unjustified if the **sample size calculation** is based on statistically unreliable (p=0.30) data. *Not justified; the high I² indicates substantial heterogeneity making pooled estimates unreliable* - This statement is factually incorrect as the prompt states **I²=0%**, which indicates **no observed statistical heterogeneity** among the trials. - **I²=0%** suggests that the results of the specific trials were consistent with each other, though all remained non-significant. *Justified; the non-significant p-value indicates need for larger, definitive trial* - A non-significant p-value in a meta-analysis does not automatically mandate a larger trial; it suggests the **null hypothesis** cannot be rejected. - Planning a trial based on a **3% difference** that failed to reach significance (p=0.30) ignores the likelihood that the **true effect size** might be zero.
Explanation: ***Increase sample size to detect smaller absolute difference (11 to 14 months)*** - Because the **control regimen** is performing better than expected (11 months vs. 9 months), the **effect size** to detect the originally planned 3-month survival benefit now requires a larger cohort. - **Statistical power** is inversely proportional to the square of the **effect size**; therefore, as the absolute difference between groups remains the same but occurs at a higher baseline, or if the relative difference shrinks, a **larger sample size** is needed to maintain 80% power. *Continue as planned since the study is already approved and funded* - Proceeding without modification would result in an **underpowered** study, significantly increasing the risk of a **Type II error** (failing to detect a true difference). - Ethical clinical research requires that a study be designed with sufficient **statistical validity** to answer its primary question, which the current N=60 per arm no longer provides. *Change control arm to the newly published superior regimen* - While the control arm's expected performance changed, the **investigational regimen** remains the same; changing the control arm entirely would lead to a different study objective. - The goal is to compare the **oncologist's specific regimen** against a control; the focus should be on adjusting the **sample size** to account for the new baseline performance of that control. *Add a third arm using the newly published regimen* - Adding a third arm would further split the **statistical power** and complicate the analysis with **multiple comparisons**, requiring even more patients than a two-arm adjustment. - This modification changes the **study design** and scope significantly, rather than addressing the immediate statistical deficiency of the primary comparison. *Terminate the study as equipoise no longer exists for the original control* - **Equipoise** still exists because it is unknown if the new regimen is superior to the 11-month survival benchmark; the study is still relevant but needs **statistical recalibration**. - Termination is reserved for cases where a treatment is proven **harmful** or unequivocally **inferior** to standard of care to the point that randomization is unethical.
Explanation: ***Sample size should be based on minimal clinically important difference, not pilot results*** - Relying on **pilot study effect sizes** is inherently flawed because small samples often yield **inflated effect sizes** due to random variation, leading to underpowered future trials. - The **Minimal Clinically Important Difference (MCID)** represents the smallest change that is meaningful to patients, making it the most robust and ethical basis for **sample size calculation**. *Pilot studies typically overestimate effect sizes leading to underpowered main studies* - While this is a common occurrence known as the **“winner’s curse,”** it describes a consequence of the methodology rather than the fundamental **design principle** that was violated. - The core flaw is the failure to use **clinical significance (MCID)** as the benchmark for a definitive trial's power analysis. *The p-value from pilot data should be incorporated into power calculations* - **P-values** are measures of statistical significance for a specific dataset and should never be used as inputs for **power calculations**. - Power calculations strictly require an **effect size**, a measure of **variability (Standard Deviation)**, and the desired alpha/beta error rates. *The standard deviation from pilot data is unreliable with small samples* - Small samples do provide less precise estimates of the **population standard deviation**, but this is a secondary technical limitation rather than the primary methodological flaw. - Even with a perfect standard deviation, using the **observed mean difference** from a pilot instead of the **MCID** would still be a flawed approach. *Power calculations should use 90% power for definitive trials, not 80%* - Selecting **80% power** is a standard and acceptable convention in medical research; it is not considered a "major flaw." - Increasing power to **90%** reduces the chance of a **Type II error** but does not address the underlying bias introduced by using pilot-derived effect sizes.
Explanation: ***Self-controlled case series analyzing risk periods after vaccination*** - A **self-controlled case series (SCCS)** is highly efficient for rare vaccine adverse events because it only requires data from **individuals who have both the exposure and the outcome**, serving as their own controls. - This design eliminates **time-invariant confounding** factors (like genetics or baseline health) and drastically reduces the required sample size compared to traditional cohort studies for rare events like **Guillain-Barré syndrome**. *Case-control study with known GBS cases and matched controls* - While effective for **rare diseases**, a standard case-control study may suffer from **recall bias** and difficulties in selecting a perfectly representative and matched **control group** for vaccine exposures. - The SCCS is often preferred in modern vaccine epidemiology over case-control designs because it avoids the complexity of **control selection** and controls for individual-level confounders automatically. *Increase alpha to 0.10 to reduce required sample size by 30%* - Increasing the **alpha level** (Type I error rate) reduces the required sample size but compromises the **statistical significance** and scientific validity of the safety study. - This is generally unacceptable in **regulatory science** when assessing serious safety signals like GBS, where maintaining a strict false positive rate is crucial. *Phase IV post-marketing surveillance with voluntary reporting* - Voluntary reporting systems (e.g., **VAERS**) are susceptible to significant **under-reporting** and **reporting bias**, making it difficult to calculate the actual **relative risk (RR)** accurately. - These systems are meant for **signal detection** rather than formal hypothesis testing or confirming a doubling of risk with specific **statistical power**. *Cluster randomization of communities to reduce required participants* - **Cluster randomization** usually requires an *increase* in the total number of participants compared to individual randomization due to the **design effect** and intra-cluster correlation. - For a rare event with a 1/100,000 incidence, cluster randomization would not solve the **feasibility issue** of needing millions of subjects to observe enough GBS events.
Explanation: ***Add multiple international sites to accelerate enrollment*** - Increasing the **recruitment rate** by adding more centers is the most effective way to reach the required **sample size** of 1,200 patients per arm within a reasonable timeframe. - This strategy preserves the original **study design**, **statistical power**, and **primary endpoint**, maintaining the scientific rigor of the trial. *Extend follow-up time and use time-to-event analysis to increase statistical power* - While increasing the **follow-up duration** can increase the number of events, it does not solve the fundamental problem of being severely **underpowered** due to a low sample size (400 vs 2,400 required). - Longer follow-up increases **study costs** and risks higher **attrition rates** (loss to follow-up), which can introduce significant bias. *Reduce power requirement to 80% and recalculate required sample size* - Reducing power to 80% lowers the necessary sample size but significantly increases the risk of a **Type II error**, failing to detect a true **treatment effect**. - Even with a power reduction, the current enrollment (400) is far below any reasonable recalculated threshold for a **low-incidence event** like stroke. *Change primary endpoint to a composite outcome including TIA and stroke* - Using a **composite endpoint** increases the event rate to reduce sample size, but it changes the **clinical hypothesis** and may lead to a result driven by less severe events (TIA). - This modification requires a formal **protocol amendment** and may be viewed as "data dredging" or post-hoc manipulation if the trial is already underway. *Switch to Bayesian adaptive design with informative priors from historical data* - While **Bayesian designs** are flexible, switching mid-trial to include **historical priors** can introduce significant bias and is often not accepted by regulatory bodies for primary efficacy. - This approach is highly complex and does not address the practical logistical failure of **patient recruitment** currently threatening the trial.
Explanation: ***The claim is invalid; the study was already powered for a smaller effect size*** - The study had **85% power** to detect a **0.5% reduction**, which is a larger effect than the observed **0.4% reduction**, meaning the study was not sufficiently powered for the smaller result. - Increasing the **sample size** to find significance for a smaller, potentially clinically irrelevant effect is often used as a marketing tactic rather than a robust scientific conclusion. *The claim is valid; increasing sample size would likely achieve significance* - While increasing **N (sample size)** reduces the **standard error** and can push a p-value below 0.05, it does not validate the clinical importance of the smaller effect size. - This approach risks committing **p-hacking** or chasing results that were not part of the original **study design** and hypothesis. *The claim is valid only if the standard deviation was underestimated* - Underestimating the **standard deviation** would actually lead to an **underpowered** study, but it does not justify the post-hoc claim that significance is guaranteed with more patients. - The primary issue here is the **effect size (delta)**; the study was powered for 0.5%, and the drug only achieved 0.4%. *The claim is invalid; the p-value indicates no true effect exists* - A **p-value > 0.05** does not prove there is "no effect"; it simply means the study failed to reject the **null hypothesis** with the given data. - It is a common mistake to interpret **absence of evidence** as evidence of absence; the drug might work, but this trial didn't prove it. *The claim is valid; post-hoc power analysis supports increasing sample size* - **Post-hoc power analysis** is widely criticized by statisticians because it is directly tied to the observed p-value and adds no new information to the result. - Using power calculations after a study is completed to justify increasing the **sample size** is considered poor statistical practice.
Explanation: ***Change to a crossover design to utilize patients as their own controls*** - A **crossover design** allows each participant to serve as their own **control**, which significantly reduces **inter-individual variability** and increases statistical power. - This design modification is the most efficient way to reduce the **required sample size** without compromising the **alpha** or **detectable difference**. *Increase alpha to 0.10 to reduce required sample size* - Increasing the **alpha level** increases the risk of a **Type I error** (false positive), which compromises the statistical integrity of the trial. - While it reduces the required sample size, it is generally considered **unethical or scientifically unsound** in clinical trials to accept a 10% chance of error. *Decrease power to 70% to reduce required sample size* - Reducing **power** increases the **Type II error (beta)**, meaning the study is more likely to fail to detect a true clinical effect. - Although this would allow for a smaller sample size, it makes the study **underpowered** and less likely to yield a definitive conclusion. *Increase the detectable difference to 12 mmHg* - Increasing the **minimum detectable difference** reduces the sample size because larger effects are easier to detect statistically. - However, this may ignore **clinically significant** lower differences (like the original 10 mmHg) that are important for patient care. *Perform an interim analysis at 25 patients to stop early for efficacy* - **Interim analyses** require **statistical penalties** (like the O'Brien-Fleming rule) that actually necessitate a **larger initial sample size** to maintain overall power. - Stopping early is only possible if the treatment effect is **exceptionally large**, which cannot be guaranteed at the design phase.
Statistical power definition
Practice Questions
Type I and Type II errors
Practice Questions
Factors affecting power
Practice Questions
Sample size calculation for different study designs
Practice Questions
Effect size estimation
Practice Questions
Minimally important difference
Practice Questions
Sample size for non-inferiority trials
Practice Questions
Sample size for equivalence trials
Practice Questions
Power calculations for subgroup analyses
Practice Questions
Post-hoc power analysis limitations
Practice Questions
Adaptive sample size methods
Practice Questions
Power for repeated measures designs
Practice Questions
Group sequential designs
Practice Questions
Get full access to all questions, explanations, and performance tracking.
Start For Free