You are conducting a study comparing the efficacy of two different statin medications. Two groups are placed on different statin medications, statin A and statin B. Baseline LDL levels are drawn for each group and are subsequently measured every 3 months for 1 year. Average baseline LDL levels for each group were identical. The group receiving statin A exhibited an 11 mg/dL greater reduction in LDL in comparison to the statin B group. Your statistical analysis reports a p-value of 0.052. Which of the following best describes the meaning of this p-value?

There is a 5.2% chance of observing a difference in reduction of LDL of 11 mg/dL or greater even if the two medications have identical effects

There is a 95% chance that the difference in reduction of LDL observed reflects a real difference between the two groups

Though A is more effective than B, there is a 5% chance the difference in reduction of LDL between the two groups is due to chance

If 100 permutations of this experiment were conducted, 5 of them would show similar results to those described above

This is a statistically significant result

In the study, all participants who were enrolled and randomly assigned to treatment with pulmharkimab were analyzed in the pulmharkimab group regardless of medication nonadherence or refusal of allocated treatment. A medical student reading the abstract is confused about why some participants assigned to pulmharkimab who did not adhere to the regimen were still analyzed as part of the pulmharkimab group. Which of the following best reflects the purpose of such an analysis strategy?

To assess treatment efficacy more accurately

To increase internal validity of study

A health system implements a new sepsis protocol across 20 hospitals. A researcher plans to evaluate effectiveness using a stepped-wedge cluster randomized design where hospitals sequentially adopt the protocol every 3 months. She calculates sample size based on individual patient outcomes (mortality) needing 2,000 patients total. The biostatistician identifies a critical error. Evaluate what modification is needed.

Account for intra-cluster correlation coefficient (ICC) requiring substantial sample size inflation

Adjust for multiple time periods using Bonferroni correction

Use hospital-level outcomes instead of patient-level outcomes as unit of analysis

Increase alpha to 0.10 to account for cluster randomization reducing power

Include random effects for both hospital and time period in power calculation

Power calculations for subgroup analyses

Subgroup Pitfalls - The Double Danger

Analyzing multiple subgroups introduces two major statistical risks, creating a high chance for spurious findings.
Danger 1: Inflation of Type I Error (False Positives)
- Testing multiple hypotheses (one per subgroup) increases the probability of finding a significant result by chance alone.
- This is the problem of multiple comparisons.
Danger 2: Reduced Statistical Power (False Negatives)
- Splitting the study population into smaller subgroups reduces the sample size (n) for each test.
- Lower power decreases the ability to detect a true effect, increasing the risk of a Type II error.

⭐ To be considered valid, subgroup analyses should be pre-specified in the study protocol and confirmed with a formal statistical test for interaction.

Valid Subgroups - The Credibility Gauntlet

Subgroup analyses are prone to false positives (Type I errors). Treat them with skepticism unless they pass stringent criteria.

Pre-specified: Was the subgroup hypothesis declared before the study began (a priori)? Post-hoc analyses are hypothesis-generating only.
Biologically Plausible: Is there a credible scientific reason for the effect to differ in this subgroup?
Statistically Significant Interaction: This is the most crucial test. The formal test for interaction (or heterogeneity) must be statistically significant (e.g., p < 0.05). This shows the treatment effect truly differs between subgroups.
Consistency: Is the effect seen across multiple related outcomes?
Independent Confirmation: Has the finding been replicated in other independent studies?

⭐ Interaction Test is Key: A significant p-value for the treatment effect within a subgroup is insufficient. You MUST have a significant p-value for the interaction to claim a true subgroup effect.

High‑Yield Points - ⚡ Biggest Takeaways

Subgroup analyses are inherently underpowered due to smaller sample sizes compared to the overall study.

This ↑ risk of Type II errors (false negatives), failing to detect a true effect within a subgroup.

Statistically significant findings in subgroups, especially if not pre-specified, may be due to chance.

The correct statistical method to compare effects between subgroups is a test of interaction.

Do not compare subgroup p-values directly (e.g., significant in one, non-significant in another).

Findings should be considered hypothesis-generating, not confirmatory.