test: add Tutorial 24 drift guard (staggered-vs-collapsed power claims) by igerber · Pull Request #549 · igerber/diff-diff · GitHub
Skip to content

test: add Tutorial 24 drift guard (staggered-vs-collapsed power claims)#549

Merged
igerber merged 1 commit into
mainfrom
test/tutorial24-drift-test
Jun 25, 2026
Merged

test: add Tutorial 24 drift guard (staggered-vs-collapsed power claims)#549
igerber merged 1 commit into
mainfrom
test/tutorial24-drift-test

Conversation

@igerber

@igerber igerber commented Jun 25, 2026

Copy link
Copy Markdown
Owner

Summary

  • Add a drift guard for Tutorial 24 (staggered-rollout vs. collapsed-2×2 power decision guide), pinning its two load-bearing claims against estimator-default / simulation drift:
    1. Monotonic dilution, fast → slow — the collapsed 2×2 reports a monotonically shrinking share of the truth (93.5% / 80.9% / 61.8%) and its CI coverage of the effect-on-treated collapses, while Callaway–Sant'Anna stays near nominal.
    2. CS-vs-2×2 MDE crossover / near-parity at slow rollout — the 2×2's MDE climbs (~0.37 → ~0.60) while CS's barely moves (~0.55), so the power gap closes to parity.
  • Remove the now-resolved Testing/Docs row from TODO.md.

Because nbsphinx_execute = "never", the committed notebook outputs are what RTD renders, so the prose can silently drift from the live library. These asserts re-derive the load-bearing numbers from the same public generator (generate_staggered_data) + estimators the tutorial uses and check them against the committed surface.

Methodology references

  • Method name(s): N/A — test-only. Exercises existing DifferenceInDifferences (collapsed 2×2) and CallawaySantAnna(control_group="never_treated") via tutorial-drift checks; no estimator/math/source changes.
  • Paper / source link(s): Callaway & Sant'Anna (2021) — the tutorial's subject; no new methodology introduced.
  • Intentional deviations from the source: None.

Validation

  • Tests added: tests/test_t24_staggered_vs_collapsed_power_drift.py (9 tests). Structure mirrors the T25 split:
    • Deterministic (unmarked, run in every CI leg incl. pure-Python): panel composition; §1 estimand gap; monotonic dilution (structural E2/E1, exact); a rendered-surface quote cross-check (19 committed numbers); a notebook-kwargs sync guard.
    • Monte Carlo (@pytest.mark.slow, Rust legs -m '' only — off the ~1h pure-Python budget): dilution coverage collapse; MDE crossover; flat-vs-growing estimand targeting. These assert robust orderings with wide margins, calibrated against real reduced-sim runs (not flaky exact pins).
  • Re-derivation reproduces the notebook's committed numbers exactly (E1=2.61, E2=2.10, 2×2=2.17, CS=2.67, 81%; dilution 93.5 / 80.9 / 61.8%).
  • All 9 tests pass locally (-m '', 6.6s); black --check and ruff clean. Local codex review: ✅ no P0/P1/P2 findings.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

@github-actions

Copy link
Copy Markdown

Pins the two load-bearing quantitative claims in
docs/tutorials/24_staggered_vs_collapsed_power.ipynb against
estimator-default / simulation drift, closing the deferred Testing/Docs
TODO row (branch staggered-analysis-2x2):

1. Monotonic dilution fast -> slow: the collapsed-2x2 reports a
   monotonically shrinking share of the truth (93.5% / 80.9% / 61.8%) and
   its CI coverage of the effect-on-treated collapses, while CS stays near
   nominal. Pinned deterministically (estimands are means of the noise-free
   true_effect column) so it runs in every CI leg.
2. CS-vs-2x2 MDE crossover / near-parity at slow rollout: the 2x2's MDE
   climbs (~0.37 -> ~0.60) while CS's barely moves (~0.55) so the power gap
   closes to parity. Pinned as robust orderings (the exact reversal is
   simulation-sensitive, per the prose).

Structure mirrors the T25 split: deterministic structural pins + a
rendered-surface quote cross-check + a notebook-kwargs sync guard run
unmarked; the Monte Carlo sweeps (coverage collapse, MDE crossover,
flat-vs-growing estimand targeting) are @pytest.mark.slow so they stay off
the pure-Python budget and run in the Rust legs (-m '') at full count.

Removes the resolved row from TODO.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the test/tutorial24-drift-test branch from a38113b to 805e8c0 Compare June 25, 2026 12:32
@github-actions

Copy link
Copy Markdown

@igerber igerber added the ready-for-ci Triggers CI test workflows label Jun 25, 2026
@igerber igerber merged commit a5bbceb into main Jun 25, 2026
25 of 26 checks passed
@igerber igerber deleted the test/tutorial24-drift-test branch June 25, 2026 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant