dCDH heterogeneity: per-path + global placebo predict_het R-parity + df threading by igerber · Pull Request #449 · igerber/diff-diff · GitHub
Skip to content

dCDH heterogeneity: per-path + global placebo predict_het R-parity + df threading#449

Merged
igerber merged 5 commits into
mainfrom
dcdh-heterogeneity-placebo-and-df
May 16, 2026
Merged

dCDH heterogeneity: per-path + global placebo predict_het R-parity + df threading#449
igerber merged 5 commits into
mainfrom
dcdh-heterogeneity-placebo-and-df

Conversation

@igerber

@igerber igerber commented May 15, 2026

Copy link
Copy Markdown
Owner

Summary

Closes TODO #422 (per-path placebo predict_het emission) and pilot-412 (heterogeneity OLS df threading) in a single bundle. Same surface, same R fixture, single clean R-parity story.

R-fixture probe (scenario 22) verified that did_multiplegt_dyn(by_path, predict_het, placebo) emits per-by_level heterogeneity OLS rows on backward (placebo) horizons via R's per-by_level dispatcher (DIDmultiplegtDYN:::did_multiplegt_main placebo block at the effect = matrix(-i, ...) rbind site). Scenario 23 confirmed the same emission on the global (non-by_path) surface. Both surfaces now mirror in Python.

R syntax note: did_multiplegt_dyn(predict_het=list(\"X\", c(...)), placebo=N) rejects positive-only horizons (c(1, 2, 3) errors with "specified numbers in predict_het that exceed the number of placebos" because R reuses the same horizon vector for both forward and backward indexing). The c(-1) sentinel triggers "compute heterogeneity for ALL forward (1..effects) AND ALL placebo (1..placebo) positions" — both new fixtures use this syntax.

Survey gate is warn + skip: when survey_design + placebo + heterogeneity are co-set, fit() emits a UserWarning and falls back to forward-horizon-only heterogeneity. The Binder TSL cell-period allocator's REGISTRY justification is tied to post-period attribution; backward-horizon attribution puts ψ_g mass on a pre-period cell, which is a separate library-extension claim deferred to a follow-up methodology PR. Forward-horizon predict_het + survey_design continues to work unchanged.

Phase 2 df threading: _compute_heterogeneity_test now passes df = n_obs - n_params to safe_inference on the non-survey OLS path (matches R qt(0.975, df.residual(model))); pre-PR Python used df=None (Z critical), producing 0.1-2% rtol gaps on p_value/conf_int vs R. Forward parity tests tightened from "unpinned" to INFERENCE_RTOL=1e-4. Rank-deficient designs use design.shape[1] as df denominator (pre-drop column count); fully rank-deficient is NaN-short-circuited by the existing guard. Near-rank-deficient edge case tracked as a Low TODO follow-up.

Methodology references (required if estimator / math changes)

  • Method name(s): ChaisemartinDHaultfoeuille.predict_het × placebo × by_path/paths_of_interest
  • Paper / source link(s): de Chaisemartin & D'Haultfœuille (NBER WP 29873) Web Appendix Section 1.5 (Lemma 7); R DIDmultiplegtDYN 2.3.3 did_multiplegt_main placebo + predict_het block
  • Any intentional deviations from the source (and why):
    • Survey + backward-horizon predict_het: warn + skip rather than compute. Pre-period Binder TSL allocator derivation is deferred (tracked in TODO.md). Forward + survey continues to work.
    • Joint Wald F-test across all predict_het rows: R aggregates; Python emits per-horizon inference only. Documented in REGISTRY heterogeneity Note.
    • Rank-deficient df: n_obs - n_params (pre-drop column count) vs R's df.residual = n - rank post-drop. Affects only near-rank-deficient designs that solve_ols retains rather than NaN-out. Tracked as Low TODO.

Validation

  • Tests added/updated:
    • R-parity (new): tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityHeterogeneityWithPlacebo (scenario 23, global) + ::TestDCDHDynRParityByPathHeterogeneityWithPlacebo (scenario 22, per-path × 3 paths) pinning all 6 inference fields at BETA_RTOL=1e-6/SE_RTOL=1e-5 for beta/se/t_stat/n_obs and INFERENCE_RTOL=1e-4 for p_value/conf_int.
    • Forward-horizon parity tightened: existing TestDCDHDynRParityHeterogeneity and TestDCDHDynRParityByPathHeterogeneity extended to assert p_value/conf_int at the new INFERENCE_RTOL=1e-4 (was unpinned pre-PR; replaced the Z-vs-t deviation note with the positive parity claim).
    • Cross-surface (new): tests/test_chaisemartin_dhaultfoeuille.py::TestByPathPredictHetPlacebo covers placebo het column population, survey-gate warn+skip, forward+survey anti-regression, out_idx<0 eligibility guard, single-path telescope (path_heterogeneity_effects[(only_path,)] == heterogeneity_effects bit-exactly), summary rendering, and direct-call NotImplementedError backstop.
    • Local-invariant tests refactored: test_*_inference_matches_safe_inferencetest_*_inference_local_invariants to verify SE-derivation wiring (t_stat = beta/se, symmetric conf_int, p_value in [0, 1]) without back-deriving n_params.
    • 314 dCDH tests pass (was 312 pre-PR).
  • Backtest / simulation / notebook evidence: scenarios 22 + 23 in benchmarks/data/dcdh_dynr_golden_values.json, regenerable via Rscript benchmarks/R/generate_dcdh_dynr_test_values.R.

Security / privacy

  • Confirm no secrets/PII in this PR: yes (canonical secret-pattern scan in pre-merge-check returned no hits).

🤖 Generated with Claude Code

igerber and others added 3 commits May 15, 2026 17:49
Closes TODO #422 + pilot-412 in a single PR (same surface, same R fixture,
same parity story).

Phase 0 probe verified R behavior: did_multiplegt_dyn(by_path, predict_het,
placebo) emits per-path heterogeneity OLS results on backward (placebo)
horizons via R's per-by_level dispatcher (DIDmultiplegtDYN:::did_multiplegt_main
placebo block at the `effect = matrix(-i, ...)` rbind site). New scenario 22
in benchmarks/R/generate_dcdh_dynr_test_values.R captures this with
predict_het=list("het_x", c(-1)) — the c(-1) sentinel triggers "compute
heterogeneity for ALL forward (1..effects) AND ALL placebo (1..placebo)
positions" per the R source path read at script time.

Phase 1A implementation (non-survey): _compute_heterogeneity_test gains a
placebo: int = 0 parameter and iterates forward (1..L_max) and backward
(-1..-placebo) horizons in a single loop. Explicit `if out_idx < 0: continue`
eligibility guard prevents numpy negative-index silent wrap on N_mat[g, out_idx]
when F_g - 1 + l_h < 0. _compute_path_heterogeneity_test forwards the param;
fit() passes placebo=L_max if self.placebo else 0 to both global and per-path
call sites. to_dataframe(level="by_path") placebo rows now read het_* values
from path_heterogeneity_effects negative-int keys (mirroring the existing
path_placebo_event_study negative-key convention) instead of the pre-PR
hardcoded NaN-fill.

Survey gate: when survey_design is active AND placebo > 0 + heterogeneity is
requested, _compute_heterogeneity_test raises NotImplementedError eagerly with
a documented message. The Binder TSL cell-period allocator's REGISTRY
justification is tied to post-period attribution; backward-horizon attribution
puts ψ_g mass on a pre-period cell, which is a separate library-extension
claim that needs its own derivation. Forward-horizon predict_het + survey
continues to work unchanged. Pre-period allocator derivation tracked as a
new follow-up TODO row.

Phase 2 (df threading): _compute_heterogeneity_test now passes
df = n_obs - n_params to safe_inference on the non-survey OLS path, matching
R did_multiplegt_dyn(predict_het=...)'s t-distribution inference
(qt(0.975, df.residual(model)) site). Pre-PR Python used df=None (Z critical),
producing 0.1-2% rtol gaps on p_value/conf_int vs R. Existing forward-horizon
parity tests now pin t/p/CI at INFERENCE_RTOL=1e-4 (was unpinned). Rank-
deficient designs use design.shape[1] as df denominator (pre-drop column
count); fully rank-deficient is NaN-short-circuited by the existing guard.
Near-rank-deficient edge case tracked as a new Low TODO follow-up.

R parity: scenario 22 (multi_path_reversible_predict_het_with_placebo,
placebo=2, effects=3, by_path=3) pinned at BETA_RTOL=1e-6/SE_RTOL=1e-5 for
beta/se/t_stat/n_obs and INFERENCE_RTOL=1e-4 for p_value/conf_int across
3 paths × (3 forward + 2 placebo) = 15 horizons.

Cross-surface tests (TestByPathPredictHetPlacebo): placebo het column
population, survey-gate NotImplementedError, forward+survey anti-regression,
out_idx<0 eligibility guard, single-path telescope
(path_heterogeneity_effects[(only_path,)] == heterogeneity_effects bit-
exactly), summary rendering. The two existing local-invariant tests
(test_*_inference_matches_safe_inference) refactored to verify SE-derivation
wiring (t_stat=beta/se, conf_int symmetric around beta, p_value in [0,1])
without back-deriving n_params.

REGISTRY: heterogeneity Z-vs-t deviation note replaced with positive
"R parity (post-2026-05-15 df threading)" framing including the rank-
deficient caveat. New "Per-path placebo heterogeneity" Note documents the
R parity, syntax requirements (c(-1) sentinel), survey gate, and test
anchors. CHANGELOG entry under [Unreleased]. llms-full.txt by_path entry
extended with placebo het composition + survey-gate mention. API rst extended
with the same.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_het parity

Two changes that resolve scope/contract issues with the previous commit on
this branch:

1. Survey gate (scope correctness). The eager `if use_survey and placebo > 0:
raise NotImplementedError` at function entry of `_compute_heterogeneity_test`
fired on EVERY survey + heterogeneity call when fit() passed
`placebo=L_max if self.placebo else 0` with default `placebo=True`. That
broke the previously-supported forward-horizon survey + predict_het path,
contradicting the changelog's "Forward-horizon predict_het + survey_design
is supported" claim. Refactored to a per-iteration backstop inside the
horizon loop (raises only when actually about to compute a backward
iteration under survey). fit() at both global and per-path heterogeneity
call sites now wraps the placebo arg with a warn-and-skip: if
`survey_design + placebo + heterogeneity` are all active, emit a
UserWarning explaining backward-horizon predict_het is deferred under
survey designs and pass `placebo=0` so only forward-horizon heterogeneity
is computed. The function-level NotImplementedError remains as a defensive
backstop for direct callers that bypass fit().

2. Global placebo predict_het parity (scope completeness). The previous
commit extended `_compute_heterogeneity_test` to compute backward horizons,
which means `results.heterogeneity_effects` (global, non-by_path surface)
also gains negative-int keys when `placebo=True + heterogeneity`. The R-
parity coverage was scoped to per-path only (scenario 22). Added scenario
23 (`multi_path_reversible_predict_het_with_placebo_global`) that calls
`did_multiplegt_dyn(predict_het=list("het_x", c(-1)), placebo=2, effects=3)`
WITHOUT by_path; verified R emits forward (effect=1,2,3) AND placebo
(effect=-1,-2) rows on the global surface with 3 paths × 30 switchers each
= n_obs=90 per row. New parity test class
`TestDCDHDynRParityHeterogeneityWithPlacebo` pins all 6 inference fields
across 5 horizons at the same tolerances as the per-path version.

Test changes:
- `test_predict_het_placebo_survey_design_raises` renamed to
  `test_predict_het_placebo_survey_design_warns_and_skips_backward`;
  asserts UserWarning is emitted, no exception raised, and both
  `heterogeneity_effects` and `path_heterogeneity_effects` contain only
  positive-int keys (forward-only emission).
- New `test_compute_heterogeneity_test_direct_call_raises_on_backward_survey`
  exercises the per-iteration backstop directly; locks the API contract
  for any future internal call site that bypasses fit().
- New `test_parity_multi_path_reversible_predict_het_with_placebo_global`
  pins scenario 23 R-parity.

Doc updates: REGISTRY heterogeneity Note clarifies "warn + skip" semantics
(was "raises NotImplementedError"), references both global and per-path
parity classes. CHANGELOG entry rewritten to reflect global + per-path
scope. llms-full.txt and API rst updated similarly.

Test count: 312 -> 314 (added 2; the renamed test still passes under its
new contract).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The top-level heterogeneity Note in REGISTRY.md still said "This
implementation provides only post-treatment regressions" — stale after
the placebo predict_het additions in this branch's earlier commits.
Updated to reflect:

1. Python now matches R on per-horizon placebo regressions (when
   `placebo=True` + `heterogeneity=` are co-set) — refers readers to
   the "Placebo predict_het" sub-note for the full contract (global +
   per-path scope, survey warn+skip behavior, R-parity classes).
2. The remaining gap from R is the joint null Wald F-test across all
   predict_het rows — Python emits per-horizon inference only.
3. The `controls` mutex stays unchanged.

Cross-references the existing render-time note in `_render_heterogeneity_section`
("Per-horizon regressions only (no joint F-test)") so the registry text
matches the user-facing summary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

Two informational findings from the latest review:

1. Stale `to_dataframe(level="by_path")` docstring at
   `chaisemartin_dhaultfoeuille_results.py:1530-1558` still claimed
   placebo `het_*` columns are NaN. Updated to document the post-#422
   contract: positive-horizon AND negative-horizon (placebo) rows are
   both populated when `placebo=True + heterogeneity=` are co-set;
   placebo rows under `survey_design` remain NaN with a fit-time
   UserWarning.

2. JSON golden-fixture type instability for empty `placebo_predict_het`
   / `placebo_horizons` slots. R's `jsonlite::toJSON` serializes
   plain `list()` as `[]` (array) but populated named lists as `{}`
   (object), so consumers iterating `.items()` on the slot saw
   different shapes across scenarios. Fixed both ends:
   - R-side: extractors initialize empty slots with
     `structure(list(), names = character(0))` which jsonlite
     serializes as `{}` even when empty. Verified across 4 scenarios
     (20, 21, 22, 23) — all `placebo_predict_het` / `placebo_horizons`
     slots now serialize as objects regardless of population.
   - Python-side: added `_as_dict` helper in the parity test module
     as a defensive backstop coercing any non-dict (None / [] /
     missing) to {}. Used at the two call sites that read optional
     placebo slots so consumers can call `.items()` uniformly.

The golden JSON regenerated; type-stable across all scenarios (verified
via `jq` on each scenario's predict_het type).

314 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: bcdb00a19d0d673d29a890a148ebc836633201db


Overall Assessment

✅ Looks good

Executive Summary

  • No unmitigated P0/P1 issues found in the changed estimator paths. The new backward-horizon predict_het support and finite-df threading are consistent with the updated methodology contract in docs/methodology/REGISTRY.md:L637-L643.
  • The prior re-review P3s appear addressed: the by-path to_dataframe() docs now describe placebo het_* population in diff_diff/chaisemartin_dhaultfoeuille_results.py:L1546-L1566, and the R fixture extractor now stabilizes empty placebo mappings in benchmarks/R/generate_dcdh_dynr_test_values.R:L631-L707.
  • Remaining methodology caveats are explicitly documented/tracked rather than silent: survey-weighted backward-horizon heterogeneity is warn+skip, and near-rank-deficient non-survey df still uses n_obs - n_params; both are tracked in TODO.md:L63-L64.
  • One minor docs regression remains: the public fit() docstring still says heterogeneity has “no placebo regressions,” and the heterogeneity summary note does not mention the new survey forward-only fallback.
  • Verification note: I could not run pytest here because python -m pytest --version fails with No module named pytest.

Methodology

  • No unmitigated findings. The changed behavior is internally consistent across diff_diff/chaisemartin_dhaultfoeuille.py:L3878-L3915, diff_diff/chaisemartin_dhaultfoeuille.py:L5087-L5215, the updated registry note at docs/methodology/REGISTRY.md:L637-L643, and the new parity coverage in tests/test_chaisemartin_dhaultfoeuille_parity.py:L1558-L1811.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings. The previous golden-fixture shape instability appears resolved by the explicit empty-object serialization in benchmarks/R/generate_dcdh_dynr_test_values.R:L631-L707.

Tech Debt

  • Severity: P3-informational. Impact: survey_design + placebo + heterogeneity intentionally falls back to forward-horizon-only heterogeneity because the pre-period cell allocator is still underived; this is now explicit and tracked, not silent. Concrete fix: follow the tracked TODO to derive a pre-period survey allocator before enabling backward-horizon survey predict_het. References: diff_diff/chaisemartin_dhaultfoeuille.py:L3885-L3899, docs/methodology/REGISTRY.md:L637-L643, TODO.md:L63-L63.
  • Severity: P3-informational. Impact: near-rank-deficient non-survey heterogeneity still uses df = n_obs - n_params instead of post-drop rank, so R parity can still drift in that edge case; fully rank-deficient cases remain NaN-consistent. Concrete fix: thread effective rank from solve_ols() into _compute_heterogeneity_test, as already tracked. References: diff_diff/chaisemartin_dhaultfoeuille.py:L5195-L5215, TODO.md:L64-L64.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: the public/in-code documentation is only partially updated. fit() still says heterogeneity is “post-treatment regressions only (no placebo regressions),” which now contradicts the implementation, API docs, registry, and new parity tests; the heterogeneity summary note also omits the survey forward-only fallback. This can mislead users and future reviewers who rely on in-code docstrings as methodology references. Concrete fix: update the heterogeneity parameter docstring in diff_diff/chaisemartin_dhaultfoeuille.py:L980-L989 and the summary note in diff_diff/chaisemartin_dhaultfoeuille_results.py:L1279-L1283 to match the contract already documented in docs/api/chaisemartin_dhaultfoeuille.rst:L33-L41 and docs/methodology/REGISTRY.md:L637-L643.
  • No other findings. The previous by-path to_dataframe() doc drift flagged in the last review looks fixed in diff_diff/chaisemartin_dhaultfoeuille_results.py:L1546-L1566, and the new global/per-path placebo parity tests are present in tests/test_chaisemartin_dhaultfoeuille_parity.py:L1558-L1811.

…tract

Two stale in-code documentation surfaces from the latest review:

1. `chaisemartin_dhaultfoeuille.py:980-989` — the `heterogeneity`
   parameter docstring on `fit()` still said "post-treatment regressions
   only (no placebo regressions)". Updated to document the post-#422
   contract: per-horizon OLS regressions on forward AND backward
   (placebo) horizons when `placebo=True`; survey_design composes with
   forward horizons but warns + skips backward horizons until the
   pre-period cell allocator is derived.

2. `chaisemartin_dhaultfoeuille_results.py:1279-1286` — the heterogeneity
   summary note didn't mention the survey forward-only fallback.
   Extended the note to cover the gating semantics so users reading
   `result.summary()` under `survey_design + heterogeneity` know what
   they're getting.

Both surfaces now match the contract already documented in the API rst,
REGISTRY, and CHANGELOG. No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 15, 2026
@igerber igerber merged commit 2cb8def into main May 16, 2026
33 of 34 checks passed
@igerber igerber deleted the dcdh-heterogeneity-placebo-and-df branch May 16, 2026 10:18
TDL77 pushed a commit to TDL77/diff-diff that referenced this pull request May 18, 2026
…h test

Two follow-ups closing the residual Low TODOs from PR igerber#449:

1. igerber#64 — Rank-deficient df threading. `_compute_heterogeneity_test`'s
   non-survey OLS path now computes `df = n_obs - rank(design)` via
   `_detect_rank_deficiency` (the same helper `solve_ols` calls
   internally), matching R's `df.residual = n - rank(design)`
   post-drop convention. For full-rank designs `rank == n_params` and
   behavior is bit-identical to the pre-PR `n_obs - n_params` path —
   all 4 forward-horizon parity tests still pass at the same
   `BETA_RTOL=1e-6` / `SE_RTOL=1e-5` / `INFERENCE_RTOL=1e-4`
   tolerances. For near-rank-deficient designs that `solve_ols`
   retains rather than NaN-out (e.g., cohort-collinearity at high
   horizons), the post-drop rank is strictly lower than `n_params`,
   so the post-PR `df` is strictly larger, matching R's `lm()`
   convention. Locked by
   `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathPredictHetPlacebo::test_heterogeneity_df_uses_post_drop_rank`
   which pins `safe_inference(... df=n_obs - rank)` against the
   stored `t_stat` / `p_value` / `conf_int` at `atol=1e-12`.

2. igerber#62 — Negative-baseline path regression test. New
   `TestByPathNonBinary::test_negative_baseline_path_supported`
   exercises switchers with `D_{g,1} = -1` and asserts that
   `path_effects` correctly contains negative-baseline tuple keys
   (`(-1, 0, 0, 0)`, `(-1, 1, 1, 1)`). The existing
   `test_negative_integer_D_supported` only covered paths with
   negative values in non-baseline positions (e.g.,
   `(0, -1, -1, -1)`), which does not trigger R's documented
   `substr(path, 1, 1)` baseline-extraction bug. Python's tuple-key
   matching is correct under any baseline value; this test pins the
   contract. No R-parity fixture is added because R is the buggy
   side on this regime — the deviation is already documented in the
   REGISTRY non-binary treatment Note.

TODO.md: drops the two corresponding Low rows (igerber#419 negative-baseline
test gap + follow-up rank-deficient df threading). REGISTRY heterogeneity
caveat updated to drop the "rank-deficient gap" prose and replace with
the positive "uses post-drop rank, matching R's lm() convention" claim.
CHANGELOG `[Unreleased]` extended with both items.

Test count: 314 -> 316.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TDL77 pushed a commit to TDL77/diff-diff that referenced this pull request May 18, 2026
Two P3 informational findings from R2:

1. TODO.md Tier B backlog still listed the dCDH heterogeneity df-
   threading and by-path placebo predict_het items as open, but PR
   igerber#449 closed both. Replaced the two stale bullets with a single
   bullet for the remaining survey + backward-horizon allocator
   derivation (the one Medium follow-up explicitly tracked in the
   wrap-up commit).

2. Three test-prose comments still said `df = n_obs - n_params`
   while the implementation and REGISTRY now use
   `df = n_obs - rank(design)`. Updated each comment to the
   post-drop rank wording; full-rank designs continue to have
   `rank == n_params` so the SE-derivation invariants under test
   are unchanged.

Comment-only drift; no behavior change. 317 tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant