iframe-proxy

igerber · 2026-05-14T18:37:52Z

Summary

Holistic re-audit of merged #408 (compose by_path × survey_design Wave 4) + #424 (post-merge docs-drift cleanup). The per-PR CI cleanup review on #424 couldn't see the combined post-PR holistic state; local agentic codex review surfaced 2 sibling-surface test gaps + 1 real [Unreleased] CHANGELOG drift.

Sibling-surface test gaps — Compose by_path / paths_of_interest with survey_design (Wave 4 #10) #408 shipped replicate-weight regressions for by_path but the parallel paths_of_interest selector only had analytical/gate tests under survey. Same _compute_path_effects / _compute_path_placebos IF code path; missing tests were a selector-symmetry oversight. Added test_paths_of_interest_replicate_weight_per_path_se_finite and test_paths_of_interest_survey_design_placebo_replicate_weight, both locking finite per-horizon SE under Rao-Wu (JK1) AND the _refresh_path_inference contract (every per-path entry's t_stat matches safe_inference at the FINAL df_survey).
CHANGELOG drift — the original [Unreleased] by_path entry said trends_linear, trends_nonparam, heterogeneity, design2, honest_did, survey_design all raise NotImplementedError. Each subsequent gate-lift PR added its own [Unreleased] entry but never updated this original list. Rewrote to reflect current state: only design2 + honest_did remain gated.

2 files, +152/-1. No methodology changes, no behavior changes. CHANGELOG fix is hygiene; tests add coverage on already-shipped surfaces.

Test plan

pytest tests/test_chaisemartin_dhaultfoeuille.py::TestByPathSurveyDesignAnalytical -m '' green (2 new + 16 existing)
CI AI review on the diff

🤖 Generated with Claude Code

…ale by_path gate list in CHANGELOG Holistic re-audit of merged #408 (compose `by_path` × `survey_design` Wave 4) + #424 (post-merge docs-drift cleanup). Per-PR CI on #424 couldn't see the combined post-PR holistic state. Local agentic codex review surfaced 2 sibling-surface test gaps + 1 real `[Unreleased]` CHANGELOG drift. **Sibling-surface coverage**: #408's Wave-4 PR shipped replicate-weight regressions for `by_path` (`test_per_path_replicate_se_finite`, `test_per_path_inference_refreshes_to_lower_final_df`) but the parallel `paths_of_interest` selector only had analytical / gate / unobserved-path tests under survey. Both selectors share the same `_compute_path_effects` and `_compute_path_placebos` IF code path, so the test gap was a selector-symmetry oversight, not a methodology gap. Added: - `test_paths_of_interest_replicate_weight_per_path_se_finite` — pins finite per-horizon SE under Rao-Wu (JK1) AND the `_refresh_path_inference` contract (every per-path entry's `t_stat` matches `safe_inference` at the FINAL `df_survey`, not the per-path snapshot from before replicate-weight fits appended to the shared `_replicate_n_valid_list`). - `test_paths_of_interest_survey_design_placebo_replicate_weight` — same invariants on the `_compute_path_placebos` branch. **CHANGELOG drift**: the original `[Unreleased]` `by_path` entry (added when by_path first shipped) said `trends_linear`, `trends_nonparam`, `heterogeneity`, `design2`, `honest_did`, and `survey_design` all raise `NotImplementedError`. Each subsequent gate-lift PR shipped its own `[Unreleased]` entry, but none of them went back to update this original entry's stale gated-features list. Users reading the changelog in order get contradictory upgrade guidance. Rewrote the gates list to reflect actual current state: only `design2` + `honest_did` remain gated. Pre-existing single-CHANGELOG-cycle hygiene gap, surfaced by the #408 holistic audit but applies independently of any specific subsequent PR. Holistic pilot finding NOT addressed: phantom `heterogeneity was composed in` claim in #424's CHANGELOG. That's correct in real main (#412 lifted the heterogeneity gate), but appears as a code/doc mismatch in the pilot because pilot construction (per `feedback_holistic_pilot_true_merge_cherry_pick_pitfall`) is `#408 + #424 deltas only` and doesn't include intermediate sibling PRs like #412. This is a structural limitation of the strict-delta pilot pattern — no fix-PR action needed in main.

github-actions · 2026-05-14T18:43:04Z

Overall Assessment

✅ Looks good. No unmitigated P0/P1 issues found. This diff is test-only plus a changelog correction, and the affected by_path survey/replicate behavior is already documented in the methodology registry. I found one P2 code-quality/test issue and one P3 test-coverage gap.

Executive Summary

No estimator, weighting, variance, or identification logic changed in the diff; methodology risk is therefore low.
The updated by_path changelog entry is consistent with the documented registry contract for survey-aware per-path effects/placebos and final df_survey refresh.
P2: one new test introduces unused safe_inference() outputs and therefore does not fully validate the refreshed inference fields it computes.
P3: the new replicate-placebo paths_of_interest regression only proves “at least one finite entry,” so a one-path omission could still slip through.
Static review only: pytest and ruff were not installed in this environment, so I did not execute the suite.

Methodology

None. The diff does not change estimator code. The relevant by_path survey-design / replicate-weight / final-df_survey contract is already documented in docs/methodology/REGISTRY.md, _compute_path_effects diff_diff/chaisemartin_dhaultfoeuille.py, _compute_path_placebos diff_diff/chaisemartin_dhaultfoeuille.py, and _refresh_path_inference diff_diff/chaisemartin_dhaultfoeuille.py. The changelog update at CHANGELOG.md matches that documented scope. Concrete fix: none.

Code Quality

P2 — In test_paths_of_interest_replicate_weight_per_path_se_finite, exp_p and exp_ci are assigned from safe_inference(...) but never used tests/test_chaisemartin_dhaultfoeuille.py. Impact: with Ruff F rules enabled in pyproject.toml, this is likely an F841 lint hit, and the test only proves t_stat refresh rather than the full inference tuple it computes. Concrete fix: either assert vals["p_value"] and vals["conf_int"] against exp_p / exp_ci, or replace those bindings with _ if only t_stat is intended.

Performance

None. Impact: no performance-sensitive production code changed. Concrete fix: none.

Maintainability

None. Impact: the implementation surface is unchanged; only tests and changelog text moved. Concrete fix: none.

Tech Debt

None. Impact: I did not identify a new deferred-work item that needs tracking in TODO.md. Concrete fix: none.

Security

None. Impact: no executable production path or secret-handling surface changed. Concrete fix: none.

Documentation/Tests

P3 — The new replicate-placebo sibling test only checks that res.path_placebo_event_study is non-None and that at least one populated entry has a finite SE tests/test_chaisemartin_dhaultfoeuille.py. The analytical sibling already asserts the full selector surface size under paths_of_interest tests/test_chaisemartin_dhaultfoeuille.py. Impact: a regression that silently drops one requested path on the replicate-placebo branch could still pass as long as another requested path produces one finite horizon. Concrete fix: assert that both requested path keys are present in res.path_placebo_event_study before checking SE/inference.

CI AI review on #435: - P2: exp_p / exp_ci returned from safe_inference() were unused (F841). Extended event-study test to assert p_value and conf_int match safe_inference output too, not just t_stat. The safe_inference contract is joint NaN propagation across the full inference tuple, so testing all three jointly is the right invariant. - P3: placebo test only checked 'at least one finite entry', so a silent path-drop on the replicate-weight placebo branch could pass. Added explicit assertions that both (0,1,1,1) and (0,1,0,0) are present in res.path_placebo_event_study before checking SE/inference.

github-actions · 2026-05-14T19:56:36Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 2f9535382cfa5e64315391fbea3b9d47be390a14

Overall Assessment

✅ Looks good.

Executive Summary

Re-review result: the prior P2 test-quality issue is resolved. The new replicate-weight paths_of_interest test now asserts t_stat, p_value, and conf_int together from safe_inference() in tests/test_chaisemartin_dhaultfoeuille.py:L9933-L10015.
Re-review result: the prior P3 placebo-surface coverage gap is resolved. The new placebo test now asserts both requested paths are present before the finite-SE check in tests/test_chaisemartin_dhaultfoeuille.py:L10018-L10093.
No estimator, weighting, variance, identification, or default-behavior code changed in this diff; the change set is tests plus a changelog correction.
The changelog update is consistent with the documented by_path methodology contract and remaining gates in CHANGELOG.md:L59-L61 and docs/methodology/REGISTRY.md:L643-L679.
Static review only: python -m pytest ... could not run here because pytest is not installed.

Methodology

No findings. Severity: none. Impact: the diff does not alter estimator math or inference code, and the changelog sync matches the registry’s documented by_path scope, including survey support and the remaining design2 / honest_did gates in docs/methodology/REGISTRY.md:L643-L679. Concrete fix: none.

Code Quality

No findings. Severity: none. Impact: the earlier unused-safe_inference()-output issue is fixed because the new test now consumes the returned p_value and conf_int rather than leaving them unused in tests/test_chaisemartin_dhaultfoeuille.py:L9996-L10015. Concrete fix: none.

Performance

No findings. Severity: none. Impact: only test code and changelog prose changed; no production-path performance surface moved. Concrete fix: none.

Maintainability

No findings. Severity: none. Impact: the new tests are narrowly targeted to the previously missing paths_of_interest sibling surfaces and reuse the existing survey fixture/test pattern in tests/test_chaisemartin_dhaultfoeuille.py:L9933-L10093. Concrete fix: none.

Tech Debt

No findings. Severity: none. Impact: I did not identify new deferrable work that needs TODO.md tracking. Concrete fix: none.

Security

No findings. Severity: none. Impact: no executable production logic, secret-handling, or dependency surface changed. Concrete fix: none.

Documentation/Tests

No findings. Severity: none. Impact: the prior P3 omission risk is fixed by asserting both requested placebo paths exist before checking for any finite entry in tests/test_chaisemartin_dhaultfoeuille.py:L10063-L10078, and the changelog drift is corrected in CHANGELOG.md:L59-L61. Concrete fix: none.

Review note: static analysis only; test execution was not possible because pytest is unavailable in this environment.

igerber added the ready-for-ci Triggers CI test workflows label May 14, 2026

igerber merged commit 5372e88 into main May 14, 2026
25 of 26 checks passed

igerber deleted the fix-audit-408-r2 branch May 14, 2026 21:08

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #408 holistic audit residuals: sibling-surface replicate-weight test gaps + stale by_path gate list#435

Fix #408 holistic audit residuals: sibling-surface replicate-weight test gaps + stale by_path gate list#435
igerber merged 2 commits into
mainfrom
fix-audit-408-r2

igerber commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Sunbelt Computer Software

PL/B Language Development and Support

Conversation

igerber commented May 14, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant