Wave 3 estimator observability: HonestDiD M=0 test, Wooldridge canonical-link warning, ARP vertex diagnostic by igerber · Pull Request #453 · igerber/diff-diff · GitHub
Skip to content

Wave 3 estimator observability: HonestDiD M=0 test, Wooldridge canonical-link warning, ARP vertex diagnostic#453

Merged
igerber merged 4 commits into
mainfrom
wave-3-observability
May 16, 2026
Merged

Wave 3 estimator observability: HonestDiD M=0 test, Wooldridge canonical-link warning, ARP vertex diagnostic#453
igerber merged 4 commits into
mainfrom
wave-3-observability

Conversation

@igerber

@igerber igerber commented May 16, 2026

Copy link
Copy Markdown
Owner

Summary

Wave 3 of multi-wave tech-debt paydown. Three Tier-A items bundled under a coherent "estimator observability" theme — each commit is independently rollback-able.

  • bc0bf39a HonestDiD test_m0_short_circuit: replace wall-clock elapsed < 0.5s proxy with mock.patch on scipy.optimize.linprog + assert_not_called(). CI-safe (no timing dependency), instantaneous, and a direct correctness signal — the M=0 fast path in _compute_worst_case_bias is supposed to skip the LP solver, so verifying the mock was never called is exactly right.
  • cf363048 WooldridgeDiD canonical-link warning: module-level _warn_if_canonical_link_violated(method, y, stacklevel) helper called from _fit_ols, _fit_logit, _fit_poisson at stacklevel=3 (matching the existing _warn_and_fill_nan_cohort convention). Four mismatches detected: ols+binary → recommend logit; ols+count → recommend poisson; logit+fractional → note QMLE consistent but not canonical; poisson+binary → recommend logit. Estimator remains consistent under QMLE; warning surfaces the canonical-link violation (W2023 Prop 3.1) that breaks numerical equivalence between QMLE-direct and OLS-by-imputation paths. REGISTRY.md gains a one-bullet note.
  • 350ce841 HonestDiD ARP vertex-rejection diagnostic: _enumerate_vertices now tracks n_total / n_linalg_error / n_infeasible counters and emits a RuntimeWarning when (a) enumeration exhausts without feasible vertices, or (b) ≥ 50% of bases were rejected for LinAlgError. The previous silent-skip behavior is fully preserved (continue on LinAlgError, return List[np.ndarray]); the warning is purely additive observability. No caller modifications.
  • a6c562c4 TODO.md: remove 3 Tier-A bullets + 2 Methodology/Correctness rows (item 1: Tier-A only; items 5 + 6: both surfaces).

Net diff: 6 files, +284/-28.

Tests

  • HonestDiD: 30/30 pass in tests/test_methodology_honest_did.py, including new TestARPVertexEnumeration (3 cases: exhausted enumeration, heavy rejection, healthy enumeration). Old test_m0_short_circuit rewritten — drops the @skipif(CI=="true") decorator (CI-safe now).
  • WooldridgeDiD: 112/112 pass in tests/test_wooldridge.py, including new TestCanonicalLinkWarning (6 cases: 3 positive + 3 negative warning paths).
  • HonestDiD-touching: 138/138 across the whole repo.

Test plan

  • pytest tests/test_methodology_honest_did.py -v — 30 pass
  • pytest tests/test_wooldridge.py -v — 112 pass
  • pytest tests/test_methodology_honest_did.py::TestOptimalFLCI::test_m0_short_circuit -v — passes (no skip on CI)
  • pytest tests/test_wooldridge.py::TestCanonicalLinkWarning -v — 6 pass
  • pytest tests/test_methodology_honest_did.py::TestARPVertexEnumeration -v — 3 pass

🤖 Generated with Claude Code

@github-actions

Copy link
Copy Markdown

igerber and others added 3 commits May 16, 2026 08:49
…g mock

The test was using `time.time() - t0 < 0.5s` as a proxy for "M=0 took the
fast path." Wall-clock proxies on shared CI runners are flaky, so the test
was `skipif(CI=="true")` and only ran locally.

The fast path is `_compute_worst_case_bias(... M=0) -> 0.0` at
honest_did.py:1650, which means `scipy.optimize.linprog` is never reached.
Direct correctness signal: `mock.patch` on `diff_diff.honest_did.optimize.linprog`
plus `assert_not_called()`. CI-safe, instantaneous, and verifies the actual
short-circuit path rather than a timing proxy.

Drops the CI skipif decorator and the unused `import os`. Tier A row 1 of
post-Wave-2 backlog.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_enumerate_vertices` was swallowing `np.linalg.LinAlgError` on every
basis with `try / except / continue`, and `_compute_arp_test` returned
False (conservative non-rejection) when the returned list was empty.
Users had no way to tell whether the test "did not reject" because the
data didn't support rejection or because the vertex search was
numerically pathological (singular A_sys on every basis, degenerate
moment-inequality system).

Instrument `_enumerate_vertices` with three counters (`n_total`,
`n_linalg_error`, `n_infeasible`) and emit a `RuntimeWarning` at
function exit when:
  1. `vertices == []` after `n_total > 0` bases tried — enumeration
     exhausted; caller will fall back to conservative non-rejection.
  2. `vertices != []` but `n_linalg_error / n_total >= 0.5` —
     enumeration heavily constrained; recovered vertices may be
     numerically fragile.

`RuntimeWarning` (not `UserWarning`) marks this as a numerical /
algorithmic signal rather than a user-input issue. `stacklevel=3` so the
warning surfaces at `_compute_arp_test`'s caller, matching the codebase
convention for one-level-deep helper warnings.

No changes to the return type, the caller (`_compute_arp_test`), or
the algorithm semantics — the previous silent-skip behavior is fully
preserved, the diagnostic is purely additive.

Tier-A row in the post-Wave-2 backlog (TODO.md, item 6, PR #334
reference). Adds new `TestARPVertexEnumeration` class in
test_methodology_honest_did.py with three cases: exhausted enumeration,
heavy rejection, healthy enumeration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Items 1 and 6 from the post-Wave-2 backlog are now shipped:
  - HonestDiD test_m0_short_circuit (mock-based fix, commit bc0bf39)
  - HonestDiD ARP vertex-rejection diagnostic (commit 0bfaabba)

Removals:
  - Tier-A bullets for items 1 and 6
  - Methodology/Correctness row for HonestDiD ARP vertex-rejection (PR #334)

Item 5 (WooldridgeDiD method/outcome pairing) reframed in place — PR #453
R1 review pointed out that the original "canonical link requirement
(W2023 Prop 3.1) not enforced" framing misrepresented Wooldridge (2023):
Table 1 lists Gaussian/OLS for "any response" and logistic-Bernoulli for
"binary OR fractional", so neither OLS-on-binary nor logit-on-fractional
is a Prop 3.1 violation. A useful hint still exists at the *efficiency*
level (OLS on binary is consistent but logit is typically more efficient
for inference), but should not be framed as a methodology violation.

Both the Methodology/Correctness row and the Tier-A bullet for item 5
are rewritten to reflect this. Item 5 stays open with the corrected
framing for a future wave to address.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the wave-3-observability branch from a6c562c to 7c0bdb2 Compare May 16, 2026 12:50
@github-actions

Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 7c0bdb26e368bfa3d54b0ea04658d9bb4faaf6b0


Overall Assessment

✅ Looks good

Executive Summary

  • No unmitigated P0/P1 issues in the current diff.
  • HonestDiD’s code change is observability-only: _enumerate_vertices() still returns [] on vertex-search failure, and _compute_arp_test() still treats that as conservative non-rejection diff_diff/honest_did.py:L1935-L1957, diff_diff/honest_did.py:L2005-L2010.
  • The M=0 regression is materially better: it now directly asserts the LP solver is skipped instead of relying on wall-clock timing tests/test_methodology_honest_did.py:L249-L279, diff_diff/honest_did.py:L1651-L1667.
  • Re-review scope: the prior Wooldridge canonical-link methodology issue no longer appears in the current branch. The remaining TODO correctly reframes it as an optional efficiency hint, and the public method contract still allows logit for binary or fractional outcomes TODO.md:L99-L99, diff_diff/wooldridge.py:L273-L275.
  • P3: the new “heavy rejection” test does not actually isolate the heavily constrained branch; as written it can pass via the exhausted warning, so that branch is still effectively unpinned tests/test_methodology_honest_did.py:L519-L546.
  • I could not execute the tests locally because pytest is unavailable and the Python environment here is missing numpy.

Methodology

No findings. The HonestDiD change does not alter the estimator, identified set, or inference surface. Delta^RM still uses the documented conservative naive FLCI path because ARP remains disabled pending calibration diff_diff/honest_did.py:L2469-L2486, docs/methodology/REGISTRY.md:L2692-L2695.

Code Quality

No findings. Replacing the timing proxy with a direct mock-based assertion is a better correctness signal for the M=0 fast path tests/test_methodology_honest_did.py:L249-L279.

Performance

No findings. The added counters and warning checks are O(1) per enumerated basis and do not change the combinatorial search complexity diff_diff/honest_did.py:L1894-L1957.

Maintainability

No findings. The TODO cleanup matches the actual state of the branch, including the corrected Wooldridge framing TODO.md:L95-L100, TODO.md:L175-L178.

Tech Debt

No findings. Removing the resolved HonestDiD TODOs is appropriate, and the remaining HonestDiD ARP calibration debt stays tracked separately TODO.md:L95-L95, docs/methodology/REGISTRY.md:L2693-L2694.

Security

No findings.

Documentation/Tests

  • Severity P3 | Impact: tests/test_methodology_honest_did.py:L519-L546 is intended to cover the new heavily constrained warning in diff_diff/honest_did.py:L1947-L1957, but the fixture only has two unique nuisance-row types across 5 moments. That makes every 3-row basis singular, so the test can satisfy its assertion through the exhausted warning alone and will not catch regressions specific to the ≥50%-rejected-with-some-feasible-vertices branch. | Concrete fix: replace the fixture with one that yields both feasible and singular bases, then assert specifically on the heavily constrained message. Keep the pure exhaustion case separate as already covered at tests/test_methodology_honest_did.py:L505-L517.

P3 — `test_enumerate_vertices_warns_on_heavy_rejection` previously used
a fixture (X_tilde with 2 unique row types, 5 moments → all 3-row bases
singular) where the assertion could pass via the `exhausted` warning
instead of the intended `heavily constrained` branch. The branch was
effectively untested.

Rewrite the fixture: 5 moments × 1 nuisance column, C(5,2)=10 bases. By
design, 6 bases trip LinAlgError (pairs among the singular-X_tilde
indices) and 4 bases produce feasible vertices (each pairs a positive
X_tilde with the unique negative X_tilde at index 4). 60% rejection rate
hits the heavily-constrained branch specifically, not exhaustion.

Switched to `pytest.warns(RuntimeWarning, match="heavily constrained")`
so the test now fails if the message changes or if the wrong branch
fires. Added a `len(vertices) >= 1` assertion to guard against the
fixture inadvertently producing the exhausted-branch outcome.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 16, 2026
@igerber igerber merged commit 176174c into main May 16, 2026
33 of 34 checks passed
@igerber igerber deleted the wave-3-observability branch May 16, 2026 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant