Bump AI PR review to gpt-5.5 + add Single-Pass Completeness Mandate by igerber · Pull Request #404 · igerber/diff-diff · GitHub
Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 16 additions & 10 deletions .claude/commands/ai-review-local.md
111 changes: 96 additions & 15 deletions .claude/scripts/openai_review.py
Original file line number Diff line number Diff line change
Expand Up @@ -853,6 +853,8 @@ def apply_token_budget(
# Source: https://platform.openai.com/docs/pricing
# MAINTENANCE: Update when OpenAI changes pricing.
PRICING = {
"gpt-5.5": (5.00, 30.00),
"gpt-5.5-pro": (30.00, 180.00),
"gpt-5.4": (2.50, 15.00),
"gpt-5.4-pro": (30.00, 180.00),
"gpt-4.1": (2.00, 8.00),
Expand Down Expand Up @@ -900,8 +902,8 @@ def estimate_cost(
"code changes that have not yet been submitted as a pull request.",
),
(
"Review ONLY the changes introduced by this PR (diff)",
"Review ONLY the changes shown in the diff below",
"Review the changes introduced by this PR (diff)",
"Review the code changes shown in the diff below",
),
(
"If the PR changes an estimator",
Expand Down Expand Up @@ -933,6 +935,62 @@ def estimate_cost(
"Use the branch name only to understand which "
"methods/papers are intended.",
),
# Replace the CI Single-Pass Completeness Mandate with a local-mode note.
# The CI Mandate instructs the reviewer to run shell greps, load sibling
# files, and sweep transitive paths — none of which the local raw API
# path can do. Leaving the CI wording in place would cause the model to
# claim audits it cannot perform.
(
"""## Single-Pass Completeness Mandate (Initial Review Only)

This is an INITIAL review. Treat this as the only chance to enumerate findings.
Follow-up rounds are expensive — find ALL P0/P1/P2 issues in this pass.

Before finalizing, confirm you have run each of these audits on the diff:

1. **Sibling-surface mirror audit**: For every fix or change in a method, schema,
default-value path, or report block, identify the parallel surface in the same
codebase (BR ↔ DR, schema ↔ renderer, default ↔ precomputed, summary ↔ full)
and check whether the same change applies there. Flag the unmirrored side as P1.

2. **Pattern-wide grep**: When you flag any anti-pattern or bug class, use `grep`
on `diff_diff/**.py` to identify sibling occurrences of the same pattern and
enumerate them in the SAME finding. Only LOAD a sibling file's full contents
if grep returns a hit and you need surrounding context to verify the issue.
Do not defer pattern-class findings to a follow-up round.

3. **Reciprocal/symmetry check**: For dispatch code, validation, or guards in
one direction (A-on-B), explicitly enumerate the reciprocal direction (B-on-A)
and confirm coverage.

4. **Transitive workflow deps**: For GH Actions workflow `paths:` or pytest
selection changes, sweep transitive auto-loaded files (conftest.py,
pyproject.toml, ancestor conftests) and confirm they are included.

5. **Scope override (with carve-outs)**: The audits above explicitly authorize
loading files outside the diff to verify completeness. This overrides the
"minimum surrounding context" default in the Rules section below.

**DO NOT load these paths** (the workflow's diff-build deliberately excludes
them; they are noise or out-of-scope):
- `docs/tutorials/*.ipynb` (notebook outputs are large JSON blobs)
- `benchmarks/data/real/*.json`
- `benchmarks/data/real/*.csv`""",
"""## Single-Pass Completeness Audit (Local Review)

This is a local review running as a static-prompt API call. You do NOT have
shell or file-loading access — only the prompt content below is available
(diff + changed source files + first-level imports).

Find ALL P0/P1/P2 issues within the loaded context. Audit sibling surfaces,
parallel patterns, and reciprocal directions THAT ARE VISIBLE in the loaded
files.

Do NOT claim to have run shell greps, loaded sibling files outside the
prompt, or audited paths not present here. If a relevant audit is impossible
because the necessary context is not in the prompt, say so explicitly rather
than asserting completeness.""",
),
]


Expand Down Expand Up @@ -1095,15 +1153,30 @@ def compile_prompt(
# ---------------------------------------------------------------------------

ENDPOINT = "https://api.openai.com/v1/responses"
DEFAULT_MODEL = "gpt-5.4"
DEFAULT_TIMEOUT = 300 # seconds
DEFAULT_MODEL = "gpt-5.5"
DEFAULT_TIMEOUT = 300 # seconds — non-reasoning models
REASONING_TIMEOUT = 900 # seconds — reasoning models can take 10-15 min
DEFAULT_MAX_TOKENS = 16384
REASONING_MAX_TOKENS = 32768


def _is_reasoning_model(model: str) -> bool:
"""Return True for models that use internal chain-of-thought reasoning."""
return model.startswith(("o1", "o3", "o4")) or "-pro" in model
return model.startswith(("o1", "o3", "o4", "gpt-5.4", "gpt-5.5")) or "-pro" in model


def _resolve_timeout(timeout: "int | None", model: str) -> int:
"""Resolve the effective HTTP timeout for an API call.

If --timeout was explicitly provided, use it. Otherwise pick
REASONING_TIMEOUT (900s) for reasoning models and DEFAULT_TIMEOUT
(300s) for standard models. This prevents reasoning-model reviews
from hitting a too-short default when the wrapper command does not
pass --timeout.
"""
if timeout is not None:
return timeout
return REASONING_TIMEOUT if _is_reasoning_model(model) else DEFAULT_TIMEOUT


def estimate_tokens(text: str) -> int:
Expand Down Expand Up @@ -1134,13 +1207,19 @@ def call_openai(
prompt: str,
model: str,
api_key: str,
timeout: int = DEFAULT_TIMEOUT,
timeout: "int | None" = None,
) -> "tuple[str, dict]":
"""Call the OpenAI Responses API.

If ``timeout`` is None, resolves to REASONING_TIMEOUT (900s) for
reasoning models and DEFAULT_TIMEOUT (300s) otherwise — same logic
as the CLI ``--timeout`` flag. This guards future direct callers
against the old 300s-everywhere default.

Returns (content, usage) where usage is the API response's usage dict
containing input_tokens and output_tokens.
"""
timeout = _resolve_timeout(timeout, model)
reasoning = _is_reasoning_model(model)
max_tokens = REASONING_MAX_TOKENS if reasoning else DEFAULT_MAX_TOKENS

Expand Down Expand Up @@ -1343,8 +1422,12 @@ def main() -> None:
parser.add_argument(
"--timeout",
type=int,
default=DEFAULT_TIMEOUT,
help=f"HTTP request timeout in seconds (default: {DEFAULT_TIMEOUT})",
default=None,
help=(
f"HTTP request timeout in seconds. If omitted, defaults to "
f"{REASONING_TIMEOUT} for reasoning models and {DEFAULT_TIMEOUT} for "
f"standard models."
),
)
parser.add_argument(
"--delta-diff",
Expand Down Expand Up @@ -1374,6 +1457,10 @@ def main() -> None:

args = parser.parse_args()

# Resolve --timeout: reasoning models default to REASONING_TIMEOUT (900s),
# standard models to DEFAULT_TIMEOUT (300s). Explicit --timeout overrides.
args.timeout = _resolve_timeout(args.timeout, args.model)

# Post-parse validation
if args.context != "minimal" and not args.repo_root:
parser.error(
Expand Down Expand Up @@ -1614,13 +1701,7 @@ def main() -> None:
sys.exit(0)

# Call OpenAI API
if _is_reasoning_model(args.model) and args.timeout == DEFAULT_TIMEOUT:
print(
f"Note: {args.model} is a reasoning model. Consider --timeout 900 "
"for large reviews.",
file=sys.stderr,
)
print(f"Sending review to {args.model}...", file=sys.stderr)
print(f"Sending review to {args.model} (timeout: {args.timeout}s)...", file=sys.stderr)
print(f"Estimated input tokens: ~{est_tokens:,}", file=sys.stderr)
if cost_str:
print(f"Estimated cost: {cost_str}", file=sys.stderr)
Expand Down
46 changes: 44 additions & 2 deletions .github/codex/prompts/pr_review.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,42 @@ When reviewing new features or code paths, specifically check:
- Command to check: `grep -n "pattern" diff_diff/*.py`
- Flag as P1 if only partial fixes were made

## Single-Pass Completeness Mandate (Initial Review Only)

This is an INITIAL review. Treat this as the only chance to enumerate findings.
Follow-up rounds are expensive — find ALL P0/P1/P2 issues in this pass.

Before finalizing, confirm you have run each of these audits on the diff:

1. **Sibling-surface mirror audit**: For every fix or change in a method, schema,
default-value path, or report block, identify the parallel surface in the same
codebase (BR ↔ DR, schema ↔ renderer, default ↔ precomputed, summary ↔ full)
and check whether the same change applies there. Flag the unmirrored side as P1.

2. **Pattern-wide grep**: When you flag any anti-pattern or bug class, use `grep`
on `diff_diff/**.py` to identify sibling occurrences of the same pattern and
enumerate them in the SAME finding. Only LOAD a sibling file's full contents
if grep returns a hit and you need surrounding context to verify the issue.
Do not defer pattern-class findings to a follow-up round.

3. **Reciprocal/symmetry check**: For dispatch code, validation, or guards in
one direction (A-on-B), explicitly enumerate the reciprocal direction (B-on-A)
and confirm coverage.

4. **Transitive workflow deps**: For GH Actions workflow `paths:` or pytest
selection changes, sweep transitive auto-loaded files (conftest.py,
pyproject.toml, ancestor conftests) and confirm they are included.

5. **Scope override (with carve-outs)**: The audits above explicitly authorize
loading files outside the diff to verify completeness. This overrides the
"minimum surrounding context" default in the Rules section below.

**DO NOT load these paths** (the workflow's diff-build deliberately excludes
them; they are noise or out-of-scope):
- `docs/tutorials/*.ipynb` (notebook outputs are large JSON blobs)
- `benchmarks/data/real/*.json`
- `benchmarks/data/real/*.csv`

## Deferred Work Acceptance

This project tracks deferred technical debt in `TODO.md` under "Tech Debt from Code Reviews."
Expand All @@ -68,7 +104,9 @@ This project tracks deferred technical debt in `TODO.md` under "Tech Debt from C
and incorrect statistical output are not.

Rules:
- Review ONLY the changes introduced by this PR (diff) and the minimum surrounding context needed.
- Review the changes introduced by this PR (diff). The Single-Pass Completeness
Mandate above authorizes broader audits (sibling surfaces, pattern-wide greps,
reciprocal checks, transitive deps) — do those upfront rather than deferring.
- Provide a single Markdown report with:
- Overall assessment (see Assessment Criteria below)
- Executive summary (3–6 bullets)
Expand Down Expand Up @@ -116,7 +154,11 @@ When this is a re-review (the PR has prior AI review comments):
- New P1+ findings on unchanged code MAY be raised but must be marked "[Newly identified]"
to distinguish from moving goalposts. Limit these to clear, concrete issues — not
speculative concerns or stylistic preferences.
- New code added since the last review IS in scope for new findings.
- New code added since the last review IS in scope for new findings — apply the
Single-Pass Completeness Mandate's audits (sibling surfaces, pattern-wide greps,
reciprocal checks) to that new code in this re-review pass. For UNCHANGED code,
the existing [Newly identified] convention from the bullet above still applies:
new P1+ findings MAY be raised but must be marked "[Newly identified]".
- If all previous P1+ findings are resolved, the assessment should be ✅ even if new
P2/P3 items are noticed.

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/ai_pr_review.yml
Loading
Loading