[codex] PostgreSQL compatibility harness foundation by renecannao · Pull Request #50 · ProxySQL/ParserSQL · GitHub
Skip to content

[codex] PostgreSQL compatibility harness foundation#50

Open
renecannao wants to merge 31 commits into
mainfrom
feat/postgresql-compat-harness
Open

[codex] PostgreSQL compatibility harness foundation#50
renecannao wants to merge 31 commits into
mainfrom
feat/postgresql-compat-harness

Conversation

@renecannao

@renecannao renecannao commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add the PostgreSQL compatibility harness foundation for ParserSQL.
  • Pin and validate the libpg_query 17-latest and 18-latest oracle sources.
  • Add the compatibility result model, statement-type mapping, differential runner, and deterministic inventory normalization.
  • Add PG17-to-PG18 inventory and structural delta extraction for grammar, keywords, parse nodes, and protobuf schemas.
  • Add baseline transition enforcement, deterministic committed CI case generation, and reviewed PostgreSQL syntax witnesses.
  • Add deterministic Markdown compatibility report generation for PostgreSQL 18.
  • Add Makefile orchestration commands and GitHub Actions coverage for the pg_compat gate and scheduled full validation.
  • Document the PostgreSQL compatibility workflow, prerequisites, cache behavior, and result taxonomy.

What is included

Validation

  • make test with local Homebrew dependency overrides: 1,299 tests run, 1,262 passed, 37 skipped.
  • make test-pg-compat PG_COMPAT_CACHE=/tmp/parsersql-pg-compat-task9: 87 tests passed; checked 24,080 committed pg_compat CI cases.
  • make pg-compat PG_COMPAT_CACHE=/tmp/parsersql-pg-compat-task9: skipped 13 known invalid/unsplittable PostgreSQL corpus files per oracle role; checked 24,080 committed pg_compat CI cases; committed PostgreSQL compatibility baseline matches.
  • make pg-compat-refresh PG_COMPAT_CACHE=/tmp/parsersql-pg-compat-task9: regenerated PostgreSQL 18 baseline/report, 51,415 baseline rows, 24,080 CI cases, 167 structural features.
  • python3 YAML parse check for .github/workflows/ci.yml: jobs include pg-compat and pg-compat-full.
  • git diff --check: passed after Task 12 docs.
  • Result taxonomy verification: every accepted committed PostgreSQL case is one of DEEP_SUPPORTED, CLASSIFIED_ONLY, PARTIAL, ERROR, TRAILING_INPUT, or TYPE_MISMATCH.

Notes

  • PostgreSQL source pins remain immutable in tests/pg_compat/upstream_pins.json; CI does not silently advance 18-latest.
  • The PR/push pg_compat job runs the deterministic committed gate (make test-pg-compat).
  • The scheduled job runs the full pinned PG17/PG18 compatibility validation (make pg-compat).
  • libpg_query is not linked into production parser targets; it is confined to benchmarks and the pg_compat harness.
  • DEEP_SUPPORTED is the only full parser parity result; CLASSIFIED_ONLY is reported separately as top-level statement routing coverage.

Summary by CodeRabbit

  • New Features

    • Added PostgreSQL compatibility testing to verify parser alignment with PostgreSQL 17 and 18 versions.
  • Build & CI

    • Implemented automated compatibility testing in continuous integration with weekly scheduled runs.
    • Added build targets for running compatibility tests locally and refreshing test baselines.
  • Documentation

    • Added comprehensive documentation for the compatibility testing harness and reproduction procedures.

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

@renecannao renecannao marked this pull request as ready for review June 16, 2026 10:32

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (5)
scripts/pg_compat/run_compat.py (1)

275-275: ⚡ Quick win

Remove unused _diagnostics variable.

The partition_rows function returns a tuple, but only the first element is used. The second element _diagnostics is assigned but never referenced.

♻️ Suggested cleanup
-    accepted, _diagnostics = partition_rows(rows)
+    accepted, _ = partition_rows(rows)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/pg_compat/run_compat.py` at line 275, The partition_rows function
call is unpacking both return values into accepted and _diagnostics, but
_diagnostics is never used afterward. Remove the unused variable assignment by
only capturing the first element from the partition_rows return value in the
assignment statement.
scripts/pg_compat/generate_report.py (1)

223-223: ⚡ Quick win

Consider dynamic version numbers in section headers for consistency.

The title on line 196 uses target_major dynamically, but the "PG18 Backlog" header and "PG17 to PG18 Release Delta" header hardcode version numbers. For consistency and future maintainability (e.g., extending to PG19), consider:

# Line 223
f"## PG{target_major} Backlog",

# Line 250
f"## PG{_major(previous['pg_version'])} to PG{target_major} Release Delta",

Also applies to: 250-250

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/pg_compat/generate_report.py` at line 223, The section headers in the
generate_report.py file hardcode version numbers (PG18, PG17 to PG18) instead of
using dynamic variables for consistency. At line 223, replace the hardcoded "##
PG18 Backlog" string with an f-string using the target_major variable to match
the dynamic approach used in the title at line 196. At line 250, similarly
replace the hardcoded "## PG17 to PG18 Release Delta" header with an f-string
that constructs the version numbers dynamically using
_major(previous['pg_version']) and target_major. This ensures the headers will
adapt automatically if the script is extended to support future PostgreSQL
versions.
scripts/pg_compat/fetch_libpg_query.sh (1)

187-210: 💤 Low value

Python os.setsid() could fail if already a session leader.

Line 196 calls os.setsid() to create a new session for proper signal handling. This will fail with EPERM if the calling process is already a session leader (rare in practice, but possible if the script is invoked through certain process managers).

However, the os.execvp() call on line 199 will still execute even if setsid() fails, so the subprocess will run. For this use case (interruptible git/curl operations in the compat harness), this edge case is acceptable.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/pg_compat/fetch_libpg_query.sh` around lines 187 - 210, The
os.setsid() call in the Python snippet within the run_interruptible function
could fail with EPERM if the process is already a session leader, causing the
script to exit prematurely. Wrap the os.setsid() call in a try-except block to
gracefully handle this edge case, catching OSError or other potential exceptions
and continuing execution since the subprocess will still run properly even if
session creation fails.
scripts/pg_compat/baseline.py (1)

308-353: 💤 Low value

Consider adding strict=True to zip for clarity.

The zip() call at line 324 lacks an explicit strict= parameter. While the length validation at lines 314-318 ensures the lists have the same length, adding strict=True would make the intent explicit and provide a built-in runtime check.

Note: This requires Python 3.10+. If the project targets Python 3.9 or earlier, skip this suggestion.

♻️ Optional enhancement
-    for index, (oracle_row, metadata_row) in enumerate(
-        zip(oracle_rows, metadata_rows)
-    ):
+    for index, (oracle_row, metadata_row) in enumerate(
+        zip(oracle_rows, metadata_rows, strict=True)
+    ):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/pg_compat/baseline.py` around lines 308 - 353, The zip() call in the
validate_witnesses function that iterates over oracle_rows and metadata_rows
lacks an explicit strict parameter. Add strict=True to the zip() call to make
the intent explicit and provide a built-in runtime check that ensures both
sequences have equal length, which aligns with the length validation already
performed at the start of the function. This change requires Python 3.10 or
later support.
tests/pg_compat/fixtures/pg17.proto (1)

2-2: Add clarifying comment or message stub for undefined Node type in fixture.

The Node type is referenced but not defined in this protobuf fixture. While the text-based extraction logic in compare_versions.py handles this without issue, the fixture lacks self-contained clarity. Either add a brief comment explaining that Node is intentionally omitted (e.g., // Node defined separately), or include a minimal stub (message Node {}), similar to the pattern used in related tests.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/pg_compat/fixtures/pg17.proto` at line 2, The protobuf fixture file
references the Node type in the target_list field but does not define it,
reducing clarity and self-containment. In the pg17.proto fixture file where the
repeated Node target_list field is defined, add either a brief clarifying
comment (such as "// Node defined separately") above or near the target_list
field definition, or include a minimal stub message definition for Node (e.g.,
"message Node {}") elsewhere in the file to match patterns used in related
tests. This clarifies that the omission is intentional and aids anyone reviewing
the fixture.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/ci.yml:
- Line 113: Pin all GitHub Actions references to commit SHAs instead of semantic
versions to comply with the repository's security policy. In
`.github/workflows/ci.yml` at lines 113 and 133, replace `actions/checkout@v4`
with `actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af172` (or the current
latest v4 commit SHA). At lines 116 and 136, replace `actions/cache@v4` with the
full commit SHA for the latest v4 release of that action. Verify each pinned SHA
corresponds to the intended version tag to prevent action mutations.
- Around line 106-124: The pg-compat and pg-compat-full jobs in
.github/workflows/ci.yml lack explicit permissions configuration and allow
credential persistence, creating unnecessary security risks. At
.github/workflows/ci.yml#L106-L124 (pg-compat job), add a permissions block with
contents: read at the job level and set persist-credentials: false on the
actions/checkout@v4 step. Apply the same changes at
.github/workflows/ci.yml#L126-L144 (pg-compat-full job) - add permissions: {
contents: read } at the job level and set persist-credentials: false on the
checkout action. Both jobs only require read access to the repository and do not
write artifacts or perform deployments.

---

Nitpick comments:
In `@scripts/pg_compat/baseline.py`:
- Around line 308-353: The zip() call in the validate_witnesses function that
iterates over oracle_rows and metadata_rows lacks an explicit strict parameter.
Add strict=True to the zip() call to make the intent explicit and provide a
built-in runtime check that ensures both sequences have equal length, which
aligns with the length validation already performed at the start of the
function. This change requires Python 3.10 or later support.

In `@scripts/pg_compat/fetch_libpg_query.sh`:
- Around line 187-210: The os.setsid() call in the Python snippet within the
run_interruptible function could fail with EPERM if the process is already a
session leader, causing the script to exit prematurely. Wrap the os.setsid()
call in a try-except block to gracefully handle this edge case, catching OSError
or other potential exceptions and continuing execution since the subprocess will
still run properly even if session creation fails.

In `@scripts/pg_compat/generate_report.py`:
- Line 223: The section headers in the generate_report.py file hardcode version
numbers (PG18, PG17 to PG18) instead of using dynamic variables for consistency.
At line 223, replace the hardcoded "## PG18 Backlog" string with an f-string
using the target_major variable to match the dynamic approach used in the title
at line 196. At line 250, similarly replace the hardcoded "## PG17 to PG18
Release Delta" header with an f-string that constructs the version numbers
dynamically using _major(previous['pg_version']) and target_major. This ensures
the headers will adapt automatically if the script is extended to support future
PostgreSQL versions.

In `@scripts/pg_compat/run_compat.py`:
- Line 275: The partition_rows function call is unpacking both return values
into accepted and _diagnostics, but _diagnostics is never used afterward. Remove
the unused variable assignment by only capturing the first element from the
partition_rows return value in the assignment statement.

In `@tests/pg_compat/fixtures/pg17.proto`:
- Line 2: The protobuf fixture file references the Node type in the target_list
field but does not define it, reducing clarity and self-containment. In the
pg17.proto fixture file where the repeated Node target_list field is defined,
add either a brief clarifying comment (such as "// Node defined separately")
above or near the target_list field definition, or include a minimal stub
message definition for Node (e.g., "message Node {}") elsewhere in the file to
match patterns used in related tests. This clarifies that the omission is
intentional and aids anyone reviewing the fixture.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 45be172d-156a-4e34-8517-09c2a7d56a32

📥 Commits

Reviewing files that changed from the base of the PR and between 3cfc581 and 901f666.

📒 Files selected for processing (45)
  • .github/workflows/ci.yml
  • .gitignore
  • Makefile
  • README.md
  • docs/benchmarks/REPRODUCING.md
  • docs/compatibility/postgresql-18.md
  • docs/superpowers/plans/2026-06-11-postgresql-libpg-query-compatibility.md
  • docs/superpowers/specs/2026-06-11-postgresql-libpg-query-compatibility-design.md
  • include/sql_engine/local_txn.h
  • scripts/pg_compat/baseline.py
  • scripts/pg_compat/common.py
  • scripts/pg_compat/compare_versions.py
  • scripts/pg_compat/extract_statements.py
  • scripts/pg_compat/fetch_libpg_query.sh
  • scripts/pg_compat/generate_report.py
  • scripts/pg_compat/run_compat.py
  • tests/pg_compat/ci_cases.jsonl
  • tests/pg_compat/expected_results.jsonl
  • tests/pg_compat/fixtures/pg17-gram.y
  • tests/pg_compat/fixtures/pg17-inventory.jsonl
  • tests/pg_compat/fixtures/pg17-kwlist.h
  • tests/pg_compat/fixtures/pg17-parsenodes.h
  • tests/pg_compat/fixtures/pg17.proto
  • tests/pg_compat/fixtures/pg18-gram.y
  • tests/pg_compat/fixtures/pg18-inventory.jsonl
  • tests/pg_compat/fixtures/pg18-kwlist.h
  • tests/pg_compat/fixtures/pg18-parsenodes.h
  • tests/pg_compat/fixtures/pg18.proto
  • tests/pg_compat/runner_cases.sql
  • tests/pg_compat/statement_type_cases.cpp
  • tests/pg_compat/structural_dispositions.json
  • tests/pg_compat/test_baseline.py
  • tests/pg_compat/test_common.py
  • tests/pg_compat/test_compare_versions.py
  • tests/pg_compat/test_extract_statements.py
  • tests/pg_compat/test_generate_report.py
  • tests/pg_compat/test_run_compat.py
  • tests/pg_compat/test_runner.py
  • tests/pg_compat/upstream_pins.json
  • tests/pg_compat/witnesses.json
  • tests/pg_compat/witnesses.sql
  • tools/pg_compat/pg_compat.cpp
  • tools/pg_compat/result.h
  • tools/pg_compat/statement_type_map.cpp
  • tools/pg_compat/statement_type_map.h

Comment thread .github/workflows/ci.yml
Comment on lines +106 to +124
pg-compat:
if: github.event_name != 'schedule'
runs-on: ubuntu-24.04
timeout-minutes: 30
env:
PG_COMPAT_CACHE: /tmp/parsersql-pg-compat
steps:
- uses: actions/checkout@v4

- name: Cache PostgreSQL compatibility sources
uses: actions/cache@v4
with:
path: /tmp/parsersql-pg-compat
key: pg-compat-${{ runner.os }}-${{ hashFiles('tests/pg_compat/upstream_pins.json') }}
restore-keys: |
pg-compat-${{ runner.os }}-

- name: Test PostgreSQL compatibility gate
run: make -f Makefile test-pg-compat

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Consolidation: Restrict job permissions and disable credential persistence in both pg-compat jobs.

Both the pg-compat and pg-compat-full jobs omit explicit permissions configuration and allow credential persistence, creating overly broad access and credential leakage risks.

  • .github/workflows/ci.yml#L106-L124 (pg-compat): add permissions: { contents: read } and set persist-credentials: false on checkout
  • .github/workflows/ci.yml#L126-L144 (pg-compat-full): add permissions: { contents: read } and set persist-credentials: false on checkout

Both jobs only need read access to the repository; they do not write artifacts or deploy.

🧰 Tools
🪛 zizmor (1.25.2)

[warning] 113-113: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)


[warning] 106-124: overly broad permissions (excessive-permissions): default permissions used due to no permissions: block

(excessive-permissions)


[error] 113-113: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)


[error] 116-116: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

📍 Affects 1 file
  • .github/workflows/ci.yml#L106-L124 (this comment)
  • .github/workflows/ci.yml#L126-L144
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ci.yml around lines 106 - 124, The pg-compat and
pg-compat-full jobs in .github/workflows/ci.yml lack explicit permissions
configuration and allow credential persistence, creating unnecessary security
risks. At .github/workflows/ci.yml#L106-L124 (pg-compat job), add a permissions
block with contents: read at the job level and set persist-credentials: false on
the actions/checkout@v4 step. Apply the same changes at
.github/workflows/ci.yml#L126-L144 (pg-compat-full job) - add permissions: {
contents: read } at the job level and set persist-credentials: false on the
checkout action. Both jobs only require read access to the repository and do not
write artifacts or perform deployments.

Source: Linters/SAST tools

Comment thread .github/workflows/ci.yml

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Consolidation: Pin all GitHub Actions to commit SHAs.

Four action references use unpinned semantic versions instead of commit SHAs, which violates the repository's security policy and allows unreviewed action mutations.

  • .github/workflows/ci.yml#L113: actions/checkout@v4 → pin to SHA
  • .github/workflows/ci.yml#L116: actions/cache@v4 → pin to SHA
  • .github/workflows/ci.yml#L133: actions/checkout@v4 → pin to SHA
  • .github/workflows/ci.yml#L136: actions/cache@v4 → pin to SHA

Example: replace actions/checkout@v4 with actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af172 (or the current latest v4 commit SHA).

🧰 Tools
🪛 zizmor (1.25.2)

[warning] 113-113: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)


[error] 113-113: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ci.yml at line 113, Pin all GitHub Actions references to
commit SHAs instead of semantic versions to comply with the repository's
security policy. In `.github/workflows/ci.yml` at lines 113 and 133, replace
`actions/checkout@v4` with
`actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af172` (or the current
latest v4 commit SHA). At lines 116 and 136, replace `actions/cache@v4` with the
full commit SHA for the latest v4 release of that action. Verify each pinned SHA
corresponds to the intended version tag to prevent action mutations.

Source: Linters/SAST tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment