Comparing master...compressed-parse-table · tree-sitter/tree-sitter · GitHub
Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: tree-sitter/tree-sitter
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: master
Choose a base ref
...
head repository: tree-sitter/tree-sitter
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: compressed-parse-table
Choose a head ref
Checking mergeability… Don’t worry, you can still create the pull request.
  • 9 commits
  • 13 files changed
  • 4 contributors

Commits on May 19, 2026

  1. perf(generate): add CSR-compressed parse tables (ABI 16)

    Apply Compressed Sparse Row (CSR) compression to the parse table for
    grammars that benefit, replacing the original dense + small split with
    three flat arrays:
    
      uint32_t parse_table_row_offsets[STATE_COUNT + 1]
      uint16_t parse_table_columns[TOTAL_NNZ]
      uint16_t parse_table_values[TOTAL_NNZ]
    
    Heuristic for enabling CSR (per grammar, all three must hold):
    
      1. LARGE_STATE_COUNT * SYMBOL_COUNT > STATE_COUNT * 40
         Ensures the dense table is large enough relative to total state
         count that savings outweigh the small-state grouping penalty.
    
      2. dense table density < 45%
         Ensures CSR actually saves space. Above ~50% density, CSR's
         per-entry column indices cost more than the zeros they eliminate.
         45% adds margin below the theoretical crossover.
    
      3. LARGE_STATE_COUNT * SYMBOL_COUNT > 50,000
         Avoids applying a format change to tiny grammars where fixed
         overhead dominates.
    
    Co-authored-by: Tuomas Hietanen <thorium@iki.fi>
    WillLillis and Thorium committed May 19, 2026
    Configuration menu
    Copy the full SHA
    ea90489 View commit details
    Browse the repository at this point in the history
  2. perf(generate): deduplicate identical small parse table entries

    When generating the hybrid format's small parse table, some grammars
    emit many states with identical grouped (symbol, action) data.Detect
    duplicates and reuse a single table offset for them.
    
    Co-authored-by: Tuomas Hietanen <thorium@iki.fi>
    WillLillis and Thorium committed May 19, 2026
    Configuration menu
    Copy the full SHA
    8f663fb View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    d4d7c1b View commit details
    Browse the repository at this point in the history
  4. fixup: language test ABI>=15

    clason authored and WillLillis committed May 19, 2026
    Configuration menu
    Copy the full SHA
    2b8647f View commit details
    Browse the repository at this point in the history
  5. wip(generate): per-state representation picker for plan B

    Add dense/CSR/small picker keyed on overhead-aware per-state cost
    plus a dedup-promotion pass for canonical small-table groups. Emits
    PER_STATE_* defines for corpus-runner CSV. No emission change yet:
    the picker only reports stats while the existing 2-way path drives
    parser.c output.
    WillLillis committed May 19, 2026
    Configuration menu
    Copy the full SHA
    0532e54 View commit details
    Browse the repository at this point in the history
  6. wip(generate): state renumbering scaffold for plan B

    Add `assign_initial_state_repr` + `reorder_states_by_repr` so the
    picker output drives a contiguous [Dense | CSR | Small] tier layout
    with all Shift/Goto state references remapped through the new ids,
    plus parallel state-indexed arrays permuted to match.
    
    Step 2 wires this with the existing 2-way decision as input, so the
    permutation is the identity and parser.c output is byte-identical
    across the 320-grammar corpus. Step 3 will switch the input to the
    free 3-way picker and add three-way emission.
    WillLillis committed May 19, 2026
    Configuration menu
    Copy the full SHA
    73f7a94 View commit details
    Browse the repository at this point in the history
  7. wip(generate): emit 3-way per-state parse table layout (plan B)

    Replace the per-grammar CSR-vs-hybrid choice with a per-state picker
    that places each state in the smallest of dense / CSR / small. State
    ids are partitioned into contiguous tiers [Dense | CSR | Small] driven
    by `large_state_count` and a new `csr_state_count` field on
    TSLanguage. The runtime dispatches on two range checks; the rest of
    the lookup logic per tier is unchanged.
    
    States 0 (error) and 1 (start state) are pinned to the Dense tier so
    they remain at indices 0 and 1, matching the runtime's hard-coded
    expectations. Cost is bounded at ~2 * SYMBOL_COUNT * 2 bytes per
    grammar.
    
    Cleanup: drop --table-fmt CLI, OptLevel::ForceHybridTable /
    ForceCompressedTable, RenderError::ConflictingParseTableFlags,
    heuristic_should_compress, and use_compressed_tables since the free
    picker provably picks the smallest representation per state.
    
    Tests pass against the regenerated fixtures. The size savings are
    measured in a follow-up corpus run.
    WillLillis committed May 19, 2026
    Configuration menu
    Copy the full SHA
    65e8b25 View commit details
    Browse the repository at this point in the history
  8. wip(generate): bias the 3-way picker toward CSR over Small

    The bytes-optimal pick is not the parse-time-optimal pick. CSR uses
    O(log n) binary search; Small uses an O(group_count + nnz) linear
    scan over grouped sections. The unbiased picker put ~50% of states
    into Small, causing parse-time regressions of up to +24% on grammars
    like kotlin and javascript.
    
    Add a SMALL_VS_CSR_BIAS constant (0.6) that requires Small to be at
    least 40% smaller than CSR before the picker prefers it, applied both
    to the per-state argmin and the canonical-group dedup-promotion pass.
    This keeps each grammar's parse-time delta within roughly the ±10%
    band that csr-clean already occupies vs master, while still
    delivering ~7pp of the ~11pp size win available from the unbiased
    picker (~38% .so reduction vs master across the corpus).
    WillLillis committed May 19, 2026
    Configuration menu
    Copy the full SHA
    591847a View commit details
    Browse the repository at this point in the history

Commits on May 24, 2026

  1. Configuration menu
    Copy the full SHA
    5e2fb7d View commit details
    Browse the repository at this point in the history
Loading