Clean up parse table representation, use 16 bits for production_id by maxbrunsfeld · Pull Request #943 · tree-sitter/tree-sitter · GitHub
Skip to content

Clean up parse table representation, use 16 bits for production_id#943

Merged
maxbrunsfeld merged 2 commits into
masterfrom
new-parse-table-repr
Feb 26, 2021
Merged

Clean up parse table representation, use 16 bits for production_id#943
maxbrunsfeld merged 2 commits into
masterfrom
new-parse-table-repr

Conversation

@maxbrunsfeld

Copy link
Copy Markdown
Contributor

Fixes #511

Background

Currently, because of the way that Tree-sitter's parser ABI incrementally evolved, the production_id field on parse actions (which is used to identify the sequence of fields and aliases that applies to a given node's children) is represented as a uint8_t. So far, this has not been a problem for any of the grammars that we maintain, but other users have hit the limit, and when it happens, it's very confusing.

Change

This PR changes production_id to a nice uint16_t, which should always be more than large enough.

Unfortunately, because of some shortcomings of the old struct layout, this was not feasible to do without making a breaking change to the parser ABI. So, for the first time in a long time, I have bumped tree-sitter's TREE_SITTER_MIN_COMPATIBLE_LANGUAGE_VERSION constant. This means that new versions of the library will refuse to load parsers that were generated with previous versions of the CLI. Specifically ts_parser_set_language will return false. Users will need to regenerate their parsers.

Normally, this is not a problem: I think that for most use cases, users can easily upgrade the library and their parsers together. There are a few exceptions that I know of:

  1. Atom - At some point, when Atom adopts this version of Tree-sitter, this change will affect Atom users who have installed packages using third-party Tree-sitter parsers. The packages will needto be updated to use a regenerated version of the parsers. /cc @darangi - I don't think Atom plans on upgrading Tree-sitter anytime soon, so this is probably not urgent for you I think?

  2. Neovim - I think that some end-users of Neovim have installed their own third-party parsers. When Neovim upgrades to the latest version of Tree-sitter, Neovim will refuse to load these older parsers. The end users will need to install new versions of the parsers. /cc @bfredl @vigoux @theHamsta @tjdevries

  3. Emacs-Tree-Sitter - /cc @ubolonton I'm not sure if users can install their own parsers. If so, then this is something to be aware of.

Details

Since an ABI change was already needed, I took this opportunity to fix some dumb aspects of the current ABI. There are no more bitfields (they were used unnecessarily before). Also, I've added one more field to the TSLanguage struct - production_id_count, which makes it so that all the arrays on TSLanguage can have their length known at runtime.

Also remove the use of bitfields from the parse table format.
In all cases, bitfields were not necessary to achieve the
current binary sizes. Avoiding them makes the binaries more
portable.

There was no way to make this change backward-compatible,
so we have finally dropped support for parsers generated
with an earlier version of Tree-sitter.

At some point, when Atom adopts this version of Tree-sitter,
this change will affect Atom users who have installed packages
using third-party Tree-sitter parsers. The packages will need
to be updated to use a regenerated version of the parsers.
I think this is the last additional field that's needed so
that every array member of TSLanguage has a length that
can be calculated at runtime.
@ubolonton

Copy link
Copy Markdown
Contributor

maxbrunsfeld added a commit that referenced this pull request Mar 1, 2021
Due to the breaking ABI change in #943, this is our chance
to reorder the fields in a more logical way.
qjoner

This comment was marked as spam.

@dundargoc dundargoc added this to the 1.0 milestone Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tree-sitter generates invalid code for grammars with large numbers of fields and/or aliases

4 participants