You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the not-too-distant future, I'd like to bump Tree-sitter's version to 1.0, indicating a greater degree of stability and completeness. After that I'd like to regenerate all of the parsers in the tree-sitter github org, and bump them to 1.0 as well. Before doing this, there are several important problems with the framework that I think should be fixed.
Tasks
Unicode character properties - Support ECMAScript unicode property escapes in regexes.
Update tree-sitter-javascript and tree-sitter-typescript to use this more flexible precedence scheme. Right now, the integer precedence system is making it very difficult to continue development of tree-sitter-typescript in particular, because of the mix of different conflicts between types and expressions.
Dynamic precedence should probably stay integer-only, for simplicity
Strategy - Decide whether we're going to bother to maintain backward compatibility with old generated parsers, if so, the library code will need to become a bit more complicated in order to consume both binary formats.
Grammars - Regenerate all the parsers with the new representation.
Support grammars defined as ECMAScript modules instead of CommonJS module.
Reduce Coupling to Node - Introduce some Tree-sitter specific GRAMMAR_PATH setting where the CLI will search for grammar modules, instead of relying on node_modules and npm.
Mergeable Git Repos - Make it easier to collaborate on grammars by removing generated files from version control.
Figure out if the new scanner function can be made optional (with the parser generator inspecting scanner.c to decide whether to link against a _compare function).
Update tree-sitter-html to use this API, improving its incremental performance
Native Library, WASM parsers - Add a compile-time option to link the C library against a standard WASM engine (V8, wasmtime, or wasmer). When this feature is enabled, allow the native library to load WASM parsers, marshaling the parse table into native memory, and using WASM execution only for the lexing phase. This will make it more useful to distribute parsers as pre-compiled .wasm files, instead of as C code. The performance cost should be small, because all of the expensive parsing operations will still be native. Add optional WASM feature to the native library, allowing it to run wasm-compiled parsers via wasmtime #1864
In the not-too-distant future, I'd like to bump Tree-sitter's version to 1.0, indicating a greater degree of stability and completeness. After that I'd like to regenerate all of the parsers in the tree-sitter github org, and bump them to 1.0 as well. Before doing this, there are several important problems with the framework that I think should be fixed.
Tasks
Unicode character properties - Support ECMAScript unicode property escapes in regexes.
Partial Precedence Orderings - The integer precedence system makes some grammars shockingly difficult to maintain.
tree-sitter-javascriptandtree-sitter-typescriptto use this more flexible precedence scheme. Right now, the integer precedence system is making it very difficult to continue development oftree-sitter-typescriptin particular, because of the mix of different conflicts between types and expressions.Grammars with many fields, aliases - By historical accident, generated parsers use too small an integer type (
uint8_t) for storing nodes' field and alias information. Parsers with large numbers of fields can cause integer overflows (Tree-sitter generates invalid code for grammars with large numbers of fields and/or aliases #511)production_idas auint16_t(Clean up parse table representation, use 16 bits for production_id #943)Fix issues with the
get_columnexternal scanner API (Fix the behavior of Lexer.get_column #978)CLI Ergonomics
parsecommand, auto-detect UTF-16 files and decode them accordingly. This will help windows users who currently trip over the suggestedechocommand in the docs. (feat: add encoding flag and automatically check if a file might be utf16 #2368)GRAMMAR_PATHsetting where the CLI will search for grammar modules, instead of relying onnode_modulesandnpm.Mergeable Git Repos - Make it easier to collaborate on grammars by removing generated files from version control.
packandpublishsubcommands to the Tree-sitter CLI, for uploading tarballs and compiled.wasmfiles to the GitHub releases API. Store generated files as GH release artifacts instead of checking them into git repositories #730 (comment)Documentation
expression/identifiersyntax.tree-sitter-highlightrust crate (just using tree queries directly).tags.scmqueries used for code navigation on GitHub. Documentqueries/tags.scm#660Formalize the query spec
Stretch Goals
I'm recording these here even though they are a bit less urgent.
Incremental Parsing Perf - Enhance the external scanner API to allow for looser state comparisons, avoiding the catastrophic node-reuse failures seen in the HTML parser (Incremental parsing is ineffective when a new tag is opened tree-sitter-html#23)
scanner.cto decide whether to link against a_comparefunction).tree-sitter-htmlto use this API, improving its incremental performanceNative Library, WASM parsers - Add a compile-time option to link the C library against a standard WASM engine (V8, wasmtime, or wasmer). When this feature is enabled, allow the native library to load WASM parsers, marshaling the parse table into native memory, and using WASM execution only for the lexing phase. This will make it more useful to distribute parsers as pre-compiled
.wasmfiles, instead of as C code. The performance cost should be small, because all of the expensive parsing operations will still be native. Add optional WASM feature to the native library, allowing it to run wasm-compiled parsers via wasmtime #1864