Trigram tokeniser + inverted-index code search substrate in Zig 0.16. Ports GitHub's Blackbird code search architecture (castle L125) and Datadog's full-text search via n-gram + posting list (castle L216) to a single library.
v0.0.2 — 29/29 tests pass on Zig 0.16. Adds ULEB128 varint posting-list codec + ClickHouse-style sparse-grams tokeniser. extractTrigrams + InvertedIndex from v0.0.1 unchanged; varint.encodePostingList / decodePostingList provide a delta-encoded byte format; SparseInvertedIndex cuts ~63% of n-grams while preserving lookup correctness.
Trigram—u32, 3 bytes packed into the low 24 bits in lexicographic order so sorted comparisons match byte order.packTrigram(a, b, c)/unpackTrigram(t)— round-trip helpers.extractTrigrams(allocator, text)— sorted-deduped iterator over overlapping 3-byte windows.DocId—u64.InvertedIndexwithadd/postingFor/trigramCount/intersect/search.
varint.encodeU64/decodeU64— ULEB128 unsigned codec (up to 10 bytes per u64).varint.encodePostingList/decodePostingList— delta-encoded varint posting. Dense doc-id corpora compress to ~1 byte per doc (8x vs raw[]u64).sparse_grams.extractSparseTrigrams(allocator, text, cfg)— ClickHouse L286 sparse-grams: hash-rolled fragment-cut tokeniser. Always emits first + last trigram; otherwise keeps a window iffwyhash(window) mod period == 0. Default period 3 yields ~33% retention.SparseInvertedIndex— same shape asInvertedIndexbut tokenises sparsely. Query + add must share the sameSparseConfigfor results to be comparable.
zig build test # 29 unit tests- No regex-to-trigram planner. v0.0.2 still only handles literal queries ≥ 3 bytes. The regex → trigram-set planner is v0.0.3.
- No sharding. v0.0.2 is single-shard. v0.0.3 ships sharding by Git blob SHA hash for the multi-billion-document scale Blackbird targets.
- No ranking. Returns docs in DocId order. BM25 is v0.0.3.
- No incremental update. Build-once. v0.0.3 ships incremental update + delete.
- No persistence. Memory-only. v0.0.3 ships a single-file on-disk layout that uses the v0.0.2 varint posting format.
- K11 zig-symbol-emit (P13 shipped) — symbols emitted from Zig source feed naturally into Blackbird as documents; the symbol graph + trigram index together give the codex an
impact(symbol)+search(text)substrate without the GitNexus / Cypher daemon. - Datadog L216 ngram FTS — same data structure; this Zig port is the substrate Datadog and Blackbird both describe.
- steam / orderbook / edge-ledger — any byte-slice column at the silver/gold layer can be indexed for sub-100 ms substring search.
Concepts adapted from GitHub Blackbird code search (castle L125) and Datadog full-text-search engineering (castle L216). Frontier port by Sean Collins (sean@sunlitmoon.online).
AGPL-3.0-or-later. See LICENSE.
