GitHub - SMC17/blackbird-zig: Trigram inverted index with ULEB128 varint posting-list codec and ClickHouse-style sparse-gram tokenizer. 63% n-gram reduction, correct lookup. 29 tests. · GitHub
Skip to content

SMC17/blackbird-zig

Repository files navigation

blackbird-zig

License: AGPL-3.0-or-later Zig

Trigram tokeniser + inverted-index code search substrate in Zig 0.16. Ports GitHub's Blackbird code search architecture (castle L125) and Datadog's full-text search via n-gram + posting list (castle L216) to a single library.

Status

v0.0.2 — 29/29 tests pass on Zig 0.16. Adds ULEB128 varint posting-list codec + ClickHouse-style sparse-grams tokeniser. extractTrigrams + InvertedIndex from v0.0.1 unchanged; varint.encodePostingList / decodePostingList provide a delta-encoded byte format; SparseInvertedIndex cuts ~63% of n-grams while preserving lookup correctness.

What ships

v0.0.1 (carried)

  • Trigramu32, 3 bytes packed into the low 24 bits in lexicographic order so sorted comparisons match byte order.
  • packTrigram(a, b, c) / unpackTrigram(t) — round-trip helpers.
  • extractTrigrams(allocator, text) — sorted-deduped iterator over overlapping 3-byte windows.
  • DocIdu64.
  • InvertedIndex with add / postingFor / trigramCount / intersect / search.

v0.0.2 additions

  • varint.encodeU64 / decodeU64 — ULEB128 unsigned codec (up to 10 bytes per u64).
  • varint.encodePostingList / decodePostingList — delta-encoded varint posting. Dense doc-id corpora compress to ~1 byte per doc (8x vs raw []u64).
  • sparse_grams.extractSparseTrigrams(allocator, text, cfg) — ClickHouse L286 sparse-grams: hash-rolled fragment-cut tokeniser. Always emits first + last trigram; otherwise keeps a window iff wyhash(window) mod period == 0. Default period 3 yields ~33% retention.
  • SparseInvertedIndex — same shape as InvertedIndex but tokenises sparsely. Query + add must share the same SparseConfig for results to be comparable.

Build

zig build test                  # 29 unit tests

What ships does NOT do (yet)

  • No regex-to-trigram planner. v0.0.2 still only handles literal queries ≥ 3 bytes. The regex → trigram-set planner is v0.0.3.
  • No sharding. v0.0.2 is single-shard. v0.0.3 ships sharding by Git blob SHA hash for the multi-billion-document scale Blackbird targets.
  • No ranking. Returns docs in DocId order. BM25 is v0.0.3.
  • No incremental update. Build-once. v0.0.3 ships incremental update + delete.
  • No persistence. Memory-only. v0.0.3 ships a single-file on-disk layout that uses the v0.0.2 varint posting format.

Composes with shipped substrate

  • K11 zig-symbol-emit (P13 shipped) — symbols emitted from Zig source feed naturally into Blackbird as documents; the symbol graph + trigram index together give the codex an impact(symbol) + search(text) substrate without the GitNexus / Cypher daemon.
  • Datadog L216 ngram FTS — same data structure; this Zig port is the substrate Datadog and Blackbird both describe.
  • steam / orderbook / edge-ledger — any byte-slice column at the silver/gold layer can be indexed for sub-100 ms substring search.

Credit

Concepts adapted from GitHub Blackbird code search (castle L125) and Datadog full-text-search engineering (castle L216). Frontier port by Sean Collins (sean@sunlitmoon.online).

License

AGPL-3.0-or-later. See LICENSE.

About

Trigram inverted index with ULEB128 varint posting-list codec and ClickHouse-style sparse-gram tokenizer. 63% n-gram reduction, correct lookup. 29 tests.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

Contributors