feat(core, indexer): support group_by searching and fix hsnw sparse by JalinWang · Pull Request #527 · alibaba/zvec · GitHub
Skip to content

feat(core, indexer): support group_by searching and fix hsnw sparse#527

Open
JalinWang wants to merge 17 commits into
alibaba:mainfrom
JalinWang:feat/group_by
Open

feat(core, indexer): support group_by searching and fix hsnw sparse#527
JalinWang wants to merge 17 commits into
alibaba:mainfrom
JalinWang:feat/group_by

Conversation

@JalinWang

@JalinWang JalinWang commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR implements group_by search (a.k.a. group-by deduplication) across the core interface, algorithm, and DB layers of zvec. Users can now retrieve top-K results per group instead of globally, with full support for fetch_vector, is_linear, and bf_pks query modes on supported index types.

Motivation

Prior to this PR, group_by was either silently ignored (IVF, Vamana) or produced incorrect/empty results (DiskAnn, sparse fetch_vector). This PR brings consistent, well-tested group_by behavior to all index types.

Previous Status

Index Type group_by (graph search) group_by + is_linear group_by + bf_pks fetch_vector in group_by Notes
flat (dense) N/A (always linear) Yes Yes Yes Full support via FlatStreamerContext
flat_sparse N/A (always linear) Yes Yes No Known limitation; search_group/search_group_p_keys don't emit vectors
hnsw (dense) Yes Yes Yes Yes Graph search via HnswAlgorithm::search, bf via HnswStreamer::search_bf_impl and search_bf_by_p_keys_impl
hnsw_sparse Yes Yes Yes Partial (no sparse vector in group result) topk_to_group_result uses get_vector_meta(id) only, not full sparse data
hnsw_rabitq Yes Yes Yes Yes Both graph and bf paths handle group_by; HnswRabitqContext has full group result + fetch
diskann Yes (buggy) Yes (buggy) Yes (buggy) No Empty results — linear_search/keys_search never populated group_topk_heaps_; knn_search had typo at L985 writing to wrong heap;
and the fetch vector doesn’t work in normal search
ivf No No No No IVFSearcherContext has no group_by_search(), no group_topk_heaps. group_by is silently ignored
vamana No No No No VamanaContext has no group support. topk_to_result only builds flat results. group_by is silently ignored

Current Status

Unsupported index types now fail fast with IndexError_Unsupported and an error log, instead of silently returning empty or incorrect results.

Support Reject
flat (dense) diskann
flat (sparse) ivf
hnsw (dense) vamana
hnsw (sparse)
hnsw_rabitq

We will gradually enable GroupBy searching for them.

Key Changes

Core Interface Layer

  • index_param.h: Added GroupByParam struct (group_topk, group_count, group_by callback) and group_by_param field on BaseIndexQueryParam.
  • index.h / index.cc: Added group_doc_list_ to SearchResult. Introduced for_each_doc helper to uniformly iterate over flat and grouped results for score normalization and vector reverting. Added _set_group_by_on_context static helper called by supported index types at the end of _prepare_for_search.
  • Per-index _prepare_for_search: FlatIndex, HNSWIndex, HNSWRabitqIndex call _set_group_by_on_context. IVFIndex, DiskAnnIndex, and VamanaIndex add early rejection checks for group_by.

Algorithm Layer

  • flat_sparse_search.h: Populates IndexSparseDocument in group_by results when fetch_vector is enabled (previously missing).
  • hnsw_sparse_context.h: Fetches full sparse vector data via get_sparse_data + SparseUtility::ReverseSparseFormat in group results (previously only stored get_vector_meta).

DB Layer

  • engine_helper.hpp: Translates DB-layer vector_column_params::GroupByParams to core-layer GroupByParam when building the engine query param.
  • vector_column_indexer.cc: Returns GroupVectorIndexResults when search_result.group_doc_list_ is populated.
  • combined_vector_column_indexer.cc: Merges group_by results across multiple index blocks — adjusts doc keys by block offset, merges groups by group_id, sorts within groups by score (respecting metric direction), and truncates to group_topk / group_count.

Tests

  • index_group_by_test.cc (new, 520 lines): Data-driven GroupByInterfaceTest fixture at the core interface layer with RunOk/RunRejected methods. Covers dense (flat, hnsw, hnsw_rabitq — graph, linear, bf_pks, fetch_vector), sparse (flat, hnsw — graph, linear, bf_pks, fetch_vector), and unsupported (vamana, ivf, diskann) index types.
  • vector_column_indexer_test.cc (+257 lines): Data-driven GroupByIndexerTest fixture at the DB indexer layer, mirroring the core test pattern. Covers dense (flat, hnsw — graph, linear, bf_pks, fetch_vector), sparse (flat, hnsw), and unsupported (ivf, diskann with optional plugin skip).

Test Results

All group_by tests pass:

  • GroupByInterfaceTest.Dense — flat/hnsw/hnsw_rabitq (graph, linear, bf_pks, fetch_vector)
  • GroupByInterfaceTest.Sparse — flat_sparse/hnsw_sparse (graph, linear, bf_pks, fetch_vector)
  • GroupByInterfaceTest.UnsupportedIndexTypes — vamana/ivf/diskann correctly rejected
  • GroupByIndexerTest.Dense — flat/hnsw (graph, linear, bf_pks, fetch_vector)
  • GroupByIndexerTest.Sparse — flat_sparse/hnsw_sparse
  • GroupByIndexerTest.UnsupportedIndexTypes — ivf/diskann correctly rejected

JalinWang added 15 commits June 18, 2026 16:12
…ssing

Added for_each_doc helper to unify iteration over flat doc_list_ or grouped
group_doc_list_, eliminating 5 of 6 duplicated if(has_group_by) blocks in
_dense_search and _sparse_search. Reduces code duplication and makes the
group_by logic easier to maintain.

The one remaining dual-rail (reformer normalize) is preserved because the
batch API requires different handling for grouped vs flat results.
Move DiskAnn group_by support to feat/group_by_diskann branch.
On this branch, DiskAnn explicitly rejects group_by with
IndexError_Unsupported, matching Vamana and IVF behavior.
@JalinWang JalinWang changed the title Feat/group by feat(core, indexer): support group_by searching Jun 25, 2026
@JalinWang JalinWang marked this pull request as ready for review June 25, 2026 08:08
@JalinWang JalinWang removed the request for review from chinaux June 25, 2026 08:23
@JalinWang JalinWang changed the title feat(core, indexer): support group_by searching feat(core, indexer): support group_by searching and fix hsnw sparse Jun 25, 2026
@JalinWang JalinWang requested review from egolearner and feihongxu0824 and removed request for zhourrr June 25, 2026 08:24
uint32_t group_count =
query_params.group_by ? query_params.group_by->group_count : 0;
if (group_count > 0 && merged_group_docs.size() > group_count) {
merged_group_docs.resize(group_count);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里截断的依据是什么,好像没看到merged_group_docs有排过序?

(search_param->group_by_param && search_param->group_by_param->group_by);
if (has_group_by) {
result->group_doc_list_ = std::move(
const_cast<core::IndexGroupDocumentList &>(context->group_result()));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor:可以考虑给context弄一个mutable_group_result方法,这样就不需要const_cast

// Return grouped results when group_by is active
if (!search_result.group_doc_list_.empty()) {
auto result = std::make_shared<GroupVectorIndexResults>(
std::move(search_result.group_doc_list_));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted_vector_list_ 和 reverted_sparse_values_list_ 应该还是需要的吧

std::min(256u, vamana_search_param->prefetch_lines);
params.set(core::PARAM_VAMANA_STREAMER_PL, real_search_pl);
context->update(params);
_set_group_by_on_context(search_param, context);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vamana不支持groupby,这里还需要set context吗?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

groupby search的情况下,这个result应该是空的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants