{{ message }}
feat(core, indexer): support group_by searching and fix hsnw sparse#527
Open
JalinWang wants to merge 17 commits into
Open
feat(core, indexer): support group_by searching and fix hsnw sparse#527JalinWang wants to merge 17 commits into
JalinWang wants to merge 17 commits into
Conversation
…ssing Added for_each_doc helper to unify iteration over flat doc_list_ or grouped group_doc_list_, eliminating 5 of 6 duplicated if(has_group_by) blocks in _dense_search and _sparse_search. Reduces code duplication and makes the group_by logic easier to maintain. The one remaining dual-rail (reformer normalize) is preserved because the batch API requires different handling for grouped vs flat results.
Move DiskAnn group_by support to feat/group_by_diskann branch. On this branch, DiskAnn explicitly rejects group_by with IndexError_Unsupported, matching Vamana and IVF behavior.
iaojnh
reviewed
Jun 25, 2026
| uint32_t group_count = | ||
| query_params.group_by ? query_params.group_by->group_count : 0; | ||
| if (group_count > 0 && merged_group_docs.size() > group_count) { | ||
| merged_group_docs.resize(group_count); |
Collaborator
There was a problem hiding this comment.
这里截断的依据是什么,好像没看到merged_group_docs有排过序?
| (search_param->group_by_param && search_param->group_by_param->group_by); | ||
| if (has_group_by) { | ||
| result->group_doc_list_ = std::move( | ||
| const_cast<core::IndexGroupDocumentList &>(context->group_result())); |
Collaborator
There was a problem hiding this comment.
minor:可以考虑给context弄一个mutable_group_result方法,这样就不需要const_cast
| // Return grouped results when group_by is active | ||
| if (!search_result.group_doc_list_.empty()) { | ||
| auto result = std::make_shared<GroupVectorIndexResults>( | ||
| std::move(search_result.group_doc_list_)); |
Collaborator
There was a problem hiding this comment.
reverted_vector_list_ 和 reverted_sparse_values_list_ 应该还是需要的吧
| std::min(256u, vamana_search_param->prefetch_lines); | ||
| params.set(core::PARAM_VAMANA_STREAMER_PL, real_search_pl); | ||
| context->update(params); | ||
| _set_group_by_on_context(search_param, context); |
Collaborator
There was a problem hiding this comment.
vamana不支持groupby,这里还需要set context吗?
Collaborator
There was a problem hiding this comment.
groupby search的情况下,这个result应该是空的
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
This PR implements group_by search (a.k.a. group-by deduplication) across the core interface, algorithm, and DB layers of zvec. Users can now retrieve top-K results per group instead of globally, with full support for
fetch_vector,is_linear, andbf_pksquery modes on supported index types.Motivation
Prior to this PR, group_by was either silently ignored (IVF, Vamana) or produced incorrect/empty results (DiskAnn, sparse fetch_vector). This PR brings consistent, well-tested group_by behavior to all index types.
Previous Status
and the fetch vector doesn’t work in normal search
Current Status
Unsupported index types now fail fast with
IndexError_Unsupportedand an error log, instead of silently returning empty or incorrect results.We will gradually enable GroupBy searching for them.
Key Changes
Core Interface Layer
index_param.h: AddedGroupByParamstruct (group_topk,group_count,group_bycallback) andgroup_by_paramfield onBaseIndexQueryParam.index.h/index.cc: Addedgroup_doc_list_toSearchResult. Introducedfor_each_dochelper to uniformly iterate over flat and grouped results for score normalization and vector reverting. Added_set_group_by_on_contextstatic helper called by supported index types at the end of_prepare_for_search._prepare_for_search:FlatIndex,HNSWIndex,HNSWRabitqIndexcall_set_group_by_on_context.IVFIndex,DiskAnnIndex, andVamanaIndexadd early rejection checks for group_by.Algorithm Layer
flat_sparse_search.h: PopulatesIndexSparseDocumentin group_by results whenfetch_vectoris enabled (previously missing).hnsw_sparse_context.h: Fetches full sparse vector data viaget_sparse_data+SparseUtility::ReverseSparseFormatin group results (previously only storedget_vector_meta).DB Layer
engine_helper.hpp: Translates DB-layervector_column_params::GroupByParamsto core-layerGroupByParamwhen building the engine query param.vector_column_indexer.cc: ReturnsGroupVectorIndexResultswhensearch_result.group_doc_list_is populated.combined_vector_column_indexer.cc: Merges group_by results across multiple index blocks — adjusts doc keys by block offset, merges groups bygroup_id, sorts within groups by score (respecting metric direction), and truncates togroup_topk/group_count.Tests
index_group_by_test.cc(new, 520 lines): Data-drivenGroupByInterfaceTestfixture at the core interface layer withRunOk/RunRejectedmethods. Covers dense (flat, hnsw, hnsw_rabitq — graph, linear, bf_pks, fetch_vector), sparse (flat, hnsw — graph, linear, bf_pks, fetch_vector), and unsupported (vamana, ivf, diskann) index types.vector_column_indexer_test.cc(+257 lines): Data-drivenGroupByIndexerTestfixture at the DB indexer layer, mirroring the core test pattern. Covers dense (flat, hnsw — graph, linear, bf_pks, fetch_vector), sparse (flat, hnsw), and unsupported (ivf, diskann with optional plugin skip).Test Results
All group_by tests pass:
GroupByInterfaceTest.Dense— flat/hnsw/hnsw_rabitq (graph, linear, bf_pks, fetch_vector)GroupByInterfaceTest.Sparse— flat_sparse/hnsw_sparse (graph, linear, bf_pks, fetch_vector)GroupByInterfaceTest.UnsupportedIndexTypes— vamana/ivf/diskann correctly rejectedGroupByIndexerTest.Dense— flat/hnsw (graph, linear, bf_pks, fetch_vector)GroupByIndexerTest.Sparse— flat_sparse/hnsw_sparseGroupByIndexerTest.UnsupportedIndexTypes— ivf/diskann correctly rejected