Optimize Int8 and Int16 integer IN filters#23299
Conversation
d99ae33 to
5141088
Compare
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing codex/in-list-direct-bitmap-filter (5141088) to 01bf68c (merge-base) diff using: in_list_strategy File an issue against this benchmark runner |
|
|
||
| #[inline(always)] | ||
| fn index(value: Self::Native) -> usize { | ||
| value as u16 as usize |
There was a problem hiding this comment.
the idea is just to reinterpret the signed value into a usize
There was a problem hiding this comment.
I think this is good as an comment within the code
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagein_list_strategy — base (merge-base)
in_list_strategy — branch
File an issue against this benchmark runner |
| } | ||
|
|
||
| #[test] | ||
| fn bitmap_filter_i8_handles_signed_boundaries_and_slices() -> Result<()> { |
There was a problem hiding this comment.
One small follow-up idea: it could be useful to add a public-path regression through InListExpr::try_new_from_array, or a SQL-level test, for Int8 and Int16 signed values.
That would cover the dispatch path in strategy.rs in addition to the filter implementation itself.
There was a problem hiding this comment.
This is a good idea -- I reviewed the existing slt (SQL) test coverage and it doesn't really have comprensive IN list coverage -- I will add some in a follow on PR
There was a problem hiding this comment.
Tracking with an issue:
First PR
|
Benchmarks look good |
|
Per @geoffreyclaude in I will address @kosiew 's comments and merge this PR so we can keep thigns moving |
## Which issue does this PR close? - Part of apache#19241 - part of apache#23307 - Follow on to apache#23299 ## Rationale for this change We (mostly @geoffreyclaude ) are in the process of optimizing IN lists with specialized implementations for various different data types. @kosiew suggested in apache#23299 (comment) that we add SQL level coverage for the IN list optimizations to both cover the dispatch logic as well as ensure everything works end to end. > One small follow-up idea: it could be useful to add a public-path regression through InListExpr::try_new_from_array, or a SQL-level test, for Int8 and Int16 signed values. ## What changes are included in this PR? 1. Add `in_list.slt` file to specifically target the IN lists 2. Only add test coverage for integer types note I think we should have significiantly more SLT coverage (list on apache#23307) but to keep the review small / understandable I plan to do it with several PRs ## Are these changes tested? Only tests ## Are there any user-facing changes? No

Which issue does this PR close?
Rationale for this change
This is an alternate implementation proposal for optimizing small integer
INlist from @geoffreyclaudeThe in #23013 does some shenangans, including rebuilding the Arrow array data directly in order to probe a bitmap keyed by an unsigned type. Handling signed small integer types directly keeps the implementation simpler and avoids the extra array-data path while preserving the compact bitmap representation for these small domains.
What changes are included in this PR?
This updates the specialized
INlist bitmap filter to handleInt8,UInt8,Int16, andUInt16directly.Signed small integer filters now map values to bitmap indexes by their bit pattern, so negative values use the same compact bitmap domain as their unsigned counterparts.
Are these changes tested?
Yes. This adds coverage for signed
Int8andInt16bitmap filters, including boundary values and sliced arrays.Are there any user-facing changes?
No. This is an internal physical expression optimization for small integer
INlist filters.