[dbsp] rank and dense_rank operators.#6085
Conversation
There was a problem hiding this comment.
Shouldn't both of these be <=?
| CF: CmpFunc<V>, | ||
| OV: DBData, | ||
| RV: DBData, | ||
| PF: Fn(&V, &mut RV) + 'static, |
There was a problem hiding this comment.
I don't really like mutable arguments in the IR; all my dataflow analyses work on pure functions. I can probably pretend in the IR that this is just a regular function and rewrite it before generating Rust code.
There was a problem hiding this comment.
changed to return RV instead.
| .typed() | ||
| } | ||
|
|
||
| // /// Number the rows in the group according to the order defined by `CF`. |
There was a problem hiding this comment.
this is for the next PR? Maybe you want to move it to another commit.
| }) | ||
| } | ||
|
|
||
| // /// See [`Stream::dense_rank_custom_order`]. |
There was a problem hiding this comment.
I expected dense_rank to be in this PR, only ROW_NUMBER to be missing
There was a problem hiding this comment.
it is, I removed the commented function.
|
Thanks. I addressed all comments. |
|
I have pushed two commits, the second one which implements compiler support for RANK and DENSE_RANK. |
mythical-fred
left a comment
There was a problem hiding this comment.
Two findings — one bug, one nit.
We resisted adding these operators for a long time. The reason is that it's very easy to use them in a way that makes incremental updates inefficient: inserting a new element with rank 1 requires updating the ranks of all existing records in the collection. Instead, we supported a limited subset of rank queries that can compiled into top-k. Having said that, rank queries are not always bad, e.g., they can be efficient when groups are small or when most new values are added with the highest rank. Performance aside, it is sometime impossible to avoid rank queries. This commit introduces general-purpose rank and dense_rank operators that match the semantics of RANK and DENSE_RANK in SQL. The algorithm in a nutshell finds the smallest rank affected by new inputs and update the ranks of all records from that point forward. This commit doesn't yet implement the ROW_NUMBER operator, which I will work on next. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>
Signed-off-by: Mihai Budiu <mbudiu@feldera.com>
d8505ae to
cddf591
Compare
Signed-off-by: Mihai Budiu <mbudiu@feldera.com>
Signed-off-by: Mihai Budiu <mbudiu@feldera.com>
Signed-off-by: Mihai Budiu <mbudiu@feldera.com>
mythical-fred
left a comment
There was a problem hiding this comment.
Commit message subject has a trailing period; please fix commit history (rebase).
There was a problem hiding this comment.
Nit: "it are" -> "it is" (subject/verb agreement).

We resisted adding these operators for a long time. The reason is that it's very easy to use them in a way that makes incremental updates inefficient: inserting a new element with rank 1 requires updating the ranks of all existing records in the collection.
Instead, we supported a limited subset of rank queries that can compiled into top-k.
Having said that, rank queries are not always bad, e.g., they can be efficient when groups are small or when most new values are added with the highest rank. Performance aside, it is sometime impossible to avoid rank queries.
This commit introduces general-purpose rank and dense_rank operators that match the semantics of RANK and DENSE_RANK in SQL.
The algorithm in a nutshell finds the smallest rank affected by new inputs and update the ranks of all records from that point forward.
This commit doesn't yet implement the ROW_NUMBER operator, which I will work on next.
Describe Manual Test Plan
automatic tests only
Checklist
Breaking Changes?
Mark if you think the answer is yes for any of these components:
Describe Incompatible Changes