Added support for basic incremental index sorting by llalexandru00 · Pull Request #3656 · h2database/h2database · GitHub
Skip to content

Added support for basic incremental index sorting#3656

Open
llalexandru00 wants to merge 2 commits into
h2database:masterfrom
llalexandru00:master
Open

Added support for basic incremental index sorting#3656
llalexandru00 wants to merge 2 commits into
h2database:masterfrom
llalexandru00:master

Conversation

@llalexandru00

Copy link
Copy Markdown

Performance bottleneck

I am using a lot of multi-table simple queries with ORDER BY (and eventually OFFSET/LIMIT), which aren't optimized in any sense. All records are fetched, a final sorting is done and eventually an OFFSET/LIMIT is applied. There is an existing optimization in H2 based on the Select.sortUsingIndex flag, which hints that the records are already fetched in a sorted way (because an index is used), so a final sorting is not required. Also, LIMIT clause has a lot to benefit from this, as only some records are fetched. However, this works only for single-table queries.

Example

The most basic example is something like SELECT * FROM TEST1 CROSS JOIN TEST2 ORDER BY TEST1.F1 TEST2.F2 LIMIT 5. Right now, H2 sequentially scans both tables, sorts the result set according to ORDER BY TEST1.F1 TEST2.F2 and retrieves the first 5 rows. If both tables have ~1000 unique records, this operation takes ~450ms (embedded anonymous in-memory database v2.1.214).

Research

  • PostgreSQL (at least 14) has "INCREMENTAL SORTING" part of the planning phase. This improves the sorting at the end as the result set can be sorted in batches. Intuitively, the LIMIT is also optimized due to this technique.
  • DuckDB (at least 0.5.1) and SQLite (at least 3.39.3) also have differential performance for LIMIT in the context of multi-table ORDER BY (LIMIT 1 is considerably faster than LIMIT 100000 even for multi-table queries). Of course, these are C based back-ends, but the theoretic approach is relevant here.
  • Apache Derby (at least 10.4.2) has a a poor performance on cross joins, but still has a consistent improvement between LIMIT values. This may be misleading, as the final sorting performance difference may spoil the performance check.

Solution

I suggest using the best b-tree index for the topFilterTable - the one that matches the longest prefix of columns from the ORDER BY clause as possible. Using such index will ensure that the result set is partially sorted on the first Select.alreadySorted elements from ORDER BY.

When OFFSET/LIMIT is specified, only LIMIT elements shall be retrieved initially. To ensure completeness, extra records should be fetched until a "bigger" record is encountered based on the first Select.alreadySorted fields. Sorting these and applying LIMIT is a complete and sound solution.

Using something like LIMIT X will make SELECT to retrieve only X records and eventually fetches some extra records to ensure completeness.

  • Best case scenario: ORDER BY has many fields matched by an index or the records have a high variance. Very few extra records are fetched (maybe none). In this case, the number of extra comparisons is low, so the overhead is clearly acceptable comparing to the fetching of all records.
  • Worst case scenario: the best index matched only the first 1/2 fields from ORDER BY and these have a very low variance. Many extra records are fetched (maybe the whole result-set). In this case, the number of extra comparisons is high, but the number of fields for which one comparison is done is very low (1/2). Therefore, the overhead of applying this solution on result-sets with low variance is negligible.

Result

The same SQL I provided earlier now runs in ~6 ms. The change never compromises as this does not interfere with join order or index selection. Only when it is safe to use/upgrade an index, Select.alreadySorted is used.

  • In general, all CROSS JOIN queries with ORDER BY and LIMIT are working considerably faster.
  • Queries that don't use ORDER BY are not affected by the change.
  • Inner joins without index may use a sort index to aid ORDER BY and LIMIT
  • Inner joins with index may upgrade to a more complex index to also cover ORDER BY and LIMIT, but always preserves the invariant proposed by the index in the first place.

Comment thread h2/src/main/org/h2/command/query/Select.java
Comment thread h2/src/main/org/h2/command/query/Select.java
Comment thread h2/src/main/org/h2/command/query/Select.java Outdated
@h2database h2database deleted a comment from sonatype-lift Bot Oct 15, 2022

@katzyn katzyn left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution!

Please, send a license statement as described here
https://h2database.com/html/build.html#providing_patches
to our mailing list:
https://groups.google.com/g/h2-database
(It is partially pre-moderated, some new messages will be visible only after approval.)

Comment thread h2/src/main/org/h2/command/query/Select.java Outdated
Comment thread h2/src/main/org/h2/command/query/Select.java Outdated
Comment thread h2/src/main/org/h2/command/query/Select.java
Comment thread h2/src/main/org/h2/command/query/Select.java
@katzyn

katzyn commented Oct 18, 2022

Copy link
Copy Markdown
Contributor

@llalexandru00

Copy link
Copy Markdown
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants