Dedup pushdown (TopHits Agg) should work with Object fields by LantaoJin · Pull Request #4991 · opensearch-project/sql · GitHub
Skip to content

Dedup pushdown (TopHits Agg) should work with Object fields#4991

Merged
qianheng-aws merged 1 commit into
opensearch-project:mainfrom
LantaoJin:pr/issues/4990
Dec 24, 2025
Merged

Dedup pushdown (TopHits Agg) should work with Object fields#4991
qianheng-aws merged 1 commit into
opensearch-project:mainfrom
LantaoJin:pr/issues/4990

Conversation

@LantaoJin

@LantaoJin LantaoJin commented Dec 23, 2025

Copy link
Copy Markdown
Member

Description

#4844 converted dedup to TopHits Agg. But failed to parse dedup column if the column is a child of Object field.
#4360 restored the internal primitive value in a Map for Aggregates (first, last, min, max) which stored these Map objects in their accumulators.(first, last, min, max) stored these Map objects in their accumulators. But this fixing was not necessary since #4844 fixed them in other way.

In this PR:

  1. fix the bug of get the dedup column names
  2. revert Fix ClassCastException for value-storing aggregates on nested PPL fields #4360

Related Issues

Resolves #4990

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Lantao Jin <ltjin@amazon.com>
@coderabbitai

coderabbitai Bot commented Dec 23, 2025

Copy link
Copy Markdown
Contributor

@LantaoJin

Copy link
Copy Markdown
Member Author

cc @aaarone90
cc @ahkcs can you help to confirm the current fixing could address the issue of #4359 as long as the CalcitePPLAggregationIT passed?

@aalva500-prog aalva500-prog Dec 23, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @LantaoJin, thank you for the changes. The dedup command now works, but looks like the query execution plan changed. Looks like it now includes a PROJECT pushdown optimization:

curl -X POST "localhost:9200/_plugins/_ppl/_explain" \
-H "Content-Type: application/json" \
-d '{
"query": "source = big5 | dedup metrics.size | sort - @timestamp"          
}'
{
  "calcite": {
    "logical": "LogicalSystemLimit(sort0=[$7], dir0=[DESC-nulls-last], fetch=[10000], type=[QUERY_SIZE_LIMIT])\n  LogicalProject(agent=[$0], process=[$6], log=[$8], message=[$11], tags=[$12], cloud=[$13], input=[$15], @timestamp=[$17], data_stream=[$18], host=[$22], metrics=[$24], aws=[$27], event=[$32])\n    LogicalSort(sort0=[$17], dir0=[DESC-nulls-last])\n      LogicalProject(agent=[$0], agent.ephemeral_id=[$1], agent.id=[$2], agent.name=[$3], agent.type=[$4], agent.version=[$5], process=[$6], process.name=[$7], log=[$8], log.file=[$9], log.file.path=[$10], message=[$11], tags=[$12], cloud=[$13], cloud.region=[$14], input=[$15], input.type=[$16], @timestamp=[$17], data_stream=[$18], data_stream.dataset=[$19], data_stream.namespace=[$20], data_stream.type=[$21], host=[$22], host.name=[$23], metrics=[$24], metrics.size=[$25], metrics.tmin=[$26], aws=[$27], aws.cloudwatch=[$28], aws.cloudwatch.ingestion_time=[$29], aws.cloudwatch.log_group=[$30], aws.cloudwatch.log_stream=[$31], event=[$32], event.dataset=[$33], event.id=[$34], event.ingested=[$35], _id=[$36], _index=[$37], _score=[$38], _maxscore=[$39], _sort=[$40], _routing=[$41])\n        LogicalFilter(condition=[<=($42, 1)])\n          LogicalProject(agent=[$0], agent.ephemeral_id=[$1], agent.id=[$2], agent.name=[$3], agent.type=[$4], agent.version=[$5], process=[$6], process.name=[$7], log=[$8], log.file=[$9], log.file.path=[$10], message=[$11], tags=[$12], cloud=[$13], cloud.region=[$14], input=[$15], input.type=[$16], @timestamp=[$17], data_stream=[$18], data_stream.dataset=[$19], data_stream.namespace=[$20], data_stream.type=[$21], host=[$22], host.name=[$23], metrics=[$24], metrics.size=[$25], metrics.tmin=[$26], aws=[$27], aws.cloudwatch=[$28], aws.cloudwatch.ingestion_time=[$29], aws.cloudwatch.log_group=[$30], aws.cloudwatch.log_stream=[$31], event=[$32], event.dataset=[$33], event.id=[$34], event.ingested=[$35], _id=[$36], _index=[$37], _score=[$38], _maxscore=[$39], _sort=[$40], _routing=[$41], _row_number_dedup_=[ROW_NUMBER() OVER (PARTITION BY $25)])\n            LogicalFilter(condition=[IS NOT NULL($25)])\n              CalciteLogicalIndexScan(table=[[OpenSearch, big5]])\n",
    "physical": "EnumerableLimit(fetch=[10000])\n  EnumerableSort(sort0=[$7], dir0=[DESC-nulls-last])\n    CalciteEnumerableIndexScan(table=[[OpenSearch, big5]], PushDownContext=[[PROJECT->[agent, process, log, message, tags, cloud, input, @timestamp, data_stream, host, metrics, metrics.size, aws, event], AGGREGATION->rel#2183:LogicalAggregate.NONE.[](input=LogicalProject#2181,group={0},agg#0=LITERAL_AGG(1))], OpenSearchRequestBuilder(sourceBuilder={\"from\":0,\"size\":0,\"timeout\":\"1m\",\"_source\":{\"includes\":[\"agent\",\"process\",\"log\",\"message\",\"tags\",\"cloud\",\"input\",\"@timestamp\",\"data_stream\",\"host\",\"metrics\",\"metrics.size\",\"aws\",\"event\"],\"excludes\":[]},\"aggregations\":{\"composite_buckets\":{\"composite\":{\"size\":10000,\"sources\":[{\"metrics.size\":{\"terms\":{\"field\":\"metrics.size\",\"missing_bucket\":false,\"order\":\"asc\"}}}]},\"aggregations\":{\"$f1\":{\"top_hits\":{\"from\":0,\"size\":1,\"version\":false,\"seq_no_primary_term\":false,\"explain\":false,\"_source\":{\"includes\":[\"metrics.size\",\"agent\",\"process\",\"log\",\"message\",\"tags\",\"cloud\",\"input\",\"@timestamp\",\"data_stream\",\"host\",\"metrics\",\"aws\",\"event\"],\"excludes\":[]},\"script_fields\":{}}}}}}}, requestedTotalSize=2147483647, pageSize=null, startFrom=0)])\n"
  }

@LantaoJin LantaoJin Dec 24, 2025

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it now includes a PROJECT pushdown optimization

Did you run on the latest code? I didn't see the project pushdown action in the explain output.

@yuancu yuancu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qianheng-aws qianheng-aws merged commit 1192376 into opensearch-project:main Dec 24, 2025
44 of 47 checks passed
@opensearch-trigger-bot

Copy link
Copy Markdown
Contributor

LantaoJin added a commit to LantaoJin/search-plugins-sql that referenced this pull request Dec 24, 2025
…ch-project#4991)

Signed-off-by: Lantao Jin <ltjin@amazon.com>
(cherry picked from commit 1192376)
@LantaoJin LantaoJin added the backport-manually Filed a PR to backport manually. label Dec 24, 2025
qianheng-aws pushed a commit that referenced this pull request Dec 26, 2025
…4995)

(cherry picked from commit 1192376)

Signed-off-by: Lantao Jin <ltjin@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Dedup pushdown (top hits) not work when parse Object fields

4 participants