Support serializing external OpenSearch UDFs at pushdown time by yuancu · Pull Request #4618 · opensearch-project/sql · GitHub
Skip to content

Support serializing external OpenSearch UDFs at pushdown time#4618

Merged
LantaoJin merged 4 commits into
opensearch-project:mainfrom
yuancu:issues/4478
Nov 3, 2025
Merged

Support serializing external OpenSearch UDFs at pushdown time#4618
LantaoJin merged 4 commits into
opensearch-project:mainfrom
yuancu:issues/4478

Conversation

@yuancu

@yuancu yuancu commented Oct 21, 2025

Copy link
Copy Markdown
Collaborator

Description

Previously, externally registered OpenSearch UDFs can not be serialized as they are not registered in RelJsonSerializer. This PR collects these UDFs in a SqlOperatorTable and registers it in RelJsonSerializer.

Blocker

Blocked by #2813 (or potentially other issues that restrict groupping by struct fields)

UDF is serialized, but grouping by a generated struct seems to be problematic after pushdown.

  • source=weblogs | where host='1.2.3.4' | eval info = geoip('my-datasource', host) | stats count() by info:

    {
      "error": {
        "reason": "There was internal problem at backend",
        "details": "java.sql.SQLException: exception while executing query: class java.lang.String cannot be cast to class java.util.Map (java.lang.String and java.util.Map are in module java.base of loader 'bootstrap')",
        "type": "RuntimeException"
      },
      "status": 500
    }

    In this case, the result map is converted to a string when used as a group key.

    Directly running the DSL with the script gives the bucket key as string

    Query:

    {"from":0,"size":0,"timeout":"1m","query":{"term":{"host":{"value":"1.2.3.4","boost":1.0}}},"_source":{"includes":["host"],"excludes":[]},"aggregations":{"composite_buckets":{"composite":{"size":10000,"sources":[{"info":{"terms":{"script":{"source":"{\"langType\":\"calcite\",\"script\":\"rO0ABXNyABFqYXZhLnV0aWwuQ29sbFNlcleOq7Y6G6gRAwABSQADdGFneHAAAAADdwQAAAAGdAAHcm93VHlwZXQAknsKICAiZmllbGRzIjogWwogICAgewogICAgICAidWR0IjogIkVYUFJfSVAiLAogICAgICAidHlwZSI6ICJPVEhFUiIsCiAgICAgICJudWxsYWJsZSI6IHRydWUsCiAgICAgICJuYW1lIjogImhvc3QiCiAgICB9CiAgXSwKICAibnVsbGFibGUiOiBmYWxzZQp9dAAEZXhwcnQCw3sKICAib3AiOiB7CiAgICAibmFtZSI6ICJHRU9JUCIsCiAgICAia2luZCI6ICJPVEhFUl9GVU5DVElPTiIsCiAgICAic3ludGF4IjogIkZVTkNUSU9OIgogIH0sCiAgIm9wZXJhbmRzIjogWwogICAgewogICAgICAibGl0ZXJhbCI6ICJteS1kYXRhc291cmNlIiwKICAgICAgInR5cGUiOiB7CiAgICAgICAgInR5cGUiOiAiVkFSQ0hBUiIsCiAgICAgICAgIm51bGxhYmxlIjogZmFsc2UsCiAgICAgICAgInByZWNpc2lvbiI6IC0xCiAgICAgIH0KICAgIH0sCiAgICB7CiAgICAgICJpbnB1dCI6IDAsCiAgICAgICJuYW1lIjogIiQwIgogICAgfQogIF0sCiAgImNsYXNzIjogIm9yZy5vcGVuc2VhcmNoLnNxbC5leHByZXNzaW9uLmZ1bmN0aW9uLlVzZXJEZWZpbmVkRnVuY3Rpb25CdWlsZGVyJDEiLAogICJ0eXBlIjogewogICAgInR5cGUiOiAiTUFQIiwKICAgICJudWxsYWJsZSI6IGZhbHNlLAogICAgImtleSI6IHsKICAgICAgInR5cGUiOiAiVkFSQ0hBUiIsCiAgICAgICJudWxsYWJsZSI6IGZhbHNlLAogICAgICAicHJlY2lzaW9uIjogLTEKICAgIH0sCiAgICAidmFsdWUiOiB7CiAgICAgICJ0eXBlIjogIkFOWSIsCiAgICAgICJudWxsYWJsZSI6IGZhbHNlLAogICAgICAicHJlY2lzaW9uIjogLTEsCiAgICAgICJzY2FsZSI6IC0yMTQ3NDgzNjQ4CiAgICB9CiAgfSwKICAiZGV0ZXJtaW5pc3RpYyI6IHRydWUsCiAgImR5bmFtaWMiOiBmYWxzZQp9dAAKZmllbGRUeXBlc3NyABFqYXZhLnV0aWwuSGFzaE1hcAUH2sHDFmDRAwACRgAKbG9hZEZhY3RvckkACXRocmVzaG9sZHhwP0AAAAAAAAx3CAAAABAAAAABdAAEaG9zdH5yAClvcmcub3BlbnNlYXJjaC5zcWwuZGF0YS50eXBlLkV4cHJDb3JlVHlwZQAAAAAAAAAAEgAAeHIADmphdmEubGFuZy5FbnVtAAAAAAAAAAASAAB4cHQAAklQeHg=\"}","lang":"opensearch_compounded_script","params":{"utcTimestamp":1761646601217183000}},"missing_bucket":true,"missing_order":"first","order":"asc"}}}]}}}}

    Result:

    {
      "took": 137,
      "timed_out": false,
      "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": {
          "value": 1,
          "relation": "eq"
        },
        "max_score": null,
        "hits": []
      },
      "aggregations": {
        "composite_buckets": {
          "after_key": {
            "info": "{continent_name=Oceania, country_iso_code=AU, country_name=Australia, location=-33.4940,143.2104, time_zone=Australia/Sydney}"
          },
          "buckets": [
            {
              "key": {
                "info": "{continent_name=Oceania, country_iso_code=AU, country_name=Australia, location=-33.4940,143.2104, time_zone=Australia/Sydney}"
              },
              "doc_count": 1
            }
          ]
        }
      }
    }
For reference, the result of `source=weblogs | where host='1.2.3.4' | eval info = geoip('my-datasource', host)` is:
{
  "schema": [
    {
      "name": "host",
      "type": "ip"
    },
    {
      "name": "method",
      "type": "string"
    },
    {
      "name": "bytes",
      "type": "string"
    },
    {
      "name": "response",
      "type": "string"
    },
    {
      "name": "url",
      "type": "string"
    },
    {
      "name": "info",
      "type": "struct"
    }
  ],
  "datarows": [
    [
      "1.2.3.4",
      "GET",
      "1234",
      "200",
      "/history/voyager1/",
      {
        "continent_name": "Oceania",
        "country_iso_code": "AU",
        "country_name": "Australia",
        "location": "-33.4940,143.2104",
        "time_zone": "Australia/Sydney"
      }
    ]
  ],
  "total": 1,
  "size": 1
}

Workaround

Instead of directly groupping by the struct generated by geoip, I created tests that group results by sub-fields of result struct of geoip function. E.g. if the result object info {"str": "a string", "num": 1}, I group results by info.str.

Related Issues

Resolves #4478

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@yuancu yuancu added the bug Something isn't working label Oct 22, 2025
Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
…r types are UNDEFINED

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
@yuancu yuancu changed the title WIP: Support serializing external OpenSearch UDFs at pushdown time Support serializing external OpenSearch UDFs at pushdown time Oct 28, 2025
@yuancu yuancu marked this pull request as ready for review October 28, 2025 10:06
@LantaoJin LantaoJin merged commit a003e8c into opensearch-project:main Nov 3, 2025
64 of 65 checks passed
@opensearch-trigger-bot

Copy link
Copy Markdown
Contributor

@yuancu yuancu deleted the issues/4478 branch November 4, 2025 02:05
yuancu added a commit to yuancu/sql-plugin that referenced this pull request Nov 4, 2025
…arch-project#4618)

* Supports serilizing external OpenSearch UDFs

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Correct subfield access logical when calling ITEM

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Resolve types of generated structs based on their values because their types are UNDEFINED

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Add explain and integration tests for geoip

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

---------

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
(cherry picked from commit a003e8c)
yuancu added a commit that referenced this pull request Nov 4, 2025
…ushdown time (#4618) (#4725)

* Support serializing external OpenSearch UDFs at pushdown time (#4618)

* Supports serilizing external OpenSearch UDFs

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Correct subfield access logical when calling ITEM

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Resolve types of generated structs based on their values because their types are UNDEFINED

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Add explain and integration tests for geoip

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

---------

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
(cherry picked from commit a003e8c)

* Update plan for testGeoIpPushedInAgg

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Mask position of fields to fix an undeterministic position of a field in the result plan for java21

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

---------

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
@yuancu yuancu added the backport-manually Filed a PR to backport manually. label Nov 4, 2025
expani pushed a commit to vinaykpud/sql that referenced this pull request Nov 4, 2025
…arch-project#4618)

* Supports serilizing external OpenSearch UDFs

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Correct subfield access logical when calling ITEM

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Resolve types of generated structs based on their values because their types are UNDEFINED

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Add explain and integration tests for geoip

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

---------

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 2.19-dev backport-failed backport-manually Filed a PR to backport manually. bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] geoip() function fails with deserialization error when used in aggregation grouping context

3 participants