feat: Add auto-classification support for storage service containers by edg956 · Pull Request #26495 · open-metadata/OpenMetadata · GitHub
Skip to content

feat: Add auto-classification support for storage service containers#26495

Open
edg956 wants to merge 67 commits intomainfrom
feat/support-container-classification
Open

feat: Add auto-classification support for storage service containers#26495
edg956 wants to merge 67 commits intomainfrom
feat/support-container-classification

Conversation

@edg956
Copy link
Copy Markdown
Contributor

@edg956 edg956 commented Mar 14, 2026

Summary

Implements #21475

Add support for auto-classification (PII detection) on storage service containers (S3, GCS, etc.), enabling automatic tagging of sensitive data in files stored in cloud storage.

This extends OpenMetadata's existing auto-classification capabilities from database tables to storage containers with structured data (CSV, Parquet, etc.).

Changes

Schema & API

  • Add sampleData field to container.json schema for storing sample data
  • Create storageServiceAutoClassificationPipeline.json schema defining configuration for storage service auto-classification workflows
  • Add REST endpoints in ContainerResource for sample data operations: PUT/GET/DELETE /{id}/sampleData

Backend (Java)

  • ContainerRepository: Implement sample data persistence and retrieval from entity_extension table
  • EntityRepository: Refactor validateColumn() to support both Table and Container column validation
  • PIIMasker: Extend PII masking to support Container entities with proper tag-based masking
  • ContainerResource: Add authorization for VIEW_SAMPLE_DATA and EDIT_SAMPLE_DATA operations

Ingestion Framework (Python)

  • Storage Samplers: Implement StorageSampler base class with S3 and GCS concrete implementations for reading structured files
  • Fetcher Strategy: Add StorageFetcherStrategy for fetching Container entities from storage services
  • SamplerProcessor: Extend to handle Container entities alongside Table entities
  • PII Processor: Update to classify container columns using ClassifiableEntityType union (Table | Container)
  • Metadata Sink: Add Container sample data ingestion via OMetaContainerMixin
  • Patch Mixin: Support Container dataModel column tag updates

Testing

  • Add comprehensive integration tests for container classification (MinIO/S3 with PII detection)
  • Add unit tests for StorageFetcherStrategy filtering and SamplerProcessor container handling
  • Reorganize auto-classification tests by entity type (databases/ and containers/)

Bug Fixes

  • Fix sample data not being retrieved when requesting container with fields=["sampleData"] - added proper field handling in ContainerRepository.setFields()

Type of Change

  • New feature
  • Bug fix (sample data retrieval)

Test Plan

Unit Tests

cd ingestion
pytest tests/unit/profiler/test_container_fetcher.py
pytest tests/unit/sampler/test_container_sampler_processor.py

Integration Tests

cd ingestion
pytest tests/integration/auto_classification/containers/

Manual Testing

  1. Configure a storage service (S3/GCS) with structured files containing PII
  2. Run storage service metadata ingestion
  3. Run storage service auto-classification pipeline with storeSampleData: true
  4. Verify containers have:
    • PII tags on sensitive columns (email, SSN, credit card, etc.)
    • Sample data stored and retrievable via API
    • PII masking applied when user lacks authorization

Checklist

  • I have read the CONTRIBUTING document
  • I have added tests around the new logic
  • I have added a test that covers the bug fix scenario
  • For JSON Schema changes: Updated to add new pipeline type and container field
  • Code formatted with mvn spotless:apply, make py_format

Summary by Gitar

  • UI/UX Enhancements:
    • Added EntityTabs.SAMPLE_DATA to CONTAINER_DEFAULT_TABS to expose the new sample data feature in the UI.

This will update automatically on new commits.

@github-actions github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Mar 14, 2026
Base automatically changed from feat/refactor-to-make-autoclassification-tableless to main March 18, 2026 01:20
@edg956 edg956 force-pushed the feat/support-container-classification branch from d9e3d87 to 6195e0f Compare March 25, 2026 20:01
Comment thread ingestion/src/metadata/pii/base_processor.py
Comment thread ingestion/src/metadata/profiler/source/fetcher/fetcher_strategy.py Outdated
@edg956 edg956 force-pushed the feat/support-container-classification branch from 3853ec8 to 25a28e0 Compare March 27, 2026 12:32
Comment thread ingestion/src/metadata/profiler/source/fetcher/fetcher_strategy.py
@edg956 edg956 changed the title wip feat: Add auto-classification support for storage service containers Mar 27, 2026
@edg956 edg956 self-assigned this Mar 27, 2026
@edg956 edg956 marked this pull request as ready for review March 27, 2026 13:45
@edg956 edg956 requested a review from a team as a code owner March 27, 2026 13:45
@edg956 edg956 force-pushed the feat/support-container-classification branch from 8deee2f to 2837179 Compare March 27, 2026 13:45
@github-actions
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot requested a review from a team as a code owner March 27, 2026 13:49
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 27, 2026

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion-base-slim:trivy (debian 12.13)

Vulnerabilities (4)

Package Vulnerability ID Severity Installed Version Fixed Version
libpng-dev CVE-2026-33416 🚨 HIGH 1.6.39-2+deb12u3 1.6.39-2+deb12u4
libpng-dev CVE-2026-33636 🚨 HIGH 1.6.39-2+deb12u3 1.6.39-2+deb12u4
libpng16-16 CVE-2026-33416 🚨 HIGH 1.6.39-2+deb12u3 1.6.39-2+deb12u4
libpng16-16 CVE-2026-33636 🚨 HIGH 1.6.39-2+deb12u3 1.6.39-2+deb12u4

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (37)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
com.squareup.okhttp3:okhttp CVE-2021-0341 🚨 HIGH 3.12.12 4.9.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.airlift:aircompressor CVE-2025-67721 🚨 HIGH 0.27 2.0.3
io.netty:netty-codec-http CVE-2026-33870 🚨 HIGH 4.1.96.Final 4.1.132.Final, 4.2.10.Final
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 CVE-2026-33871 🚨 HIGH 4.1.96.Final 4.1.132.Final, 4.2.11.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.spark:spark-core_2.12 CVE-2025-54920 🚨 HIGH 3.5.6 3.5.7
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219
org.lz4:lz4-java CVE-2025-12183 🚨 HIGH 1.8.0 1.8.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (15)

Package Vulnerability ID Severity Installed Version Fixed Version
apache-airflow CVE-2025-68438 🚨 HIGH 3.1.5 3.1.6
apache-airflow CVE-2025-68675 🚨 HIGH 3.1.5 3.1.6, 2.11.1
apache-airflow CVE-2026-26929 🚨 HIGH 3.1.5 3.1.8
apache-airflow CVE-2026-28779 🚨 HIGH 3.1.5 3.1.8
apache-airflow CVE-2026-30911 🚨 HIGH 3.1.5 3.1.8
cryptography CVE-2026-26007 🚨 HIGH 42.0.8 46.0.5
jaraco.context CVE-2026-23949 🚨 HIGH 5.3.0 6.1.0
jaraco.context CVE-2026-23949 🚨 HIGH 6.0.1 6.1.0
pyOpenSSL CVE-2026-27459 🚨 HIGH 24.1.0 26.0.0
starlette CVE-2025-62727 🚨 HIGH 0.48.0 0.49.1
urllib3 CVE-2025-66418 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2025-66471 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2026-21441 🚨 HIGH 1.26.20 2.6.3
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/extended_sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/lineage.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data_aut.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage_aut.yaml

No Vulnerabilities Found

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 27, 2026

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion:trivy (debian 12.12)

Vulnerabilities (4)

Package Vulnerability ID Severity Installed Version Fixed Version
libpam-modules CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam-modules-bin CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam-runtime CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam0g CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (37)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
com.squareup.okhttp3:okhttp CVE-2021-0341 🚨 HIGH 3.12.12 4.9.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.airlift:aircompressor CVE-2025-67721 🚨 HIGH 0.27 2.0.3
io.netty:netty-codec-http CVE-2026-33870 🚨 HIGH 4.1.96.Final 4.1.132.Final, 4.2.10.Final
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 CVE-2026-33871 🚨 HIGH 4.1.96.Final 4.1.132.Final, 4.2.11.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.spark:spark-core_2.12 CVE-2025-54920 🚨 HIGH 3.5.6 3.5.7
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219
org.lz4:lz4-java CVE-2025-12183 🚨 HIGH 1.8.0 1.8.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (33)

Package Vulnerability ID Severity Installed Version Fixed Version
Authlib CVE-2026-27962 🔥 CRITICAL 1.6.6 1.6.9
Authlib CVE-2026-28490 🚨 HIGH 1.6.6 1.6.9
Authlib CVE-2026-28498 🚨 HIGH 1.6.6 1.6.9
Authlib CVE-2026-28802 🚨 HIGH 1.6.6 1.6.7
PyJWT CVE-2026-32597 🚨 HIGH 2.10.1 2.12.0
Werkzeug CVE-2024-34069 🚨 HIGH 2.2.3 3.0.3
aiohttp CVE-2025-69223 🚨 HIGH 3.12.12 3.13.3
aiohttp CVE-2025-69223 🚨 HIGH 3.13.2 3.13.3
apache-airflow CVE-2025-68438 🚨 HIGH 3.1.5 3.1.6
apache-airflow CVE-2025-68675 🚨 HIGH 3.1.5 3.1.6, 2.11.1
apache-airflow CVE-2026-26929 🚨 HIGH 3.1.5 3.1.8
apache-airflow CVE-2026-28779 🚨 HIGH 3.1.5 3.1.8
apache-airflow CVE-2026-30911 🚨 HIGH 3.1.5 3.1.8
apache-airflow-providers-http CVE-2025-69219 🚨 HIGH 5.6.0 6.0.0
azure-core CVE-2026-21226 🚨 HIGH 1.37.0 1.38.0
cryptography CVE-2026-26007 🚨 HIGH 42.0.8 46.0.5
google-cloud-aiplatform CVE-2026-2472 🚨 HIGH 1.130.0 1.131.0
google-cloud-aiplatform CVE-2026-2473 🚨 HIGH 1.130.0 1.133.0
jaraco.context CVE-2026-23949 🚨 HIGH 5.3.0 6.1.0
jaraco.context CVE-2026-23949 🚨 HIGH 6.0.1 6.1.0
protobuf CVE-2026-0994 🚨 HIGH 4.25.8 6.33.5, 5.29.6
pyOpenSSL CVE-2026-27459 🚨 HIGH 24.1.0 26.0.0
pyasn1 CVE-2026-23490 🚨 HIGH 0.6.1 0.6.2
pyasn1 CVE-2026-30922 🚨 HIGH 0.6.1 0.6.3
python-multipart CVE-2026-24486 🚨 HIGH 0.0.20 0.0.22
ray CVE-2025-62593 🔥 CRITICAL 2.47.1 2.52.0
starlette CVE-2025-62727 🚨 HIGH 0.48.0 0.49.1
tornado CVE-2026-31958 🚨 HIGH 6.5.3 6.5.5
urllib3 CVE-2025-66418 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2025-66471 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2026-21441 🚨 HIGH 1.26.20 2.6.3
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2

🛡️ TRIVY SCAN RESULT 🛡️

Target: usr/bin/docker

Vulnerabilities (4)

Package Vulnerability ID Severity Installed Version Fixed Version
stdlib CVE-2025-68121 🔥 CRITICAL v1.25.5 1.24.13, 1.25.7, 1.26.0-rc.3
stdlib CVE-2025-61726 🚨 HIGH v1.25.5 1.24.12, 1.25.6
stdlib CVE-2025-61728 🚨 HIGH v1.25.5 1.24.12, 1.25.6
stdlib CVE-2026-25679 🚨 HIGH v1.25.5 1.25.8, 1.26.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /home/airflow/openmetadata-airflow-apis/openmetadata_managed_apis.egg-info/PKG-INFO

No Vulnerabilities Found

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 27, 2026

🔴 Playwright Results — 3 failure(s), 18 flaky

✅ 3944 passed · ❌ 3 failed · 🟡 18 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
🔴 Shard 1 295 1 1 4
🟡 Shard 2 753 0 6 8
🟡 Shard 3 730 0 2 7
🟡 Shard 4 751 0 1 18
🟡 Shard 5 686 0 1 41
🔴 Shard 6 729 2 7 8

Genuine Failures (failed on all attempts)

Pages/SearchSettings.spec.ts › Restore default search settings (shard 1)
Error: �[2mexpect(�[22m�[31mreceived�[39m�[2m).�[22mtoEqual�[2m(�[22m�[32mexpected�[39m�[2m) // deep equality�[22m

�[32m- Expected  - 0�[39m
�[31m+ Received  + 5�[39m

�[33m@@ -45,10 +45,15 @@�[39m
�[2m        "boost": 20,�[22m
�[2m        "field": "displayName.keyword",�[22m
�[2m        "matchType": "exact",�[22m
�[2m      },�[22m
�[2m      Object {�[22m
�[31m+       "boost": 20,�[39m
�[31m+       "field": "name.keyword",�[39m
�[31m+       "matchType": "exact",�[39m
�[31m+     },�[39m
�[31m+     Object {�[39m
�[2m        "boost": 10,�[22m
�[2m        "field": "name",�[22m
�[2m        "matchType": "phrase",�[22m
�[2m      },�[22m
�[2m      Object {�[22m
Pages/Glossary.spec.ts › Add and Remove Assets (shard 6)
�[31mTest timeout of 180000ms exceeded.�[39m
Pages/Users.spec.ts › Check permissions for Data Steward (shard 6)
ReferenceError: getApiContext is not defined
🟡 18 flaky test(s) (passed on retry)
  • Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event is created when description is updated (shard 2, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
  • Features/DomainFilterQueryFilter.spec.ts › Subdomain assets should be visible when parent domain is selected (shard 2, 1 retry)
  • Features/DomainFilterQueryFilter.spec.ts › Domain filter should work with different asset types (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should start term as Draft when glossary has reviewers (shard 2, 2 retries)
  • Features/Glossary/MUIGlossaryMutualExclusivity.spec.ts › MUI-ME-S01: Selecting ME child should auto-deselect siblings (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for Store Procedure (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
  • Pages/Glossary.spec.ts › Column dropdown drag-and-drop functionality for Glossary Terms table (shard 6, 1 retry)
  • Pages/InputOutputPorts.spec.ts › Lineage section collapse/expand (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Data Model (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Spreadsheet (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/ODCSImportExport.spec.ts › Multi-object ODCS contract - object selector shows all schema objects (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

edg956 and others added 9 commits March 27, 2026 20:46
Extend container entity schema to support sample data storage, enabling
PII detection and classification workflows on storage service containers.

Changes:
- Add sampleData field to container.json for storing sample data
- Create storageServiceAutoClassificationPipeline.json schema defining
  configuration for storage service auto-classification pipelines
- Update workflow.json to include StorageServiceAutoClassificationPipeline
  as a supported pipeline type

This provides the schema foundation for running auto-classification
workflows on S3, GCS, and other storage service containers.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Implement Java backend functionality to handle sample data ingestion,
storage, and PII masking for container entities.

Changes:
- ContainerRepository: Add sample data retrieval and storage operations
- EntityRepository: Extend sample data support to container entities
- ContainerResource: Add REST endpoint for container sample data ingestion
- PIIMasker: Extend PII masking to support container entities

This enables the backend to process and store sample data from storage
service containers and apply PII masking rules during data retrieval.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add Container to the ClassifiableEntityType union, enabling PII detection
and auto-classification workflows to process storage service containers
alongside database tables.

Changes:
- Update ClassifiableEntityType from Table-only to Union[Table, Container]
- Import Container entity type
- Update module docstring to reflect current support

This type extension allows the PII processor to handle both database
tables and storage containers uniformly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Implement container-specific API mixin for sample data operations and
integrate it into the main OpenMetadata client.

Changes:
- Add OMetaContainerMixin with ingest_container_sample_data method
- Handle binary data encoding (base64) and serialization errors
- Register mixin in OpenMetadata class hierarchy
- Mirror table sample data ingestion patterns for consistency

This provides the Python API layer for ingesting sample data from
storage service containers into OpenMetadata.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add sampler implementations for storage services to extract sample data
from structured containers (Parquet, CSV) for auto-classification.

Changes:
- Create base StorageSamplerInterface for storage service sampling
- Implement S3Sampler for AWS S3 containers with structured file support
- Implement GCSSampler for Google Cloud Storage containers
- Support column extraction and data sampling for structured formats
- Handle dataModel-based column definitions from containers

Storage samplers read container metadata, fetch file contents, and
generate sample datasets for downstream PII detection.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Extend the base PII processor to handle both Table and Container
entities with unified column extraction logic.

Changes:
- Add _get_entity_columns helper to extract columns from Table or Container
- Handle Container entities with optional dataModel.columns structure
- Improve column matching with safe fallback for missing columns
- Use generic entity reference in error reporting
- Add early return when entity has no columns to process

This enables PII detection to run on storage containers the same way
it processes database tables.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Extend the sampler processor to handle both database and storage service
entities with appropriate sampler class selection.

Changes:
- Detect service type from source config (Database vs Storage)
- Import StorageServiceAutoClassificationPipeline
- Handle both Table and Container entity types in _run method
- Add column validation for Container entities (via dataModel.columns)
- Create storage-specific sampler interfaces for S3 and GCS
- Update sampler_interface to support Container entities
- Improve error messages with entity type context

The processor now dynamically selects database or storage samplers based
on the pipeline configuration type.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Implement fetcher strategy pattern for storage services to retrieve
containers for auto-classification workflows.

Changes:
- Add StorageFetcherStrategy to handle storage service entity fetching
- Update EntityFetcher to select appropriate strategy based on service type
- Support both DatabaseService and StorageService in strategy selection
- Import StorageService type for service detection
- Improve error messages with specific service type information

The fetcher now dynamically creates database or storage-specific
strategies to retrieve entities based on pipeline configuration.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add AutoClassification pipeline support to S3 and GCS storage service
specifications, enabling UI and workflow registration.

Changes:
- Add AutoClassification to S3ServiceSpec supported pipelines
- Add AutoClassification to GCSServiceSpec supported pipelines
- Import StorageServiceAutoClassificationPipeline in both specs

This registers the auto-classification workflow type for storage
services in the ingestion framework's service registry.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Comment on lines +45 to 48
"supportsProfiler": {
"title": "Supports Profiler",
"$ref": "../connectionBasicType.json#/definitions/supportsProfiler"
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging this in case it's missed (also called out in my comment on CreateIngestionPipelineImpl). This claims ADLS supports the profiler / auto-classification pipeline, but the python samplers shipped in this PR cover only S3 and GCS (ingestion/src/metadata/ingestion/source/storage/{s3,gcs}/service_spec.py).

With this flag set, the UI will let a user configure an auto-classification pipeline for an ADLS service which then fails at pipeline execution time with a "no sampler registered" style error. Same applies to customStorageConnection.json.

Suggestion is to drop the supportsProfiler addition from adlsConnection.json and customStorageConnection.json for now, and bring them back in the follow-up PRs that ship their samplers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ok because the UI does not let users create ADLS services in OM. This is Collate-specific

IceS2
IceS2 previously approved these changes Apr 14, 2026
Comment on lines +483 to +485
logger.debug(
f"Container {container.fullyQualifiedName.root} has no dataModel, skipping column tag patch"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a Warning?

Comment on lines +81 to +84
and entry.get("Key")
and not entry.get("Key").endswith("/")
and "/_delta_log/" not in entry.get("Key")
and not entry.get("Key").endswith("/_SUCCESS")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could potentially be extracted

IceS2
IceS2 previously approved these changes Apr 15, 2026
manerow
manerow previously approved these changes Apr 15, 2026
IceS2
IceS2 previously approved these changes Apr 20, 2026
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 23, 2026

Code Review 👍 Approved with suggestions 17 resolved / 18 findings

Adds auto-classification support for storage service containers with comprehensive fixes addressing 14 validation, null-safety, and error-handling issues. Consider forwarding queryText parameter to the POST aggregation path to maintain consistency across both code paths.

💡 Quality: queryText not forwarded in independent (POST) aggregation path

In getAggregationOptions, the new queryText parameter is only passed to the getAggregateFieldOptions (GET) path. When isIndependent is true, postAggregateFieldOptions is called without queryText, so the search text won't influence filter options in the independent mode. This may be intentional if the POST endpoint doesn't support it, but it's worth confirming the behavior is expected.

✅ 17 resolved
Edge Case: Sample data accepted without validation when dataModel is null

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/ContainerRepository.java:497
In ContainerRepository.addSampleData, column name validation is skipped when container.getDataModel() is null, but the sample data is still persisted. This means a container without a data model can accept sample data with completely arbitrary column names. TableRepository.addSampleData always enforces column validation. Consider whether containers without a data model should reject sample data entirely, or if this permissiveness is intentional.

Edge Case: base_processor accesses entity.dataModel.columns without null check

📄 ingestion/src/metadata/pii/base_processor.py:93-99
In _get_entity_columns (line 98), for Container entities the code accesses entity.dataModel.columns with a ternary guard on entity.dataModel, but this is only used in _run at line 112-114 where a None result causes a silent early return. However, if dataModel exists but columns is None, the method returns None and classification is silently skipped—this seems intentional but worth noting.

More importantly, in the existing _run method at line 120, the code does next((c for c in columns if c.name == column_name), None) which correctly handles missing columns with a continue. This is good defensive coding.

Edge Case: _filter_entities accesses attributes without getattr guard

📄 ingestion/src/metadata/profiler/source/fetcher/fetcher_strategy.py:430-439
_filter_entities at lines 430-431 accesses self.source_config.bucketFilterPattern and self.source_config.containerFilterPattern directly, while _filter_buckets (line 371) and _filter_containers (line 397-398) use getattr(..., None) for the same attributes. This inconsistency means _filter_entities would raise an AttributeError if the source config doesn't have these attributes, whereas the individual filter methods handle it gracefully.

Bug: hasPiiSensitiveTag(Container) NPE when tags is null

📄 openmetadata-service/src/main/java/org/openmetadata/service/security/mask/PIIMasker.java:305 📄 openmetadata-service/src/main/java/org/openmetadata/service/security/mask/PIIMasker.java:312-316
The new hasPiiSensitiveTag(Container) method calls container.getTags().stream() without a null check. The tags field is optional in the Container JSON schema and can be null at runtime (e.g., when not fetched/populated). This will throw a NullPointerException when PII masking is triggered on a container whose tags haven't been loaded.

Note: the existing hasPiiSensitiveTag(Table) has the same issue, but it's out of scope for this PR.

Bug: addSampleData missing @transaction annotation

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/ContainerRepository.java:494 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/ContainerRepository.java:508-515 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/ContainerRepository.java:508
ContainerRepository.addSampleData performs a DAO insert (lines 512-518) but lacks the @Transaction annotation. Both TableRepository.addSampleData and FileRepository.addSampleData consistently use @Transaction. Without it, the insert is not protected by a transaction boundary, risking partial/inconsistent state on failure.

...and 12 more resolved from earlier reviews

🤖 Prompt for agents
Code Review: Adds auto-classification support for storage service containers with comprehensive fixes addressing 14 validation, null-safety, and error-handling issues. Consider forwarding queryText parameter to the POST aggregation path to maintain consistency across both code paths.

1. 💡 Quality: queryText not forwarded in independent (POST) aggregation path

   In `getAggregationOptions`, the new `queryText` parameter is only passed to the `getAggregateFieldOptions` (GET) path. When `isIndependent` is true, `postAggregateFieldOptions` is called without `queryText`, so the search text won't influence filter options in the independent mode. This may be intentional if the POST endpoint doesn't support it, but it's worth confirming the behavior is expected.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants