Perf release gate by bric3 · Pull Request #9068 · DataDog/dd-trace-java · GitHub
Skip to content

Perf release gate#9068

Merged
bric3 merged 11 commits into
masterfrom
bdu/r-gate
Jul 22, 2025
Merged

Perf release gate#9068
bric3 merged 11 commits into
masterfrom
bdu/r-gate

Conversation

@bric3

@bric3 bric3 commented Jun 30, 2025

Copy link
Copy Markdown
Contributor

What Does This Do

Check then notify if the release do not meet expected thresholds.

Screenshot 2025-07-11 at 16 37 58 Screenshot 2025-07-11 at 16 13 18

Motivation

Ensure performance thresholds.

Additional Notes

Contributor Checklist

Jira ticket: [PROJ-IDENT]

@bric3 bric3 requested a review from a team as a code owner June 30, 2025 16:59
@bric3 bric3 requested review from colin-higgins and removed request for a team June 30, 2025 16:59
@bric3 bric3 added tag: no release notes Changes to exclude from release notes comp: tooling Build & Tooling labels Jun 30, 2025
@bric3 bric3 marked this pull request as draft June 30, 2025 16:59
@bric3 bric3 changed the title chore(ci): Basic slo breach prototype Perf release gate Jun 30, 2025
@pr-commenter

pr-commenter Bot commented Jun 30, 2025

Copy link
Copy Markdown

@ddyurchenko ddyurchenko left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my side! 🎉
Need also @igoragoli review for the final approval.

@ddyurchenko ddyurchenko requested a review from igoragoli July 1, 2025 13:43
@bric3 bric3 requested a review from a team July 1, 2025 14:07

@igoragoli igoragoli left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bric3! 🙌

There are just some points regarding startup:petclinic.* thresholds that need to be addressed before merging.

Comment thread .gitlab/benchmarks/bp-runner.fail-on-breach.yml Outdated
Comment thread .gitlab/benchmarks/bp-runner.fail-on-breach.yml Outdated
Comment thread .gitlab/macrobenchmarks.yml Outdated
@bric3 bric3 requested a review from a team July 2, 2025 15:24
@bric3 bric3 force-pushed the bdu/r-gate branch 7 times, most recently from 7e61d82 to feb60ef Compare July 10, 2025 12:48
@ddyurchenko ddyurchenko self-requested a review July 11, 2025 14:39
@bric3 bric3 marked this pull request as ready for review July 11, 2025 15:45

@igoragoli igoragoli left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks Brice!

I think it's a good idea to include the source for the SLOs in the thresholds file, nice.

Comment thread .gitlab/macrobenchmarks.yml Outdated
when: always
- when: manual
allow_failure: true
- when: on_success # TODO: PLEASE revert before merging the PR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo:

  • To revert before merging

@ddyurchenko

Copy link
Copy Markdown
Contributor

Still see some yellow in results, so while job won't block releases, it will send warning messages in Java guild channel.
Augusto practically finalized the changes to reporting (so now confidence intervals and thresholds are clearly displayed, incl. warning threshold). I propose to rebuild Java image, so the changes are included, and update the SLOs once more for p50 latency, p99 latency and startup time mean value (execution_time metric), so they are no longer in yellow zone.

SLO breach check  | 
SLO breach check  | #### high_load--only-tracing
SLO breach check  | 
SLO breach check  | - 🟩 `throughput` 1[250](https://gitlab.ddbuild.io/DataDog/apm-reliability/dd-trace-java/-/jobs/1025486085#L250).55 op/s > 1100.00 op/s
SLO breach check  | 
SLO breach check  | #### high_load--otel-latest
SLO breach check  | 
SLO breach check  | - 🟩 `throughput` 1245.11 op/s > 1100.00 op/s
SLO breach check  | 
SLO breach check  | #### normal_operation--only-tracing
SLO breach check  | 
SLO breach check  | - 🟩 `agg_http_req_duration_p50` 2.12 ms < 2.36 ms
SLO breach check  | - 🟨 `agg_http_req_duration_p99` 7.10 ms < 7.89 ms
SLO breach check  | 
SLO breach check  | #### normal_operation--otel-latest
SLO breach check  | 
SLO breach check  | - 🟨 `agg_http_req_duration_p50` 2.12 ms < 2.34 ms
SLO breach check  | - 🟨 `agg_http_req_duration_p99` 8.75 ms < 9.50 ms
SLO breach check  | 
SLO breach check  | #### startup:petclinic:appsec:GlobalTracer
SLO breach check  | 
SLO breach check  | - 🟨 `execution_time` 235.01 ms < [260](https://gitlab.ddbuild.io/DataDog/apm-reliability/dd-trace-java/-/jobs/1025486085#L260).00 ms
SLO breach check  | 
SLO breach check  | #### startup:petclinic:iast:GlobalTracer
SLO breach check  | 
SLO breach check  | - 🟩 `execution_time` 231.70 ms < 260.00 ms
SLO breach check  | 
SLO breach check  | #### startup:petclinic:profiling:GlobalTracer
SLO breach check  | 
SLO breach check  | - 🟨 `execution_time` 361.27 ms < 368.00 ms
SLO breach check  | 
SLO breach check  | #### startup:petclinic:tracing:GlobalTracer
SLO breach check  | 
SLO breach check  | - 🟨 `execution_time` 243.13 ms < 260.00 ms
SLO breach check  | 
SLO breach check  | ---
SLO breach check  | 
SLO breach check  | Legend:
SLO breach check  | - 🟩 pass
SLO breach check  | - 🟥 breach
SLO breach check  | - 🟨 warning
SLO breach check  | - (unstable) unstable

@igoragoli

igoragoli commented Jul 16, 2025

Copy link
Copy Markdown
Contributor

Augusto practically finalized the changes to reporting

Changes for reporting done! 🙂

I'm updating the registry.ddbuild.io/images/benchmarking-platform-tools-ubuntu:latest image (the one used in the check-slo-breaches job across the board) here: https://gitlab.ddbuild.io/DataDog/benchmarking-platform-tools/-/jobs/1031270236

@ddyurchenko

Copy link
Copy Markdown
Contributor

Thanks @igoragoli ! I reran the job https://gitlab.ddbuild.io/DataDog/apm-reliability/dd-trace-java/-/jobs/1031460223, based on its results, will provide suggestions to @bric3 .

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the results of https://gitlab.ddbuild.io/DataDog/apm-reliability/dd-trace-java/-/jobs/1031460223, I suggest to update the SLOs to the following values:


          # Standard macrobenchmarks
          # https://benchmarking.us1.prod.dog/trends?projectId=4&branch=master&trendsTab=per_scenario&scenario=normal_operation%2Fonly-tracing&trendsType=scenario
          - name: normal_operation/only-tracing
            thresholds:
              - agg_http_req_duration_p50 < 2.36 ms
              - agg_http_req_duration_p99 < 7.89 ms
          # https://benchmarking.us1.prod.dog/trends?projectId=4&branch=master&trendsTab=per_scenario&scenario=normal_operation%2Fotel-latest&trendsType=scenario
          - name: normal_operation/otel-latest
            thresholds:
              - agg_http_req_duration_p50 < 2.5 ms
              - agg_http_req_duration_p99 < 10 ms

          # https://benchmarking.us1.prod.dog/trends?projectId=4&branch=master&trendsTab=per_scenario&scenario=high_load%2Fonly-tracing&trendsType=scenario
          - name: high_load/only-tracing
            thresholds:
              - throughput > 1100.0 op/s
          # https://benchmarking.us1.prod.dog/trends?projectId=4&branch=master&trendsTab=per_scenario&scenario=high_load%2Fotel-latest&trendsType=scenario
          - name: high_load/otel-latest
            thresholds:
              - throughput > 1100.0 op/s

          # Startup macrobenchmarks
          # https://benchmarking.us1.prod.dog/trends?projectId=4&branch=master&trendsTab=per_scenario&scenario=startup%3Apetclinic%3Atracing%3AGlobalTracer&trendsType=scenario
          # https://benchmarking.us1.prod.dog/trends?projectId=4&branch=master&trendsTab=per_scenario&scenario=startup%3Apetclinic%3Aappsec%3AGlobalTracer&trendsType=scenario
          # https://benchmarking.us1.prod.dog/trends?projectId=4&branch=master&trendsTab=per_scenario&scenario=startup%3Apetclinic%3Aiast%3AGlobalTracer&trendsType=scenario
          - name: "startup:petclinic:(tracing|appsec|iast):GlobalTracer"
            thresholds:
              - execution_time < 280 ms
          # https://benchmarking.us1.prod.dog/trends?projectId=4&branch=master&trendsTab=per_scenario&scenario=startup%3Apetclinic%3Aprofiling%3AGlobalTracer&trendsType=scenario
          - name: "startup:petclinic:profiling:GlobalTracer"
            thresholds:
              - execution_time < 420 ms

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another tweak
image

@ddyurchenko ddyurchenko left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update some of SLO thresholds before merge.

@bric3 bric3 enabled auto-merge (squash) July 22, 2025 10:08
@bric3 bric3 merged commit 4e4c286 into master Jul 22, 2025
503 checks passed
@bric3 bric3 deleted the bdu/r-gate branch July 22, 2025 10:33
@github-actions github-actions Bot added this to the 1.52.0 milestone Jul 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: tooling Build & Tooling tag: no release notes Changes to exclude from release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants