iframe-proxy

mymeiyi · 2026-04-23T07:38:05Z

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

when fe starts in metadata_failure_recovery mode with different host, the CloudClusterChecker will drop the fe and there is no fe in bdbje, fe can not start normally

2026-04-23 11:37:15,024 INFO (cloud cluster check|82) [Env.dropFrontendFromBDBJE():3515] remove frontend: name: fe_83d061f4_31b3_43ee_9764_5506795e0bfe, role: FOLLOWER, 183.70.1.1:9010, is alive: false
2026-04-23 11:37:15,026 INFO (cloud cluster check|82) [CloudSystemInfoService.updateFrontends():442] dropped cloud frontend=name: fe_83d061f4_31b3_43ee_9764_5506795e0bfe, role: FOLLOWER, 183.70.1.1:9010, is alive: false

2026-04-23 11:39:01,373 INFO (mysql-nio-pool-3|491) [BDBEnvironment.getReplicationGroupAdmin():237] addresses is empty
2026-04-23 11:39:01,374 WARN (mysql-nio-pool-3|491) [FrontendsProcNode.getFrontendsInfo():105] failed to get leader: Cannot invoke "com.sleepycat.je.rep.util.ReplicationGroupAdmin.getMasterNodeName()" because "replicationGroupAdmin" is null
2026-04-23 11:39:01,374 INFO (mysql-nio-pool-3|491) [FrontendsProcNode.getFrontendsInfo():124] bdbje fes [], env fes []

modify regression framework to support start fe with restore_snapshot

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

Thearas · 2026-04-23T07:38:13Z

Copilot

Pull request overview

Fixes cloud FE startup in metadata_failure_recovery when restored metadata contains a stale FE host/IP, which previously caused CloudClusterChecker to drop the only FE and leave the BDBJE group empty. Also extends the docker-compose runtime and regression framework to better support restore/snapshot recovery workflows.

Changes:

FE: In cloud recovery mode, locate the FE entry by nodeName and persist an updated host to match the current node before cloud cluster checking.
Regression framework: Add SuiteCluster start/stop helpers for meta services and recyclers.
Docker-compose runtime: Auto-detect and run a restore script, start FE with recovery flags, update default FDB version, and adjust fdb monitor config.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
regression-test/framework/src/main/groovy/org/apache/doris/regression/suite/SuiteCluster.groovy	Adds start/stop helpers for meta services and recyclers in the regression cluster wrapper.
fe/fe-core/src/main/java/org/apache/doris/catalog/Env.java	Updates FE host in recovered metadata (cloud mode) to prevent `CloudClusterChecker` from dropping self during recovery.
docker/runtime/doris-compose/resource/init_fe.sh	Executes `restore_snapshot.sh` when present and starts FE with `--metadata_failure_recovery` and `--recovery_journal_id`.
docker/runtime/doris-compose/resource/fdb.conf	Adds a `backup_agent.1` section alongside the existing backup agent config.
docker/runtime/doris-compose/command.py	Bumps default FoundationDB image version for cloud clusters.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-23T07:44:20Z

+    // if not specific fe indices, then start all meta services
+    void startMetaServices(int... indices) {
+        runMsCmd(START_WAIT_TIMEOUT + 5, "start  --wait-timeout ${START_WAIT_TIMEOUT}".toString(), indices)
+    }
+
+    // indices start from 1, not 0
+    // if not specific fe indices, then start all recyclers


The new start/stop helpers for meta services use names/comments that are inconsistent with the existing API in this class (e.g. existing restartMs(...) vs new startMetaServices/stopMetaServices). Also the comments mention “fe indices”/“be indices” but these methods take meta-service/recycler indices. Consider renaming to match the established *Ms naming (or aligning the existing method) and fix the copied comments to avoid confusion for suite authors.

Suggested change

// if not specific fe indices, then start all meta services

void startMetaServices(int... indices) {

runMsCmd(START_WAIT_TIMEOUT + 5, "start --wait-timeout ${START_WAIT_TIMEOUT}".toString(), indices)

}

// indices start from 1, not 0

// if not specific fe indices, then start all recyclers

// if not specific meta-service indices, then start all meta services

void startMs(int... indices) {

runMsCmd(START_WAIT_TIMEOUT + 5, "start --wait-timeout ${START_WAIT_TIMEOUT}".toString(), indices)

}

// Backward-compatible alias for suite code that still uses the longer name.

void startMetaServices(int... indices) {

startMs(indices)

}

// indices start from 1, not 0

// if not specific recycler indices, then start all recyclers

Copilot · 2026-04-23T07:44:21Z

+    // if not specific be indices, then stop all meta services
+    void stopMetaServices(int... indices) {
+        runMsCmd(STOP_WAIT_TIMEOUT + 5, "stop --wait-timeout ${STOP_WAIT_TIMEOUT}".toString(), indices)
+        waitHbChanged()
+    }
+
+    // indices start from 1, not 0
+    // if not specific be indices, then stop all recyclers


stopMetaServices(...)/stopRecyclers(...) comments refer to “be indices”, but these methods operate on meta services / recyclers. Updating these copied comments will help avoid using the wrong indices when writing regression suites.

Suggested change

// if not specific be indices, then stop all meta services

void stopMetaServices(int... indices) {

runMsCmd(STOP_WAIT_TIMEOUT + 5, "stop --wait-timeout ${STOP_WAIT_TIMEOUT}".toString(), indices)

waitHbChanged()

}

// indices start from 1, not 0

// if not specific be indices, then stop all recyclers

// if not specific meta service indices, then stop all meta services

void stopMetaServices(int... indices) {

runMsCmd(STOP_WAIT_TIMEOUT + 5, "stop --wait-timeout ${STOP_WAIT_TIMEOUT}".toString(), indices)

waitHbChanged()

}

// indices start from 1, not 0

// if not specific recycler indices, then stop all recyclers

Copilot · 2026-04-23T07:44:21Z

+        bash "$RECOVERY_SCRIPT"
+        mv "$RECOVERY_SCRIPT" "${RECOVERY_SCRIPT}.bak"


The recovery-script path is executed and then immediately renamed to .bak without checking the script’s exit status. If restore_snapshot.sh fails, this will both hide the failure and prevent retry on the next container start, and then proceed to start FE with recovery flags against an un-restored meta dir. Capture the exit code and only rename the script / continue startup when the restore succeeds; otherwise log the failure and exit non-zero.

Suggested change

bash "$RECOVERY_SCRIPT"

mv "$RECOVERY_SCRIPT" "${RECOVERY_SCRIPT}.bak"

bash "$RECOVERY_SCRIPT"

RECOVERY_RES=$?

if [ $RECOVERY_RES -ne 0 ]; then

health_log "ERROR: Recovery script failed with exit code $RECOVERY_RES"

exit $RECOVERY_RES

fi

mv "$RECOVERY_SCRIPT" "${RECOVERY_SCRIPT}.bak"

MV_RES=$?

if [ $MV_RES -ne 0 ]; then

health_log "ERROR: Failed to rename recovery script to ${RECOVERY_SCRIPT}.bak"

exit $MV_RES

fi

some fix

f270635

Copilot AI review requested due to automatic review settings April 23, 2026 07:38

mymeiyi requested review from CalvinKirs, dataroaring and morningman as code owners April 23, 2026 07:38

mymeiyi changed the title ~~[fix](fe) fix host not match if fe starts in metadata_failure_recovery~~ [fix](fe) fix host not match if start fe in metadata_failure_recovery Apr 23, 2026

Copilot started reviewing on behalf of mymeiyi April 23, 2026 07:39 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix](fe) fix host not match if start fe in metadata_failure_recovery#62748

[fix](fe) fix host not match if start fe in metadata_failure_recovery#62748
mymeiyi wants to merge 1 commit intoapache:masterfrom
mymeiyi:fix-br-0423

mymeiyi commented Apr 23, 2026

Uh oh!

Thearas commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		bash "$RECOVERY_SCRIPT"
		mv "$RECOVERY_SCRIPT" "${RECOVERY_SCRIPT}.bak"

-        bash "$RECOVERY_SCRIPT"
-        mv "$RECOVERY_SCRIPT" "${RECOVERY_SCRIPT}.bak"
+        bash "$RECOVERY_SCRIPT"
+        RECOVERY_RES=$?
+        if [ $RECOVERY_RES -ne 0 ]; then
+            health_log "ERROR: Recovery script failed with exit code $RECOVERY_RES"
+            exit $RECOVERY_RES
+        fi
+        mv "$RECOVERY_SCRIPT" "${RECOVERY_SCRIPT}.bak"
+        MV_RES=$?
+        if [ $MV_RES -ne 0 ]; then
+            health_log "ERROR: Failed to rename recovery script to ${RECOVERY_SCRIPT}.bak"
+            exit $MV_RES
+        fi

Sunbelt Computer Software

PL/B Language Development and Support

Conversation

mymeiyi commented Apr 23, 2026

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

Thearas commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants