[fix](fe) fix host not match if start fe in metadata_failure_recovery by mymeiyi · Pull Request #62748 · apache/doris · GitHub
Skip to content

[fix](fe) fix host not match if start fe in metadata_failure_recovery#62748

Open
mymeiyi wants to merge 1 commit intoapache:masterfrom
mymeiyi:fix-br-0423
Open

[fix](fe) fix host not match if start fe in metadata_failure_recovery#62748
mymeiyi wants to merge 1 commit intoapache:masterfrom
mymeiyi:fix-br-0423

Conversation

@mymeiyi
Copy link
Copy Markdown
Contributor

@mymeiyi mymeiyi commented Apr 23, 2026

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

  1. when fe starts in metadata_failure_recovery mode with different host, the CloudClusterChecker will drop the fe and there is no fe in bdbje, fe can not start normally
2026-04-23 11:37:15,024 INFO (cloud cluster check|82) [Env.dropFrontendFromBDBJE():3515] remove frontend: name: fe_83d061f4_31b3_43ee_9764_5506795e0bfe, role: FOLLOWER, 183.70.1.1:9010, is alive: false
2026-04-23 11:37:15,026 INFO (cloud cluster check|82) [CloudSystemInfoService.updateFrontends():442] dropped cloud frontend=name: fe_83d061f4_31b3_43ee_9764_5506795e0bfe, role: FOLLOWER, 183.70.1.1:9010, is alive: false

2026-04-23 11:39:01,373 INFO (mysql-nio-pool-3|491) [BDBEnvironment.getReplicationGroupAdmin():237] addresses is empty
2026-04-23 11:39:01,374 WARN (mysql-nio-pool-3|491) [FrontendsProcNode.getFrontendsInfo():105] failed to get leader: Cannot invoke "com.sleepycat.je.rep.util.ReplicationGroupAdmin.getMasterNodeName()" because "replicationGroupAdmin" is null
2026-04-23 11:39:01,374 INFO (mysql-nio-pool-3|491) [FrontendsProcNode.getFrontendsInfo():124] bdbje fes [], env fes []
  1. modify regression framework to support start fe with restore_snapshot

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

Copilot AI review requested due to automatic review settings April 23, 2026 07:38
@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 23, 2026

@mymeiyi mymeiyi changed the title [fix](fe) fix host not match if fe starts in metadata_failure_recovery [fix](fe) fix host not match if start fe in metadata_failure_recovery Apr 23, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes cloud FE startup in metadata_failure_recovery when restored metadata contains a stale FE host/IP, which previously caused CloudClusterChecker to drop the only FE and leave the BDBJE group empty. Also extends the docker-compose runtime and regression framework to better support restore/snapshot recovery workflows.

Changes:

  • FE: In cloud recovery mode, locate the FE entry by nodeName and persist an updated host to match the current node before cloud cluster checking.
  • Regression framework: Add SuiteCluster start/stop helpers for meta services and recyclers.
  • Docker-compose runtime: Auto-detect and run a restore script, start FE with recovery flags, update default FDB version, and adjust fdb monitor config.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
regression-test/framework/src/main/groovy/org/apache/doris/regression/suite/SuiteCluster.groovy Adds start/stop helpers for meta services and recyclers in the regression cluster wrapper.
fe/fe-core/src/main/java/org/apache/doris/catalog/Env.java Updates FE host in recovered metadata (cloud mode) to prevent CloudClusterChecker from dropping self during recovery.
docker/runtime/doris-compose/resource/init_fe.sh Executes restore_snapshot.sh when present and starts FE with --metadata_failure_recovery and --recovery_journal_id.
docker/runtime/doris-compose/resource/fdb.conf Adds a backup_agent.1 section alongside the existing backup agent config.
docker/runtime/doris-compose/command.py Bumps default FoundationDB image version for cloud clusters.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +673 to +679
// if not specific fe indices, then start all meta services
void startMetaServices(int... indices) {
runMsCmd(START_WAIT_TIMEOUT + 5, "start --wait-timeout ${START_WAIT_TIMEOUT}".toString(), indices)
}

// indices start from 1, not 0
// if not specific fe indices, then start all recyclers
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new start/stop helpers for meta services use names/comments that are inconsistent with the existing API in this class (e.g. existing restartMs(...) vs new startMetaServices/stopMetaServices). Also the comments mention “fe indices”/“be indices” but these methods take meta-service/recycler indices. Consider renaming to match the established *Ms naming (or aligning the existing method) and fix the copied comments to avoid confusion for suite authors.

Suggested change
// if not specific fe indices, then start all meta services
void startMetaServices(int... indices) {
runMsCmd(START_WAIT_TIMEOUT + 5, "start --wait-timeout ${START_WAIT_TIMEOUT}".toString(), indices)
}
// indices start from 1, not 0
// if not specific fe indices, then start all recyclers
// if not specific meta-service indices, then start all meta services
void startMs(int... indices) {
runMsCmd(START_WAIT_TIMEOUT + 5, "start --wait-timeout ${START_WAIT_TIMEOUT}".toString(), indices)
}
// Backward-compatible alias for suite code that still uses the longer name.
void startMetaServices(int... indices) {
startMs(indices)
}
// indices start from 1, not 0
// if not specific recycler indices, then start all recyclers

Copilot uses AI. Check for mistakes.
Comment on lines +699 to +706
// if not specific be indices, then stop all meta services
void stopMetaServices(int... indices) {
runMsCmd(STOP_WAIT_TIMEOUT + 5, "stop --wait-timeout ${STOP_WAIT_TIMEOUT}".toString(), indices)
waitHbChanged()
}

// indices start from 1, not 0
// if not specific be indices, then stop all recyclers
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stopMetaServices(...)/stopRecyclers(...) comments refer to “be indices”, but these methods operate on meta services / recyclers. Updating these copied comments will help avoid using the wrong indices when writing regression suites.

Suggested change
// if not specific be indices, then stop all meta services
void stopMetaServices(int... indices) {
runMsCmd(STOP_WAIT_TIMEOUT + 5, "stop --wait-timeout ${STOP_WAIT_TIMEOUT}".toString(), indices)
waitHbChanged()
}
// indices start from 1, not 0
// if not specific be indices, then stop all recyclers
// if not specific meta service indices, then stop all meta services
void stopMetaServices(int... indices) {
runMsCmd(STOP_WAIT_TIMEOUT + 5, "stop --wait-timeout ${STOP_WAIT_TIMEOUT}".toString(), indices)
waitHbChanged()
}
// indices start from 1, not 0
// if not specific recycler indices, then stop all recyclers

Copilot uses AI. Check for mistakes.
Comment on lines +178 to +179
bash "$RECOVERY_SCRIPT"
mv "$RECOVERY_SCRIPT" "${RECOVERY_SCRIPT}.bak"
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recovery-script path is executed and then immediately renamed to .bak without checking the script’s exit status. If restore_snapshot.sh fails, this will both hide the failure and prevent retry on the next container start, and then proceed to start FE with recovery flags against an un-restored meta dir. Capture the exit code and only rename the script / continue startup when the restore succeeds; otherwise log the failure and exit non-zero.

Suggested change

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants