[fix](fe) fix host not match if start fe in metadata_failure_recovery#62748
[fix](fe) fix host not match if start fe in metadata_failure_recovery#62748mymeiyi wants to merge 1 commit intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes cloud FE startup in metadata_failure_recovery when restored metadata contains a stale FE host/IP, which previously caused CloudClusterChecker to drop the only FE and leave the BDBJE group empty. Also extends the docker-compose runtime and regression framework to better support restore/snapshot recovery workflows.
Changes:
- FE: In cloud recovery mode, locate the FE entry by
nodeNameand persist an updated host to match the current node before cloud cluster checking. - Regression framework: Add SuiteCluster start/stop helpers for meta services and recyclers.
- Docker-compose runtime: Auto-detect and run a restore script, start FE with recovery flags, update default FDB version, and adjust fdb monitor config.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| regression-test/framework/src/main/groovy/org/apache/doris/regression/suite/SuiteCluster.groovy | Adds start/stop helpers for meta services and recyclers in the regression cluster wrapper. |
| fe/fe-core/src/main/java/org/apache/doris/catalog/Env.java | Updates FE host in recovered metadata (cloud mode) to prevent CloudClusterChecker from dropping self during recovery. |
| docker/runtime/doris-compose/resource/init_fe.sh | Executes restore_snapshot.sh when present and starts FE with --metadata_failure_recovery and --recovery_journal_id. |
| docker/runtime/doris-compose/resource/fdb.conf | Adds a backup_agent.1 section alongside the existing backup agent config. |
| docker/runtime/doris-compose/command.py | Bumps default FoundationDB image version for cloud clusters. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // if not specific fe indices, then start all meta services | ||
| void startMetaServices(int... indices) { | ||
| runMsCmd(START_WAIT_TIMEOUT + 5, "start --wait-timeout ${START_WAIT_TIMEOUT}".toString(), indices) | ||
| } | ||
|
|
||
| // indices start from 1, not 0 | ||
| // if not specific fe indices, then start all recyclers |
There was a problem hiding this comment.
The new start/stop helpers for meta services use names/comments that are inconsistent with the existing API in this class (e.g. existing restartMs(...) vs new startMetaServices/stopMetaServices). Also the comments mention “fe indices”/“be indices” but these methods take meta-service/recycler indices. Consider renaming to match the established *Ms naming (or aligning the existing method) and fix the copied comments to avoid confusion for suite authors.
| // if not specific fe indices, then start all meta services | |
| void startMetaServices(int... indices) { | |
| runMsCmd(START_WAIT_TIMEOUT + 5, "start --wait-timeout ${START_WAIT_TIMEOUT}".toString(), indices) | |
| } | |
| // indices start from 1, not 0 | |
| // if not specific fe indices, then start all recyclers | |
| // if not specific meta-service indices, then start all meta services | |
| void startMs(int... indices) { | |
| runMsCmd(START_WAIT_TIMEOUT + 5, "start --wait-timeout ${START_WAIT_TIMEOUT}".toString(), indices) | |
| } | |
| // Backward-compatible alias for suite code that still uses the longer name. | |
| void startMetaServices(int... indices) { | |
| startMs(indices) | |
| } | |
| // indices start from 1, not 0 | |
| // if not specific recycler indices, then start all recyclers |
| // if not specific be indices, then stop all meta services | ||
| void stopMetaServices(int... indices) { | ||
| runMsCmd(STOP_WAIT_TIMEOUT + 5, "stop --wait-timeout ${STOP_WAIT_TIMEOUT}".toString(), indices) | ||
| waitHbChanged() | ||
| } | ||
|
|
||
| // indices start from 1, not 0 | ||
| // if not specific be indices, then stop all recyclers |
There was a problem hiding this comment.
stopMetaServices(...)/stopRecyclers(...) comments refer to “be indices”, but these methods operate on meta services / recyclers. Updating these copied comments will help avoid using the wrong indices when writing regression suites.
| // if not specific be indices, then stop all meta services | |
| void stopMetaServices(int... indices) { | |
| runMsCmd(STOP_WAIT_TIMEOUT + 5, "stop --wait-timeout ${STOP_WAIT_TIMEOUT}".toString(), indices) | |
| waitHbChanged() | |
| } | |
| // indices start from 1, not 0 | |
| // if not specific be indices, then stop all recyclers | |
| // if not specific meta service indices, then stop all meta services | |
| void stopMetaServices(int... indices) { | |
| runMsCmd(STOP_WAIT_TIMEOUT + 5, "stop --wait-timeout ${STOP_WAIT_TIMEOUT}".toString(), indices) | |
| waitHbChanged() | |
| } | |
| // indices start from 1, not 0 | |
| // if not specific recycler indices, then stop all recyclers |
| bash "$RECOVERY_SCRIPT" | ||
| mv "$RECOVERY_SCRIPT" "${RECOVERY_SCRIPT}.bak" |
There was a problem hiding this comment.
The recovery-script path is executed and then immediately renamed to .bak without checking the script’s exit status. If restore_snapshot.sh fails, this will both hide the failure and prevent retry on the next container start, and then proceed to start FE with recovery flags against an un-restored meta dir. Capture the exit code and only rename the script / continue startup when the restore succeeds; otherwise log the failure and exit non-zero.

What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
CloudClusterCheckerwill drop the fe and there is no fe in bdbje, fe can not start normallyRelease note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)