Problem
The agent has no observability or recovery module. Four gaps:
- No deployment health. Nothing tracks whether a deployment's containers
are running. Agent health is independent of deployment state, so a dead
deployment produces no signal.
- No recovery. On startup the agent starts only the file watcher and API
server with no deployment reconcile (cmd/agent/main.go:21-83); on shutdown
it stops only the API server. Recovery after a host or Docker daemon restart
depends entirely on each container's restart: policy
(agent/templates/*/docker-compose.yml). When that does not fire (compose
file without it, missing external network, image that will not pull at boot),
deployments stay down with no auto-heal.
- No notifications. No mechanism exists to alert an operator to anything.
Email is used only for registry credentials and certbot.
- Resource visibility is thin. Host CPU and memory are read point-in-time
(internal/system/stats.go); there is no per-container monitoring and no
spike detection.
Constraint: recovery must not restart deployments the user intentionally
stopped. Reconcile against tracked desired state, not a blind up of every
directory.
Why it matters
A reboot or a resource spike becomes a silent outage. The first signal is a
user reporting the site is down.
Proposal
One observability and recovery module that reuses the existing AI action engine.
- Health monitoring. Poll per-deployment container state (running / exited
/ restarting, health, exit codes), expose via API.
- Resource monitoring. Extend
internal/system/stats.go with per-container
CPU and memory and configurable spike thresholds.
- Notifications. A pluggable channel subsystem, email first, interface open
for webhook and others. Notify on ill-health, resource spikes, and
recovery-from-shutdown events.
- AI-native recovery. At boot and on detected ill-health, reconcile
deployments to desired state. Drive remediation through the existing
service_action suggestion engine (internal/ai/suggestions.go:
restart / rebuild / pull) so the recovery is diagnosis-led, not blind.
Emit a recovery event when deployments are brought back after a restart.
Action
Acceptance criteria
- After a host reboot, deployments self-recover without manual intervention and a recovery notification is sent.
- Operators are notified on deployment ill-health and on resource spikes.
Problem
The agent has no observability or recovery module. Four gaps:
are running. Agent health is independent of deployment state, so a dead
deployment produces no signal.
server with no deployment reconcile (
cmd/agent/main.go:21-83); on shutdownit stops only the API server. Recovery after a host or Docker daemon restart
depends entirely on each container's
restart:policy(
agent/templates/*/docker-compose.yml). When that does not fire (composefile without it, missing external network, image that will not pull at boot),
deployments stay down with no auto-heal.
Email is used only for registry credentials and certbot.
(
internal/system/stats.go); there is no per-container monitoring and nospike detection.
Constraint: recovery must not restart deployments the user intentionally
stopped. Reconcile against tracked desired state, not a blind
upof everydirectory.
Why it matters
A reboot or a resource spike becomes a silent outage. The first signal is a
user reporting the site is down.
Proposal
One observability and recovery module that reuses the existing AI action engine.
/ restarting, health, exit codes), expose via API.
internal/system/stats.gowith per-containerCPU and memory and configurable spike thresholds.
for webhook and others. Notify on ill-health, resource spikes, and
recovery-from-shutdown events.
deployments to desired state. Drive remediation through the existing
service_actionsuggestion engine (internal/ai/suggestions.go:restart / rebuild / pull) so the recovery is diagnosis-led, not blind.
Emit a recovery event when deployments are brought back after a restart.
Action
Acceptance criteria