Operations Runbook¶
Use /dashboard as the first stop for triage. The incident queue is sorted by status and priority.
Local Login And Reset¶
The API uses SQLite by default at apps/api/data/oath-bringer.db. Docker and production can override this with DATABASE_URL=file:/data/oath-bringer.db; local scripts, pnpm --filter @oath-bringer/api dev, and workspace ops scripts resolve relative DB paths from apps/api.
Create or reset a local admin without exposing the password in logs:
pnpm ops:user:create -- --email=[email protected] --password='long-local-password' --name='Local Admin'
pnpm ops:user:reset -- --email=[email protected] --password='new-long-local-password'
Reset clears active sessions and passkeys for that user so password login is reliable again. If login still fails, confirm the API and script are using the same DB path in the JSON output from the script and the [DB] Database initialized at ... API startup line.
DNS¶
Signals:
- Check root cause is
dns. - Summary mentions lookup failure or missing host records.
Actions:
- Verify the hostname in
data/operations-inventory.json. - Check Cloudflare DNS records and proxy state.
- Confirm the service still owns the expected domain.
Cloudflare¶
Signals:
- Zone or Pages metadata is missing or stale.
- Public URL responds differently than origin.
- Provider diagnostics show
missing_credentials,degraded, orerrorfor Cloudflare.
Actions:
- Confirm Cloudflare token and account id availability for discovery.
- Supported local keys in
~/.secrets/Cloudflareor the API environment areCLOUDFLARE_API_TOKENorCF_API_TOKEN, plusCLOUDFLARE_ACCOUNT_IDorCF_ACCOUNT_ID. - Run
pnpm ops:cloudflare:checkfor non-secret capability diagnostics. - Inspect zone DNS, SSL/TLS mode, WAF events, and Pages or Worker deployment status.
- Bypass proxy temporarily only when you need to isolate origin behavior.
Production Deployment And Routing¶
Deploy from main after local validation. llama is the hardened production deploy target. Public Cloudflare tunnel connectors currently include llama and aequitas, so public deploys update both origins:
ops:deploy:prod updates llama only through rsync/systemd. ops:deploy:aequitas builds the API image, imports it into aequitas k3s, patches the API deployment and initContainer image, mounts the local k3s kubeconfig read-only for ops checks, and waits for rollout. ops:deploy:public runs both paths so public tunnel connectors do not split across different code.
Do not repair production by committing secrets or by relying on git credentials on a server. The deploy scripts preserve .env, .env.*, kubeconfig files, SQLite DB/WAL/SHM files, hostPath data, and Docker volumes.
Verify these routes after every deployment:
https://oath-bringer.com/account-recoveryhttps://oath-bringer.com/api/healthhttp://127.0.0.1:4000/healthfrom llamahttp://127.0.0.1:4000/api/healthfrom llama/dashboard/hostsafter Codex MCP account login- MCP
oath_hosts_list - MCP
oath_system_overview
The public /api/* path should resolve to one of the live APIs that pnpm ops:deploy:public updates. If account-level Cloudflare credentials become available, consolidate hostname routing to a single intended origin and update this runbook and deploy scripts in the same change. Until then, do not manually roll aequitas k3s for Oath Bringer; use pnpm ops:deploy:aequitas or pnpm ops:deploy:public.
Kubernetes Checks¶
The live ops cockpit reads Kubernetes through either a mounted kubeconfig or SSH:
- Mounted kubeconfig: set
OPS_KUBECONFIG=/etc/oath-bringer/kubeconfigand mount the real file from the host, for example withOPS_KUBECONFIG_HOST_PATH=/opt/oath-bringer/kubeconfig. - SSH kubectl: set
OPS_KUBECTL_SSHto the host that can runkubectl.
If /api/ops/health reports Kubernetes unavailable, the error should name the missing file, invalid context, SSH target, or kubectl failure. Preserve kubeconfig files during deploys and never commit them.
Origin Or Network¶
Signals:
- Timeout, abort, connection refused, or unreachable origin.
- DNS resolves but fetch fails.
Actions:
- SSH to the deploy target from the inventory.
- Check reverse proxy, firewall, systemd, container, or Kubernetes status.
- Confirm origin ports and health endpoint routing.
App¶
Signals:
- HTTP 5xx responses.
- Service URL responds but health endpoint fails.
Actions:
- Inspect app logs first.
- Check the latest deployment and config changes.
- Verify dependencies such as databases, queues, auth, and storage.
Certificate¶
Signals:
- TLS, SSL, or certificate error in the check summary.
Actions:
- Check Cloudflare SSL/TLS mode.
- Inspect origin certificate expiration and chain.
- Renew origin or public certificates as needed.
GitLab CI/CD¶
Signals:
- Service has no URL but has a GitLab project path.
- Latest deployment status is failed or unknown.
- Provider diagnostics show
missing_credentialsordegradedfor GitLab.
Actions:
- Open the service detail page and follow the GitLab pipelines link.
- Inspect failed jobs and recent merge activity.
- Re-run or roll back the deployment after the underlying failure is corrected.
- Make sure
GITLAB_TOKENorGITLAB_PRIVATE_TOKENis available to the API process for deployment enrichment.
Provider Diagnostics¶
Signals:
- Provider diagnostics cards are missing, stale, or not connected.
- Recent deployments are empty even though services have GitLab or Cloudflare metadata.
Actions:
- Run
pnpm ops:refreshlocally and inspect provider output. - Open
/dashboard/operations/resourcesto inspect discovered zones, DNS records, SSL/TLS settings, Pages projects, Workers, routes, linked services, and mapping gaps. - Confirm the API process can read
~/.secretsor has equivalent environment variables. /api/operations/healthincludes non-secretproviderDiagnosticsandproviderIssueswith exact missing aliases and secret-file paths.- Check
OPERATIONS_REFRESH_ENABLEDandOPERATIONS_REFRESH_INTERVAL_MSif scheduled updates are stale.
Investigation Bundles¶
Use the service detail page action to copy a redacted investigation bundle. The bundle includes service metadata, latest checks, provider resources, recent deployments, confidence, evidence chains, incident timeline, and next actions. Secret-like fields are redacted before export.