Skip to content

Operations Runbook

Use /dashboard as the first stop for triage. The incident queue is sorted by status and priority.

Local Login And Reset

The API uses SQLite by default at apps/api/data/oath-bringer.db. Docker and production can override this with DATABASE_URL=file:/data/oath-bringer.db; local scripts, pnpm --filter @oath-bringer/api dev, and workspace ops scripts resolve relative DB paths from apps/api.

Create or reset a local admin without exposing the password in logs:

pnpm ops:user:create -- --email=[email protected] --password='long-local-password' --name='Local Admin'
pnpm ops:user:reset -- --email=[email protected] --password='new-long-local-password'

Reset clears active sessions and passkeys for that user so password login is reliable again. If login still fails, confirm the API and script are using the same DB path in the JSON output from the script and the [DB] Database initialized at ... API startup line.

DNS

Signals:

  • Check root cause is dns.
  • Summary mentions lookup failure or missing host records.

Actions:

  • Verify the hostname in data/operations-inventory.json.
  • Check Cloudflare DNS records and proxy state.
  • Confirm the service still owns the expected domain.

Cloudflare

Signals:

  • Zone or Pages metadata is missing or stale.
  • Public URL responds differently than origin.
  • Provider diagnostics show missing_credentials, degraded, or error for Cloudflare.

Actions:

  • Confirm Cloudflare token and account id availability for discovery.
  • Supported local keys in ~/.secrets/Cloudflare or the API environment are CLOUDFLARE_API_TOKEN or CF_API_TOKEN, plus CLOUDFLARE_ACCOUNT_ID or CF_ACCOUNT_ID.
  • Run pnpm ops:cloudflare:check for non-secret capability diagnostics.
  • Inspect zone DNS, SSL/TLS mode, WAF events, and Pages or Worker deployment status.
  • Bypass proxy temporarily only when you need to isolate origin behavior.

Production Deployment And Routing

Deploy from main after local validation. llama is the hardened production deploy target. Public Cloudflare tunnel connectors currently include llama and aequitas, so public deploys update both origins:

pnpm ops:deploy:prod
pnpm ops:deploy:public

ops:deploy:prod updates llama only through rsync/systemd. ops:deploy:aequitas builds the API image, imports it into aequitas k3s, patches the API deployment and initContainer image, mounts the local k3s kubeconfig read-only for ops checks, and waits for rollout. ops:deploy:public runs both paths so public tunnel connectors do not split across different code.

Do not repair production by committing secrets or by relying on git credentials on a server. The deploy scripts preserve .env, .env.*, kubeconfig files, SQLite DB/WAL/SHM files, hostPath data, and Docker volumes.

Verify these routes after every deployment:

  • https://oath-bringer.com/account-recovery
  • https://oath-bringer.com/api/health
  • http://127.0.0.1:4000/health from llama
  • http://127.0.0.1:4000/api/health from llama
  • /dashboard/hosts after Codex MCP account login
  • MCP oath_hosts_list
  • MCP oath_system_overview

The public /api/* path should resolve to one of the live APIs that pnpm ops:deploy:public updates. If account-level Cloudflare credentials become available, consolidate hostname routing to a single intended origin and update this runbook and deploy scripts in the same change. Until then, do not manually roll aequitas k3s for Oath Bringer; use pnpm ops:deploy:aequitas or pnpm ops:deploy:public.

Kubernetes Checks

The live ops cockpit reads Kubernetes through either a mounted kubeconfig or SSH:

  • Mounted kubeconfig: set OPS_KUBECONFIG=/etc/oath-bringer/kubeconfig and mount the real file from the host, for example with OPS_KUBECONFIG_HOST_PATH=/opt/oath-bringer/kubeconfig.
  • SSH kubectl: set OPS_KUBECTL_SSH to the host that can run kubectl.

If /api/ops/health reports Kubernetes unavailable, the error should name the missing file, invalid context, SSH target, or kubectl failure. Preserve kubeconfig files during deploys and never commit them.

Origin Or Network

Signals:

  • Timeout, abort, connection refused, or unreachable origin.
  • DNS resolves but fetch fails.

Actions:

  • SSH to the deploy target from the inventory.
  • Check reverse proxy, firewall, systemd, container, or Kubernetes status.
  • Confirm origin ports and health endpoint routing.

App

Signals:

  • HTTP 5xx responses.
  • Service URL responds but health endpoint fails.

Actions:

  • Inspect app logs first.
  • Check the latest deployment and config changes.
  • Verify dependencies such as databases, queues, auth, and storage.

Certificate

Signals:

  • TLS, SSL, or certificate error in the check summary.

Actions:

  • Check Cloudflare SSL/TLS mode.
  • Inspect origin certificate expiration and chain.
  • Renew origin or public certificates as needed.

GitLab CI/CD

Signals:

  • Service has no URL but has a GitLab project path.
  • Latest deployment status is failed or unknown.
  • Provider diagnostics show missing_credentials or degraded for GitLab.

Actions:

  • Open the service detail page and follow the GitLab pipelines link.
  • Inspect failed jobs and recent merge activity.
  • Re-run or roll back the deployment after the underlying failure is corrected.
  • Make sure GITLAB_TOKEN or GITLAB_PRIVATE_TOKEN is available to the API process for deployment enrichment.

Provider Diagnostics

Signals:

  • Provider diagnostics cards are missing, stale, or not connected.
  • Recent deployments are empty even though services have GitLab or Cloudflare metadata.

Actions:

  • Run pnpm ops:refresh locally and inspect provider output.
  • Open /dashboard/operations/resources to inspect discovered zones, DNS records, SSL/TLS settings, Pages projects, Workers, routes, linked services, and mapping gaps.
  • Confirm the API process can read ~/.secrets or has equivalent environment variables.
  • /api/operations/health includes non-secret providerDiagnostics and providerIssues with exact missing aliases and secret-file paths.
  • Check OPERATIONS_REFRESH_ENABLED and OPERATIONS_REFRESH_INTERVAL_MS if scheduled updates are stale.

Investigation Bundles

Use the service detail page action to copy a redacted investigation bundle. The bundle includes service metadata, latest checks, provider resources, recent deployments, confidence, evidence chains, incident timeline, and next actions. Secret-like fields are redacted before export.