Operations Runbook¶

Use /dashboard as the first stop for triage. The incident queue is sorted by status and priority.

For dashboard architecture, inventory fields, service onboarding, discovery, refresh, and verification commands, see Operations Dashboard.

Add Or Update A Service¶

Use data/operations-inventory.json as the editable source of truth for ownership, priority, environment, URLs, health endpoints, deploy targets, dependencies, and tags. Prefer automatic GitLab and Cloudflare discovery for repo, pipeline, zone, Pages, Worker, DNS, and deployment metadata.

After editing inventory, run:

pnpm ops:discover
pnpm ops:refresh
pnpm ops:verify

If a service can be discovered automatically, add only the fields discovery cannot infer, such as owner, priority, dependencies, and deployTarget.

The API uses SQLite by default at apps/api/data/oath-bringer.db. Docker and production can override this with DATABASE_URL=file:/data/oath-bringer.db; local scripts, pnpm --filter @oath-bringer/api dev, and workspace ops scripts resolve relative DB paths from apps/api.

Create or reset a local admin without exposing the password in logs:

pnpm ops:user:create -- --email=[email protected] --password='long-local-password' --name='Local Admin'
pnpm ops:user:reset -- --email=[email protected] --password='new-long-local-password'

Reset clears active sessions and passkeys for that user so password login is reliable again. If login still fails, confirm the API and script are using the same DB path in the JSON output from the script and the [DB] Database initialized at ... API startup line.

DNS¶

Signals:

Check root cause is dns.
Summary mentions lookup failure or missing host records.

Actions:

Verify the hostname in data/operations-inventory.json.
Check Cloudflare DNS records and proxy state.
Confirm the service still owns the expected domain.

Cloudflare¶

Signals:

Zone or Pages metadata is missing or stale.
Public URL responds differently than origin.
Provider diagnostics show missing_credentials, degraded, or error for Cloudflare.

Actions:

Confirm Cloudflare token and account id availability for discovery.
Supported local keys in .env, .env.local, ~/.secrets/OathBringer, ~/.secrets/OathBringer.env, ~/.secrets/Cloudflare, or the API environment are CLOUDFLARE_API_TOKEN or CF_API_TOKEN, plus CLOUDFLARE_ACCOUNT_ID or CF_ACCOUNT_ID.
Existing Cloudflare management-token compatibility aliases are also recognized: CLOUDFLARE_TOKEN and CF_TOKEN. R2/S3 keys such as R2_ACCOUNT_ID, R2_ACCESS_KEY_ID, R2_SECRET_ACCESS_KEY, and generic TOKEN_VALUE entries are intentionally ignored for Cloudflare zone discovery.
Run pnpm ops:cloudflare:check for non-secret capability diagnostics.
Inspect zone DNS, SSL/TLS mode, WAF events, and Pages or Worker deployment status.
Bypass proxy temporarily only when you need to isolate origin behavior.

Production Deployment And Routing¶

Deploy from main after local validation. llama is the hardened production deploy target. Public Cloudflare tunnel connectors currently include llama and aequitas, so public deploys update both origins:

pnpm ops:deploy:prod
pnpm ops:deploy:public

ops:deploy:prod updates llama only through rsync/systemd. ops:deploy:aequitas builds the API image, imports it into aequitas k3s, patches the API deployment and initContainer image, mounts the local k3s kubeconfig read-only for ops checks, wires provider credentials from existing remote runtime secret files into Kubernetes Secret refs, and waits for rollout. ops:deploy:public runs both paths so public tunnel connectors do not split across different code.

Current intentional public model: oath-bringer.com may be served by either llama or aequitas through Cloudflare Tunnel. Both hosts run active cloudflared connectors, and aequitas also runs two cloudflared pods in the cloudflare-tunnel k3s namespace. Both origins are treated as active public API origins, and pnpm ops:deploy:public is the required deployment path while that remains true. If Cloudflare routing is consolidated later, document the selected primary origin and failover model in this section before changing deploy behavior.

Do not repair production by committing secrets or by relying on git credentials on a server. The deploy scripts preserve .env, .env.*, kubeconfig files, SQLite DB/WAL/SHM files, hostPath data, and Docker volumes.

Verify these routes after every deployment:

https://oath-bringer.com/account-recovery
https://oath-bringer.com/api/health
http://127.0.0.1:4000/health from llama
http://127.0.0.1:4000/api/health from llama
/dashboard/hosts after Codex MCP account login
MCP oath_hosts_list
MCP oath_system_overview

The public /api/* path should resolve to one of the live APIs that pnpm ops:deploy:public updates. If account-level Cloudflare credentials become available, consolidate hostname routing to a single intended origin and update this runbook and deploy scripts in the same change. Until then, do not manually roll aequitas k3s for Oath Bringer; use pnpm ops:deploy:aequitas or pnpm ops:deploy:public.

Kubernetes Checks¶

The live ops cockpit reads Kubernetes through either a mounted kubeconfig or SSH:

Mounted kubeconfig: set OPS_KUBECONFIG=/etc/oath-bringer/kubeconfig and mount the real file from the host, for example with OPS_KUBECONFIG_HOST_PATH=/opt/oath-bringer/kubeconfig.
SSH kubectl: set OPS_KUBECTL_SSH to the host that can run kubectl.

If /api/ops/health reports Kubernetes unavailable, the error should name the missing file, invalid context, SSH target, or kubectl failure. Preserve kubeconfig files during deploys and never commit them.

Origin Or Network¶

Signals:

Timeout, abort, connection refused, or unreachable origin.
DNS resolves but fetch fails.

Actions:

SSH to the deploy target from the inventory.
Check reverse proxy, firewall, systemd, container, or Kubernetes status.
Confirm origin ports and health endpoint routing.

App¶

Signals:

HTTP 5xx responses.
Service URL responds but health endpoint fails.

Actions:

Inspect app logs first.
Check the latest deployment and config changes.
Verify dependencies such as databases, queues, auth, and storage.

Certificate¶

Signals:

TLS, SSL, or certificate error in the check summary.

Actions:

Check Cloudflare SSL/TLS mode.
Inspect origin certificate expiration and chain.
Renew origin or public certificates as needed.

GitLab CI/CD¶

Signals:

Service has no URL but has a GitLab project path.
Latest deployment status is failed or unknown.
Provider diagnostics show missing_credentials or degraded for GitLab.

Actions:

Open the service detail page and follow the GitLab pipelines link.
Inspect failed jobs and recent merge activity.
Re-run or roll back the deployment after the underlying failure is corrected.
Make sure GITLAB_TOKEN or GITLAB_PRIVATE_TOKEN is available to the API process for deployment enrichment. Optional GITLAB_BASE_URL or GITLAB_URL defaults to https://gitlab.lloydtheandroid.com; GITLAB_MFOX_TOKEN is recognized only as a compatibility alias.

Provider Diagnostics¶

Signals:

Provider diagnostics cards are missing, stale, or not connected.
Recent deployments are empty even though services have GitLab or Cloudflare metadata.

Actions:

Run pnpm ops:refresh locally and inspect provider output.
Open /dashboard/operations/resources to inspect discovered zones, DNS records, SSL/TLS settings, Pages projects, Workers, routes, linked services, and mapping gaps.
Confirm the API process can read ~/.secrets or has equivalent environment variables.
/api/operations/health includes non-secret providerDiagnostics and providerIssues with exact missing aliases and secret-file paths.
Provider issue statuses distinguish missing_credentials, invalid_credentials, permission_denied, timeout, error, and connected; use those labels to pick the next action instead of treating every provider issue as generic degradation.
Check OPERATIONS_REFRESH_ENABLED and OPERATIONS_REFRESH_INTERVAL_MS if scheduled updates are stale.

Investigation Bundles¶

Use the service detail page action to copy a redacted investigation bundle. The bundle includes service metadata, latest checks, provider resources, recent deployments, confidence, evidence chains, incident timeline, and next actions. Secret-like fields are redacted before export.

Deploy Gates And Validation¶

Before pnpm ops:deploy (or commits touching ops paths), run:

pnpm --filter @oath-bringer/api test
pnpm type-check
pnpm build
pnpm ops:secrets:doctor
pnpm ops:infra:audit -- --json
pnpm ops:notifications:check -- --require-channel --json
pnpm ops:verify
pnpm mcp:test

Deploy is blocked when secrets doctor reports conflicts, infra audit has active criticals, notification routing has no configured channel, or public health assertion fails.

pnpm ops:deploy -- --verify-agent runs ops:verify after health passes.

Vault App-Scoped Secrets¶

Classified Unified CRM and Bridge Four credentials live under secret/local-secrets/unifiedcrm/* and secret/local-secrets/bridge-four.cc/*. Human Vault access uses Google Workspace OIDC with role admin; see Vault authentication before updating secrets. Populate after OIDC login:

bash deploy/scripts/vault-sync-app-secrets.sh

The local-workflows AppRole needs create/update on those paths; see Vault app-scoped secrets for the exact policy block.

Rotating Credentials¶

Update the canonical Vault path (or legacy source used by vault-sync-app-secrets.sh).
Run pnpm ops:secrets:doctor and confirm fingerprints changed (prefix only).
Run pnpm ops:cloudflare:check / pnpm ops:gitlab:check or POST /api/integrations/:id/check.
Redeploy only after gates are green (pnpm ops:deploy runs pre/post gates automatically).

Resolving Gate Failures¶

Gate	Typical fix
`ops:secrets:doctor`	Resolve conflicting aliases; load missing keys from Vault
`ops:infra:audit`	Fix dangling tunnel CNAMEs, stale tunnels, unhealthy active services, stale incidents
`ops:notifications:check`	Configure Slack webhook/bot, SendGrid, webhook URL, or Twilio SMS env keys
Public `/api/health`	Run `pnpm ops:refresh`; verify GitLab/Cloudflare tokens on API runtime

Adding An Integration¶

Add provider definition in apps/api/src/integrations/catalog.ts if new vendor.
Implement probe in apps/api/src/integrations/probes.ts.
Store credentials in Vault under an app-scoped path; extend vaultSecrets.ts if needed.
Verify UI at /dashboard/integrations and MCP oath_integrations_list.

Operations Runbook¶

Add Or Update A Service¶

Local Login And Reset¶

DNS¶

Cloudflare¶

Production Deployment And Routing¶

Kubernetes Checks¶

Origin Or Network¶

App¶

Certificate¶

GitLab CI/CD¶

Provider Diagnostics¶

Investigation Bundles¶

Deploy Gates And Validation¶

Vault App-Scoped Secrets¶

Rotating Credentials¶

Resolving Gate Failures¶

Adding An Integration¶