docs: bootstrap OPS project with CLAUDE.md, session history, and TODO
- CLAUDE.md with project context, paths, architecture, deploy commands
- 12 session notes extracted from MDF Webseiten project (sessions 0022-0055)
- TODO.md with remaining items
- Notes organized in YYYY/MM date hierarchy
| .. | .. |
|---|
| 1 | +# Ops Dashboard - Project Context |
|---|
| 2 | + |
|---|
| 3 | +## Key Paths |
|---|
| 4 | + |
|---|
| 5 | +| What | Path | |
|---|
| 6 | +|------|------| |
|---|
| 7 | +| **Code repo (local)** | `~/dev/ai/OPS/` | |
|---|
| 8 | +| **Code repo (server)** | `/opt/data/ops-dashboard/` | |
|---|
| 9 | +| **Code repo (remote)** | `git.mnsoft.org/git/APPS/ops-dashboard.git` | |
|---|
| 10 | +| **Infrastructure repo (server)** | `/opt/infrastructure/` | |
|---|
| 11 | +| **Infrastructure repo (remote)** | `git.mnsoft.org/git/APPS/infrastructure.git` | |
|---|
| 12 | +| **Notes** | `~/dev/ai/OPS/Notes/` | |
|---|
| 13 | +| **TODO** | `~/dev/ai/OPS/Notes/TODO.md` | |
|---|
| 14 | + |
|---|
| 15 | +## Server Access |
|---|
| 16 | + |
|---|
| 17 | +```bash |
|---|
| 18 | +ssh mdf-system.ch # root, port 99 (via ~/.ssh/config) |
|---|
| 19 | +``` |
|---|
| 20 | + |
|---|
| 21 | +## Application |
|---|
| 22 | + |
|---|
| 23 | +| What | Detail | |
|---|
| 24 | +|------|--------| |
|---|
| 25 | +| **URL** | https://ops.tekmidian.com | |
|---|
| 26 | +| **Auth token** | `ops-mdf-2026-secure` | |
|---|
| 27 | +| **Container** | `ops-dashboard` | |
|---|
| 28 | +| **Stack** | FastAPI backend, vanilla JS frontend, SSE for real-time ops | |
|---|
| 29 | + |
|---|
| 30 | +## Architecture |
|---|
| 31 | + |
|---|
| 32 | +- Container mounts: `/opt/data`, `/opt/infrastructure`, `/var/run/docker.sock` + app source |
|---|
| 33 | +- Container has pyyaml 6.0.3 — can import toolkit directly |
|---|
| 34 | +- nsenter bridge for host operations (backup, restore, sync, promote, gen-timers) |
|---|
| 35 | +- `OPS_CLI` = `/usr/local/bin/ops` on host (bash shim -> `python3 -m toolkit.cli`) |
|---|
| 36 | +- Toolkit: `/opt/infrastructure/toolkit/` — 12 Python modules |
|---|
| 37 | +- Registry: `project.yaml` descriptors at `/opt/data/{project}/project.yaml` (source of truth) |
|---|
| 38 | + |
|---|
| 39 | +## Dashboard Pages |
|---|
| 40 | + |
|---|
| 41 | +| Page | Purpose | |
|---|
| 42 | +|------|---------| |
|---|
| 43 | +| **Dashboard** | Status tiles, project drill-down | |
|---|
| 44 | +| **Services** | Container cards, restart / logs / terminal | |
|---|
| 45 | +| **Backups** | Date-grouped, local + offsite, restore modal, multi-select delete | |
|---|
| 46 | +| **Operations** | Promote, sync, rebuild — SSE streaming modals | |
|---|
| 47 | +| **Schedules** | Backup timer management, edit modal | |
|---|
| 48 | +| **System** | CPU / mem / disk, health checks, timers | |
|---|
| 49 | + |
|---|
| 50 | +## Deploy |
|---|
| 51 | + |
|---|
| 52 | +```bash |
|---|
| 53 | +rsync -avz --delete ~/dev/ai/OPS/app/ mdf-system.ch:/opt/data/ops-dashboard/app/ |
|---|
| 54 | +rsync -avz --delete ~/dev/ai/OPS/static/ mdf-system.ch:/opt/data/ops-dashboard/static/ |
|---|
| 55 | +ssh mdf-system.ch 'docker restart ops-dashboard' |
|---|
| 56 | +``` |
|---|
| 57 | + |
|---|
| 58 | +## Mandatory Rules |
|---|
| 59 | + |
|---|
| 60 | +- **Never edit files directly on the server** — always rsync from local, then restart |
|---|
| 61 | +- **project.yaml is source of truth** — never hardcode project/env lists in the dashboard |
|---|
| 62 | +- **toolkit is importable** — use `from toolkit.X import Y` directly inside the container; no subprocess ops calls for data reads |
|---|
| 63 | +- **nsenter for mutations** — backup, restore, sync, promote must go through the nsenter bridge to the host Python venv, not direct toolkit calls |
|---|
| 64 | +- **SSE for long ops** — any operation that may take >2s must stream progress via SSE, never block the HTTP response |
|---|
| .. | .. |
|---|
| 1 | +# Session 0001: Ops Dashboard Core Fixes |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-22 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0024 |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +- [x] Removed load averages tile — replaced with Containers (running/total) + Processes tiles |
|---|
| 12 | +- [x] Fixed Health Checks section — was broken inside Docker, now runs via nsenter bridge on host |
|---|
| 13 | +- [x] Fixed Timers section — was broken (no systemd in container), now uses nsenter on host |
|---|
| 14 | +- [x] Added `run_command_host()` to `ops_runner.py` for arbitrary host commands via nsenter |
|---|
| 15 | +- [x] Rewrote timer parser — anchors on timestamp patterns instead of fragile column splitting |
|---|
| 16 | +- [x] Fixed ops CLI health check — removed stale /opt/data2 reference, added [OK]/[FAIL] output format |
|---|
| 17 | +- [x] Added Docker daemon running check to `ops health` (reports container count) |
|---|
| 18 | + |
|---|
| 19 | +## Key Decisions / Learnings |
|---|
| 20 | + |
|---|
| 21 | +- Dashboard container uses COPY (not volume mount) — requires `docker build` + recreate for changes to take effect |
|---|
| 22 | +- nsenter bridge pattern for host commands: `docker run --rm --privileged --pid=host alpine nsenter -t 1 -m -u -i -n -p --` |
|---|
| 23 | +- `ops health` must exit 0 always — returning issue count breaks callers using `set -euo pipefail` |
|---|
| 24 | + |
|---|
| 25 | +## Files Changed |
|---|
| 26 | + |
|---|
| 27 | +- `app/routers/system.py` — nsenter for health+timers, containers/processes tiles |
|---|
| 28 | +- `app/ops_runner.py` — added `run_command_host()` |
|---|
| 29 | +- `static/js/app.js` — replaced Load tile with Containers + Processes tiles |
|---|
| 30 | + |
|---|
| 31 | +--- |
|---|
| 32 | + |
|---|
| 33 | +**Tags:** #Session #OpsDashboard |
|---|
| .. | .. |
|---|
| 1 | +# Session 0002: Backup Page Redesign v5-v6 |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-22 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0026 |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +- [x] **v5**: Rewrote backup page from flat 65+ row table to date-grouped collapsible sections |
|---|
| 12 | + - Summary stat tiles (local count, offsite count, latest backup, total size) |
|---|
| 13 | + - Today/yesterday auto-expanded, older collapsed with chevron toggle animation |
|---|
| 14 | + - Combined local+offsite view with type badges |
|---|
| 15 | +- [x] **v6**: Deduplication + inline restore |
|---|
| 16 | + - Same filename in both local+offsite locations → single row with "local + offsite" badge |
|---|
| 17 | + - Removed separate Restore page from sidebar |
|---|
| 18 | + - Added Restore button per row with confirmation modal + SSE streaming output |
|---|
| 19 | + - Dry-run checkbox (default on) in restore modal |
|---|
| 20 | + |
|---|
| 21 | +## Key Decisions / Learnings |
|---|
| 22 | + |
|---|
| 23 | +- Inline restore replaces the separate Restore page — backups and restores live on one page |
|---|
| 24 | +- Dry-run default-on prevents accidental destructive restores |
|---|
| 25 | +- SSE streaming for restore output enables real-time feedback in the modal |
|---|
| 26 | +- Dedup by filename keeps the UI clean when the same backup exists locally and offsite |
|---|
| 27 | + |
|---|
| 28 | +--- |
|---|
| 29 | + |
|---|
| 30 | +**Tags:** #Session #OpsDashboard #Backups |
|---|
| .. | .. |
|---|
| 1 | +# Session 0003: Backup v8-v9 Delete, Multi-Select, URL Routing |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-22 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0031 |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +### v8: Delete + Granular Restore |
|---|
| 12 | +- [x] `DELETE /api/backups/{project}/{env}/{name}` endpoint with path traversal validation |
|---|
| 13 | +- [x] Restore `mode` query param (full/db/wp) → passes `--db-only`/`--wp-only` to ops CLI |
|---|
| 14 | +- [x] Delete button on every backup row in Level 2 drill-down |
|---|
| 15 | +- [x] Restore Mode radio buttons (Full / Database only / WP-Content only) in restore modal |
|---|
| 16 | + |
|---|
| 17 | +### v9: Multi-Select, Upload, Source Selector, URL Routing |
|---|
| 18 | +- [x] URL hash routing — `#/backups/mdf/dev`, `#/dashboard/table`, `#/system` — browser refresh preserves location |
|---|
| 19 | +- [x] Multi-select delete — checkboxes per row, select-all header, blue selection bar, bulk delete |
|---|
| 20 | +- [x] Upload to offsite — purple "Upload" button on local-only backups |
|---|
| 21 | +- [x] Restore source selector — Local/Offsite radio buttons when backup exists in both locations |
|---|
| 22 | + |
|---|
| 23 | +## Key Decisions / Learnings |
|---|
| 24 | + |
|---|
| 25 | +- Path traversal validation is required on delete endpoint (user-supplied filename in URL) |
|---|
| 26 | +- Static files are volume-mounted (not COPY'd) — frontend changes don't require container rebuild |
|---|
| 27 | +- URL hash routing lets users bookmark specific dashboard views and survive page refresh |
|---|
| 28 | +- Granular restore (db-only / wp-only) avoids full restore when only one component needs recovery |
|---|
| 29 | + |
|---|
| 30 | +## Pending |
|---|
| 31 | + |
|---|
| 32 | +- Selection bar spacing CSS gap not taking effect (possible browser cache issue) |
|---|
| 33 | + |
|---|
| 34 | +--- |
|---|
| 35 | + |
|---|
| 36 | +**Tags:** #Session #OpsDashboard #Backups |
|---|
| .. | .. |
|---|
| 1 | +# Session 0004: Adjacent Env Restriction & Lifecycle Operations |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-22 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0034 |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +### Adjacent Environment Restriction |
|---|
| 12 | +- [x] Removed direct prod↔dev sync/promote paths from UI and API |
|---|
| 13 | +- [x] Only adjacent pairs allowed: dev↔int, int↔prod |
|---|
| 14 | +- [x] Backend returns HTTP 400 for invalid environment pairs |
|---|
| 15 | + |
|---|
| 16 | +### Container Lifecycle Operations (Rebuild/Recreate) |
|---|
| 17 | +- [x] Implemented three lifecycle operations (discovered Coolify API caused duplicate containers): |
|---|
| 18 | + - **Restart** — `docker restart` via SSH (safe, no image changes) |
|---|
| 19 | + - **Rebuild** — stop → build image → start (keeps data volumes) |
|---|
| 20 | + - **Recreate** — stop → wipe data → build image → start (full disaster recovery) |
|---|
| 21 | +- [x] Color-coded UI: green (restart), yellow (rebuild), red (recreate) |
|---|
| 22 | +- [x] Type-to-confirm dialog for destructive Recreate operation |
|---|
| 23 | +- [x] Fixed EventSource auto-reconnect causing duplicate banners across operations |
|---|
| 24 | +- [x] Fixed "already stopped" graceful handling, NameError crash, container filter OR vs AND |
|---|
| 25 | + |
|---|
| 26 | +## Key Decisions / Learnings |
|---|
| 27 | + |
|---|
| 28 | +- Direct prod↔dev skips review in intermediate env (int) — adjacent-only enforced at API level, not just UI |
|---|
| 29 | +- Coolify stop prunes local Docker images — cannot use Coolify API to stop services with locally-built images |
|---|
| 30 | +- EventSource auto-reconnect must be explicitly closed after operation complete to prevent duplicate banners |
|---|
| 31 | +- Type-to-confirm for Recreate is appropriate UX — wipes data volumes, no undo |
|---|
| 32 | + |
|---|
| 33 | +--- |
|---|
| 34 | + |
|---|
| 35 | +**Tags:** #Session #OpsDashboard #Lifecycle |
|---|
| .. | .. |
|---|
| 1 | +# Session 0005: rebuild.py Rewrite & App Volume Mount |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-22 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0035 |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +- [x] Diagnosed root cause: dashboard was calling Coolify API stop/start on a placeholder test-nginx, not actual MDF containers |
|---|
| 12 | +- [x] Rewrote `rebuild.py` to use `ops rebuild` CLI via host nsenter bridge (no Coolify API) |
|---|
| 13 | +- [x] Updated `ops rebuild` to do `docker compose down` before `up -d --build` |
|---|
| 14 | +- [x] Added safety backup step to Recreate operation |
|---|
| 15 | +- [x] Added `app/` directory as volume mount to ops-dashboard compose (enables live edits without rebuild) |
|---|
| 16 | +- [x] Added ops-dashboard git remote at `git.mnsoft.org/git/APPS/ops-dashboard.git` |
|---|
| 17 | +- [x] Committed and pushed all server repos (MDF, infrastructure, ops-dashboard) |
|---|
| 18 | + |
|---|
| 19 | +## Key Decisions / Learnings |
|---|
| 20 | + |
|---|
| 21 | +- `app/` as volume mount is essential for iterating on dashboard backend without container rebuilds |
|---|
| 22 | +- `static/` was already volume-mounted; `app/` mount completes the live-edit setup |
|---|
| 23 | +- Coolify API is unreliable for locally-built images — ops CLI via nsenter bridge is the correct pattern |
|---|
| 24 | +- Safety backup before Recreate ensures data can be recovered if the restore fails |
|---|
| 25 | + |
|---|
| 26 | +--- |
|---|
| 27 | + |
|---|
| 28 | +**Tags:** #Session #OpsDashboard #Infrastructure |
|---|
| .. | .. |
|---|
| 1 | +# Session 0006: SSE Streaming Backup & Upload Endpoints |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-22 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0036 |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +- [x] Added `upload` subcommand to `offsite.py` CLI (function existed but wasn't wired) |
|---|
| 12 | +- [x] Converted `create_backup` endpoint from plain JSON to SSE streaming |
|---|
| 13 | +- [x] Converted `upload_offsite` endpoint from plain JSON to SSE streaming |
|---|
| 14 | +- [x] Changed both endpoints to accept GET+POST (EventSource API requires GET) |
|---|
| 15 | + |
|---|
| 16 | +## Key Decisions / Learnings |
|---|
| 17 | + |
|---|
| 18 | +- EventSource (SSE) requires GET requests — endpoints serving streaming output must accept GET |
|---|
| 19 | +- Converting to SSE streaming gives real-time feedback for long-running backup and upload operations |
|---|
| 20 | +- offsite.py had an upload function but it was never exposed as a CLI subcommand — easy fix, high value |
|---|
| 21 | + |
|---|
| 22 | +--- |
|---|
| 23 | + |
|---|
| 24 | +**Tags:** #Session #OpsDashboard #Backups #SSE |
|---|
| .. | .. |
|---|
| 1 | +# Session 0007: FTP Progress Callbacks & Upload Button Fix |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-22 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0038 |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +- [x] Added FTP upload/download progress callbacks to `offsite.py` — prints every 5% with size info |
|---|
| 12 | +- [x] Increased FTP block size to 256KB for better throughput |
|---|
| 13 | +- [x] Added `flush=True` and `sys.stdout.reconfigure(line_buffering=True)` for SSE streaming compatibility |
|---|
| 14 | +- [x] Fixed Upload button — now passes exact filename through frontend → API (`?name=` param) → ops CLI; previously always uploaded the latest backup regardless of which row was clicked |
|---|
| 15 | +- [x] Added `cache: 'no-store'` to all `fetch()` calls in `app.js` to prevent stale UI state |
|---|
| 16 | +- [x] Added `renderBackups()` call after upload success and on upload modal close |
|---|
| 17 | +- [x] Added `/etc/tmpfiles.d/mdf-cleanup.conf` to auto-clean orphan `/tmp/tmp*` dirs older than 1 day |
|---|
| 18 | +- [x] Increased FTP data socket timeout to 300s for large transfers |
|---|
| 19 | +- [x] Verified via Playwright: LOCAL + OFFSITE badges display correctly, merge-by-filename works |
|---|
| 20 | + |
|---|
| 21 | +## Key Decisions / Learnings |
|---|
| 22 | + |
|---|
| 23 | +- `sys.stdout.reconfigure(line_buffering=True)` is required for progress output to stream through SSE — buffered stdout swallows output |
|---|
| 24 | +- The upload endpoint must accept a `name=` param; generic "upload latest" is wrong UX when user clicks a specific row |
|---|
| 25 | +- `cache: 'no-store'` on all fetches is necessary — stale backup list after upload is confusing |
|---|
| 26 | +- Corrupt backup (`prod_backup_20260219_164913.tar.gz`) failed FTP at ~5% consistently — safe to delete |
|---|
| 27 | + |
|---|
| 28 | +--- |
|---|
| 29 | + |
|---|
| 30 | +**Tags:** #Session #OpsDashboard #Backups #FTP |
|---|
| .. | .. |
|---|
| 1 | +# Session 0008: Schedule Management & Backup Coverage System |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-23 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0040 |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +- [x] Fixed `backup-all.sh` — was appending env suffix twice → files landed in `.../dev/dev/` |
|---|
| 12 | +- [x] Moved stranded double-nested backup files to correct directories |
|---|
| 13 | +- [x] Version-controlled `offsite.py` and `backup-all.sh` into infrastructure repo |
|---|
| 14 | +- [x] Added `_backup_generic()` function to ops CLI — tar-based fallback for projects without a dedicated CLI |
|---|
| 15 | +- [x] Added `backup:` config blocks to `registry.yaml` for MDF (03:15), SeriousLetter (03:00), Coolify (04:00) |
|---|
| 16 | +- [x] Created `gen-timers.py` — reads registry, generates systemd `.service` + `.timer` units automatically |
|---|
| 17 | +- [x] Added `ops gen-timers [--dry-run]` command — replaces legacy backup-all, mdf-backup, seriousletter-backup timers |
|---|
| 18 | +- [x] Created `schedule.py` FastAPI router: |
|---|
| 19 | + - `GET /api/schedule/` — returns backup config for all projects |
|---|
| 20 | + - `PUT /api/schedule/{project}` — updates config, writes registry via nsenter, regenerates timers |
|---|
| 21 | +- [x] Added "Schedules" nav item to dashboard sidebar (clock icon) |
|---|
| 22 | +- [x] Schedule page: table showing all projects with enabled/schedule/envs/offsite/retention columns |
|---|
| 23 | +- [x] Schedule edit modal: toggle, time picker, env checkboxes, offsite section, retention fields |
|---|
| 24 | + |
|---|
| 25 | +## Key Decisions / Learnings |
|---|
| 26 | + |
|---|
| 27 | +- registry.yaml drives both systemd timers and the dashboard schedule UI — single source of truth |
|---|
| 28 | +- `gen-timers` must auto-remove orphan timers (e.g. `backup-coolify.timer`) — prevents ghost schedules |
|---|
| 29 | +- `PUT /api/schedule/{project}` writes via nsenter (not inside container) because systemd lives on host |
|---|
| 30 | +- `backup-all.sh` must NOT append `/$env` suffix if the CLI already appends it internally |
|---|
| 31 | + |
|---|
| 32 | +--- |
|---|
| 33 | + |
|---|
| 34 | +**Tags:** #Session #OpsDashboard #Backups #Scheduling |
|---|
| .. | .. |
|---|
| 1 | +# Session 0009: Dashboard Rewrite Committed & Gen-Timers Migration |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-23 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0047 |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +- [x] Committed ops-dashboard rewrite (8 files, rebuild.py reduced from 707 to 348 lines) |
|---|
| 12 | +- [x] Browser-tested all 5 dashboard pages: Dashboard, Backups, Schedules, Operations, MDF drill-down |
|---|
| 13 | + |
|---|
| 14 | +### Frontend Fixes (app.js) |
|---|
| 15 | +- [x] Fixed environment parsing — `cfg.environments` returns objects, needed `.map(e => e.name)` for promote/sync/lifecycle sections |
|---|
| 16 | +- [x] Removed `has_coolify` gate — Container Lifecycle section was incorrectly hidden entirely |
|---|
| 17 | +- [x] Changed all "Coolify API" text references to "docker compose" |
|---|
| 18 | +- [x] Fixed leftover banner bug — "Go to Backups" banner from Recreate persisted across subsequent Restart/Rebuild operations |
|---|
| 19 | + |
|---|
| 20 | +### Gen-Timers Migration |
|---|
| 21 | +- [x] Rewrote `cmd_gen_timers` to read from `all_projects()` descriptors instead of `registry.yaml` |
|---|
| 22 | +- [x] Orphan timer auto-cleanup (removed `backup-coolify.timer`) |
|---|
| 23 | +- [x] Schedule `PUT` endpoint now writes to `project.yaml` (not `registry.yaml`) — registry.yaml is now dead code |
|---|
| 24 | + |
|---|
| 25 | +### Seafile Healthchecks |
|---|
| 26 | +- [x] Added Docker HEALTHCHECK to `prod-mdf-seafile` (curl localhost:80, 60s start_period) |
|---|
| 27 | +- [x] Added Docker HEALTHCHECK to `prod-mdf-seafile-redis` (redis-cli ping) |
|---|
| 28 | +- [x] All 3 Seafile containers now report `healthy` in dashboard |
|---|
| 29 | + |
|---|
| 30 | +### Backup Bug Fixes |
|---|
| 31 | +- [x] Fixed single-env backup — MDF CLI `--all` flag was always backing up all envs even when one env was requested |
|---|
| 32 | +- [x] Fixed `bk.create` delegation — only delegates when command template contains `{env}` |
|---|
| 33 | + |
|---|
| 34 | +### Known Issues Remaining |
|---|
| 35 | +- Restore chain broken: offsite downloaded file path not reaching actual restore (shows wrong filename) |
|---|
| 36 | +- Backups page shows all entries as "Remote" (local/remote distinction broken in frontend) |
|---|
| 37 | + |
|---|
| 38 | +## Key Decisions / Learnings |
|---|
| 39 | + |
|---|
| 40 | +- `project.yaml` descriptors replace `registry.yaml` as source of truth for all ops commands |
|---|
| 41 | +- `has_coolify` gate should be removed — dashboard should always show lifecycle section |
|---|
| 42 | +- Browser testing after every batch of changes is essential — environment parsing bug only visible in browser |
|---|
| 43 | + |
|---|
| 44 | +--- |
|---|
| 45 | + |
|---|
| 46 | +**Tags:** #Session #OpsDashboard #Refactor |
|---|
| .. | .. |
|---|
| 1 | +# Session 0010: Sync Router Bidirectional Fix |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-25 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0052 |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +- [x] Fixed `sync_data.py` — added bidirectional sync pairs (int->prod, dev->int); was only defined in one direction → caused "Connection lost" error when triggering int->prod sync from dashboard |
|---|
| 12 | + |
|---|
| 13 | +## Key Decisions / Learnings |
|---|
| 14 | + |
|---|
| 15 | +- Sync pairs must be defined bidirectionally in the router even if data only ever flows one direction (prod→int→dev) — the UI may call either direction depending on user intent |
|---|
| 16 | +- This was a trivial fix but caused a visible "Connection lost" failure in the dashboard |
|---|
| 17 | + |
|---|
| 18 | +--- |
|---|
| 19 | + |
|---|
| 20 | +**Tags:** #Session #OpsDashboard #BugFix |
|---|
| .. | .. |
|---|
| 1 | +# Session 0011: v15 Deploy, Debug & Full Verification |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-25 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0054 |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +- [x] Deployed ops dashboard v15 to server (rsync + container rebuild) |
|---|
| 12 | +- [x] Fixed missing `OPS_CLI` path in `run_job()` — nsenter couldn't find the `ops` command |
|---|
| 13 | +- [x] Fixed same `OPS_CLI` bug in `restore.py` `_stream_to_job()` |
|---|
| 14 | +- [x] Fixed Python stdout buffering through nsenter pipe — added `PYTHONUNBUFFERED=1` to `_NSENTER_PREFIX` |
|---|
| 15 | +- [x] Fixed stdin inheritance — added `stdin=asyncio.subprocess.DEVNULL` to prevent `docker run -i` blocking |
|---|
| 16 | +- [x] Fixed `schedule.py` hard-coded `/usr/local/bin/ops` path — replaced with `OPS_CLI` constant |
|---|
| 17 | +- [x] Added logging to `run_job()` (command start, subprocess PID, exit code) |
|---|
| 18 | +- [x] Fixed terminal `docker exec` missing `-it` flags — shell was exiting immediately with code 0 |
|---|
| 19 | +- [x] Fixed MDF backup timer — `gen-timers` wasn't expanding `{env}` in custom command templates |
|---|
| 20 | +- [x] Verified backup (dev + int): lines streaming in real-time |
|---|
| 21 | +- [x] Verified disconnect/reconnect: output replay from offset works |
|---|
| 22 | +- [x] Verified restart mdf/dev: 2 containers restarted successfully |
|---|
| 23 | +- [x] Verified terminal: WebSocket handshake + interactive shell working |
|---|
| 24 | + |
|---|
| 25 | +## Key Decisions / Learnings |
|---|
| 26 | + |
|---|
| 27 | +| Bug | Root Cause | Fix | |
|---|
| 28 | +|-----|-----------|-----| |
|---|
| 29 | +| `nsenter: can't execute 'backup'` | `run_job()` missing OPS_CLI prefix | Added `[OPS_CLI]` to `full_args` | |
|---|
| 30 | +| Backup produces 0 lines | Python stdout buffered through pipe | Added `PYTHONUNBUFFERED=1` to nsenter prefix | |
|---|
| 31 | +| `docker run -i` hangs | stdin inherited from server process | `stdin=asyncio.subprocess.DEVNULL` | |
|---|
| 32 | +| Terminal exits immediately (code 0) | `docker exec` missing `-it` flags | Added `-it` to exec command | |
|---|
| 33 | +| MDF backups not running (2 nights) | `gen-timers`: `{env}` never expanded in custom command | Loop over envs + `.replace("{env}", env)` | |
|---|
| 34 | + |
|---|
| 35 | +- `PYTHONUNBUFFERED=1` is essential whenever running Python via nsenter pipe — buffering silently swallows all output |
|---|
| 36 | +- `stdin=asyncio.subprocess.DEVNULL` is required for non-interactive subprocess calls from async context |
|---|
| 37 | + |
|---|
| 38 | +## Commits |
|---|
| 39 | + |
|---|
| 40 | +- `9e13f76` — feat: ops dashboard v15 — persistent jobs + container terminal (Webseiten repo) |
|---|
| 41 | +- `4e65e9e` — fix: gen-timers expand {env} placeholder in custom backup commands (infrastructure repo) |
|---|
| 42 | + |
|---|
| 43 | +--- |
|---|
| 44 | + |
|---|
| 45 | +**Tags:** #Session #OpsDashboard #Debug #Deployment |
|---|
| .. | .. |
|---|
| 1 | +# Session 0012: No-Backup Option for Promote & Sync |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-26 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0055 |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +- [x] Added "Skip safety backup" checkbox to promote modal and sync modal |
|---|
| 12 | + - Backend: `no_backup` query param on `promote.py` and `sync_data.py` |
|---|
| 13 | + - Frontend: amber-colored checkbox, hidden for lifecycle operations (restart/rebuild/recreate) |
|---|
| 14 | + - Deployed and verified (200 OK) |
|---|
| 15 | + |
|---|
| 16 | +## Key Decisions / Learnings |
|---|
| 17 | + |
|---|
| 18 | +- Safety backup before promote/sync is the default — skip is opt-in, not opt-out |
|---|
| 19 | +- Amber color signals caution without being as severe as red |
|---|
| 20 | +- The checkbox is only shown for promote/sync, not for container lifecycle ops (different risk profile) |
|---|
| 21 | +- Useful when iterating quickly on dev/int where the overhead of a safety backup is unnecessary |
|---|
| 22 | + |
|---|
| 23 | +--- |
|---|
| 24 | + |
|---|
| 25 | +**Tags:** #Session #OpsDashboard #Promote #Sync |
|---|
| .. | .. |
|---|
| 1 | +# TODO |
|---|
| 2 | + |
|---|
| 3 | +## Open |
|---|
| 4 | + |
|---|
| 5 | +- [ ] Simple WordPress backup plugin (FTP + WebDAV only, UpdraftPlus alternative) |
|---|
| 6 | + |
|---|
| 7 | +## Ideas |
|---|
| 8 | + |
|---|
| 9 | +- [ ] Dark mode toggle |
|---|
| 10 | +- [ ] Mobile-responsive improvements |
|---|
| 11 | +- [ ] Log viewer with search/filter |
|---|
| 12 | +- [ ] Alerting rules configuration page |
|---|
| 13 | + |
|---|
| 14 | +## Completed (Summary) |
|---|
| 15 | + |
|---|
| 16 | +Dashboard feature-complete as of 2026-02-26. See session notes in Notes/2026/02/ for full history. |
|---|
| 17 | +Originated from MDF Webseiten project (sessions 0022-0055). |
|---|