docs: extract 14 additional session notes from MDF Webseiten (0013-0026)
Complete OPS session history now covers all ops-dashboard, ops CLI,
and infrastructure toolkit work from MDF Webseiten sessions 0018-0053.
| .. | .. |
|---|
| 1 | +# Session 0013: Infrastructure Repo & Ops CLI Bootstrap |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-20 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0018 |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +- [x] Created infrastructure repo at `git.mnsoft.org/git/APPS/infrastructure.git` |
|---|
| 12 | +- [x] Local clone: `/Users/i052341/Daten/Cloud/08 - Others/MDF/Infrastruktur/Code/infrastructure/` |
|---|
| 13 | +- [x] Server clone: `/opt/infrastructure/` |
|---|
| 14 | +- [x] Wrote `ops` CLI (bash, ~250 lines) — symlinked to `/usr/local/bin/ops` |
|---|
| 15 | +- [x] Created `servers/hetzner-vps/registry.yaml` — single source of truth for 5 projects |
|---|
| 16 | +- [x] Captured 5 Traefik dynamic configs from server into git |
|---|
| 17 | +- [x] Wrote `monitoring/healthcheck.sh` — container health + disk checks → ntfy |
|---|
| 18 | +- [x] Installed `ops-healthcheck.timer` (every 5 min) on server |
|---|
| 19 | +- [x] Added Docker labels (`ops.project`, `ops.environment`, `ops.service`) to all MDF compose files |
|---|
| 20 | +- [x] Replaced hardcoded `container_name()` in `sync.py` with label-based discovery + UUID suffix fallback |
|---|
| 21 | +- [x] Verified: `ops status`, `ops health`, `ops disk`, `ops backup mdf prod` all working |
|---|
| 22 | + |
|---|
| 23 | +## Repo Structure Created |
|---|
| 24 | + |
|---|
| 25 | +``` |
|---|
| 26 | +infrastructure/ |
|---|
| 27 | +├── ops # The ops CLI (bash) |
|---|
| 28 | +├── servers/hetzner-vps/ |
|---|
| 29 | +│ ├── registry.yaml # 5 projects defined |
|---|
| 30 | +│ ├── traefik/dynamic/ # Traefik configs captured |
|---|
| 31 | +│ ├── bootstrap/ # Coolify service payloads |
|---|
| 32 | +│ ├── scaffolding/ # Shell aliases, SSH hardening, venv setup |
|---|
| 33 | +│ ├── systemd/ # 6 timer/service units |
|---|
| 34 | +│ └── install.sh # Full fresh server setup script |
|---|
| 35 | +├── monitoring/ |
|---|
| 36 | +│ ├── healthcheck.sh |
|---|
| 37 | +│ ├── ops-healthcheck.service |
|---|
| 38 | +│ └── ops-healthcheck.timer |
|---|
| 39 | +└── docs/architecture.md |
|---|
| 40 | +``` |
|---|
| 41 | + |
|---|
| 42 | +## Key Decisions / Learnings |
|---|
| 43 | + |
|---|
| 44 | +- `ops` CLI uses `SCRIPT_DIR` with `readlink -f` for symlink-safe path resolution |
|---|
| 45 | +- `registry.yaml` uses a `name_prefix` field; container matching uses `grep` with word anchoring to prevent substring false matches |
|---|
| 46 | +- Label-based discovery is primary; Coolify UUID suffix prefix-search is the fallback |
|---|
| 47 | +- Docker labels added to compose files are not live on running containers until restart — noted as gap |
|---|
| 48 | + |
|---|
| 49 | +## Files Changed |
|---|
| 50 | + |
|---|
| 51 | +- `/opt/infrastructure/ops` — new ops CLI (bash) |
|---|
| 52 | +- `/opt/infrastructure/servers/hetzner-vps/registry.yaml` — new registry |
|---|
| 53 | +- `/opt/infrastructure/monitoring/healthcheck.sh` — new healthcheck script |
|---|
| 54 | +- `Code/mdf-system/docker-compose.yaml` — added ops.* Docker labels |
|---|
| 55 | +- `Code/mdf-system/scripts/sync/sync.py` — label-based container discovery, domain map fix |
|---|
| 56 | + |
|---|
| 57 | +--- |
|---|
| 58 | + |
|---|
| 59 | +**Tags:** #Session #OpsCLI #Infrastructure |
|---|
| .. | .. |
|---|
| 1 | +# Session 0014: Registry Naming & Backup System |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-20 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0019 |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +- [x] Fixed `sl-website` registry placement — moved under `seriousletter.services.website` to resolve prefix collision |
|---|
| 12 | +- [x] Renamed all 7 Coolify services to consistent `{project}-{env/purpose}` lowercase naming |
|---|
| 13 | +- [x] Deleted stale stopped MDF Dev duplicate from Coolify (UUID: qw8wso0ckskccoo0kcog84c0) |
|---|
| 14 | +- [x] Fixed `ops backup/restore/sync` argument validation (was crashing on unbound variable) |
|---|
| 15 | +- [x] Fixed SL CLI path in `registry.yaml` (pointed to wrong location) |
|---|
| 16 | +- [x] Added `container_name()` to SL `sync.py` with label + prefix fallback (mirrors MDF pattern) |
|---|
| 17 | +- [x] Made `ops backup <project>` work without env arg (passes `--all` to CLI) |
|---|
| 18 | +- [x] Added backup summary to `ops status` — latest backup per project/env, size, age with color coding |
|---|
| 19 | +- [x] Consolidated backup dirs to `/opt/data/backups/{project}/{env}/` across all projects |
|---|
| 20 | +- [x] Updated both MDF and SL CLIs for per-env backup subdirectory structure |
|---|
| 21 | +- [x] Volume consolidation: all data migrated from 10GB to 50GB volume at `/opt/data` |
|---|
| 22 | +- [x] Updated all path references across compose files, CLIs, systemd units, registry, ops CLI |
|---|
| 23 | + |
|---|
| 24 | +## Key Decisions / Learnings |
|---|
| 25 | + |
|---|
| 26 | +- Registry was initially ambiguous about where `sl-website` lived — prefix collision with other SL services caused matching bugs. Moving it under a `services.website` key made the prefix unique. |
|---|
| 27 | +- Per-env backup subdirs (`/opt/data/backups/{project}/{env}/`) are the correct structure — flat dirs were the source of orphaned files. |
|---|
| 28 | +- `ops backup <project>` without env should be a valid shorthand — it delegates `--all` to the project CLI rather than requiring explicit env arg. |
|---|
| 29 | +- Container name resolution logic must be identical across project CLIs — label-based primary, prefix fallback secondary. Divergence causes mysterious "container not found" bugs. |
|---|
| 30 | +- Old 10GB volume was kept mounted during migration to avoid cwd-in-mountpoint issues during `umount`. |
|---|
| 31 | + |
|---|
| 32 | +## Files Changed |
|---|
| 33 | + |
|---|
| 34 | +- `/opt/infrastructure/servers/hetzner-vps/registry.yaml` — fixed sl-website placement, naming consistency |
|---|
| 35 | +- `/opt/infrastructure/ops` — fixed arg validation, `cmd_backup` without env, backup summary in status |
|---|
| 36 | +- `/opt/data/seriousletter/{dev,int,prod}/code/scripts/sync/sync.py` — added `container_name()` with fallback |
|---|
| 37 | +- `Code/mdf-system/scripts/sync/sync.py` — per-env backup subdirectory paths |
|---|
| 38 | +- All compose files, systemd units — `/opt/data2` → `/opt/data` path updates |
|---|
| 39 | + |
|---|
| 40 | +--- |
|---|
| 41 | + |
|---|
| 42 | +**Tags:** #Session #OpsCLI #BackupSystem #Registry |
|---|
| .. | .. |
|---|
| 1 | +# Session 0015: Offsite Backup Dashboard Fix & Status Format |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-22 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0025 |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +- [x] Fixed offsite backups not showing in ops dashboard |
|---|
| 12 | + - `/api/backups/offsite` was calling `run_ops_json()` (in-container execution) but `ops offsite list` requires the host Python venv |
|---|
| 13 | + - Added `run_ops_host_json()` helper to `ops_runner.py` using `nsenter`-based host execution |
|---|
| 14 | + - Updated `backups.py` router to use `run_ops_host_json()` for offsite listing |
|---|
| 15 | + - Rebuilt and restarted ops-dashboard container |
|---|
| 16 | +- [x] Reformatted backup list in `ops status` CLI output |
|---|
| 17 | + - Changed from flat table sorted by project to date-grouped boxes |
|---|
| 18 | + - Each date gets its own Rich table: project / env / time / size / total columns |
|---|
| 19 | + - Latest backup per project/env shown, grouped by date descending, sorted by project then env within each date |
|---|
| 20 | +- [x] Fixed SeriousLetter backup path bug (CLI-level fix, required for dashboard data correctness) |
|---|
| 21 | + - SL CLI was dumping backups flat into `/opt/data/backups/` — changed `backup-all.sh` to call SL CLI per-env with explicit `--backup-dir` |
|---|
| 22 | + - Moved 15 orphaned backup files to correct per-env directories |
|---|
| 23 | +- [x] Ran full backup cycle across all 6 environments (MDF + SL x dev/int/prod), verified offsite upload |
|---|
| 24 | + |
|---|
| 25 | +## Key Decisions / Learnings |
|---|
| 26 | + |
|---|
| 27 | +- Dashboard containers cannot use in-process `ops` commands that require host-side Python venvs — must use `nsenter` bridge. This is a recurring pattern: in-container vs host execution boundary is an important architectural distinction in the ops-dashboard. |
|---|
| 28 | +- Two execution helpers needed: `run_ops_json()` (in-container, fast) and `run_ops_host_json()` (host via nsenter, required for backup/offsite commands). |
|---|
| 29 | +- Date-grouped backup status is more readable than a flat project-sorted table — groups make it obvious if a date was missed entirely. |
|---|
| 30 | + |
|---|
| 31 | +## Files Changed |
|---|
| 32 | + |
|---|
| 33 | +- `/opt/data/ops-dashboard/app/ops_runner.py` — added `run_ops_host_json()` helper |
|---|
| 34 | +- `/opt/data/ops-dashboard/app/routers/backups.py` — use host execution for offsite listing |
|---|
| 35 | +- `/opt/infrastructure/ops` — reformatted backup summary with date-grouped Rich tables |
|---|
| 36 | + |
|---|
| 37 | +--- |
|---|
| 38 | + |
|---|
| 39 | +**Tags:** #Session #OpsDashboard #BackupSystem #Offsite |
|---|
| .. | .. |
|---|
| 1 | +# Session 0016: Backup Drill-Down Redesign & Restore Fix |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-22 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0030 |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +- [x] Fixed restore API call — `mdf` CLI was falling into interactive selection because no backup filename was passed |
|---|
| 12 | + - `app.js`: `startRestore()` now includes `&name=...` from `restoreCtx` in the API URL |
|---|
| 13 | +- [x] Implemented backups drill-down redesign (deployed as v7) |
|---|
| 14 | + - Replaced flat filter state with 3-level drill state (project → env → backup file) |
|---|
| 15 | + - Added cached backups to avoid re-fetching on drill-back |
|---|
| 16 | + - Extracted `mergeBackups()` helper function |
|---|
| 17 | + - Implemented all 13 changes from the redesign plan |
|---|
| 18 | +- [x] Fixed browser cache problem preventing new JS from loading after rebuild |
|---|
| 19 | + - Rebuilt image and restarted container to force cache bust |
|---|
| 20 | + |
|---|
| 21 | +## Key Decisions / Learnings |
|---|
| 22 | + |
|---|
| 23 | +- Restore API must include the backup filename explicitly — passing only project/env and letting the CLI choose interactively breaks in non-TTY server context. |
|---|
| 24 | +- 3-level drill state (project → env → file) is the right UX pattern for hierarchical backup selection; flat filter state made navigation confusing and state management error-prone. |
|---|
| 25 | +- Caching fetched backup lists at each level avoids latency on drill-back and reduces server load. |
|---|
| 26 | +- Browser cache busting on vanilla JS apps requires either cache-control headers or a version query param — container restart alone does not always clear client caches. |
|---|
| 27 | + |
|---|
| 28 | +## Files Changed |
|---|
| 29 | + |
|---|
| 30 | +- `/opt/data/ops-dashboard/static/js/app.js` — `startRestore()` fix, 3-level drill state, `mergeBackups()` helper |
|---|
| 31 | +- Docker image rebuilt and container restarted |
|---|
| 32 | + |
|---|
| 33 | +--- |
|---|
| 34 | + |
|---|
| 35 | +**Tags:** #Session #OpsDashboard #BackupSystem |
|---|
| .. | .. |
|---|
| 1 | +# Session 0017: Modular Sync/Promote/Rebuild Architecture |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-22 |
|---|
| 4 | +**Status:** Paused (context checkpoint) |
|---|
| 5 | +**Origin:** MDF Webseiten session 0032 |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +- [x] Fixed `SL detect_env()` — was returning "seriousletter" instead of the env name; now scans path components for first match after "data" |
|---|
| 12 | +- [x] Fixed `MDF list_backups()` indentation bug — try block was at same level as for loop, only parsed the last backup file |
|---|
| 13 | +- [x] Added `promote` config to `registry.yaml` for mdf (rsync), seriousletter (git), ringsaday (git) — each defines promote type, branch mapping, post-pull behavior |
|---|
| 14 | +- [x] Added `promote` Typer command to SL `sync.py` — git fetch, diff preview, git pull, Dockerfile change detection, container rebuild/restart, health check; only dev→int and int→prod allowed |
|---|
| 15 | +- [x] Added `cmd_promote` to ops CLI — delegates to project CLI with `--from`/`--to` args |
|---|
| 16 | +- [x] Added `cmd_rebuild` to ops CLI — starts containers, waits for health, restores latest backup |
|---|
| 17 | +- [x] Created 4 new FastAPI routers in ops-dashboard: |
|---|
| 18 | + - `promote.py` — SSE streaming promote endpoint |
|---|
| 19 | + - `sync_data.py` — SSE streaming sync endpoint |
|---|
| 20 | + - `registry.py` — exposes project list + environments + promote config as JSON |
|---|
| 21 | + - `rebuild.py` — SSE streaming rebuild/disaster-recovery endpoint |
|---|
| 22 | +- [x] Updated `backups.py` to read project list from registry API instead of hardcoding |
|---|
| 23 | +- [x] Added "Operations" page to dashboard sidebar with three sections: Promote Code, Sync Data, Rebuild (Disaster Recovery) |
|---|
| 24 | +- [x] Operations page uses SSE modal with dry-run toggle; project/direction buttons populated dynamically from `/api/registry/` |
|---|
| 25 | +- [x] Verified all 7 test categories pass |
|---|
| 26 | + |
|---|
| 27 | +## Key Decisions / Learnings |
|---|
| 28 | + |
|---|
| 29 | +- All long-running ops commands (promote, sync, rebuild) use SSE streaming — consistent with existing backup/restore pattern. The `stream_ops_host()` helper is the standard interface. |
|---|
| 30 | +- Registry is the single source of truth for project/environment/promote config. Dashboard reads it dynamically — no hardcoded project names in API routers. |
|---|
| 31 | +- Promote direction validation lives in the project CLI (`sync.py`), not in the ops CLI or dashboard — keeps enforcement close to the implementation. |
|---|
| 32 | +- `ops rebuild` is the disaster recovery entry point: bring up containers → wait for healthy → restore latest backup. Simple, composable. |
|---|
| 33 | +- `detect_env()` path parsing must handle the full `/opt/data/seriousletter/{env}/code/...` structure — scanning for VALID_ENVS after "data" in path components is robust. |
|---|
| 34 | + |
|---|
| 35 | +## Files Changed |
|---|
| 36 | + |
|---|
| 37 | +- `/opt/data/seriousletter/{dev,int,prod}/code/scripts/sync/sync.py` — fix `detect_env`, add `promote` command |
|---|
| 38 | +- `Code/mdf-system/scripts/sync/sync.py` (local + deployed to dev) — fix `list_backups` indentation |
|---|
| 39 | +- `/opt/infrastructure/servers/hetzner-vps/registry.yaml` — add `promote` config per project |
|---|
| 40 | +- `/opt/infrastructure/ops` — add `cmd_promote`, `cmd_rebuild` |
|---|
| 41 | +- `/opt/data/ops-dashboard/app/routers/promote.py` — new SSE promote endpoint |
|---|
| 42 | +- `/opt/data/ops-dashboard/app/routers/sync_data.py` — new SSE sync endpoint |
|---|
| 43 | +- `/opt/data/ops-dashboard/app/routers/registry.py` — new registry JSON endpoint |
|---|
| 44 | +- `/opt/data/ops-dashboard/app/routers/rebuild.py` — new SSE rebuild endpoint |
|---|
| 45 | +- `/opt/data/ops-dashboard/app/routers/backups.py` — dynamic project list from registry |
|---|
| 46 | +- `/opt/data/ops-dashboard/app/main.py` — register 4 new routers |
|---|
| 47 | +- `/opt/data/ops-dashboard/static/js/app.js` — Operations page UI + SSE modal |
|---|
| 48 | +- `/opt/data/ops-dashboard/static/index.html` — nav link + ops-modal HTML |
|---|
| 49 | + |
|---|
| 50 | +## Next Steps (at time of pause) |
|---|
| 51 | + |
|---|
| 52 | +- [ ] Test backup creation from dashboard UI |
|---|
| 53 | +- [ ] Test full promote dry-run via dashboard (Operations page) |
|---|
| 54 | +- [ ] Test sync dry-run via dashboard |
|---|
| 55 | +- [ ] Commit infrastructure and code repo changes on server |
|---|
| 56 | +- [ ] DNS cutover mdf-system.de → .ch |
|---|
| 57 | +- [ ] Disaster recovery test (destroy + rebuild SL dev) |
|---|
| 58 | + |
|---|
| 59 | +--- |
|---|
| 60 | + |
|---|
| 61 | +**Tags:** #Session #OpsDashboard #OpsCLI #Promote #Sync #Rebuild #Registry |
|---|
| .. | .. |
|---|
| 1 | +# Session 0018: CLI Contract Spec, Sync Compliance, Dashboard Bidirectional UI |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-22 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0033 |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +- [x] Defined project CLI contract (`infrastructure/docs/cli-contract.md`, 514 lines): 4 required commands (backup, restore, sync, promote), exact flags, exit codes, output format, compliance checklist, minimal shell CLI example for new projects |
|---|
| 12 | +- [x] MDF sync.py contract compliance: ANSI suppression (NO_COLOR env var + TTY detection), `--yes` flag for backup, 6 cancellation paths changed exit 0 → exit 2, `[error]` prefix helper for stderr |
|---|
| 13 | +- [x] SL sync.py contract compliance: ANSI suppression, `error_exit()` helper, backup now uses per-env subdirectories, absolute path output after backup |
|---|
| 14 | +- [x] Ops CLI de-hardcoding: removed stale `/opt/data2` from disk checks and healthcheck.sh, generalized hardcoded MDF-specific comments, added `find_registry()` multi-server comment |
|---|
| 15 | +- [x] Disaster recovery docs: fixed `install.sh` (single-volume layout, auto-detection), fixed `bootstrap.sh` (network pre-creation, local image builds, restore instructions), wrote `docs/disaster-recovery.md` (10-phase runbook) |
|---|
| 16 | +- [x] Dashboard JS fix: fixed syntax errors in Operations page onclick handlers (nested quotes) |
|---|
| 17 | +- [x] Permanent cache fix: content-hashed asset URLs so manual `?v=XX` bumps are no longer needed |
|---|
| 18 | +- [x] Bidirectional sync UI: `prod ↔ dev` with direction picker modal ("content flows down" / "content flows up") |
|---|
| 19 | +- [x] Deployed to server: ops CLI, registry, healthcheck, install.sh, bootstrap.sh, both sync.py scripts (all 3 envs), dashboard rebuilt with content hashing |
|---|
| 20 | +- [x] Verified: ops status, ops health, promote dry-run, restore --list, dashboard SSE streaming |
|---|
| 21 | + |
|---|
| 22 | +## Key Decisions / Learnings |
|---|
| 23 | + |
|---|
| 24 | +- CLI contract enforces: ANSI off via `NO_COLOR` or non-TTY detection; exit codes 0 (success), 1 (error), 2 (cancelled by user); `[error]` prefix on stderr; `--yes` flag to skip prompts in automation |
|---|
| 25 | +- Cancellation paths must exit 2, not 0 — exit 0 was masking user-cancelled operations in the dashboard |
|---|
| 26 | +- Content hashing (not version query params) is the correct long-term cache-busting solution |
|---|
| 27 | +- `find_registry()` multi-server support is documented but not yet implemented — placeholder for future |
|---|
| 28 | +- DR runbook is 10 phases: verify backups → restore server → install deps → clone repo → restore data → start services → verify |
|---|
| 29 | + |
|---|
| 30 | +## Files Changed |
|---|
| 31 | + |
|---|
| 32 | +- `infrastructure/docs/cli-contract.md` — new, 514 lines, defines the full CLI contract |
|---|
| 33 | +- `infrastructure/docs/disaster-recovery.md` — new, 10-phase DR runbook |
|---|
| 34 | +- `infrastructure/install.sh` — single-volume layout with auto-detection |
|---|
| 35 | +- `infrastructure/bootstrap.sh` — network pre-creation, local image builds, restore instructions |
|---|
| 36 | +- `infrastructure/ops` — removed `/opt/data2`, generalized hardcoded comments, `find_registry()` note |
|---|
| 37 | +- `infrastructure/healthcheck.sh` — removed stale `/opt/data2` disk check |
|---|
| 38 | +- `Code/mdf-system/scripts/sync/sync.py` — ANSI suppression, `--yes`, exit 2 cancellations, `[error]` helper |
|---|
| 39 | +- `Code/seriousletter-sync/sync.py` — ANSI suppression, `error_exit()`, per-env backup dirs, absolute path output |
|---|
| 40 | +- `Code/ops-dashboard/` — JS onclick fix, content-hashed assets, bidirectional sync UI |
|---|
| 41 | + |
|---|
| 42 | +--- |
|---|
| 43 | + |
|---|
| 44 | +**Tags:** #Session #OpsToolkit #OpsDashboard #CliContract #DisasterRecovery |
|---|
| .. | .. |
|---|
| 1 | +# Session 0019: Offsite Download Feature Added to Dashboard |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-22 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0039 |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +- [x] Added offsite download feature to ops dashboard: per-row download buttons on the Backups page plus action bar buttons |
|---|
| 12 | +- [x] Offsite download uses SSE streaming (consistent with existing backup/restore/upload patterns) |
|---|
| 13 | +- [x] Updated ops registry with Seafile services (adds ops-visible services to status output) |
|---|
| 14 | + |
|---|
| 15 | +## Key Decisions / Learnings |
|---|
| 16 | + |
|---|
| 17 | +- Offsite download follows the same SSE streaming pattern as backup upload — consistency across all long-running operations |
|---|
| 18 | +- Per-row buttons (individual file download) and action bar buttons (bulk/selected) both supported |
|---|
| 19 | + |
|---|
| 20 | +## Files Changed |
|---|
| 21 | + |
|---|
| 22 | +- `Code/ops-dashboard/` — offsite download UI (per-row + action bar) with SSE streaming |
|---|
| 23 | +- `infrastructure/servers/hetzner-vps/registry.yaml` — added Seafile services |
|---|
| 24 | + |
|---|
| 25 | +--- |
|---|
| 26 | + |
|---|
| 27 | +**Tags:** #Session #OpsDashboard #Offsite #SSE |
|---|
| .. | .. |
|---|
| 1 | +# Session 0020: Backup Coverage Audit, Registry Fixes, Container Resolution |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-23 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0041 |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +- [x] Fixed ringsaday backup error: added `backup_sources` (volumes, keys, server, website, .env) and `backup` config to registry; changed `backup_dir` to `/opt/data/backups/ringsaday`; fixed `_backup_generic()` — changed `-d` to `-e` flag so individual files (not just directories) can be backed up; tested: 689 MB backup created successfully |
|---|
| 12 | +- [x] Full backup coverage audit: identified kioskpilot (1.3 MB) and ops-dashboard (1.5 MB) as missing backups |
|---|
| 13 | +- [x] Added kioskpilot backup (03:45, 30-day retention) |
|---|
| 14 | +- [x] Added ops-dashboard to registry + nightly backup (04:15, 30-day retention) |
|---|
| 15 | +- [x] Now 6 nightly backup timers: mdf, seriousletter, ringsaday, kioskpilot, ops-dashboard, coolify |
|---|
| 16 | +- [x] Fixed ringsaday container resolution: was showing duplicated entries in `ops status` |
|---|
| 17 | + - Added `{prefix}-{env}-` matching pattern to `find_containers()` (handles ringsaday-dev-UUID style names) |
|---|
| 18 | + - Added ringsaday-website as sub-service with `environments: [prod]` |
|---|
| 19 | +- [x] Deployed registry.yaml and ops CLI to server; 6 systemd timers active; backup dirs created |
|---|
| 20 | + |
|---|
| 21 | +## Key Decisions / Learnings |
|---|
| 22 | + |
|---|
| 23 | +- `_backup_generic()` used `-d` (directory flag) which silently skipped individual files like `.env` and SSL keys — the fix to `-e` (existence check) makes it handle both files and directories |
|---|
| 24 | +- Container naming for ringsaday uses `{prefix}-{env}-UUID` (Coolify-managed), different from other projects — `find_containers()` needed a second pattern to match these |
|---|
| 25 | +- ops-dashboard itself must be backed up — it holds its own config and data, easy to overlook |
|---|
| 26 | +- Backup coverage audit should be a recurring check whenever new projects are added |
|---|
| 27 | + |
|---|
| 28 | +## Files Changed |
|---|
| 29 | + |
|---|
| 30 | +- `infrastructure/servers/hetzner-vps/registry.yaml` — kioskpilot backup, ops-dashboard entry, ringsaday website sub-service, ringsaday backup_sources |
|---|
| 31 | +- `infrastructure/ops` — `_backup_generic()` -d→-e fix, `find_containers()` new UUID-style pattern |
|---|
| 32 | + |
|---|
| 33 | +--- |
|---|
| 34 | + |
|---|
| 35 | +**Tags:** #Session #OpsToolkit #Backup #Registry #ContainerResolution |
|---|
| .. | .. |
|---|
| 1 | +# Session 0021: Rebuild.py Coolify-Only Lifecycle, SSE Keepalive, Traefik Flush |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-23 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0044 (part 1) |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +- [x] rebuild.py — removed all docker compose fallbacks; recreate is now Coolify stop → wipe → Coolify start; rebuild is Coolify stop → docker build → Coolify start; restart stays as `docker restart` (Coolify restart prunes local images — intentional exception) |
|---|
| 12 | +- [x] Fixed build step: changed from `docker compose --profile {env} build` (requires all Coolify env vars) to `docker build -t {image}:{env} {context}` using registry `build_context` and `image_name` directly — no env vars needed |
|---|
| 13 | +- [x] Added `_coolify_start_with_retry()`: polls 60s after API call, retries up to 3 times — handles Coolify silently dropping start requests |
|---|
| 14 | +- [x] Container stabilization polling: `_poll_until_running` now waits for container count to be stable for 2 consecutive polls (10s) before declaring success — previously returned success on first container appearance |
|---|
| 15 | +- [x] "Already running/stopped" handling: Coolify API HTTP 400 with that message now treated as success, not error |
|---|
| 16 | +- [x] SSE keepalive for restore: restore connections were dropping during DB import (~60s silence); added `_stream_with_keepalive()` wrapper in `restore.py` — sends SSE comment `: keepalive` every 15s |
|---|
| 17 | +- [x] Added `responseForwarding.flushInterval: "-1"` to ops-dashboard Traefik dynamic config — Traefik was buffering SSE responses, causing keepalives to not reach the client |
|---|
| 18 | + |
|---|
| 19 | +## Key Decisions / Learnings |
|---|
| 20 | + |
|---|
| 21 | +- Coolify `restart` prunes locally-built images — `docker restart` (bypassing Coolify) is the correct approach for services with local images; this is a documented exception in rebuild.py |
|---|
| 22 | +- Coolify can silently queue-and-never-execute start requests — retry logic with polling is mandatory, not optional |
|---|
| 23 | +- "Already running" from Coolify API is a valid state (idempotent), not an error — treat HTTP 400 with that message as success |
|---|
| 24 | +- SSE keepalive must happen at the application level (`: keepalive` comment) AND Traefik must be configured to flush immediately (`flushInterval: "-1"`) — both are required; one alone is not enough |
|---|
| 25 | +- Stable polling (2 consecutive matching counts) is more reliable than "at least one container appeared" |
|---|
| 26 | + |
|---|
| 27 | +## Files Changed |
|---|
| 28 | + |
|---|
| 29 | +- `Code/ops-dashboard/app/routers/rebuild.py` — Coolify-only lifecycle, `docker build` from registry config, `_coolify_start_with_retry()`, stable container polling, HTTP 400 success handling |
|---|
| 30 | +- `Code/ops-dashboard/app/routers/restore.py` — `_stream_with_keepalive()` SSE keepalive wrapper |
|---|
| 31 | +- Server: `/data/coolify/proxy/dynamic/ops-dashboard.yaml` — added `responseForwarding.flushInterval: "-1"` |
|---|
| 32 | + |
|---|
| 33 | +--- |
|---|
| 34 | + |
|---|
| 35 | +**Tags:** #Session #OpsDashboard #Rebuild #SSE #Traefik #Coolify |
|---|
| .. | .. |
|---|
| 1 | +# Session 0022: Post-Coolify Architecture Context for Ops Toolkit |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-23 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0044 (Coolify Removal Complete) |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +- [x] Coolify fully removed from server (6 containers, 18 UUID networks, /data/coolify/ directory) |
|---|
| 12 | +- [x] Standalone Traefik v3.6 confirmed as proxy layer (was coolify-proxy, now independent at /opt/data/traefik/) |
|---|
| 13 | +- [x] All 28 containers verified operational post-removal; 17/17 domains tested |
|---|
| 14 | +- [x] Dynamic configs migrated: seriousletter.yaml, ringsaday.yaml moved to /opt/data/traefik/dynamic/ |
|---|
| 15 | +- [x] SSL certificates preserved: acme.json migrated to /opt/data/traefik/acme.json |
|---|
| 16 | +- [x] Coolify archive retained: /opt/data/backups/coolify-final-20260223.tar.gz (125KB, 30-day window) |
|---|
| 17 | + |
|---|
| 18 | +## Key Decisions / Learnings |
|---|
| 19 | + |
|---|
| 20 | +- **Ops toolkit no longer depends on Coolify API** — all lifecycle management (start/stop/rebuild/recreate) must use Docker CLI and docker compose directly against project compose files at `/opt/data/{project}/` |
|---|
| 21 | +- **Container naming is now clean** — no more UUID suffixes. Pattern: `{env}-{project}-{service}` (e.g. `prod-mdf-wordpress`, `dev-seriousletter-backend`) |
|---|
| 22 | +- **Proxy network is `proxy`** (replaces old `coolify` network) — all Traefik-exposed containers connect to it |
|---|
| 23 | +- **Project descriptors at `/opt/data/{project}/project.yaml`** are the new source of truth for container config — registry.yaml is deprecated (used only by gen-timers and schedule PUT) |
|---|
| 24 | +- **Docker provider + file provider** coexist in Traefik: MDF services use Docker labels; SeriousLetter, RingsADay, KioskPilot use file provider configs |
|---|
| 25 | +- metro.ringsaday.com returns 502 — pre-existing issue unrelated to Coolify removal (no metro service in compose) |
|---|
| 26 | +- Docker system cleanup freed ~9GB of unused images and volumes during removal |
|---|
| 27 | + |
|---|
| 28 | +## Architecture Reference (Post-Coolify) |
|---|
| 29 | + |
|---|
| 30 | +``` |
|---|
| 31 | +Proxy: Traefik v3.6 at /opt/data/traefik/ |
|---|
| 32 | +Config: traefik.yaml (static), dynamic/ (file provider) |
|---|
| 33 | +Certs: /opt/data/traefik/acme.json |
|---|
| 34 | +Proxy network: proxy |
|---|
| 35 | + |
|---|
| 36 | +Projects: |
|---|
| 37 | + MDF prod: /opt/data/mdf/prod/ — WordPress, MySQL, Mail, PostfixAdmin, Roundcube, Seafile |
|---|
| 38 | + MDF int/dev: /opt/data/mdf/{int,dev}/ — WordPress + MySQL |
|---|
| 39 | + SeriousLetter: /opt/data/seriousletter/{dev,int,prod}/ |
|---|
| 40 | + RingsADay: /opt/data/ringsaday/ |
|---|
| 41 | + KioskPilot: /opt/data/kioskpilot/ |
|---|
| 42 | + Ops Dashboard: /opt/data/ops-dashboard/ |
|---|
| 43 | +``` |
|---|
| 44 | + |
|---|
| 45 | +## Files Changed |
|---|
| 46 | + |
|---|
| 47 | +- Server: `/data/coolify/` — deleted (backed up first) |
|---|
| 48 | +- Server: `/opt/data/traefik/dynamic/` — received migrated seriousletter.yaml and ringsaday.yaml |
|---|
| 49 | + |
|---|
| 50 | +--- |
|---|
| 51 | + |
|---|
| 52 | +**Tags:** #Session #OpsToolkit #Architecture #Traefik #PostCoolify |
|---|
| .. | .. |
|---|
| 1 | +# Session 0023: Toolkit Bootstrap Starting Point |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-23 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0045 |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +- [x] Created `project.yaml` descriptors for all 5 projects (mdf, seriousletter, ringsaday, kioskpilot, ops-dashboard) |
|---|
| 12 | +- [x] Updated `ops-dashboard` docker-compose.yaml: network `coolify` → `proxy` |
|---|
| 13 | +- [x] Added Alpine pre-pull with retry (4 attempts, 15s delays) to `rebuild.py` — note: this was a pre-redesign patch, superseded by Phase 5 rewrite in session 0046 |
|---|
| 14 | +- [x] Added image verification after build to `rebuild.py` |
|---|
| 15 | +- [x] Identified Phase 3+4 toolkit work as next immediate task (was interrupted this session) |
|---|
| 16 | + |
|---|
| 17 | +## Context / Background |
|---|
| 18 | + |
|---|
| 19 | +This session was primarily about removing Coolify and migrating all projects to standalone Docker Compose. The OPS-relevant outcome is: |
|---|
| 20 | + |
|---|
| 21 | +- All 5 `project.yaml` descriptors now exist and are the source of truth for the toolkit |
|---|
| 22 | +- The `proxy` Docker network replaces the old `coolify` network — all Traefik-exposed containers connect to it |
|---|
| 23 | +- The toolkit build (Phase 3+4) was planned but interrupted mid-session — completed in session 0046 |
|---|
| 24 | +- The plan was documented at: `Notes/swarm/plan.md` (since cleaned up) |
|---|
| 25 | + |
|---|
| 26 | +## Key Decisions / Learnings |
|---|
| 27 | + |
|---|
| 28 | +- `container_prefix` in `project.yaml` uses `{env}` placeholder (e.g. `"{env}-mdf"`) — the toolkit must expand this at runtime |
|---|
| 29 | +- SeriousLetter uses `"{env}-seriousletter"` as prefix (not `sl`) |
|---|
| 30 | +- ops-dashboard gets its own `project.yaml` like all other projects |
|---|
| 31 | + |
|---|
| 32 | +## Files Changed |
|---|
| 33 | + |
|---|
| 34 | +- `/opt/data/mdf/project.yaml` — created |
|---|
| 35 | +- `/opt/data/seriousletter/project.yaml` — created |
|---|
| 36 | +- `/opt/data/ringsaday/project.yaml` — created |
|---|
| 37 | +- `/opt/data/kioskpilot/project.yaml` — created |
|---|
| 38 | +- `/opt/data/ops-dashboard/project.yaml` — created |
|---|
| 39 | +- `/opt/data/ops-dashboard/docker-compose.yml` — network coolify→proxy |
|---|
| 40 | +- `app/routers/rebuild.py` — Alpine retry + image verify (pre-redesign, superseded) |
|---|
| 41 | + |
|---|
| 42 | +--- |
|---|
| 43 | + |
|---|
| 44 | +**Tags:** #Session #OpsToolkit #Infrastructure |
|---|
| .. | .. |
|---|
| 1 | +# Session 0024: Toolkit and CLI Rewrite and Dashboard Migration |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-23 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0046 |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +### Phase 3: Shared Toolkit |
|---|
| 12 | + |
|---|
| 13 | +- [x] Completed 5 missing toolkit modules at `/opt/infrastructure/toolkit/`: |
|---|
| 14 | + - `cli.py` — main CLI entry point with all commands (status, start, stop, build, rebuild, destroy, backup, restore, sync, promote, logs, health, disk, backups, offsite, gen-timers, init) |
|---|
| 15 | + - `output.py` — formatted output (Rich tables, JSON mode, plain text fallback) |
|---|
| 16 | + - `restore.py` — restore operations with CLI delegation support |
|---|
| 17 | + - `sync.py` — data sync between environments with CLI delegation |
|---|
| 18 | + - `promote.py` — code promotion (git, rsync, script) with adjacency enforcement |
|---|
| 19 | +- [x] 7 modules already existed from prior sessions: `__init__.py`, `descriptor.py`, `docker.py`, `backup.py`, `database.py`, `health.py`, `discovery.py` |
|---|
| 20 | + |
|---|
| 21 | +### Phase 4: Ops CLI Rewrite |
|---|
| 22 | + |
|---|
| 23 | +- [x] Replaced 950-line bash ops CLI with 7-line bash shim → `python3 -m toolkit.cli` |
|---|
| 24 | +- [x] Old CLI backed up as `ops.bak.20260223` |
|---|
| 25 | +- [x] New commands added: `start`, `stop`, `build`, `destroy`, `logs`, `restart`, `init` |
|---|
| 26 | +- [x] All commands read from `project.yaml` descriptors — no `registry.yaml` dependency |
|---|
| 27 | +- [x] Container prefix matching fixed: handles `{env}` placeholder expansion in `container_prefix` |
|---|
| 28 | + |
|---|
| 29 | +### Phase 5: Dashboard Adaptation |
|---|
| 30 | + |
|---|
| 31 | +- [x] Rewrote 4 dashboard routers to use project.yaml: |
|---|
| 32 | + - `registry.py` — imports `toolkit.discovery.all_projects()` instead of parsing registry.yaml |
|---|
| 33 | + - `services.py` — uses `toolkit.descriptor.find()` for container name resolution |
|---|
| 34 | + - `rebuild.py` — massive rewrite: 707 → 348 lines, removed ALL Coolify API code, uses direct docker compose |
|---|
| 35 | + - `schedule.py` — reads from descriptors for GET, still writes to registry.yaml for PUT (gen-timers compatibility) |
|---|
| 36 | +- [x] Verified all API endpoints working: |
|---|
| 37 | + - `/api/registry/` — returns all 5 projects from descriptors |
|---|
| 38 | + - `/api/status/` — shows 25 containers |
|---|
| 39 | + - `/api/schedule/` — shows backup schedules for all 5 projects |
|---|
| 40 | + - `/api/services/logs/mdf/prod/wordpress` — correctly resolves container name |
|---|
| 41 | + |
|---|
| 42 | +## Key Decisions / Learnings |
|---|
| 43 | + |
|---|
| 44 | +- `rebuild.py` now uses `_compose_cmd()` helper that finds compose file (.yaml/.yml), env-file (.env.{env}/.env), and adds `--profile {env}` — removes all Coolify API dependency |
|---|
| 45 | +- Dashboard container has `/opt/infrastructure` mounted → can import toolkit directly via Python |
|---|
| 46 | +- pyyaml 6.0.3 confirmed available in dashboard container |
|---|
| 47 | +- `schedule.py` still writes to `registry.yaml` for PUT/gen-timers — full descriptor migration is a future task |
|---|
| 48 | +- `container_prefix_for(env)` expands `{env}` in prefix, then matches `{prefix}-*` containers |
|---|
| 49 | + |
|---|
| 50 | +## Files Changed |
|---|
| 51 | + |
|---|
| 52 | +- `/opt/infrastructure/toolkit/cli.py` — new (all CLI commands) |
|---|
| 53 | +- `/opt/infrastructure/toolkit/output.py` — new (Rich/JSON/plain output) |
|---|
| 54 | +- `/opt/infrastructure/toolkit/restore.py` — new |
|---|
| 55 | +- `/opt/infrastructure/toolkit/sync.py` — new |
|---|
| 56 | +- `/opt/infrastructure/toolkit/promote.py` — new |
|---|
| 57 | +- `/usr/local/bin/ops` — rewritten as 7-line bash shim |
|---|
| 58 | +- `app/routers/registry.py` — uses toolkit.discovery |
|---|
| 59 | +- `app/routers/services.py` — uses toolkit.descriptor |
|---|
| 60 | +- `app/routers/rebuild.py` — 707→348 lines, Coolify removed |
|---|
| 61 | +- `app/routers/schedule.py` — descriptor-backed GET |
|---|
| 62 | + |
|---|
| 63 | +--- |
|---|
| 64 | + |
|---|
| 65 | +**Tags:** #Session #OpsToolkit #OpsCLI #OpsDashboard |
|---|
| .. | .. |
|---|
| 1 | +# Session 0025: Dashboard Bugs and SL Routing Fixes |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-24 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0048 (Part 2 only — DNS cutover and mail recovery sections skipped) |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +### Operations Page: Recreate Replaced by Backup + Restore |
|---|
| 12 | + |
|---|
| 13 | +- [x] Removed "Recreate" lifecycle action (redundant with Rebuild for bind-mount projects) |
|---|
| 14 | +- [x] Added **Backup** button (blue): opens lifecycle modal with SSE streaming to `/api/backups/stream/{project}/{env}` |
|---|
| 15 | +- [x] Added **Restore** button (purple): navigates to Backups page at drill level 2 for that project/env |
|---|
| 16 | +- [x] Added cache invalidation on backup success |
|---|
| 17 | + |
|---|
| 18 | +### SeriousLetter Bad Gateway Fix |
|---|
| 19 | + |
|---|
| 20 | +- [x] Diagnosed root cause: SL containers only on `seriousletter-network`, not on `proxy` network Traefik uses |
|---|
| 21 | +- [x] Permanent fix: added `proxy` network to docker-compose.yaml for all 3 SL envs (prod/int/dev) |
|---|
| 22 | + - `backend` and `frontend` services get `proxy` in networks list |
|---|
| 23 | + - `proxy: external: true` added to networks section |
|---|
| 24 | +- [x] Added health checks for both services: |
|---|
| 25 | + - Backend: `python3 urllib.request.urlopen("http://localhost:8000/docs")` |
|---|
| 26 | + - Frontend: `wget --spider -q http://127.0.0.1:3000/` (explicit `127.0.0.1`, not `localhost` — Alpine resolves to IPv6 `::1`) |
|---|
| 27 | + |
|---|
| 28 | +### Sync Routing Bug Fix |
|---|
| 29 | + |
|---|
| 30 | +- [x] Fixed sync section only showing MDF (not SeriousLetter) |
|---|
| 31 | +- [x] Root cause (two-part): |
|---|
| 32 | + 1. `registry.py` had `desc.sync.get("type") == "cli"` — SL had `sync.type: toolkit`, evaluated to `False` |
|---|
| 33 | + 2. SL's `toolkit` type was itself wrong — should be `cli` with a CLI path |
|---|
| 34 | +- [x] Fix in `registry.py`: `"has_cli": desc.sync.get("type") == "cli"` → `"has_cli": bool(desc.sync.get("type"))` |
|---|
| 35 | +- [x] Fix in `/opt/data/seriousletter/project.yaml`: `sync.type: toolkit` → `type: cli` with `cli:` path |
|---|
| 36 | + |
|---|
| 37 | +### Backup Date Inconsistency Fix |
|---|
| 38 | + |
|---|
| 39 | +- [x] Fixed overview card showing stale "INT Latest" date while drill-down showed correct newer backups |
|---|
| 40 | +- [x] Root cause: string comparison between incompatible date formats: |
|---|
| 41 | + - Compact (MDF CLI): `20260220_195300` |
|---|
| 42 | + - ISO (toolkit): `2026-02-24T03:00:42` |
|---|
| 43 | + - Character `'0' > '-'` meant compact dates always "won" the `>` comparison |
|---|
| 44 | +- [x] Fix: added `normalizeBackupDate()` function to convert all dates to ISO format at merge time in `mergeBackups()` |
|---|
| 45 | + |
|---|
| 46 | +## Key Decisions / Learnings |
|---|
| 47 | + |
|---|
| 48 | +- When adding a container to a new network, an ad-hoc `docker network connect` is lost on restart — the fix must go in the compose file |
|---|
| 49 | +- Alpine `localhost` resolves to `::1` (IPv6). Services binding only IPv4 `0.0.0.0` won't respond. Use `127.0.0.1` explicitly in health checks. |
|---|
| 50 | +- For `has_cli` logic: any truthy `sync.type` value means the project has ops CLI support — don't compare to a specific string |
|---|
| 51 | +- Date normalization must happen at merge time, not display time, to get correct `max()` comparisons |
|---|
| 52 | + |
|---|
| 53 | +## Files Changed |
|---|
| 54 | + |
|---|
| 55 | +- `static/js/app.js` — removed recreate modal/handler, added backup modal, URL routing for restore button, cache invalidation, `normalizeBackupDate()` + `mergeBackups()` fix |
|---|
| 56 | +- `app/routers/registry.py` — `has_cli` logic fix |
|---|
| 57 | +- `/opt/data/seriousletter/project.yaml` — `sync.type` corrected |
|---|
| 58 | +- `/opt/data/seriousletter/{prod,int,dev}/code/docker-compose.yaml` — proxy network + health checks |
|---|
| 59 | + |
|---|
| 60 | +--- |
|---|
| 61 | + |
|---|
| 62 | +**Tags:** #Session #OpsDashboard #BugFix |
|---|
| .. | .. |
|---|
| 1 | +# Session 0026: Persistent Jobs and Container Terminal |
|---|
| 2 | + |
|---|
| 3 | +**Date:** 2026-02-25 |
|---|
| 4 | +**Status:** Completed |
|---|
| 5 | +**Origin:** MDF Webseiten session 0053 |
|---|
| 6 | + |
|---|
| 7 | +--- |
|---|
| 8 | + |
|---|
| 9 | +## Work Done |
|---|
| 10 | + |
|---|
| 11 | +### Feature 1: Persistent/Reconnectable Jobs |
|---|
| 12 | + |
|---|
| 13 | +- [x] New `app/job_store.py` — in-memory job store decouples subprocess from SSE connection |
|---|
| 14 | +- [x] New `app/routers/jobs.py` — job management endpoints |
|---|
| 15 | +- [x] New endpoints: `GET /api/jobs/`, `GET /api/jobs/{op_id}`, `GET /api/jobs/{op_id}/stream?from=N` |
|---|
| 16 | +- [x] Added `run_job()` to `ops_runner.py` — runs subprocess writing to job store, NOT killed on browser disconnect |
|---|
| 17 | +- [x] Added `job_sse_stream()` to `job_store.py` — shared SSE wrapper with keepalive |
|---|
| 18 | +- [x] Rewrote 6 routers to use job store pattern: backups.py, restore.py, sync_data.py, promote.py, rebuild.py, schedule.py |
|---|
| 19 | +- [x] All routers follow pattern: `create_job()` → `asyncio.create_task(run_job())` → `return StreamingResponse(job_sse_stream())` |
|---|
| 20 | +- [x] Background cleanup task removes expired jobs every 5 minutes (1 hour TTL) |
|---|
| 21 | +- [x] Frontend: auto-reconnect on SSE error via `/api/jobs/{op_id}/stream?from=N` (3 retries) |
|---|
| 22 | +- [x] Frontend: check for running jobs on page load, show reconnect banner |
|---|
| 23 | + |
|---|
| 24 | +### Feature 2: Container Terminal |
|---|
| 25 | + |
|---|
| 26 | +- [x] New `app/routers/terminal.py` — WebSocket endpoint with PTY via `docker exec` |
|---|
| 27 | +- [x] Protocol: `{"type":"input","data":"..."}` / `{"type":"resize","cols":80,"rows":24}` / `{"type":"output","data":"..."}` |
|---|
| 28 | +- [x] Frontend: xterm.js 5.5.0 + addon-fit from CDN, terminal modal, Console button on services page |
|---|
| 29 | +- [x] Security: token auth, container name validation (regex allowlist), running check via docker inspect |
|---|
| 30 | + |
|---|
| 31 | +### Fixes Applied |
|---|
| 32 | + |
|---|
| 33 | +- [x] Restored bidirectional sync pairs in `sync_data.py` (regression from engineer rewrite) |
|---|
| 34 | +- [x] Restored multi-compose support in `rebuild.py` (`_all_compose_dirs`, `_compose_cmd_for` for Seafile) |
|---|
| 35 | +- [x] Updated `main.py` with jobs + terminal routers, cleanup task in lifespan |
|---|
| 36 | +- [x] Bumped APP_VERSION to v15-20260225 |
|---|
| 37 | +- [x] Also committed + pushed `sync_data.py` bidirectional fix (git commit 31ac43f) and stabilization checks |
|---|
| 38 | + |
|---|
| 39 | +## Key Decisions / Learnings |
|---|
| 40 | + |
|---|
| 41 | +- Decoupling subprocess from SSE via a job store is the correct pattern — browser disconnect should never kill a running backup/restore |
|---|
| 42 | +- Job store is in-memory (not persisted) — server restart loses job history, which is acceptable |
|---|
| 43 | +- xterm.js from CDN (not bundled) keeps the container image lean |
|---|
| 44 | +- Container name validation via regex allowlist prevents command injection through the WebSocket terminal endpoint |
|---|
| 45 | +- `from=N` query param on stream endpoint enables replay from any position — client tracks last received line index |
|---|
| 46 | + |
|---|
| 47 | +## Files Changed |
|---|
| 48 | + |
|---|
| 49 | +- `app/job_store.py` — new (315 lines) |
|---|
| 50 | +- `app/routers/jobs.py` — new (186 lines) |
|---|
| 51 | +- `app/routers/terminal.py` — new (287 lines) |
|---|
| 52 | +- `app/ops_runner.py` — added `run_job()` (388 lines total) |
|---|
| 53 | +- `app/main.py` — added routers + cleanup task (138 lines) |
|---|
| 54 | +- `app/routers/backups.py` — job store integration (287 lines) |
|---|
| 55 | +- `app/routers/restore.py` — job store integration (290 lines) |
|---|
| 56 | +- `app/routers/sync_data.py` — job store + bidirectional fix (71 lines) |
|---|
| 57 | +- `app/routers/promote.py` — job store integration (69 lines) |
|---|
| 58 | +- `app/routers/rebuild.py` — job store + multi-compose (365 lines) |
|---|
| 59 | +- `static/js/app.js` — v15: reconnect + terminal (2355 lines) |
|---|
| 60 | +- `static/index.html` — xterm.js CDN + terminal modal |
|---|
| 61 | +- `static/css/style.css` — terminal styles |
|---|
| 62 | + |
|---|
| 63 | +## State at Session End |
|---|
| 64 | + |
|---|
| 65 | +Code written locally at `/Users/i052341/Daten/Cloud/08 - Others/MDF/Infrastruktur/Code/ops-dashboard/`. Not yet deployed to server at time of note creation. Deploy + verification is the next session's starting task. |
|---|
| 66 | + |
|---|
| 67 | +--- |
|---|
| 68 | + |
|---|
| 69 | +**Tags:** #Session #OpsDashboard #PersistentJobs #Terminal |
|---|