APPS/ops-dashboard.git - git.mnsoft.org

commits
diff

2026-02-26

Matthias Nott

docs: Session 0027 - Dynamic backup buttons & TEKMidian registration

76559f

diff | tree

2026-02-26

Matthias Nott

docs: extract 14 additional session notes from MDF Webseiten (0013-0026)

b6e410

diff | tree

15 files added

	Notes/2026/02/0013 - 2026-02-20 - Infrastructure Repo & Ops CLI Bootstrap.md	patch \| view \| blame \| history
	Notes/2026/02/0014 - 2026-02-20 - Registry Naming & Backup System.md	patch \| view \| blame \| history
	Notes/2026/02/0015 - 2026-02-22 - Offsite Backup Dashboard Fix & Status Format.md	patch \| view \| blame \| history
	Notes/2026/02/0016 - 2026-02-22 - Backup Drill-Down Redesign & Restore Fix.md	patch \| view \| blame \| history
	Notes/2026/02/0017 - 2026-02-22 - Modular Sync Promote Rebuild Architecture.md	patch \| view \| blame \| history
	Notes/2026/02/0018 - 2026-02-22 - CLI Contract Spec, Sync Compliance, Dashboard Bidirectional UI.md	patch \| view \| blame \| history
	Notes/2026/02/0019 - 2026-02-22 - Offsite Download Feature Added to Dashboard.md	patch \| view \| blame \| history
	Notes/2026/02/0020 - 2026-02-23 - Backup Coverage Audit, Registry Fixes, Container Resolution.md	patch \| view \| blame \| history
	Notes/2026/02/0021 - 2026-02-23 - Rebuild.py Coolify-Only Lifecycle, SSE Keepalive, Traefik Flush.md	patch \| view \| blame \| history
	Notes/2026/02/0022 - 2026-02-23 - Post-Coolify Architecture Context for Ops Toolkit.md	patch \| view \| blame \| history
	Notes/2026/02/0023 - 2026-02-23 - Toolkit Bootstrap Starting Point.md	patch \| view \| blame \| history
	Notes/2026/02/0024 - 2026-02-23 - Toolkit and CLI Rewrite and Dashboard Migration.md	patch \| view \| blame \| history
	Notes/2026/02/0025 - 2026-02-24 - Dashboard Bugs and SL Routing Fixes.md	patch \| view \| blame \| history
	Notes/2026/02/0026 - 2026-02-25 - Persistent Jobs and Container Terminal.md	patch \| view \| blame \| history
	Notes/2026/02/0027 - 2026-02-26 - Dynamic Backup Buttons & TEKMidian Registration.md	patch \| view \| blame \| history

 Notes/2026/02/0013 - 2026-02-20 - Infrastructure Repo & Ops CLI Bootstrap.md
.. .. @@ -0,0 +1,59 @@
1 +# Session 0013: Infrastructure Repo & Ops CLI Bootstrap
2 +
3 +**Date:** 2026-02-20
4 +**Status:** Completed
5 +**Origin:** MDF Webseiten session 0018
6 +
7 +---
8 +
9 +## Work Done
10 +
11 +- [x] Created infrastructure repo at `git.mnsoft.org/git/APPS/infrastructure.git`
12 +- [x] Local clone: `/Users/i052341/Daten/Cloud/08 - Others/MDF/Infrastruktur/Code/infrastructure/`
13 +- [x] Server clone: `/opt/infrastructure/`
14 +- [x] Wrote `ops` CLI (bash, ~250 lines) — symlinked to `/usr/local/bin/ops`
15 +- [x] Created `servers/hetzner-vps/registry.yaml` — single source of truth for 5 projects
16 +- [x] Captured 5 Traefik dynamic configs from server into git
17 +- [x] Wrote `monitoring/healthcheck.sh` — container health + disk checks → ntfy
18 +- [x] Installed `ops-healthcheck.timer` (every 5 min) on server
19 +- [x] Added Docker labels (`ops.project`, `ops.environment`, `ops.service`) to all MDF compose files
20 +- [x] Replaced hardcoded `container_name()` in `sync.py` with label-based discovery + UUID suffix fallback
21 +- [x] Verified: `ops status`, `ops health`, `ops disk`, `ops backup mdf prod` all working
22 +
23 +## Repo Structure Created
24 +
25 +```
26 +infrastructure/
27 +├── ops                           # The ops CLI (bash)
28 +├── servers/hetzner-vps/
29 +│   ├── registry.yaml             # 5 projects defined
30 +│   ├── traefik/dynamic/          # Traefik configs captured
31 +│   ├── bootstrap/                # Coolify service payloads
32 +│   ├── scaffolding/              # Shell aliases, SSH hardening, venv setup
33 +│   ├── systemd/                  # 6 timer/service units
34 +│   └── install.sh                # Full fresh server setup script
35 +├── monitoring/
36 +│   ├── healthcheck.sh
37 +│   ├── ops-healthcheck.service
38 +│   └── ops-healthcheck.timer
39 +└── docs/architecture.md
40 +```
41 +
42 +## Key Decisions / Learnings
43 +
44 +- `ops` CLI uses `SCRIPT_DIR` with `readlink -f` for symlink-safe path resolution
45 +- `registry.yaml` uses a `name_prefix` field; container matching uses `grep` with word anchoring to prevent substring false matches
46 +- Label-based discovery is primary; Coolify UUID suffix prefix-search is the fallback
47 +- Docker labels added to compose files are not live on running containers until restart — noted as gap
48 +
49 +## Files Changed
50 +
51 +- `/opt/infrastructure/ops` — new ops CLI (bash)
52 +- `/opt/infrastructure/servers/hetzner-vps/registry.yaml` — new registry
53 +- `/opt/infrastructure/monitoring/healthcheck.sh` — new healthcheck script
54 +- `Code/mdf-system/docker-compose.yaml` — added ops.* Docker labels
55 +- `Code/mdf-system/scripts/sync/sync.py` — label-based container discovery, domain map fix
56 +
57 +---
58 +
59 +**Tags:** #Session #OpsCLI #Infrastructure
 Notes/2026/02/0014 - 2026-02-20 - Registry Naming & Backup System.md
.. .. @@ -0,0 +1,42 @@
1 +# Session 0014: Registry Naming & Backup System
2 +
3 +**Date:** 2026-02-20
4 +**Status:** Completed
5 +**Origin:** MDF Webseiten session 0019
6 +
7 +---
8 +
9 +## Work Done
10 +
11 +- [x] Fixed `sl-website` registry placement — moved under `seriousletter.services.website` to resolve prefix collision
12 +- [x] Renamed all 7 Coolify services to consistent `{project}-{env/purpose}` lowercase naming
13 +- [x] Deleted stale stopped MDF Dev duplicate from Coolify (UUID: qw8wso0ckskccoo0kcog84c0)
14 +- [x] Fixed `ops backup/restore/sync` argument validation (was crashing on unbound variable)
15 +- [x] Fixed SL CLI path in `registry.yaml` (pointed to wrong location)
16 +- [x] Added `container_name()` to SL `sync.py` with label + prefix fallback (mirrors MDF pattern)
17 +- [x] Made `ops backup <project>` work without env arg (passes `--all` to CLI)
18 +- [x] Added backup summary to `ops status` — latest backup per project/env, size, age with color coding
19 +- [x] Consolidated backup dirs to `/opt/data/backups/{project}/{env}/` across all projects
20 +- [x] Updated both MDF and SL CLIs for per-env backup subdirectory structure
21 +- [x] Volume consolidation: all data migrated from 10GB to 50GB volume at `/opt/data`
22 +- [x] Updated all path references across compose files, CLIs, systemd units, registry, ops CLI
23 +
24 +## Key Decisions / Learnings
25 +
26 +- Registry was initially ambiguous about where `sl-website` lived — prefix collision with other SL services caused matching bugs. Moving it under a `services.website` key made the prefix unique.
27 +- Per-env backup subdirs (`/opt/data/backups/{project}/{env}/`) are the correct structure — flat dirs were the source of orphaned files.
28 +- `ops backup <project>` without env should be a valid shorthand — it delegates `--all` to the project CLI rather than requiring explicit env arg.
29 +- Container name resolution logic must be identical across project CLIs — label-based primary, prefix fallback secondary. Divergence causes mysterious "container not found" bugs.
30 +- Old 10GB volume was kept mounted during migration to avoid cwd-in-mountpoint issues during `umount`.
31 +
32 +## Files Changed
33 +
34 +- `/opt/infrastructure/servers/hetzner-vps/registry.yaml` — fixed sl-website placement, naming consistency
35 +- `/opt/infrastructure/ops` — fixed arg validation, `cmd_backup` without env, backup summary in status
36 +- `/opt/data/seriousletter/{dev,int,prod}/code/scripts/sync/sync.py` — added `container_name()` with fallback
37 +- `Code/mdf-system/scripts/sync/sync.py` — per-env backup subdirectory paths
38 +- All compose files, systemd units — `/opt/data2` → `/opt/data` path updates
39 +
40 +---
41 +
42 +**Tags:** #Session #OpsCLI #BackupSystem #Registry
 Notes/2026/02/0015 - 2026-02-22 - Offsite Backup Dashboard Fix & Status Format.md
.. .. @@ -0,0 +1,39 @@
1 +# Session 0015: Offsite Backup Dashboard Fix & Status Format
2 +
3 +**Date:** 2026-02-22
4 +**Status:** Completed
5 +**Origin:** MDF Webseiten session 0025
6 +
7 +---
8 +
9 +## Work Done
10 +
11 +- [x] Fixed offsite backups not showing in ops dashboard
12 +  - `/api/backups/offsite` was calling `run_ops_json()` (in-container execution) but `ops offsite list` requires the host Python venv
13 +  - Added `run_ops_host_json()` helper to `ops_runner.py` using `nsenter`-based host execution
14 +  - Updated `backups.py` router to use `run_ops_host_json()` for offsite listing
15 +  - Rebuilt and restarted ops-dashboard container
16 +- [x] Reformatted backup list in `ops status` CLI output
17 +  - Changed from flat table sorted by project to date-grouped boxes
18 +  - Each date gets its own Rich table: project / env / time / size / total columns
19 +  - Latest backup per project/env shown, grouped by date descending, sorted by project then env within each date
20 +- [x] Fixed SeriousLetter backup path bug (CLI-level fix, required for dashboard data correctness)
21 +  - SL CLI was dumping backups flat into `/opt/data/backups/` — changed `backup-all.sh` to call SL CLI per-env with explicit `--backup-dir`
22 +  - Moved 15 orphaned backup files to correct per-env directories
23 +- [x] Ran full backup cycle across all 6 environments (MDF + SL x dev/int/prod), verified offsite upload
24 +
25 +## Key Decisions / Learnings
26 +
27 +- Dashboard containers cannot use in-process `ops` commands that require host-side Python venvs — must use `nsenter` bridge. This is a recurring pattern: in-container vs host execution boundary is an important architectural distinction in the ops-dashboard.
28 +- Two execution helpers needed: `run_ops_json()` (in-container, fast) and `run_ops_host_json()` (host via nsenter, required for backup/offsite commands).
29 +- Date-grouped backup status is more readable than a flat project-sorted table — groups make it obvious if a date was missed entirely.
30 +
31 +## Files Changed
32 +
33 +- `/opt/data/ops-dashboard/app/ops_runner.py` — added `run_ops_host_json()` helper
34 +- `/opt/data/ops-dashboard/app/routers/backups.py` — use host execution for offsite listing
35 +- `/opt/infrastructure/ops` — reformatted backup summary with date-grouped Rich tables
36 +
37 +---
38 +
39 +**Tags:** #Session #OpsDashboard #BackupSystem #Offsite
 Notes/2026/02/0016 - 2026-02-22 - Backup Drill-Down Redesign & Restore Fix.md
.. .. @@ -0,0 +1,35 @@
1 +# Session 0016: Backup Drill-Down Redesign & Restore Fix
2 +
3 +**Date:** 2026-02-22
4 +**Status:** Completed
5 +**Origin:** MDF Webseiten session 0030
6 +
7 +---
8 +
9 +## Work Done
10 +
11 +- [x] Fixed restore API call — `mdf` CLI was falling into interactive selection because no backup filename was passed
12 +  - `app.js`: `startRestore()` now includes `&name=...` from `restoreCtx` in the API URL
13 +- [x] Implemented backups drill-down redesign (deployed as v7)
14 +  - Replaced flat filter state with 3-level drill state (project → env → backup file)
15 +  - Added cached backups to avoid re-fetching on drill-back
16 +  - Extracted `mergeBackups()` helper function
17 +  - Implemented all 13 changes from the redesign plan
18 +- [x] Fixed browser cache problem preventing new JS from loading after rebuild
19 +  - Rebuilt image and restarted container to force cache bust
20 +
21 +## Key Decisions / Learnings
22 +
23 +- Restore API must include the backup filename explicitly — passing only project/env and letting the CLI choose interactively breaks in non-TTY server context.
24 +- 3-level drill state (project → env → file) is the right UX pattern for hierarchical backup selection; flat filter state made navigation confusing and state management error-prone.
25 +- Caching fetched backup lists at each level avoids latency on drill-back and reduces server load.
26 +- Browser cache busting on vanilla JS apps requires either cache-control headers or a version query param — container restart alone does not always clear client caches.
27 +
28 +## Files Changed
29 +
30 +- `/opt/data/ops-dashboard/static/js/app.js` — `startRestore()` fix, 3-level drill state, `mergeBackups()` helper
31 +- Docker image rebuilt and container restarted
32 +
33 +---
34 +
35 +**Tags:** #Session #OpsDashboard #BackupSystem
 Notes/2026/02/0017 - 2026-02-22 - Modular Sync Promote Rebuild Architecture.md
.. .. @@ -0,0 +1,61 @@
1 +# Session 0017: Modular Sync/Promote/Rebuild Architecture
2 +
3 +**Date:** 2026-02-22
4 +**Status:** Paused (context checkpoint)
5 +**Origin:** MDF Webseiten session 0032
6 +
7 +---
8 +
9 +## Work Done
10 +
11 +- [x] Fixed `SL detect_env()` — was returning "seriousletter" instead of the env name; now scans path components for first match after "data"
12 +- [x] Fixed `MDF list_backups()` indentation bug — try block was at same level as for loop, only parsed the last backup file
13 +- [x] Added `promote` config to `registry.yaml` for mdf (rsync), seriousletter (git), ringsaday (git) — each defines promote type, branch mapping, post-pull behavior
14 +- [x] Added `promote` Typer command to SL `sync.py` — git fetch, diff preview, git pull, Dockerfile change detection, container rebuild/restart, health check; only dev→int and int→prod allowed
15 +- [x] Added `cmd_promote` to ops CLI — delegates to project CLI with `--from`/`--to` args
16 +- [x] Added `cmd_rebuild` to ops CLI — starts containers, waits for health, restores latest backup
17 +- [x] Created 4 new FastAPI routers in ops-dashboard:
18 +  - `promote.py` — SSE streaming promote endpoint
19 +  - `sync_data.py` — SSE streaming sync endpoint
20 +  - `registry.py` — exposes project list + environments + promote config as JSON
21 +  - `rebuild.py` — SSE streaming rebuild/disaster-recovery endpoint
22 +- [x] Updated `backups.py` to read project list from registry API instead of hardcoding
23 +- [x] Added "Operations" page to dashboard sidebar with three sections: Promote Code, Sync Data, Rebuild (Disaster Recovery)
24 +- [x] Operations page uses SSE modal with dry-run toggle; project/direction buttons populated dynamically from `/api/registry/`
25 +- [x] Verified all 7 test categories pass
26 +
27 +## Key Decisions / Learnings
28 +
29 +- All long-running ops commands (promote, sync, rebuild) use SSE streaming — consistent with existing backup/restore pattern. The `stream_ops_host()` helper is the standard interface.
30 +- Registry is the single source of truth for project/environment/promote config. Dashboard reads it dynamically — no hardcoded project names in API routers.
31 +- Promote direction validation lives in the project CLI (`sync.py`), not in the ops CLI or dashboard — keeps enforcement close to the implementation.
32 +- `ops rebuild` is the disaster recovery entry point: bring up containers → wait for healthy → restore latest backup. Simple, composable.
33 +- `detect_env()` path parsing must handle the full `/opt/data/seriousletter/{env}/code/...` structure — scanning for VALID_ENVS after "data" in path components is robust.
34 +
35 +## Files Changed
36 +
37 +- `/opt/data/seriousletter/{dev,int,prod}/code/scripts/sync/sync.py` — fix `detect_env`, add `promote` command
38 +- `Code/mdf-system/scripts/sync/sync.py` (local + deployed to dev) — fix `list_backups` indentation
39 +- `/opt/infrastructure/servers/hetzner-vps/registry.yaml` — add `promote` config per project
40 +- `/opt/infrastructure/ops` — add `cmd_promote`, `cmd_rebuild`
41 +- `/opt/data/ops-dashboard/app/routers/promote.py` — new SSE promote endpoint
42 +- `/opt/data/ops-dashboard/app/routers/sync_data.py` — new SSE sync endpoint
43 +- `/opt/data/ops-dashboard/app/routers/registry.py` — new registry JSON endpoint
44 +- `/opt/data/ops-dashboard/app/routers/rebuild.py` — new SSE rebuild endpoint
45 +- `/opt/data/ops-dashboard/app/routers/backups.py` — dynamic project list from registry
46 +- `/opt/data/ops-dashboard/app/main.py` — register 4 new routers
47 +- `/opt/data/ops-dashboard/static/js/app.js` — Operations page UI + SSE modal
48 +- `/opt/data/ops-dashboard/static/index.html` — nav link + ops-modal HTML
49 +
50 +## Next Steps (at time of pause)
51 +
52 +- [ ] Test backup creation from dashboard UI
53 +- [ ] Test full promote dry-run via dashboard (Operations page)
54 +- [ ] Test sync dry-run via dashboard
55 +- [ ] Commit infrastructure and code repo changes on server
56 +- [ ] DNS cutover mdf-system.de → .ch
57 +- [ ] Disaster recovery test (destroy + rebuild SL dev)
58 +
59 +---
60 +
61 +**Tags:** #Session #OpsDashboard #OpsCLI #Promote #Sync #Rebuild #Registry
 Notes/2026/02/0018 - 2026-02-22 - CLI Contract Spec, Sync Compliance, Dashboard Bidirectional UI.md
.. .. @@ -0,0 +1,44 @@
1 +# Session 0018: CLI Contract Spec, Sync Compliance, Dashboard Bidirectional UI
2 +
3 +**Date:** 2026-02-22
4 +**Status:** Completed
5 +**Origin:** MDF Webseiten session 0033
6 +
7 +---
8 +
9 +## Work Done
10 +
11 +- [x] Defined project CLI contract (`infrastructure/docs/cli-contract.md`, 514 lines): 4 required commands (backup, restore, sync, promote), exact flags, exit codes, output format, compliance checklist, minimal shell CLI example for new projects
12 +- [x] MDF sync.py contract compliance: ANSI suppression (NO_COLOR env var + TTY detection), `--yes` flag for backup, 6 cancellation paths changed exit 0 → exit 2, `[error]` prefix helper for stderr
13 +- [x] SL sync.py contract compliance: ANSI suppression, `error_exit()` helper, backup now uses per-env subdirectories, absolute path output after backup
14 +- [x] Ops CLI de-hardcoding: removed stale `/opt/data2` from disk checks and healthcheck.sh, generalized hardcoded MDF-specific comments, added `find_registry()` multi-server comment
15 +- [x] Disaster recovery docs: fixed `install.sh` (single-volume layout, auto-detection), fixed `bootstrap.sh` (network pre-creation, local image builds, restore instructions), wrote `docs/disaster-recovery.md` (10-phase runbook)
16 +- [x] Dashboard JS fix: fixed syntax errors in Operations page onclick handlers (nested quotes)
17 +- [x] Permanent cache fix: content-hashed asset URLs so manual `?v=XX` bumps are no longer needed
18 +- [x] Bidirectional sync UI: `prod ↔ dev` with direction picker modal ("content flows down" / "content flows up")
19 +- [x] Deployed to server: ops CLI, registry, healthcheck, install.sh, bootstrap.sh, both sync.py scripts (all 3 envs), dashboard rebuilt with content hashing
20 +- [x] Verified: ops status, ops health, promote dry-run, restore --list, dashboard SSE streaming
21 +
22 +## Key Decisions / Learnings
23 +
24 +- CLI contract enforces: ANSI off via `NO_COLOR` or non-TTY detection; exit codes 0 (success), 1 (error), 2 (cancelled by user); `[error]` prefix on stderr; `--yes` flag to skip prompts in automation
25 +- Cancellation paths must exit 2, not 0 — exit 0 was masking user-cancelled operations in the dashboard
26 +- Content hashing (not version query params) is the correct long-term cache-busting solution
27 +- `find_registry()` multi-server support is documented but not yet implemented — placeholder for future
28 +- DR runbook is 10 phases: verify backups → restore server → install deps → clone repo → restore data → start services → verify
29 +
30 +## Files Changed
31 +
32 +- `infrastructure/docs/cli-contract.md` — new, 514 lines, defines the full CLI contract
33 +- `infrastructure/docs/disaster-recovery.md` — new, 10-phase DR runbook
34 +- `infrastructure/install.sh` — single-volume layout with auto-detection
35 +- `infrastructure/bootstrap.sh` — network pre-creation, local image builds, restore instructions
36 +- `infrastructure/ops` — removed `/opt/data2`, generalized hardcoded comments, `find_registry()` note
37 +- `infrastructure/healthcheck.sh` — removed stale `/opt/data2` disk check
38 +- `Code/mdf-system/scripts/sync/sync.py` — ANSI suppression, `--yes`, exit 2 cancellations, `[error]` helper
39 +- `Code/seriousletter-sync/sync.py` — ANSI suppression, `error_exit()`, per-env backup dirs, absolute path output
40 +- `Code/ops-dashboard/` — JS onclick fix, content-hashed assets, bidirectional sync UI
41 +
42 +---
43 +
44 +**Tags:** #Session #OpsToolkit #OpsDashboard #CliContract #DisasterRecovery
 Notes/2026/02/0019 - 2026-02-22 - Offsite Download Feature Added to Dashboard.md
.. .. @@ -0,0 +1,27 @@
1 +# Session 0019: Offsite Download Feature Added to Dashboard
2 +
3 +**Date:** 2026-02-22
4 +**Status:** Completed
5 +**Origin:** MDF Webseiten session 0039
6 +
7 +---
8 +
9 +## Work Done
10 +
11 +- [x] Added offsite download feature to ops dashboard: per-row download buttons on the Backups page plus action bar buttons
12 +- [x] Offsite download uses SSE streaming (consistent with existing backup/restore/upload patterns)
13 +- [x] Updated ops registry with Seafile services (adds ops-visible services to status output)
14 +
15 +## Key Decisions / Learnings
16 +
17 +- Offsite download follows the same SSE streaming pattern as backup upload — consistency across all long-running operations
18 +- Per-row buttons (individual file download) and action bar buttons (bulk/selected) both supported
19 +
20 +## Files Changed
21 +
22 +- `Code/ops-dashboard/` — offsite download UI (per-row + action bar) with SSE streaming
23 +- `infrastructure/servers/hetzner-vps/registry.yaml` — added Seafile services
24 +
25 +---
26 +
27 +**Tags:** #Session #OpsDashboard #Offsite #SSE
 Notes/2026/02/0020 - 2026-02-23 - Backup Coverage Audit, Registry Fixes, Container Resolution.md
.. .. @@ -0,0 +1,35 @@
1 +# Session 0020: Backup Coverage Audit, Registry Fixes, Container Resolution
2 +
3 +**Date:** 2026-02-23
4 +**Status:** Completed
5 +**Origin:** MDF Webseiten session 0041
6 +
7 +---
8 +
9 +## Work Done
10 +
11 +- [x] Fixed ringsaday backup error: added `backup_sources` (volumes, keys, server, website, .env) and `backup` config to registry; changed `backup_dir` to `/opt/data/backups/ringsaday`; fixed `_backup_generic()` — changed `-d` to `-e` flag so individual files (not just directories) can be backed up; tested: 689 MB backup created successfully
12 +- [x] Full backup coverage audit: identified kioskpilot (1.3 MB) and ops-dashboard (1.5 MB) as missing backups
13 +- [x] Added kioskpilot backup (03:45, 30-day retention)
14 +- [x] Added ops-dashboard to registry + nightly backup (04:15, 30-day retention)
15 +- [x] Now 6 nightly backup timers: mdf, seriousletter, ringsaday, kioskpilot, ops-dashboard, coolify
16 +- [x] Fixed ringsaday container resolution: was showing duplicated entries in `ops status`
17 +  - Added `{prefix}-{env}-` matching pattern to `find_containers()` (handles ringsaday-dev-UUID style names)
18 +  - Added ringsaday-website as sub-service with `environments: [prod]`
19 +- [x] Deployed registry.yaml and ops CLI to server; 6 systemd timers active; backup dirs created
20 +
21 +## Key Decisions / Learnings
22 +
23 +- `_backup_generic()` used `-d` (directory flag) which silently skipped individual files like `.env` and SSL keys — the fix to `-e` (existence check) makes it handle both files and directories
24 +- Container naming for ringsaday uses `{prefix}-{env}-UUID` (Coolify-managed), different from other projects — `find_containers()` needed a second pattern to match these
25 +- ops-dashboard itself must be backed up — it holds its own config and data, easy to overlook
26 +- Backup coverage audit should be a recurring check whenever new projects are added
27 +
28 +## Files Changed
29 +
30 +- `infrastructure/servers/hetzner-vps/registry.yaml` — kioskpilot backup, ops-dashboard entry, ringsaday website sub-service, ringsaday backup_sources
31 +- `infrastructure/ops` — `_backup_generic()` -d→-e fix, `find_containers()` new UUID-style pattern
32 +
33 +---
34 +
35 +**Tags:** #Session #OpsToolkit #Backup #Registry #ContainerResolution
 Notes/2026/02/0021 - 2026-02-23 - Rebuild.py Coolify-Only Lifecycle, SSE Keepalive, Traefik Flush.md
.. .. @@ -0,0 +1,35 @@
1 +# Session 0021: Rebuild.py Coolify-Only Lifecycle, SSE Keepalive, Traefik Flush
2 +
3 +**Date:** 2026-02-23
4 +**Status:** Completed
5 +**Origin:** MDF Webseiten session 0044 (part 1)
6 +
7 +---
8 +
9 +## Work Done
10 +
11 +- [x] rebuild.py — removed all docker compose fallbacks; recreate is now Coolify stop → wipe → Coolify start; rebuild is Coolify stop → docker build → Coolify start; restart stays as `docker restart` (Coolify restart prunes local images — intentional exception)
12 +- [x] Fixed build step: changed from `docker compose --profile {env} build` (requires all Coolify env vars) to `docker build -t {image}:{env} {context}` using registry `build_context` and `image_name` directly — no env vars needed
13 +- [x] Added `_coolify_start_with_retry()`: polls 60s after API call, retries up to 3 times — handles Coolify silently dropping start requests
14 +- [x] Container stabilization polling: `_poll_until_running` now waits for container count to be stable for 2 consecutive polls (10s) before declaring success — previously returned success on first container appearance
15 +- [x] "Already running/stopped" handling: Coolify API HTTP 400 with that message now treated as success, not error
16 +- [x] SSE keepalive for restore: restore connections were dropping during DB import (~60s silence); added `_stream_with_keepalive()` wrapper in `restore.py` — sends SSE comment `: keepalive` every 15s
17 +- [x] Added `responseForwarding.flushInterval: "-1"` to ops-dashboard Traefik dynamic config — Traefik was buffering SSE responses, causing keepalives to not reach the client
18 +
19 +## Key Decisions / Learnings
20 +
21 +- Coolify `restart` prunes locally-built images — `docker restart` (bypassing Coolify) is the correct approach for services with local images; this is a documented exception in rebuild.py
22 +- Coolify can silently queue-and-never-execute start requests — retry logic with polling is mandatory, not optional
23 +- "Already running" from Coolify API is a valid state (idempotent), not an error — treat HTTP 400 with that message as success
24 +- SSE keepalive must happen at the application level (`: keepalive` comment) AND Traefik must be configured to flush immediately (`flushInterval: "-1"`) — both are required; one alone is not enough
25 +- Stable polling (2 consecutive matching counts) is more reliable than "at least one container appeared"
26 +
27 +## Files Changed
28 +
29 +- `Code/ops-dashboard/app/routers/rebuild.py` — Coolify-only lifecycle, `docker build` from registry config, `_coolify_start_with_retry()`, stable container polling, HTTP 400 success handling
30 +- `Code/ops-dashboard/app/routers/restore.py` — `_stream_with_keepalive()` SSE keepalive wrapper
31 +- Server: `/data/coolify/proxy/dynamic/ops-dashboard.yaml` — added `responseForwarding.flushInterval: "-1"`
32 +
33 +---
34 +
35 +**Tags:** #Session #OpsDashboard #Rebuild #SSE #Traefik #Coolify
 Notes/2026/02/0022 - 2026-02-23 - Post-Coolify Architecture Context for Ops Toolkit.md
.. .. @@ -0,0 +1,52 @@
1 +# Session 0022: Post-Coolify Architecture Context for Ops Toolkit
2 +
3 +**Date:** 2026-02-23
4 +**Status:** Completed
5 +**Origin:** MDF Webseiten session 0044 (Coolify Removal Complete)
6 +
7 +---
8 +
9 +## Work Done
10 +
11 +- [x] Coolify fully removed from server (6 containers, 18 UUID networks, /data/coolify/ directory)
12 +- [x] Standalone Traefik v3.6 confirmed as proxy layer (was coolify-proxy, now independent at /opt/data/traefik/)
13 +- [x] All 28 containers verified operational post-removal; 17/17 domains tested
14 +- [x] Dynamic configs migrated: seriousletter.yaml, ringsaday.yaml moved to /opt/data/traefik/dynamic/
15 +- [x] SSL certificates preserved: acme.json migrated to /opt/data/traefik/acme.json
16 +- [x] Coolify archive retained: /opt/data/backups/coolify-final-20260223.tar.gz (125KB, 30-day window)
17 +
18 +## Key Decisions / Learnings
19 +
20 +- **Ops toolkit no longer depends on Coolify API** — all lifecycle management (start/stop/rebuild/recreate) must use Docker CLI and docker compose directly against project compose files at `/opt/data/{project}/`
21 +- **Container naming is now clean** — no more UUID suffixes. Pattern: `{env}-{project}-{service}` (e.g. `prod-mdf-wordpress`, `dev-seriousletter-backend`)
22 +- **Proxy network is `proxy`** (replaces old `coolify` network) — all Traefik-exposed containers connect to it
23 +- **Project descriptors at `/opt/data/{project}/project.yaml`** are the new source of truth for container config — registry.yaml is deprecated (used only by gen-timers and schedule PUT)
24 +- **Docker provider + file provider** coexist in Traefik: MDF services use Docker labels; SeriousLetter, RingsADay, KioskPilot use file provider configs
25 +- metro.ringsaday.com returns 502 — pre-existing issue unrelated to Coolify removal (no metro service in compose)
26 +- Docker system cleanup freed ~9GB of unused images and volumes during removal
27 +
28 +## Architecture Reference (Post-Coolify)
29 +
30 +```
31 +Proxy:           Traefik v3.6 at /opt/data/traefik/
32 +Config:          traefik.yaml (static), dynamic/ (file provider)
33 +Certs:           /opt/data/traefik/acme.json
34 +Proxy network:   proxy
35 +
36 +Projects:
37 +  MDF prod:      /opt/data/mdf/prod/  — WordPress, MySQL, Mail, PostfixAdmin, Roundcube, Seafile
38 +  MDF int/dev:   /opt/data/mdf/{int,dev}/  — WordPress + MySQL
39 +  SeriousLetter: /opt/data/seriousletter/{dev,int,prod}/
40 +  RingsADay:     /opt/data/ringsaday/
41 +  KioskPilot:    /opt/data/kioskpilot/
42 +  Ops Dashboard: /opt/data/ops-dashboard/
43 +```
44 +
45 +## Files Changed
46 +
47 +- Server: `/data/coolify/` — deleted (backed up first)
48 +- Server: `/opt/data/traefik/dynamic/` — received migrated seriousletter.yaml and ringsaday.yaml
49 +
50 +---
51 +
52 +**Tags:** #Session #OpsToolkit #Architecture #Traefik #PostCoolify
 Notes/2026/02/0023 - 2026-02-23 - Toolkit Bootstrap Starting Point.md
.. .. @@ -0,0 +1,44 @@
1 +# Session 0023: Toolkit Bootstrap Starting Point
2 +
3 +**Date:** 2026-02-23
4 +**Status:** Completed
5 +**Origin:** MDF Webseiten session 0045
6 +
7 +---
8 +
9 +## Work Done
10 +
11 +- [x] Created `project.yaml` descriptors for all 5 projects (mdf, seriousletter, ringsaday, kioskpilot, ops-dashboard)
12 +- [x] Updated `ops-dashboard` docker-compose.yaml: network `coolify` → `proxy`
13 +- [x] Added Alpine pre-pull with retry (4 attempts, 15s delays) to `rebuild.py` — note: this was a pre-redesign patch, superseded by Phase 5 rewrite in session 0046
14 +- [x] Added image verification after build to `rebuild.py`
15 +- [x] Identified Phase 3+4 toolkit work as next immediate task (was interrupted this session)
16 +
17 +## Context / Background
18 +
19 +This session was primarily about removing Coolify and migrating all projects to standalone Docker Compose. The OPS-relevant outcome is:
20 +
21 +- All 5 `project.yaml` descriptors now exist and are the source of truth for the toolkit
22 +- The `proxy` Docker network replaces the old `coolify` network — all Traefik-exposed containers connect to it
23 +- The toolkit build (Phase 3+4) was planned but interrupted mid-session — completed in session 0046
24 +- The plan was documented at: `Notes/swarm/plan.md` (since cleaned up)
25 +
26 +## Key Decisions / Learnings
27 +
28 +- `container_prefix` in `project.yaml` uses `{env}` placeholder (e.g. `"{env}-mdf"`) — the toolkit must expand this at runtime
29 +- SeriousLetter uses `"{env}-seriousletter"` as prefix (not `sl`)
30 +- ops-dashboard gets its own `project.yaml` like all other projects
31 +
32 +## Files Changed
33 +
34 +- `/opt/data/mdf/project.yaml` — created
35 +- `/opt/data/seriousletter/project.yaml` — created
36 +- `/opt/data/ringsaday/project.yaml` — created
37 +- `/opt/data/kioskpilot/project.yaml` — created
38 +- `/opt/data/ops-dashboard/project.yaml` — created
39 +- `/opt/data/ops-dashboard/docker-compose.yml` — network coolify→proxy
40 +- `app/routers/rebuild.py` — Alpine retry + image verify (pre-redesign, superseded)
41 +
42 +---
43 +
44 +**Tags:** #Session #OpsToolkit #Infrastructure
 Notes/2026/02/0024 - 2026-02-23 - Toolkit and CLI Rewrite and Dashboard Migration.md
.. .. @@ -0,0 +1,65 @@
1 +# Session 0024: Toolkit and CLI Rewrite and Dashboard Migration
2 +
3 +**Date:** 2026-02-23
4 +**Status:** Completed
5 +**Origin:** MDF Webseiten session 0046
6 +
7 +---
8 +
9 +## Work Done
10 +
11 +### Phase 3: Shared Toolkit
12 +
13 +- [x] Completed 5 missing toolkit modules at `/opt/infrastructure/toolkit/`:
14 +  - `cli.py` — main CLI entry point with all commands (status, start, stop, build, rebuild, destroy, backup, restore, sync, promote, logs, health, disk, backups, offsite, gen-timers, init)
15 +  - `output.py` — formatted output (Rich tables, JSON mode, plain text fallback)
16 +  - `restore.py` — restore operations with CLI delegation support
17 +  - `sync.py` — data sync between environments with CLI delegation
18 +  - `promote.py` — code promotion (git, rsync, script) with adjacency enforcement
19 +- [x] 7 modules already existed from prior sessions: `__init__.py`, `descriptor.py`, `docker.py`, `backup.py`, `database.py`, `health.py`, `discovery.py`
20 +
21 +### Phase 4: Ops CLI Rewrite
22 +
23 +- [x] Replaced 950-line bash ops CLI with 7-line bash shim → `python3 -m toolkit.cli`
24 +- [x] Old CLI backed up as `ops.bak.20260223`
25 +- [x] New commands added: `start`, `stop`, `build`, `destroy`, `logs`, `restart`, `init`
26 +- [x] All commands read from `project.yaml` descriptors — no `registry.yaml` dependency
27 +- [x] Container prefix matching fixed: handles `{env}` placeholder expansion in `container_prefix`
28 +
29 +### Phase 5: Dashboard Adaptation
30 +
31 +- [x] Rewrote 4 dashboard routers to use project.yaml:
32 +  - `registry.py` — imports `toolkit.discovery.all_projects()` instead of parsing registry.yaml
33 +  - `services.py` — uses `toolkit.descriptor.find()` for container name resolution
34 +  - `rebuild.py` — massive rewrite: 707 → 348 lines, removed ALL Coolify API code, uses direct docker compose
35 +  - `schedule.py` — reads from descriptors for GET, still writes to registry.yaml for PUT (gen-timers compatibility)
36 +- [x] Verified all API endpoints working:
37 +  - `/api/registry/` — returns all 5 projects from descriptors
38 +  - `/api/status/` — shows 25 containers
39 +  - `/api/schedule/` — shows backup schedules for all 5 projects
40 +  - `/api/services/logs/mdf/prod/wordpress` — correctly resolves container name
41 +
42 +## Key Decisions / Learnings
43 +
44 +- `rebuild.py` now uses `_compose_cmd()` helper that finds compose file (.yaml/.yml), env-file (.env.{env}/.env), and adds `--profile {env}` — removes all Coolify API dependency
45 +- Dashboard container has `/opt/infrastructure` mounted → can import toolkit directly via Python
46 +- pyyaml 6.0.3 confirmed available in dashboard container
47 +- `schedule.py` still writes to `registry.yaml` for PUT/gen-timers — full descriptor migration is a future task
48 +- `container_prefix_for(env)` expands `{env}` in prefix, then matches `{prefix}-*` containers
49 +
50 +## Files Changed
51 +
52 +- `/opt/infrastructure/toolkit/cli.py` — new (all CLI commands)
53 +- `/opt/infrastructure/toolkit/output.py` — new (Rich/JSON/plain output)
54 +- `/opt/infrastructure/toolkit/restore.py` — new
55 +- `/opt/infrastructure/toolkit/sync.py` — new
56 +- `/opt/infrastructure/toolkit/promote.py` — new
57 +- `/usr/local/bin/ops` — rewritten as 7-line bash shim
58 +- `app/routers/registry.py` — uses toolkit.discovery
59 +- `app/routers/services.py` — uses toolkit.descriptor
60 +- `app/routers/rebuild.py` — 707→348 lines, Coolify removed
61 +- `app/routers/schedule.py` — descriptor-backed GET
62 +
63 +---
64 +
65 +**Tags:** #Session #OpsToolkit #OpsCLI #OpsDashboard
 Notes/2026/02/0025 - 2026-02-24 - Dashboard Bugs and SL Routing Fixes.md
.. .. @@ -0,0 +1,62 @@
1 +# Session 0025: Dashboard Bugs and SL Routing Fixes
2 +
3 +**Date:** 2026-02-24
4 +**Status:** Completed
5 +**Origin:** MDF Webseiten session 0048 (Part 2 only — DNS cutover and mail recovery sections skipped)
6 +
7 +---
8 +
9 +## Work Done
10 +
11 +### Operations Page: Recreate Replaced by Backup + Restore
12 +
13 +- [x] Removed "Recreate" lifecycle action (redundant with Rebuild for bind-mount projects)
14 +- [x] Added **Backup** button (blue): opens lifecycle modal with SSE streaming to `/api/backups/stream/{project}/{env}`
15 +- [x] Added **Restore** button (purple): navigates to Backups page at drill level 2 for that project/env
16 +- [x] Added cache invalidation on backup success
17 +
18 +### SeriousLetter Bad Gateway Fix
19 +
20 +- [x] Diagnosed root cause: SL containers only on `seriousletter-network`, not on `proxy` network Traefik uses
21 +- [x] Permanent fix: added `proxy` network to docker-compose.yaml for all 3 SL envs (prod/int/dev)
22 +  - `backend` and `frontend` services get `proxy` in networks list
23 +  - `proxy: external: true` added to networks section
24 +- [x] Added health checks for both services:
25 +  - Backend: `python3 urllib.request.urlopen("http://localhost:8000/docs")`
26 +  - Frontend: `wget --spider -q http://127.0.0.1:3000/` (explicit `127.0.0.1`, not `localhost` — Alpine resolves to IPv6 `::1`)
27 +
28 +### Sync Routing Bug Fix
29 +
30 +- [x] Fixed sync section only showing MDF (not SeriousLetter)
31 +- [x] Root cause (two-part):
32 +  1. `registry.py` had `desc.sync.get("type") == "cli"` — SL had `sync.type: toolkit`, evaluated to `False`
33 +  2. SL's `toolkit` type was itself wrong — should be `cli` with a CLI path
34 +- [x] Fix in `registry.py`: `"has_cli": desc.sync.get("type") == "cli"` → `"has_cli": bool(desc.sync.get("type"))`
35 +- [x] Fix in `/opt/data/seriousletter/project.yaml`: `sync.type: toolkit` → `type: cli` with `cli:` path
36 +
37 +### Backup Date Inconsistency Fix
38 +
39 +- [x] Fixed overview card showing stale "INT Latest" date while drill-down showed correct newer backups
40 +- [x] Root cause: string comparison between incompatible date formats:
41 +  - Compact (MDF CLI): `20260220_195300`
42 +  - ISO (toolkit): `2026-02-24T03:00:42`
43 +  - Character `'0' > '-'` meant compact dates always "won" the `>` comparison
44 +- [x] Fix: added `normalizeBackupDate()` function to convert all dates to ISO format at merge time in `mergeBackups()`
45 +
46 +## Key Decisions / Learnings
47 +
48 +- When adding a container to a new network, an ad-hoc `docker network connect` is lost on restart — the fix must go in the compose file
49 +- Alpine `localhost` resolves to `::1` (IPv6). Services binding only IPv4 `0.0.0.0` won't respond. Use `127.0.0.1` explicitly in health checks.
50 +- For `has_cli` logic: any truthy `sync.type` value means the project has ops CLI support — don't compare to a specific string
51 +- Date normalization must happen at merge time, not display time, to get correct `max()` comparisons
52 +
53 +## Files Changed
54 +
55 +- `static/js/app.js` — removed recreate modal/handler, added backup modal, URL routing for restore button, cache invalidation, `normalizeBackupDate()` + `mergeBackups()` fix
56 +- `app/routers/registry.py` — `has_cli` logic fix
57 +- `/opt/data/seriousletter/project.yaml` — `sync.type` corrected
58 +- `/opt/data/seriousletter/{prod,int,dev}/code/docker-compose.yaml` — proxy network + health checks
59 +
60 +---
61 +
62 +**Tags:** #Session #OpsDashboard #BugFix
 Notes/2026/02/0026 - 2026-02-25 - Persistent Jobs and Container Terminal.md
.. .. @@ -0,0 +1,69 @@
1 +# Session 0026: Persistent Jobs and Container Terminal
2 +
3 +**Date:** 2026-02-25
4 +**Status:** Completed
5 +**Origin:** MDF Webseiten session 0053
6 +
7 +---
8 +
9 +## Work Done
10 +
11 +### Feature 1: Persistent/Reconnectable Jobs
12 +
13 +- [x] New `app/job_store.py` — in-memory job store decouples subprocess from SSE connection
14 +- [x] New `app/routers/jobs.py` — job management endpoints
15 +- [x] New endpoints: `GET /api/jobs/`, `GET /api/jobs/{op_id}`, `GET /api/jobs/{op_id}/stream?from=N`
16 +- [x] Added `run_job()` to `ops_runner.py` — runs subprocess writing to job store, NOT killed on browser disconnect
17 +- [x] Added `job_sse_stream()` to `job_store.py` — shared SSE wrapper with keepalive
18 +- [x] Rewrote 6 routers to use job store pattern: backups.py, restore.py, sync_data.py, promote.py, rebuild.py, schedule.py
19 +- [x] All routers follow pattern: `create_job()` → `asyncio.create_task(run_job())` → `return StreamingResponse(job_sse_stream())`
20 +- [x] Background cleanup task removes expired jobs every 5 minutes (1 hour TTL)
21 +- [x] Frontend: auto-reconnect on SSE error via `/api/jobs/{op_id}/stream?from=N` (3 retries)
22 +- [x] Frontend: check for running jobs on page load, show reconnect banner
23 +
24 +### Feature 2: Container Terminal
25 +
26 +- [x] New `app/routers/terminal.py` — WebSocket endpoint with PTY via `docker exec`
27 +- [x] Protocol: `{"type":"input","data":"..."}` / `{"type":"resize","cols":80,"rows":24}` / `{"type":"output","data":"..."}`
28 +- [x] Frontend: xterm.js 5.5.0 + addon-fit from CDN, terminal modal, Console button on services page
29 +- [x] Security: token auth, container name validation (regex allowlist), running check via docker inspect
30 +
31 +### Fixes Applied
32 +
33 +- [x] Restored bidirectional sync pairs in `sync_data.py` (regression from engineer rewrite)
34 +- [x] Restored multi-compose support in `rebuild.py` (`_all_compose_dirs`, `_compose_cmd_for` for Seafile)
35 +- [x] Updated `main.py` with jobs + terminal routers, cleanup task in lifespan
36 +- [x] Bumped APP_VERSION to v15-20260225
37 +- [x] Also committed + pushed `sync_data.py` bidirectional fix (git commit 31ac43f) and stabilization checks
38 +
39 +## Key Decisions / Learnings
40 +
41 +- Decoupling subprocess from SSE via a job store is the correct pattern — browser disconnect should never kill a running backup/restore
42 +- Job store is in-memory (not persisted) — server restart loses job history, which is acceptable
43 +- xterm.js from CDN (not bundled) keeps the container image lean
44 +- Container name validation via regex allowlist prevents command injection through the WebSocket terminal endpoint
45 +- `from=N` query param on stream endpoint enables replay from any position — client tracks last received line index
46 +
47 +## Files Changed
48 +
49 +- `app/job_store.py` — new (315 lines)
50 +- `app/routers/jobs.py` — new (186 lines)
51 +- `app/routers/terminal.py` — new (287 lines)
52 +- `app/ops_runner.py` — added `run_job()` (388 lines total)
53 +- `app/main.py` — added routers + cleanup task (138 lines)
54 +- `app/routers/backups.py` — job store integration (287 lines)
55 +- `app/routers/restore.py` — job store integration (290 lines)
56 +- `app/routers/sync_data.py` — job store + bidirectional fix (71 lines)
57 +- `app/routers/promote.py` — job store integration (69 lines)
58 +- `app/routers/rebuild.py` — job store + multi-compose (365 lines)
59 +- `static/js/app.js` — v15: reconnect + terminal (2355 lines)
60 +- `static/index.html` — xterm.js CDN + terminal modal
61 +- `static/css/style.css` — terminal styles
62 +
63 +## State at Session End
64 +
65 +Code written locally at `/Users/i052341/Daten/Cloud/08 - Others/MDF/Infrastruktur/Code/ops-dashboard/`. Not yet deployed to server at time of note creation. Deploy + verification is the next session's starting task.
66 +
67 +---
68 +
69 +**Tags:** #Session #OpsDashboard #PersistentJobs #Terminal
 Notes/2026/02/0027 - 2026-02-26 - Dynamic Backup Buttons & TEKMidian Registration.md
.. .. @@ -0,0 +1,61 @@
1 +# 0027 - 2026-02-26 - Dynamic Backup Buttons & TEKMidian Registration
2 +
3 +## Context
4 +
5 +Changes made from the TEKMidian project session while registering TEKMidian in the ops dashboard.
6 +
7 +## Changes
8 +
9 +### 1. Dynamic "Create Backup" Buttons (app.js)
10 +
11 +**Problem:** The "Create Backup" buttons on the Backups page were hardcoded to only `mdf` and `seriousletter`:
12 +```javascript
13 +for (const p of ['mdf', 'seriousletter']) {
14 +    for (const e of ['dev', 'int', 'prod']) {
15 +```
16 +
17 +**Fix:** Made buttons dynamic from the `/api/schedule/` endpoint. Now all backup-enabled projects get buttons automatically based on their configured environments:
18 +```javascript
19 +for (const s of (cachedSchedules || [])) {
20 +    if (!s.enabled) continue;
21 +    const envs = s.backup_environments || s.environments || [];
22 +    // render button per environment
23 +}
24 +```
25 +
26 +Also added schedule data fetch in `renderBackups()` alongside the existing backup/offsite fetches:
27 +```javascript
28 +const [local, offsite, schedules] = await Promise.all([
29 +    api('/api/backups/'),
30 +    api('/api/backups/offsite').catch(() => []),
31 +    cachedSchedules ? Promise.resolve(cachedSchedules) : api('/api/schedule/').catch(() => []),
32 +]);
33 +```
34 +
35 +### 2. Empty-State Project Cards (app.js)
36 +
37 +**Problem:** Projects with backup config but no backups yet didn't appear in the project cards grid (only projects with existing backup files showed up).
38 +
39 +**Fix:** After the existing project cards loop, added a second loop over `cachedSchedules` to show backup-configured projects that have 0 backups as dashed-border cards:
40 +```javascript
41 +for (const s of (cachedSchedules || [])) {
42 +    if (!s.enabled || projects[s.project]) continue;
43 +    // render dashed card with "0 backups" and "No backups yet"
44 +}
45 +```
46 +
47 +### 3. Cache Busting
48 +
49 +- Bumped `APP_VERSION` from `v15-20260225` to `v16-20260226`
50 +- Updated `index.html`: `app.js?v=15` to `app.js?v=16`
51 +
52 +## Files Modified (on server)
53 +
54 +- `/opt/data/ops-dashboard/static/js/app.js` — dynamic backup buttons, schedule fetch, empty-state cards, version bump
55 +- `/opt/data/ops-dashboard/static/index.html` — cache bust `?v=16`
56 +
57 +## Notes
58 +
59 +- Edits require `sudo` — file is owned by uid 501 (macOS user via scp)
60 +- No container restart needed — static files are bind-mounted (`./static:/app/static`)
61 +- First TEKMidian backup triggered via schedule API (926K tar.gz)

..	..	@@ -0,0 +1,59 @@
	1	+# Session 0013: Infrastructure Repo & Ops CLI Bootstrap
	2	+
	3	+Date: 2026-02-20
	4	+Status: Completed
	5	+Origin: MDF Webseiten session 0018
	6	+
	7	+---
	8	+
	9	+## Work Done
	10	+
	11	+- [x] Created infrastructure repo at `git.mnsoft.org/git/APPS/infrastructure.git`
	12	+- [x] Local clone: `/Users/i052341/Daten/Cloud/08 - Others/MDF/Infrastruktur/Code/infrastructure/`
	13	+- [x] Server clone: `/opt/infrastructure/`
	14	+- [x] Wrote `ops` CLI (bash, ~250 lines) — symlinked to `/usr/local/bin/ops`
	15	+- [x] Created `servers/hetzner-vps/registry.yaml` — single source of truth for 5 projects
	16	+- [x] Captured 5 Traefik dynamic configs from server into git
	17	+- [x] Wrote `monitoring/healthcheck.sh` — container health + disk checks → ntfy
	18	+- [x] Installed `ops-healthcheck.timer` (every 5 min) on server
	19	+- [x] Added Docker labels (`ops.project`, `ops.environment`, `ops.service`) to all MDF compose files
	20	+- [x] Replaced hardcoded `container_name()` in `sync.py` with label-based discovery + UUID suffix fallback
	21	+- [x] Verified: `ops status`, `ops health`, `ops disk`, `ops backup mdf prod` all working
	22	+
	23	+## Repo Structure Created
	24	+
	25	+```
	26	+infrastructure/
	27	+├── ops # The ops CLI (bash)
	28	+├── servers/hetzner-vps/
	29	+│ ├── registry.yaml # 5 projects defined
	30	+│ ├── traefik/dynamic/ # Traefik configs captured
	31	+│ ├── bootstrap/ # Coolify service payloads
	32	+│ ├── scaffolding/ # Shell aliases, SSH hardening, venv setup
	33	+│ ├── systemd/ # 6 timer/service units
	34	+│ └── install.sh # Full fresh server setup script
	35	+├── monitoring/
	36	+│ ├── healthcheck.sh
	37	+│ ├── ops-healthcheck.service
	38	+│ └── ops-healthcheck.timer
	39	+└── docs/architecture.md
	40	+```
	41	+
	42	+## Key Decisions / Learnings
	43	+
	44	+- `ops` CLI uses `SCRIPT_DIR` with `readlink -f` for symlink-safe path resolution
	45	+- `registry.yaml` uses a `name_prefix` field; container matching uses `grep` with word anchoring to prevent substring false matches
	46	+- Label-based discovery is primary; Coolify UUID suffix prefix-search is the fallback
	47	+- Docker labels added to compose files are not live on running containers until restart — noted as gap
	48	+
	49	+## Files Changed
	50	+
	51	+- `/opt/infrastructure/ops` — new ops CLI (bash)
	52	+- `/opt/infrastructure/servers/hetzner-vps/registry.yaml` — new registry
	53	+- `/opt/infrastructure/monitoring/healthcheck.sh` — new healthcheck script
	54	+- `Code/mdf-system/docker-compose.yaml` — added ops.* Docker labels
	55	+- `Code/mdf-system/scripts/sync/sync.py` — label-based container discovery, domain map fix
	56	+
	57	+---
	58	+
	59	+Tags: #Session #OpsCLI #Infrastructure

..	..	@@ -0,0 +1,42 @@
	1	+# Session 0014: Registry Naming & Backup System
	2	+
	3	+Date: 2026-02-20
	4	+Status: Completed
	5	+Origin: MDF Webseiten session 0019
	6	+
	7	+---
	8	+
	9	+## Work Done
	10	+
	11	+- [x] Fixed `sl-website` registry placement — moved under `seriousletter.services.website` to resolve prefix collision
	12	+- [x] Renamed all 7 Coolify services to consistent `{project}-{env/purpose}` lowercase naming
	13	+- [x] Deleted stale stopped MDF Dev duplicate from Coolify (UUID: qw8wso0ckskccoo0kcog84c0)
	14	+- [x] Fixed `ops backup/restore/sync` argument validation (was crashing on unbound variable)
	15	+- [x] Fixed SL CLI path in `registry.yaml` (pointed to wrong location)
	16	+- [x] Added `container_name()` to SL `sync.py` with label + prefix fallback (mirrors MDF pattern)
	17	+- [x] Made `ops backup <project>` work without env arg (passes `--all` to CLI)
	18	+- [x] Added backup summary to `ops status` — latest backup per project/env, size, age with color coding
	19	+- [x] Consolidated backup dirs to `/opt/data/backups/{project}/{env}/` across all projects
	20	+- [x] Updated both MDF and SL CLIs for per-env backup subdirectory structure
	21	+- [x] Volume consolidation: all data migrated from 10GB to 50GB volume at `/opt/data`
	22	+- [x] Updated all path references across compose files, CLIs, systemd units, registry, ops CLI
	23	+
	24	+## Key Decisions / Learnings
	25	+
	26	+- Registry was initially ambiguous about where `sl-website` lived — prefix collision with other SL services caused matching bugs. Moving it under a `services.website` key made the prefix unique.
	27	+- Per-env backup subdirs (`/opt/data/backups/{project}/{env}/`) are the correct structure — flat dirs were the source of orphaned files.
	28	+- `ops backup <project>` without env should be a valid shorthand — it delegates `--all` to the project CLI rather than requiring explicit env arg.
	29	+- Container name resolution logic must be identical across project CLIs — label-based primary, prefix fallback secondary. Divergence causes mysterious "container not found" bugs.
	30	+- Old 10GB volume was kept mounted during migration to avoid cwd-in-mountpoint issues during `umount`.
	31	+
	32	+## Files Changed
	33	+
	34	+- `/opt/infrastructure/servers/hetzner-vps/registry.yaml` — fixed sl-website placement, naming consistency
	35	+- `/opt/infrastructure/ops` — fixed arg validation, `cmd_backup` without env, backup summary in status
	36	+- `/opt/data/seriousletter/{dev,int,prod}/code/scripts/sync/sync.py` — added `container_name()` with fallback
	37	+- `Code/mdf-system/scripts/sync/sync.py` — per-env backup subdirectory paths
	38	+- All compose files, systemd units — `/opt/data2` → `/opt/data` path updates
	39	+
	40	+---
	41	+
	42	+Tags: #Session #OpsCLI #BackupSystem #Registry

..	..	@@ -0,0 +1,39 @@
	1	+# Session 0015: Offsite Backup Dashboard Fix & Status Format
	2	+
	3	+Date: 2026-02-22
	4	+Status: Completed
	5	+Origin: MDF Webseiten session 0025
	6	+
	7	+---
	8	+
	9	+## Work Done
	10	+
	11	+- [x] Fixed offsite backups not showing in ops dashboard
	12	+ - `/api/backups/offsite` was calling `run_ops_json()` (in-container execution) but `ops offsite list` requires the host Python venv
	13	+ - Added `run_ops_host_json()` helper to `ops_runner.py` using `nsenter`-based host execution
	14	+ - Updated `backups.py` router to use `run_ops_host_json()` for offsite listing
	15	+ - Rebuilt and restarted ops-dashboard container
	16	+- [x] Reformatted backup list in `ops status` CLI output
	17	+ - Changed from flat table sorted by project to date-grouped boxes
	18	+ - Each date gets its own Rich table: project / env / time / size / total columns
	19	+ - Latest backup per project/env shown, grouped by date descending, sorted by project then env within each date
	20	+- [x] Fixed SeriousLetter backup path bug (CLI-level fix, required for dashboard data correctness)
	21	+ - SL CLI was dumping backups flat into `/opt/data/backups/` — changed `backup-all.sh` to call SL CLI per-env with explicit `--backup-dir`
	22	+ - Moved 15 orphaned backup files to correct per-env directories
	23	+- [x] Ran full backup cycle across all 6 environments (MDF + SL x dev/int/prod), verified offsite upload
	24	+
	25	+## Key Decisions / Learnings
	26	+
	27	+- Dashboard containers cannot use in-process `ops` commands that require host-side Python venvs — must use `nsenter` bridge. This is a recurring pattern: in-container vs host execution boundary is an important architectural distinction in the ops-dashboard.
	28	+- Two execution helpers needed: `run_ops_json()` (in-container, fast) and `run_ops_host_json()` (host via nsenter, required for backup/offsite commands).
	29	+- Date-grouped backup status is more readable than a flat project-sorted table — groups make it obvious if a date was missed entirely.
	30	+
	31	+## Files Changed
	32	+
	33	+- `/opt/data/ops-dashboard/app/ops_runner.py` — added `run_ops_host_json()` helper
	34	+- `/opt/data/ops-dashboard/app/routers/backups.py` — use host execution for offsite listing
	35	+- `/opt/infrastructure/ops` — reformatted backup summary with date-grouped Rich tables
	36	+
	37	+---
	38	+
	39	+Tags: #Session #OpsDashboard #BackupSystem #Offsite

..	..	@@ -0,0 +1,35 @@
	1	+# Session 0016: Backup Drill-Down Redesign & Restore Fix
	2	+
	3	+Date: 2026-02-22
	4	+Status: Completed
	5	+Origin: MDF Webseiten session 0030
	6	+
	7	+---
	8	+
	9	+## Work Done
	10	+
	11	+- [x] Fixed restore API call — `mdf` CLI was falling into interactive selection because no backup filename was passed
	12	+ - `app.js`: `startRestore()` now includes `&name=...` from `restoreCtx` in the API URL
	13	+- [x] Implemented backups drill-down redesign (deployed as v7)
	14	+ - Replaced flat filter state with 3-level drill state (project → env → backup file)
	15	+ - Added cached backups to avoid re-fetching on drill-back
	16	+ - Extracted `mergeBackups()` helper function
	17	+ - Implemented all 13 changes from the redesign plan
	18	+- [x] Fixed browser cache problem preventing new JS from loading after rebuild
	19	+ - Rebuilt image and restarted container to force cache bust
	20	+
	21	+## Key Decisions / Learnings
	22	+
	23	+- Restore API must include the backup filename explicitly — passing only project/env and letting the CLI choose interactively breaks in non-TTY server context.
	24	+- 3-level drill state (project → env → file) is the right UX pattern for hierarchical backup selection; flat filter state made navigation confusing and state management error-prone.
	25	+- Caching fetched backup lists at each level avoids latency on drill-back and reduces server load.
	26	+- Browser cache busting on vanilla JS apps requires either cache-control headers or a version query param — container restart alone does not always clear client caches.
	27	+
	28	+## Files Changed
	29	+
	30	+- `/opt/data/ops-dashboard/static/js/app.js` — `startRestore()` fix, 3-level drill state, `mergeBackups()` helper
	31	+- Docker image rebuilt and container restarted
	32	+
	33	+---
	34	+
	35	+Tags: #Session #OpsDashboard #BackupSystem

..	..	@@ -0,0 +1,61 @@
	1	+# Session 0017: Modular Sync/Promote/Rebuild Architecture
	2	+
	3	+Date: 2026-02-22
	4	+Status: Paused (context checkpoint)
	5	+Origin: MDF Webseiten session 0032
	6	+
	7	+---
	8	+
	9	+## Work Done
	10	+
	11	+- [x] Fixed `SL detect_env()` — was returning "seriousletter" instead of the env name; now scans path components for first match after "data"
	12	+- [x] Fixed `MDF list_backups()` indentation bug — try block was at same level as for loop, only parsed the last backup file
	13	+- [x] Added `promote` config to `registry.yaml` for mdf (rsync), seriousletter (git), ringsaday (git) — each defines promote type, branch mapping, post-pull behavior
	14	+- [x] Added `promote` Typer command to SL `sync.py` — git fetch, diff preview, git pull, Dockerfile change detection, container rebuild/restart, health check; only dev→int and int→prod allowed
	15	+- [x] Added `cmd_promote` to ops CLI — delegates to project CLI with `--from`/`--to` args
	16	+- [x] Added `cmd_rebuild` to ops CLI — starts containers, waits for health, restores latest backup
	17	+- [x] Created 4 new FastAPI routers in ops-dashboard:
	18	+ - `promote.py` — SSE streaming promote endpoint
	19	+ - `sync_data.py` — SSE streaming sync endpoint
	20	+ - `registry.py` — exposes project list + environments + promote config as JSON
	21	+ - `rebuild.py` — SSE streaming rebuild/disaster-recovery endpoint
	22	+- [x] Updated `backups.py` to read project list from registry API instead of hardcoding
	23	+- [x] Added "Operations" page to dashboard sidebar with three sections: Promote Code, Sync Data, Rebuild (Disaster Recovery)
	24	+- [x] Operations page uses SSE modal with dry-run toggle; project/direction buttons populated dynamically from `/api/registry/`
	25	+- [x] Verified all 7 test categories pass
	26	+
	27	+## Key Decisions / Learnings
	28	+
	29	+- All long-running ops commands (promote, sync, rebuild) use SSE streaming — consistent with existing backup/restore pattern. The `stream_ops_host()` helper is the standard interface.
	30	+- Registry is the single source of truth for project/environment/promote config. Dashboard reads it dynamically — no hardcoded project names in API routers.
	31	+- Promote direction validation lives in the project CLI (`sync.py`), not in the ops CLI or dashboard — keeps enforcement close to the implementation.
	32	+- `ops rebuild` is the disaster recovery entry point: bring up containers → wait for healthy → restore latest backup. Simple, composable.
	33	+- `detect_env()` path parsing must handle the full `/opt/data/seriousletter/{env}/code/...` structure — scanning for VALID_ENVS after "data" in path components is robust.
	34	+
	35	+## Files Changed
	36	+
	37	+- `/opt/data/seriousletter/{dev,int,prod}/code/scripts/sync/sync.py` — fix `detect_env`, add `promote` command
	38	+- `Code/mdf-system/scripts/sync/sync.py` (local + deployed to dev) — fix `list_backups` indentation
	39	+- `/opt/infrastructure/servers/hetzner-vps/registry.yaml` — add `promote` config per project
	40	+- `/opt/infrastructure/ops` — add `cmd_promote`, `cmd_rebuild`
	41	+- `/opt/data/ops-dashboard/app/routers/promote.py` — new SSE promote endpoint
	42	+- `/opt/data/ops-dashboard/app/routers/sync_data.py` — new SSE sync endpoint
	43	+- `/opt/data/ops-dashboard/app/routers/registry.py` — new registry JSON endpoint
	44	+- `/opt/data/ops-dashboard/app/routers/rebuild.py` — new SSE rebuild endpoint
	45	+- `/opt/data/ops-dashboard/app/routers/backups.py` — dynamic project list from registry
	46	+- `/opt/data/ops-dashboard/app/main.py` — register 4 new routers
	47	+- `/opt/data/ops-dashboard/static/js/app.js` — Operations page UI + SSE modal
	48	+- `/opt/data/ops-dashboard/static/index.html` — nav link + ops-modal HTML
	49	+
	50	+## Next Steps (at time of pause)
	51	+
	52	+- [ ] Test backup creation from dashboard UI
	53	+- [ ] Test full promote dry-run via dashboard (Operations page)
	54	+- [ ] Test sync dry-run via dashboard
	55	+- [ ] Commit infrastructure and code repo changes on server
	56	+- [ ] DNS cutover mdf-system.de → .ch
	57	+- [ ] Disaster recovery test (destroy + rebuild SL dev)
	58	+
	59	+---
	60	+
	61	+Tags: #Session #OpsDashboard #OpsCLI #Promote #Sync #Rebuild #Registry

..	..	@@ -0,0 +1,44 @@
	1	+# Session 0018: CLI Contract Spec, Sync Compliance, Dashboard Bidirectional UI
	2	+
	3	+Date: 2026-02-22
	4	+Status: Completed
	5	+Origin: MDF Webseiten session 0033
	6	+
	7	+---
	8	+
	9	+## Work Done
	10	+
	11	+- [x] Defined project CLI contract (`infrastructure/docs/cli-contract.md`, 514 lines): 4 required commands (backup, restore, sync, promote), exact flags, exit codes, output format, compliance checklist, minimal shell CLI example for new projects
	12	+- [x] MDF sync.py contract compliance: ANSI suppression (NO_COLOR env var + TTY detection), `--yes` flag for backup, 6 cancellation paths changed exit 0 → exit 2, `[error]` prefix helper for stderr
	13	+- [x] SL sync.py contract compliance: ANSI suppression, `error_exit()` helper, backup now uses per-env subdirectories, absolute path output after backup
	14	+- [x] Ops CLI de-hardcoding: removed stale `/opt/data2` from disk checks and healthcheck.sh, generalized hardcoded MDF-specific comments, added `find_registry()` multi-server comment
	15	+- [x] Disaster recovery docs: fixed `install.sh` (single-volume layout, auto-detection), fixed `bootstrap.sh` (network pre-creation, local image builds, restore instructions), wrote `docs/disaster-recovery.md` (10-phase runbook)
	16	+- [x] Dashboard JS fix: fixed syntax errors in Operations page onclick handlers (nested quotes)
	17	+- [x] Permanent cache fix: content-hashed asset URLs so manual `?v=XX` bumps are no longer needed
	18	+- [x] Bidirectional sync UI: `prod ↔ dev` with direction picker modal ("content flows down" / "content flows up")
	19	+- [x] Deployed to server: ops CLI, registry, healthcheck, install.sh, bootstrap.sh, both sync.py scripts (all 3 envs), dashboard rebuilt with content hashing
	20	+- [x] Verified: ops status, ops health, promote dry-run, restore --list, dashboard SSE streaming
	21	+
	22	+## Key Decisions / Learnings
	23	+
	24	+- CLI contract enforces: ANSI off via `NO_COLOR` or non-TTY detection; exit codes 0 (success), 1 (error), 2 (cancelled by user); `[error]` prefix on stderr; `--yes` flag to skip prompts in automation
	25	+- Cancellation paths must exit 2, not 0 — exit 0 was masking user-cancelled operations in the dashboard
	26	+- Content hashing (not version query params) is the correct long-term cache-busting solution
	27	+- `find_registry()` multi-server support is documented but not yet implemented — placeholder for future
	28	+- DR runbook is 10 phases: verify backups → restore server → install deps → clone repo → restore data → start services → verify
	29	+
	30	+## Files Changed
	31	+
	32	+- `infrastructure/docs/cli-contract.md` — new, 514 lines, defines the full CLI contract
	33	+- `infrastructure/docs/disaster-recovery.md` — new, 10-phase DR runbook
	34	+- `infrastructure/install.sh` — single-volume layout with auto-detection
	35	+- `infrastructure/bootstrap.sh` — network pre-creation, local image builds, restore instructions
	36	+- `infrastructure/ops` — removed `/opt/data2`, generalized hardcoded comments, `find_registry()` note
	37	+- `infrastructure/healthcheck.sh` — removed stale `/opt/data2` disk check
	38	+- `Code/mdf-system/scripts/sync/sync.py` — ANSI suppression, `--yes`, exit 2 cancellations, `[error]` helper
	39	+- `Code/seriousletter-sync/sync.py` — ANSI suppression, `error_exit()`, per-env backup dirs, absolute path output
	40	+- `Code/ops-dashboard/` — JS onclick fix, content-hashed assets, bidirectional sync UI
	41	+
	42	+---
	43	+
	44	+Tags: #Session #OpsToolkit #OpsDashboard #CliContract #DisasterRecovery

..	..	@@ -0,0 +1,27 @@
	1	+# Session 0019: Offsite Download Feature Added to Dashboard
	2	+
	3	+Date: 2026-02-22
	4	+Status: Completed
	5	+Origin: MDF Webseiten session 0039
	6	+
	7	+---
	8	+
	9	+## Work Done
	10	+
	11	+- [x] Added offsite download feature to ops dashboard: per-row download buttons on the Backups page plus action bar buttons
	12	+- [x] Offsite download uses SSE streaming (consistent with existing backup/restore/upload patterns)
	13	+- [x] Updated ops registry with Seafile services (adds ops-visible services to status output)
	14	+
	15	+## Key Decisions / Learnings
	16	+
	17	+- Offsite download follows the same SSE streaming pattern as backup upload — consistency across all long-running operations
	18	+- Per-row buttons (individual file download) and action bar buttons (bulk/selected) both supported
	19	+
	20	+## Files Changed
	21	+
	22	+- `Code/ops-dashboard/` — offsite download UI (per-row + action bar) with SSE streaming
	23	+- `infrastructure/servers/hetzner-vps/registry.yaml` — added Seafile services
	24	+
	25	+---
	26	+
	27	+Tags: #Session #OpsDashboard #Offsite #SSE

..	..	@@ -0,0 +1,35 @@
	1	+# Session 0020: Backup Coverage Audit, Registry Fixes, Container Resolution
	2	+
	3	+Date: 2026-02-23
	4	+Status: Completed
	5	+Origin: MDF Webseiten session 0041
	6	+
	7	+---
	8	+
	9	+## Work Done
	10	+
	11	+- [x] Fixed ringsaday backup error: added `backup_sources` (volumes, keys, server, website, .env) and `backup` config to registry; changed `backup_dir` to `/opt/data/backups/ringsaday`; fixed `_backup_generic()` — changed `-d` to `-e` flag so individual files (not just directories) can be backed up; tested: 689 MB backup created successfully
	12	+- [x] Full backup coverage audit: identified kioskpilot (1.3 MB) and ops-dashboard (1.5 MB) as missing backups
	13	+- [x] Added kioskpilot backup (03:45, 30-day retention)
	14	+- [x] Added ops-dashboard to registry + nightly backup (04:15, 30-day retention)
	15	+- [x] Now 6 nightly backup timers: mdf, seriousletter, ringsaday, kioskpilot, ops-dashboard, coolify
	16	+- [x] Fixed ringsaday container resolution: was showing duplicated entries in `ops status`
	17	+ - Added `{prefix}-{env}-` matching pattern to `find_containers()` (handles ringsaday-dev-UUID style names)
	18	+ - Added ringsaday-website as sub-service with `environments: [prod]`
	19	+- [x] Deployed registry.yaml and ops CLI to server; 6 systemd timers active; backup dirs created
	20	+
	21	+## Key Decisions / Learnings
	22	+
	23	+- `_backup_generic()` used `-d` (directory flag) which silently skipped individual files like `.env` and SSL keys — the fix to `-e` (existence check) makes it handle both files and directories
	24	+- Container naming for ringsaday uses `{prefix}-{env}-UUID` (Coolify-managed), different from other projects — `find_containers()` needed a second pattern to match these
	25	+- ops-dashboard itself must be backed up — it holds its own config and data, easy to overlook
	26	+- Backup coverage audit should be a recurring check whenever new projects are added
	27	+
	28	+## Files Changed
	29	+
	30	+- `infrastructure/servers/hetzner-vps/registry.yaml` — kioskpilot backup, ops-dashboard entry, ringsaday website sub-service, ringsaday backup_sources
	31	+- `infrastructure/ops` — `_backup_generic()` -d→-e fix, `find_containers()` new UUID-style pattern
	32	+
	33	+---
	34	+
	35	+Tags: #Session #OpsToolkit #Backup #Registry #ContainerResolution

..	..	@@ -0,0 +1,35 @@
	1	+# Session 0021: Rebuild.py Coolify-Only Lifecycle, SSE Keepalive, Traefik Flush
	2	+
	3	+Date: 2026-02-23
	4	+Status: Completed
	5	+Origin: MDF Webseiten session 0044 (part 1)
	6	+
	7	+---
	8	+
	9	+## Work Done
	10	+
	11	+- [x] rebuild.py — removed all docker compose fallbacks; recreate is now Coolify stop → wipe → Coolify start; rebuild is Coolify stop → docker build → Coolify start; restart stays as `docker restart` (Coolify restart prunes local images — intentional exception)
	12	+- [x] Fixed build step: changed from `docker compose --profile {env} build` (requires all Coolify env vars) to `docker build -t {image}:{env} {context}` using registry `build_context` and `image_name` directly — no env vars needed
	13	+- [x] Added `_coolify_start_with_retry()`: polls 60s after API call, retries up to 3 times — handles Coolify silently dropping start requests
	14	+- [x] Container stabilization polling: `_poll_until_running` now waits for container count to be stable for 2 consecutive polls (10s) before declaring success — previously returned success on first container appearance
	15	+- [x] "Already running/stopped" handling: Coolify API HTTP 400 with that message now treated as success, not error
	16	+- [x] SSE keepalive for restore: restore connections were dropping during DB import (~60s silence); added `_stream_with_keepalive()` wrapper in `restore.py` — sends SSE comment `: keepalive` every 15s
	17	+- [x] Added `responseForwarding.flushInterval: "-1"` to ops-dashboard Traefik dynamic config — Traefik was buffering SSE responses, causing keepalives to not reach the client
	18	+
	19	+## Key Decisions / Learnings
	20	+
	21	+- Coolify `restart` prunes locally-built images — `docker restart` (bypassing Coolify) is the correct approach for services with local images; this is a documented exception in rebuild.py
	22	+- Coolify can silently queue-and-never-execute start requests — retry logic with polling is mandatory, not optional
	23	+- "Already running" from Coolify API is a valid state (idempotent), not an error — treat HTTP 400 with that message as success
	24	+- SSE keepalive must happen at the application level (`: keepalive` comment) AND Traefik must be configured to flush immediately (`flushInterval: "-1"`) — both are required; one alone is not enough
	25	+- Stable polling (2 consecutive matching counts) is more reliable than "at least one container appeared"
	26	+
	27	+## Files Changed
	28	+
	29	+- `Code/ops-dashboard/app/routers/rebuild.py` — Coolify-only lifecycle, `docker build` from registry config, `_coolify_start_with_retry()`, stable container polling, HTTP 400 success handling
	30	+- `Code/ops-dashboard/app/routers/restore.py` — `_stream_with_keepalive()` SSE keepalive wrapper
	31	+- Server: `/data/coolify/proxy/dynamic/ops-dashboard.yaml` — added `responseForwarding.flushInterval: "-1"`
	32	+
	33	+---
	34	+
	35	+Tags: #Session #OpsDashboard #Rebuild #SSE #Traefik #Coolify

..	..	@@ -0,0 +1,52 @@
	1	+# Session 0022: Post-Coolify Architecture Context for Ops Toolkit
	2	+
	3	+Date: 2026-02-23
	4	+Status: Completed
	5	+Origin: MDF Webseiten session 0044 (Coolify Removal Complete)
	6	+
	7	+---
	8	+
	9	+## Work Done
	10	+
	11	+- [x] Coolify fully removed from server (6 containers, 18 UUID networks, /data/coolify/ directory)
	12	+- [x] Standalone Traefik v3.6 confirmed as proxy layer (was coolify-proxy, now independent at /opt/data/traefik/)
	13	+- [x] All 28 containers verified operational post-removal; 17/17 domains tested
	14	+- [x] Dynamic configs migrated: seriousletter.yaml, ringsaday.yaml moved to /opt/data/traefik/dynamic/
	15	+- [x] SSL certificates preserved: acme.json migrated to /opt/data/traefik/acme.json
	16	+- [x] Coolify archive retained: /opt/data/backups/coolify-final-20260223.tar.gz (125KB, 30-day window)
	17	+
	18	+## Key Decisions / Learnings
	19	+
	20	+- Ops toolkit no longer depends on Coolify API — all lifecycle management (start/stop/rebuild/recreate) must use Docker CLI and docker compose directly against project compose files at `/opt/data/{project}/`
	21	+- Container naming is now clean — no more UUID suffixes. Pattern: `{env}-{project}-{service}` (e.g. `prod-mdf-wordpress`, `dev-seriousletter-backend`)
	22	+- Proxy network is `proxy` (replaces old `coolify` network) — all Traefik-exposed containers connect to it
	23	+- Project descriptors at `/opt/data/{project}/project.yaml` are the new source of truth for container config — registry.yaml is deprecated (used only by gen-timers and schedule PUT)
	24	+- Docker provider + file provider coexist in Traefik: MDF services use Docker labels; SeriousLetter, RingsADay, KioskPilot use file provider configs
	25	+- metro.ringsaday.com returns 502 — pre-existing issue unrelated to Coolify removal (no metro service in compose)
	26	+- Docker system cleanup freed ~9GB of unused images and volumes during removal
	27	+
	28	+## Architecture Reference (Post-Coolify)
	29	+
	30	+```
	31	+Proxy: Traefik v3.6 at /opt/data/traefik/
	32	+Config: traefik.yaml (static), dynamic/ (file provider)
	33	+Certs: /opt/data/traefik/acme.json
	34	+Proxy network: proxy
	35	+
	36	+Projects:
	37	+ MDF prod: /opt/data/mdf/prod/ — WordPress, MySQL, Mail, PostfixAdmin, Roundcube, Seafile
	38	+ MDF int/dev: /opt/data/mdf/{int,dev}/ — WordPress + MySQL
	39	+ SeriousLetter: /opt/data/seriousletter/{dev,int,prod}/
	40	+ RingsADay: /opt/data/ringsaday/
	41	+ KioskPilot: /opt/data/kioskpilot/
	42	+ Ops Dashboard: /opt/data/ops-dashboard/
	43	+```
	44	+
	45	+## Files Changed
	46	+
	47	+- Server: `/data/coolify/` — deleted (backed up first)
	48	+- Server: `/opt/data/traefik/dynamic/` — received migrated seriousletter.yaml and ringsaday.yaml
	49	+
	50	+---
	51	+
	52	+Tags: #Session #OpsToolkit #Architecture #Traefik #PostCoolify

..	..	@@ -0,0 +1,44 @@
	1	+# Session 0023: Toolkit Bootstrap Starting Point
	2	+
	3	+Date: 2026-02-23
	4	+Status: Completed
	5	+Origin: MDF Webseiten session 0045
	6	+
	7	+---
	8	+
	9	+## Work Done
	10	+
	11	+- [x] Created `project.yaml` descriptors for all 5 projects (mdf, seriousletter, ringsaday, kioskpilot, ops-dashboard)
	12	+- [x] Updated `ops-dashboard` docker-compose.yaml: network `coolify` → `proxy`
	13	+- [x] Added Alpine pre-pull with retry (4 attempts, 15s delays) to `rebuild.py` — note: this was a pre-redesign patch, superseded by Phase 5 rewrite in session 0046
	14	+- [x] Added image verification after build to `rebuild.py`
	15	+- [x] Identified Phase 3+4 toolkit work as next immediate task (was interrupted this session)
	16	+
	17	+## Context / Background
	18	+
	19	+This session was primarily about removing Coolify and migrating all projects to standalone Docker Compose. The OPS-relevant outcome is:
	20	+
	21	+- All 5 `project.yaml` descriptors now exist and are the source of truth for the toolkit
	22	+- The `proxy` Docker network replaces the old `coolify` network — all Traefik-exposed containers connect to it
	23	+- The toolkit build (Phase 3+4) was planned but interrupted mid-session — completed in session 0046
	24	+- The plan was documented at: `Notes/swarm/plan.md` (since cleaned up)
	25	+
	26	+## Key Decisions / Learnings
	27	+
	28	+- `container_prefix` in `project.yaml` uses `{env}` placeholder (e.g. `"{env}-mdf"`) — the toolkit must expand this at runtime
	29	+- SeriousLetter uses `"{env}-seriousletter"` as prefix (not `sl`)
	30	+- ops-dashboard gets its own `project.yaml` like all other projects
	31	+
	32	+## Files Changed
	33	+
	34	+- `/opt/data/mdf/project.yaml` — created
	35	+- `/opt/data/seriousletter/project.yaml` — created
	36	+- `/opt/data/ringsaday/project.yaml` — created
	37	+- `/opt/data/kioskpilot/project.yaml` — created
	38	+- `/opt/data/ops-dashboard/project.yaml` — created
	39	+- `/opt/data/ops-dashboard/docker-compose.yml` — network coolify→proxy
	40	+- `app/routers/rebuild.py` — Alpine retry + image verify (pre-redesign, superseded)
	41	+
	42	+---
	43	+
	44	+Tags: #Session #OpsToolkit #Infrastructure

..	..	@@ -0,0 +1,65 @@
	1	+# Session 0024: Toolkit and CLI Rewrite and Dashboard Migration
	2	+
	3	+Date: 2026-02-23
	4	+Status: Completed
	5	+Origin: MDF Webseiten session 0046
	6	+
	7	+---
	8	+
	9	+## Work Done
	10	+
	11	+### Phase 3: Shared Toolkit
	12	+
	13	+- [x] Completed 5 missing toolkit modules at `/opt/infrastructure/toolkit/`:
	14	+ - `cli.py` — main CLI entry point with all commands (status, start, stop, build, rebuild, destroy, backup, restore, sync, promote, logs, health, disk, backups, offsite, gen-timers, init)
	15	+ - `output.py` — formatted output (Rich tables, JSON mode, plain text fallback)
	16	+ - `restore.py` — restore operations with CLI delegation support
	17	+ - `sync.py` — data sync between environments with CLI delegation
	18	+ - `promote.py` — code promotion (git, rsync, script) with adjacency enforcement
	19	+- [x] 7 modules already existed from prior sessions: `__init__.py`, `descriptor.py`, `docker.py`, `backup.py`, `database.py`, `health.py`, `discovery.py`
	20	+
	21	+### Phase 4: Ops CLI Rewrite
	22	+
	23	+- [x] Replaced 950-line bash ops CLI with 7-line bash shim → `python3 -m toolkit.cli`
	24	+- [x] Old CLI backed up as `ops.bak.20260223`
	25	+- [x] New commands added: `start`, `stop`, `build`, `destroy`, `logs`, `restart`, `init`
	26	+- [x] All commands read from `project.yaml` descriptors — no `registry.yaml` dependency
	27	+- [x] Container prefix matching fixed: handles `{env}` placeholder expansion in `container_prefix`
	28	+
	29	+### Phase 5: Dashboard Adaptation
	30	+
	31	+- [x] Rewrote 4 dashboard routers to use project.yaml:
	32	+ - `registry.py` — imports `toolkit.discovery.all_projects()` instead of parsing registry.yaml
	33	+ - `services.py` — uses `toolkit.descriptor.find()` for container name resolution
	34	+ - `rebuild.py` — massive rewrite: 707 → 348 lines, removed ALL Coolify API code, uses direct docker compose
	35	+ - `schedule.py` — reads from descriptors for GET, still writes to registry.yaml for PUT (gen-timers compatibility)
	36	+- [x] Verified all API endpoints working:
	37	+ - `/api/registry/` — returns all 5 projects from descriptors
	38	+ - `/api/status/` — shows 25 containers
	39	+ - `/api/schedule/` — shows backup schedules for all 5 projects
	40	+ - `/api/services/logs/mdf/prod/wordpress` — correctly resolves container name
	41	+
	42	+## Key Decisions / Learnings
	43	+
	44	+- `rebuild.py` now uses `_compose_cmd()` helper that finds compose file (.yaml/.yml), env-file (.env.{env}/.env), and adds `--profile {env}` — removes all Coolify API dependency
	45	+- Dashboard container has `/opt/infrastructure` mounted → can import toolkit directly via Python
	46	+- pyyaml 6.0.3 confirmed available in dashboard container
	47	+- `schedule.py` still writes to `registry.yaml` for PUT/gen-timers — full descriptor migration is a future task
	48	+- `container_prefix_for(env)` expands `{env}` in prefix, then matches `{prefix}-*` containers
	49	+
	50	+## Files Changed
	51	+
	52	+- `/opt/infrastructure/toolkit/cli.py` — new (all CLI commands)
	53	+- `/opt/infrastructure/toolkit/output.py` — new (Rich/JSON/plain output)
	54	+- `/opt/infrastructure/toolkit/restore.py` — new
	55	+- `/opt/infrastructure/toolkit/sync.py` — new
	56	+- `/opt/infrastructure/toolkit/promote.py` — new
	57	+- `/usr/local/bin/ops` — rewritten as 7-line bash shim
	58	+- `app/routers/registry.py` — uses toolkit.discovery
	59	+- `app/routers/services.py` — uses toolkit.descriptor
	60	+- `app/routers/rebuild.py` — 707→348 lines, Coolify removed
	61	+- `app/routers/schedule.py` — descriptor-backed GET
	62	+
	63	+---
	64	+
	65	+Tags: #Session #OpsToolkit #OpsCLI #OpsDashboard

..	..	@@ -0,0 +1,62 @@
	1	+# Session 0025: Dashboard Bugs and SL Routing Fixes
	2	+
	3	+Date: 2026-02-24
	4	+Status: Completed
	5	+Origin: MDF Webseiten session 0048 (Part 2 only — DNS cutover and mail recovery sections skipped)
	6	+
	7	+---
	8	+
	9	+## Work Done
	10	+
	11	+### Operations Page: Recreate Replaced by Backup + Restore
	12	+
	13	+- [x] Removed "Recreate" lifecycle action (redundant with Rebuild for bind-mount projects)
	14	+- [x] Added Backup button (blue): opens lifecycle modal with SSE streaming to `/api/backups/stream/{project}/{env}`
	15	+- [x] Added Restore button (purple): navigates to Backups page at drill level 2 for that project/env
	16	+- [x] Added cache invalidation on backup success
	17	+
	18	+### SeriousLetter Bad Gateway Fix
	19	+
	20	+- [x] Diagnosed root cause: SL containers only on `seriousletter-network`, not on `proxy` network Traefik uses
	21	+- [x] Permanent fix: added `proxy` network to docker-compose.yaml for all 3 SL envs (prod/int/dev)
	22	+ - `backend` and `frontend` services get `proxy` in networks list
	23	+ - `proxy: external: true` added to networks section
	24	+- [x] Added health checks for both services:
	25	+ - Backend: `python3 urllib.request.urlopen("http://localhost:8000/docs")`
	26	+ - Frontend: `wget --spider -q http://127.0.0.1:3000/` (explicit `127.0.0.1`, not `localhost` — Alpine resolves to IPv6 `::1`)
	27	+
	28	+### Sync Routing Bug Fix
	29	+
	30	+- [x] Fixed sync section only showing MDF (not SeriousLetter)
	31	+- [x] Root cause (two-part):
	32	+ 1. `registry.py` had `desc.sync.get("type") == "cli"` — SL had `sync.type: toolkit`, evaluated to `False`
	33	+ 2. SL's `toolkit` type was itself wrong — should be `cli` with a CLI path
	34	+- [x] Fix in `registry.py`: `"has_cli": desc.sync.get("type") == "cli"` → `"has_cli": bool(desc.sync.get("type"))`
	35	+- [x] Fix in `/opt/data/seriousletter/project.yaml`: `sync.type: toolkit` → `type: cli` with `cli:` path
	36	+
	37	+### Backup Date Inconsistency Fix
	38	+
	39	+- [x] Fixed overview card showing stale "INT Latest" date while drill-down showed correct newer backups
	40	+- [x] Root cause: string comparison between incompatible date formats:
	41	+ - Compact (MDF CLI): `20260220_195300`
	42	+ - ISO (toolkit): `2026-02-24T03:00:42`
	43	+ - Character `'0' > '-'` meant compact dates always "won" the `>` comparison
	44	+- [x] Fix: added `normalizeBackupDate()` function to convert all dates to ISO format at merge time in `mergeBackups()`
	45	+
	46	+## Key Decisions / Learnings
	47	+
	48	+- When adding a container to a new network, an ad-hoc `docker network connect` is lost on restart — the fix must go in the compose file
	49	+- Alpine `localhost` resolves to `::1` (IPv6). Services binding only IPv4 `0.0.0.0` won't respond. Use `127.0.0.1` explicitly in health checks.
	50	+- For `has_cli` logic: any truthy `sync.type` value means the project has ops CLI support — don't compare to a specific string
	51	+- Date normalization must happen at merge time, not display time, to get correct `max()` comparisons
	52	+
	53	+## Files Changed
	54	+
	55	+- `static/js/app.js` — removed recreate modal/handler, added backup modal, URL routing for restore button, cache invalidation, `normalizeBackupDate()` + `mergeBackups()` fix
	56	+- `app/routers/registry.py` — `has_cli` logic fix
	57	+- `/opt/data/seriousletter/project.yaml` — `sync.type` corrected
	58	+- `/opt/data/seriousletter/{prod,int,dev}/code/docker-compose.yaml` — proxy network + health checks
	59	+
	60	+---
	61	+
	62	+Tags: #Session #OpsDashboard #BugFix

..	..	@@ -0,0 +1,69 @@
	1	+# Session 0026: Persistent Jobs and Container Terminal
	2	+
	3	+Date: 2026-02-25
	4	+Status: Completed
	5	+Origin: MDF Webseiten session 0053
	6	+
	7	+---
	8	+
	9	+## Work Done
	10	+
	11	+### Feature 1: Persistent/Reconnectable Jobs
	12	+
	13	+- [x] New `app/job_store.py` — in-memory job store decouples subprocess from SSE connection
	14	+- [x] New `app/routers/jobs.py` — job management endpoints
	15	+- [x] New endpoints: `GET /api/jobs/`, `GET /api/jobs/{op_id}`, `GET /api/jobs/{op_id}/stream?from=N`
	16	+- [x] Added `run_job()` to `ops_runner.py` — runs subprocess writing to job store, NOT killed on browser disconnect
	17	+- [x] Added `job_sse_stream()` to `job_store.py` — shared SSE wrapper with keepalive
	18	+- [x] Rewrote 6 routers to use job store pattern: backups.py, restore.py, sync_data.py, promote.py, rebuild.py, schedule.py
	19	+- [x] All routers follow pattern: `create_job()` → `asyncio.create_task(run_job())` → `return StreamingResponse(job_sse_stream())`
	20	+- [x] Background cleanup task removes expired jobs every 5 minutes (1 hour TTL)
	21	+- [x] Frontend: auto-reconnect on SSE error via `/api/jobs/{op_id}/stream?from=N` (3 retries)
	22	+- [x] Frontend: check for running jobs on page load, show reconnect banner
	23	+
	24	+### Feature 2: Container Terminal
	25	+
	26	+- [x] New `app/routers/terminal.py` — WebSocket endpoint with PTY via `docker exec`
	27	+- [x] Protocol: `{"type":"input","data":"..."}` / `{"type":"resize","cols":80,"rows":24}` / `{"type":"output","data":"..."}`
	28	+- [x] Frontend: xterm.js 5.5.0 + addon-fit from CDN, terminal modal, Console button on services page
	29	+- [x] Security: token auth, container name validation (regex allowlist), running check via docker inspect
	30	+
	31	+### Fixes Applied
	32	+
	33	+- [x] Restored bidirectional sync pairs in `sync_data.py` (regression from engineer rewrite)
	34	+- [x] Restored multi-compose support in `rebuild.py` (`_all_compose_dirs`, `_compose_cmd_for` for Seafile)
	35	+- [x] Updated `main.py` with jobs + terminal routers, cleanup task in lifespan
	36	+- [x] Bumped APP_VERSION to v15-20260225
	37	+- [x] Also committed + pushed `sync_data.py` bidirectional fix (git commit 31ac43f) and stabilization checks
	38	+
	39	+## Key Decisions / Learnings
	40	+
	41	+- Decoupling subprocess from SSE via a job store is the correct pattern — browser disconnect should never kill a running backup/restore
	42	+- Job store is in-memory (not persisted) — server restart loses job history, which is acceptable
	43	+- xterm.js from CDN (not bundled) keeps the container image lean
	44	+- Container name validation via regex allowlist prevents command injection through the WebSocket terminal endpoint
	45	+- `from=N` query param on stream endpoint enables replay from any position — client tracks last received line index
	46	+
	47	+## Files Changed
	48	+
	49	+- `app/job_store.py` — new (315 lines)
	50	+- `app/routers/jobs.py` — new (186 lines)
	51	+- `app/routers/terminal.py` — new (287 lines)
	52	+- `app/ops_runner.py` — added `run_job()` (388 lines total)
	53	+- `app/main.py` — added routers + cleanup task (138 lines)
	54	+- `app/routers/backups.py` — job store integration (287 lines)
	55	+- `app/routers/restore.py` — job store integration (290 lines)
	56	+- `app/routers/sync_data.py` — job store + bidirectional fix (71 lines)
	57	+- `app/routers/promote.py` — job store integration (69 lines)
	58	+- `app/routers/rebuild.py` — job store + multi-compose (365 lines)
	59	+- `static/js/app.js` — v15: reconnect + terminal (2355 lines)
	60	+- `static/index.html` — xterm.js CDN + terminal modal
	61	+- `static/css/style.css` — terminal styles
	62	+
	63	+## State at Session End
	64	+
	65	+Code written locally at `/Users/i052341/Daten/Cloud/08 - Others/MDF/Infrastruktur/Code/ops-dashboard/`. Not yet deployed to server at time of note creation. Deploy + verification is the next session's starting task.
	66	+
	67	+---
	68	+
	69	+Tags: #Session #OpsDashboard #PersistentJobs #Terminal

..	..	@@ -0,0 +1,61 @@
	1	+# 0027 - 2026-02-26 - Dynamic Backup Buttons & TEKMidian Registration
	2	+
	3	+## Context
	4	+
	5	+Changes made from the TEKMidian project session while registering TEKMidian in the ops dashboard.
	6	+
	7	+## Changes
	8	+
	9	+### 1. Dynamic "Create Backup" Buttons (app.js)
	10	+
	11	+Problem: The "Create Backup" buttons on the Backups page were hardcoded to only `mdf` and `seriousletter`:
	12	+```javascript
	13	+for (const p of ['mdf', 'seriousletter']) {
	14	+ for (const e of ['dev', 'int', 'prod']) {
	15	+```
	16	+
	17	+Fix: Made buttons dynamic from the `/api/schedule/` endpoint. Now all backup-enabled projects get buttons automatically based on their configured environments:
	18	+```javascript
	19	+for (const s of (cachedSchedules \|\| [])) {
	20	+ if (!s.enabled) continue;
	21	+ const envs = s.backup_environments \|\| s.environments \|\| [];
	22	+ // render button per environment
	23	+}
	24	+```
	25	+
	26	+Also added schedule data fetch in `renderBackups()` alongside the existing backup/offsite fetches:
	27	+```javascript
	28	+const [local, offsite, schedules] = await Promise.all([
	29	+ api('/api/backups/'),
	30	+ api('/api/backups/offsite').catch(() => []),
	31	+ cachedSchedules ? Promise.resolve(cachedSchedules) : api('/api/schedule/').catch(() => []),
	32	+]);
	33	+```
	34	+
	35	+### 2. Empty-State Project Cards (app.js)
	36	+
	37	+Problem: Projects with backup config but no backups yet didn't appear in the project cards grid (only projects with existing backup files showed up).
	38	+
	39	+Fix: After the existing project cards loop, added a second loop over `cachedSchedules` to show backup-configured projects that have 0 backups as dashed-border cards:
	40	+```javascript
	41	+for (const s of (cachedSchedules \|\| [])) {
	42	+ if (!s.enabled \|\| projects[s.project]) continue;
	43	+ // render dashed card with "0 backups" and "No backups yet"
	44	+}
	45	+```
	46	+
	47	+### 3. Cache Busting
	48	+
	49	+- Bumped `APP_VERSION` from `v15-20260225` to `v16-20260226`
	50	+- Updated `index.html`: `app.js?v=15` to `app.js?v=16`
	51	+
	52	+## Files Modified (on server)
	53	+
	54	+- `/opt/data/ops-dashboard/static/js/app.js` — dynamic backup buttons, schedule fetch, empty-state cards, version bump
	55	+- `/opt/data/ops-dashboard/static/index.html` — cache bust `?v=16`
	56	+
	57	+## Notes
	58	+
	59	+- Edits require `sudo` — file is owned by uid 501 (macOS user via scp)
	60	+- No container restart needed — static files are bind-mounted (`./static:/app/static`)
	61	+- First TEKMidian backup triggered via schedule API (926K tar.gz)