# Repository Architecture ## Purpose This document is the current source of truth for: - repo package boundaries - MCP user/developer surfaces - registry/compiler/execution layering - checkpoint compatibility flow - where new features should plug in For algorithm-specific integration expectations, read [`feature-placement.md`](feature-placement.md) after this document. Maintenance rule: - If a change alters MCP surfaces, boundary ownership, registry/compiler responsibilities, checkpoint materialization flow, or persisted handle types, update this document in the same change. ## Package Map | Area | Owns | | --- | --- | | `src/dymad/agent/mcp` | MCP-facing tool adapters and server assembly | | `src/dymad/agent/app` | transport-neutral app services shared by user-facing adapters | | `src/dymad/agent/registry` | user-facing capability metadata, profiles, schemas, supported analyses/evaluations | | `src/dymad/agent/compiler` | typed request validation and compilation into persisted requests | | `src/dymad/agent/exec` | workflow orchestration and compatibility execution | | `src/dymad/agent/facade` | stable typed boundary over persisted objects | | `src/dymad/agent/store` | in-memory/filesystem-backed artifact records and handle persistence | | `src/dymad/models` | model families, collections, typed model specs, rollout helpers | | `src/dymad/training` | training runtime, phases, trainers, execution helpers | | `src/dymad/io` | checkpoint loading, trajectory/data managers, legacy public runtime entrypoints | | `src/dymad/core` | typed runtime/series/transform building blocks | | `src/dymad/numerics` | math and linear-algebra utilities | | `src/dymad/sako` | spectral analysis runtime and adapters | ## Layer Stack Current user-facing stacks: ```text MCP server -> user_tools / developer_tools -> app services where transport-neutral workflow assembly is shared -> registry + compiler -> CompatibilityExecutor -> FacadeOperations -> ObjectStore / FilesystemArtifactStore -> legacy runtime/training/checkpoint/analysis code dymad CLI -> cli.py argument adapter -> agent/app path-first workflow service -> registry + compiler -> CompatibilityExecutor -> FacadeOperations -> ObjectStore / FilesystemArtifactStore -> legacy runtime/training/checkpoint/analysis code ``` Important distinction: - `server.py` only registers tools and mode splits. - `user_tools.py` is the high-level surface. - `demo_tools.py` plus `developer_tools.py` expose the raw/developer surface. - `cli.py` is the package-level path-first user interface; it should stay thin and delegate workflow assembly to `agent/app`. - `CompatibilityExecutor` still owns orchestration, but some compatibility flows intentionally materialize through legacy `io/*` code instead of fully executor-native implementations. ## User Transports DyMAD now has two user-facing transports over the same registry/compiler/executor/facade/store boundary: - MCP user mode is structured and handle-first. It assumes dataset handles already exist and keeps `{"ok": ..., "data": ...}` envelopes. - The `dymad` CLI is path-first and reproducibility-focused. It loads YAML configs, registers dataset paths through the facade, compiles through the user-mode training compiler, launches the same async worker, and writes `dymad-run.json` under the run directory so later CLI commands can recover handles and store location. `dymad train --config ...` can derive the run directory from the config file's directory plus `run.name`; `--out` remains available to choose and validate an explicit run directory. ## MCP Surfaces `build_server(mode=...)` supports three registrations: - `mode="user"`: high-level workflows - `mode="developer"`: raw/debug/compatibility tools - `mode="both"`: both surfaces on one server ### User Mode User mode is registry/compiler-backed. It currently exposes: - `list_training_capabilities` - `list_analysis_capabilities` - `list_evaluation_capabilities` - `describe_training_capability` - `compile_training_request` - `start_training_run` - `describe_training_run` - `read_training_run_log` - `evaluate_checkpoint` - `compile_analysis_request` - `run_analysis_request` Notes: - user mode does not require raw `model_ref` - user mode compiles `model_key` plus validated overrides into persisted compiled requests - `describe_training_capability` is the authoritative contract for allowed overrides, phase-entry schemas, CV sweep support metadata, natural-language-to-override translation guidance, and surfaced training constraints - user mode currently assumes dataset handles already exist ### Developer Mode Developer mode keeps the raw and compatibility-oriented path available: - `register_dataset_file` - `inspect_dataset` - `register_checkpoint` - `prepare_prediction_request` - `plan_checkpoint_prediction` - `start_model_training` - `describe_training_run` - `read_training_run_log` - `evaluate_model` - `list_evaluation_capabilities` - `list_model_capabilities` - `resolve_model_capability` - `list_profile_capabilities` - `describe_training_capability` - `describe_object` - `list_objects` Use developer mode when debugging boundary behavior, raw config/profile selection, or compatibility flows. ## Current Workflow Paths ### Training and Evaluation High-level path: ```text register_dataset_file -> describe_training_capability / list_training_capabilities -> compile_training_request -> start_training_run -> describe_training_run / read_training_run_log -> evaluate_checkpoint ``` CLI training enters the same path after resolving files from a YAML config: ```text dymad train --config config.yaml [--out runs/foo] -> agent/app CLI workflow service -> register_dataset_file for train/valid/test paths -> compile_training_request -> start_training_run -> describe_training_run / read_training_run_log -> evaluate_checkpoint via dymad eval ``` Compilation resolves: - `model_key` -> model capability -> default `model_ref` - dataset kind compatibility - default or explicit profile - allowed user overrides - optional single-split CV sweep settings under `overrides.cv`, including: - `param_grid` candidate definitions for grid or legacy candidate-based adaptive search - optional `search` policy whose `mode` selects the CV optimizer (`grid` or `nelder_mead_like`) plus optimizer-specific config such as simplex-style coefficients; in current runtime `nelder_mead_like` can either run a bounded continuous search over `search.bounds` lower/upper pairs or, when bounds are omitted, the legacy adaptive path over numeric single-split `param_grid` candidates - optional `selection` policy (`goal` plus ordered tie-breakers) for deterministic best-model choice - phase overrides normalized against matching profile defaults so trainer-specific phase config is preserved unless explicitly overridden - translation guidance and surfaced constraint notes for clients that map natural-language requests into structured overrides, including CV sweep requests - effective config - trainer kind Execution is now submit-and-poll: - `compile_training_request` still persists the validated compiled request - `start_training_run` / `start_model_training` persist a `training_run` record immediately and spawn `dymad.agent.exec.training_worker` - the worker reloads the persisted context, marks the run `RUNNING`, executes the private synchronous `_execute_training_run(...)` helper, then persists `SUCCEEDED` or `FAILED` - `describe_training_run` is the polling surface and reconciles stale `RUNNING` jobs whose worker pid has disappeared without a terminal write - `read_training_run_log` returns incremental log chunks from the persisted worker log ### Analysis Current analysis path: ```text compile_analysis_request -> persisted compiled analysis request -> run_analysis_request -> analysis-specific execution in CompatibilityExecutor ``` Currently supported workflow keys: - `spectral_koopman` - `vortex_transform_modes` ### Checkpoint Compatibility Current checkpoint load path: ```text dymad.io.load_model(...) -> CompatibilityExecutor.plan_checkpoint_prediction(...) -> FacadeOperations.register_checkpoint(...) -> FacadeOperations.prepare_prediction_request(...) -> legacy checkpoint materialization in dymad.io.checkpoint ``` This is an important current-state detail: - `CompatibilityExecutor.plan_checkpoint_prediction(...)` is active. - `CompatibilityExecutor.materialize_checkpoint_prediction(...)` is not the active materialization path today; it is a placeholder that raises `NotImplementedError`. - the persisted checkpoint and prediction-request handles still record the boundary state used by `load_model(...)`. So the boundary plan is real, but final checkpoint materialization still goes through `dymad.io.checkpoint`. ## Persisted Artifacts and Handles The object store persists the main boundary objects used by MCP and compatibility workflows: - datasets: `ds_*` - checkpoints: `chk_*` - training runs: `run_*` - compiled training requests: `trainreq_*` - compiled analysis requests: `analysisreq_*` - evaluations: `eval_*` - prediction requests: `pred_*` - spectral snapshots: `specsnap_*` If a new workflow needs durable planning or inspection across calls, it usually needs a new record type in `agent/store` plus matching facade helpers. ## Design Rules - Keep policy and validation out of `server.py`. - Prefer stable user-facing keys in `registry/*` over raw import strings in user-mode flows. - Put request-shape validation in `compiler/*`, not in MCP adapters. - Put orchestration in `exec/*`, not in registry or MCP modules. - Put persistence logic in `store/*` and `facade/*`, not in executor methods. - Keep model/math/runtime behavior in the implementation packages unless the public boundary changes. ## Tests That Define the Boundary Use these as the fastest ground truth for the current architecture: - `tests/test_mcp_server_modes.py`: user/developer mode split - `tests/test_mcp_user_tools.py`: user-mode compile/train/evaluate path - `tests/test_training_compiler.py`: typed training compiler behavior - `tests/test_analysis_workflows.py`: compiled analysis workflows - `tests/test_checkpoint_e2e_layering.py`: checkpoint planning through exec/facade/store - `tests/test_public_load_model_boundary.py`: `load_model(...)` still materializes through `dymad.io.checkpoint` ## When Adding Features If you are deciding where a change belongs, use [feature-placement.md](feature-placement.md). If your change moves the answer, update that file too.