# Repository Architecture

## Purpose

This document is the current source of truth for:

- repo package boundaries
- MCP user/developer surfaces
- registry/compiler/execution layering
- checkpoint compatibility flow
- where new features should plug in

For algorithm-specific integration expectations, read [`feature-placement.md`](feature-placement.md)
after this document.

Maintenance rule:

- If a change alters MCP surfaces, boundary ownership, registry/compiler responsibilities,
  checkpoint materialization flow, or persisted handle types, update this document in the same
  change.

## Package Map

| Area | Owns |
| --- | --- |
| `src/dymad/agent/mcp` | MCP-facing tool adapters and server assembly |
| `src/dymad/agent/app` | transport-neutral app services shared by user-facing adapters |
| `src/dymad/agent/registry` | user-facing capability metadata, profiles, schemas, supported analyses/evaluations |
| `src/dymad/agent/compiler` | typed request validation and compilation into persisted requests |
| `src/dymad/agent/exec` | workflow orchestration and compatibility execution |
| `src/dymad/agent/facade` | stable typed boundary over persisted objects |
| `src/dymad/agent/store` | in-memory/filesystem-backed artifact records and handle persistence |
| `src/dymad/models` | model families, collections, typed model specs, rollout helpers |
| `src/dymad/training` | training runtime, phases, trainers, execution helpers |
| `src/dymad/io` | checkpoint loading, trajectory/data managers, legacy public runtime entrypoints |
| `src/dymad/core` | typed runtime/series/transform building blocks |
| `src/dymad/numerics` | math and linear-algebra utilities |
| `src/dymad/sako` | spectral analysis runtime and adapters |

## Layer Stack

Current user-facing stacks:

```text
MCP server
  -> user_tools / developer_tools
  -> app services where transport-neutral workflow assembly is shared
  -> registry + compiler
  -> CompatibilityExecutor
  -> FacadeOperations
  -> ObjectStore / FilesystemArtifactStore
  -> legacy runtime/training/checkpoint/analysis code

dymad CLI
  -> cli.py argument adapter
  -> agent/app path-first workflow service
  -> registry + compiler
  -> CompatibilityExecutor
  -> FacadeOperations
  -> ObjectStore / FilesystemArtifactStore
  -> legacy runtime/training/checkpoint/analysis code
```

Important distinction:

- `server.py` only registers tools and mode splits.
- `user_tools.py` is the high-level surface.
- `demo_tools.py` plus `developer_tools.py` expose the raw/developer surface.
- `cli.py` is the package-level path-first user interface; it should stay thin and delegate
  workflow assembly to `agent/app`.
- `CompatibilityExecutor` still owns orchestration, but some compatibility flows intentionally
  materialize through legacy `io/*` code instead of fully executor-native implementations.

## User Transports

DyMAD now has two user-facing transports over the same registry/compiler/executor/facade/store
boundary:

- MCP user mode is structured and handle-first. It assumes dataset handles already exist and keeps
  `{"ok": ..., "data": ...}` envelopes.
- The `dymad` CLI is path-first and reproducibility-focused. It loads YAML configs, registers
  dataset paths through the facade, compiles through the user-mode training compiler, launches the
  same async worker, and writes `dymad-run.json` under the run directory so later CLI commands can
  recover handles and store location. `dymad train --config ...` can derive the run directory from
  the config file's directory plus `run.name`; `--out` remains available to choose and validate an
  explicit run directory.

## MCP Surfaces

`build_server(mode=...)` supports three registrations:

- `mode="user"`: high-level workflows
- `mode="developer"`: raw/debug/compatibility tools
- `mode="both"`: both surfaces on one server

### User Mode

User mode is registry/compiler-backed. It currently exposes:

- `list_training_capabilities`
- `list_analysis_capabilities`
- `list_evaluation_capabilities`
- `describe_training_capability`
- `compile_training_request`
- `start_training_run`
- `describe_training_run`
- `read_training_run_log`
- `evaluate_checkpoint`
- `compile_analysis_request`
- `run_analysis_request`

Notes:

- user mode does not require raw `model_ref`
- user mode compiles `model_key` plus validated overrides into persisted compiled requests
- `describe_training_capability` is the authoritative contract for allowed overrides, phase-entry
  schemas, CV sweep support metadata, natural-language-to-override translation guidance, and
  surfaced training constraints
- user mode currently assumes dataset handles already exist

### Developer Mode

Developer mode keeps the raw and compatibility-oriented path available:

- `register_dataset_file`
- `inspect_dataset`
- `register_checkpoint`
- `prepare_prediction_request`
- `plan_checkpoint_prediction`
- `start_model_training`
- `describe_training_run`
- `read_training_run_log`
- `evaluate_model`
- `list_evaluation_capabilities`
- `list_model_capabilities`
- `resolve_model_capability`
- `list_profile_capabilities`
- `describe_training_capability`
- `describe_object`
- `list_objects`

Use developer mode when debugging boundary behavior, raw config/profile selection, or compatibility
flows.

## Current Workflow Paths

### Training and Evaluation

High-level path:

```text
register_dataset_file
  -> describe_training_capability / list_training_capabilities
  -> compile_training_request
  -> start_training_run
  -> describe_training_run / read_training_run_log
  -> evaluate_checkpoint
```

CLI training enters the same path after resolving files from a YAML config:

```text
dymad train --config config.yaml [--out runs/foo]
  -> agent/app CLI workflow service
  -> register_dataset_file for train/valid/test paths
  -> compile_training_request
  -> start_training_run
  -> describe_training_run / read_training_run_log
  -> evaluate_checkpoint via dymad eval
```

Compilation resolves:

- `model_key` -> model capability -> default `model_ref`
- dataset kind compatibility
- default or explicit profile
- allowed user overrides
- optional single-split CV sweep settings under `overrides.cv`, including:
  - `param_grid` candidate definitions for grid or legacy candidate-based adaptive search
  - optional `search` policy whose `mode` selects the CV optimizer (`grid` or
    `nelder_mead_like`) plus optimizer-specific config such as simplex-style coefficients; in
    current runtime `nelder_mead_like` can either run a bounded continuous search over
    `search.bounds` lower/upper pairs or, when bounds are omitted, the legacy adaptive path over
    numeric single-split `param_grid` candidates
  - optional `selection` policy (`goal` plus ordered tie-breakers) for deterministic best-model
    choice
- phase overrides normalized against matching profile defaults so trainer-specific phase config is
  preserved unless explicitly overridden
- translation guidance and surfaced constraint notes for clients that map natural-language requests
  into structured overrides, including CV sweep requests
- effective config
- trainer kind

Execution is now submit-and-poll:

- `compile_training_request` still persists the validated compiled request
- `start_training_run` / `start_model_training` persist a `training_run` record immediately and
  spawn `dymad.agent.exec.training_worker`
- the worker reloads the persisted context, marks the run `RUNNING`, executes the private
  synchronous `_execute_training_run(...)` helper, then persists `SUCCEEDED` or `FAILED`
- `describe_training_run` is the polling surface and reconciles stale `RUNNING` jobs whose worker
  pid has disappeared without a terminal write
- `read_training_run_log` returns incremental log chunks from the persisted worker log

### Analysis

Current analysis path:

```text
compile_analysis_request
  -> persisted compiled analysis request
  -> run_analysis_request
  -> analysis-specific execution in CompatibilityExecutor
```

Currently supported workflow keys:

- `spectral_koopman`
- `vortex_transform_modes`

### Checkpoint Compatibility

Current checkpoint load path:

```text
dymad.io.load_model(...)
  -> CompatibilityExecutor.plan_checkpoint_prediction(...)
  -> FacadeOperations.register_checkpoint(...)
  -> FacadeOperations.prepare_prediction_request(...)
  -> legacy checkpoint materialization in dymad.io.checkpoint
```

This is an important current-state detail:

- `CompatibilityExecutor.plan_checkpoint_prediction(...)` is active.
- `CompatibilityExecutor.materialize_checkpoint_prediction(...)` is not the active materialization
  path today; it is a placeholder that raises `NotImplementedError`.
- the persisted checkpoint and prediction-request handles still record the boundary state used by
  `load_model(...)`.

So the boundary plan is real, but final checkpoint materialization still goes through
`dymad.io.checkpoint`.

## Persisted Artifacts and Handles

The object store persists the main boundary objects used by MCP and compatibility workflows:

- datasets: `ds_*`
- checkpoints: `chk_*`
- training runs: `run_*`
- compiled training requests: `trainreq_*`
- compiled analysis requests: `analysisreq_*`
- evaluations: `eval_*`
- prediction requests: `pred_*`
- spectral snapshots: `specsnap_*`

If a new workflow needs durable planning or inspection across calls, it usually needs a new record
type in `agent/store` plus matching facade helpers.

## Design Rules

- Keep policy and validation out of `server.py`.
- Prefer stable user-facing keys in `registry/*` over raw import strings in user-mode flows.
- Put request-shape validation in `compiler/*`, not in MCP adapters.
- Put orchestration in `exec/*`, not in registry or MCP modules.
- Put persistence logic in `store/*` and `facade/*`, not in executor methods.
- Keep model/math/runtime behavior in the implementation packages unless the public boundary
  changes.

## Tests That Define the Boundary

Use these as the fastest ground truth for the current architecture:

- `tests/test_mcp_server_modes.py`: user/developer mode split
- `tests/test_mcp_user_tools.py`: user-mode compile/train/evaluate path
- `tests/test_training_compiler.py`: typed training compiler behavior
- `tests/test_analysis_workflows.py`: compiled analysis workflows
- `tests/test_checkpoint_e2e_layering.py`: checkpoint planning through exec/facade/store
- `tests/test_public_load_model_boundary.py`: `load_model(...)` still materializes through
  `dymad.io.checkpoint`

## When Adding Features

If you are deciding where a change belongs, use [feature-placement.md](feature-placement.md).

If your change moves the answer, update that file too.