Repository Architecture¶

Purpose¶

This document is the current source of truth for:

repo package boundaries
MCP user/developer surfaces
registry/compiler/execution layering
checkpoint compatibility flow
where new features should plug in

For algorithm-specific integration expectations, read feature-placement.md after this document.

Maintenance rule:

If a change alters MCP surfaces, boundary ownership, registry/compiler responsibilities, checkpoint materialization flow, or persisted handle types, update this document in the same change.

Package Map¶

Area	Owns
`src/dymad/agent/mcp`	MCP-facing tool adapters and server assembly
`src/dymad/agent/app`	transport-neutral app services shared by user-facing adapters
`src/dymad/agent/registry`	user-facing capability metadata, profiles, schemas, supported analyses/evaluations
`src/dymad/agent/compiler`	typed request validation and compilation into persisted requests
`src/dymad/agent/exec`	workflow orchestration and compatibility execution
`src/dymad/agent/facade`	stable typed boundary over persisted objects
`src/dymad/agent/store`	in-memory/filesystem-backed artifact records and handle persistence
`src/dymad/models`	model families, collections, typed model specs, rollout helpers
`src/dymad/training`	training runtime, phases, trainers, execution helpers
`src/dymad/io`	checkpoint loading, trajectory/data managers, legacy public runtime entrypoints
`src/dymad/core`	typed runtime/series/transform building blocks
`src/dymad/numerics`	math and linear-algebra utilities
`src/dymad/sako`	spectral analysis runtime and adapters

Layer Stack¶

Current user-facing stacks:

MCP server
  -> user_tools / developer_tools
  -> app services where transport-neutral workflow assembly is shared
  -> registry + compiler
  -> CompatibilityExecutor
  -> FacadeOperations
  -> ObjectStore / FilesystemArtifactStore
  -> legacy runtime/training/checkpoint/analysis code

dymad CLI
  -> cli.py argument adapter
  -> agent/app path-first workflow service
  -> registry + compiler
  -> CompatibilityExecutor
  -> FacadeOperations
  -> ObjectStore / FilesystemArtifactStore
  -> legacy runtime/training/checkpoint/analysis code

Important distinction:

server.py only registers tools and mode splits.
user_tools.py is the high-level surface.
demo_tools.py plus developer_tools.py expose the raw/developer surface.
cli.py is the package-level path-first user interface; it should stay thin and delegate workflow assembly to agent/app.
CompatibilityExecutor still owns orchestration, but some compatibility flows intentionally materialize through legacy io/* code instead of fully executor-native implementations.

User Transports¶

DyMAD now has two user-facing transports over the same registry/compiler/executor/facade/store boundary:

MCP user mode is structured and handle-first. It assumes dataset handles already exist and keeps {"ok": ..., "data": ...} envelopes.
The dymad CLI is path-first and reproducibility-focused. It loads YAML configs, registers dataset paths through the facade, compiles through the user-mode training compiler, launches the same async worker, and writes dymad-run.json under the run directory so later CLI commands can recover handles and store location. dymad train --config ... can derive the run directory from the config file’s directory plus run.name; --out remains available to choose and validate an explicit run directory.

MCP Surfaces¶

build_server(mode=...) supports three registrations:

mode="user": high-level workflows
mode="developer": raw/debug/compatibility tools
mode="both": both surfaces on one server

User Mode¶

User mode is registry/compiler-backed. It currently exposes:

list_training_capabilities
list_analysis_capabilities
list_evaluation_capabilities
describe_training_capability
compile_training_request
start_training_run
describe_training_run
read_training_run_log
evaluate_checkpoint
compile_analysis_request
run_analysis_request

Notes:

user mode does not require raw model_ref
user mode compiles model_key plus validated overrides into persisted compiled requests
describe_training_capability is the authoritative contract for allowed overrides, phase-entry schemas, CV sweep support metadata, natural-language-to-override translation guidance, and surfaced training constraints
user mode currently assumes dataset handles already exist

Developer Mode¶

Developer mode keeps the raw and compatibility-oriented path available:

register_dataset_file
inspect_dataset
register_checkpoint
prepare_prediction_request
plan_checkpoint_prediction
start_model_training
describe_training_run
read_training_run_log
evaluate_model
list_evaluation_capabilities
list_model_capabilities
resolve_model_capability
list_profile_capabilities
describe_training_capability
describe_object
list_objects

Use developer mode when debugging boundary behavior, raw config/profile selection, or compatibility flows.

Current Workflow Paths¶

Training and Evaluation¶

High-level path:

register_dataset_file
  -> describe_training_capability / list_training_capabilities
  -> compile_training_request
  -> start_training_run
  -> describe_training_run / read_training_run_log
  -> evaluate_checkpoint

CLI training enters the same path after resolving files from a YAML config:

dymad train --config config.yaml [--out runs/foo]
  -> agent/app CLI workflow service
  -> register_dataset_file for train/valid/test paths
  -> compile_training_request
  -> start_training_run
  -> describe_training_run / read_training_run_log
  -> evaluate_checkpoint via dymad eval

Compilation resolves:

model_key -> model capability -> default model_ref
dataset kind compatibility
default or explicit profile
allowed user overrides
optional single-split CV sweep settings under overrides.cv, including:
- param_grid candidate definitions for grid or legacy candidate-based adaptive search
- optional search policy whose mode selects the CV optimizer (grid or nelder_mead_like) plus optimizer-specific config such as simplex-style coefficients; in current runtime nelder_mead_like can either run a bounded continuous search over search.bounds lower/upper pairs or, when bounds are omitted, the legacy adaptive path over numeric single-split param_grid candidates
- optional selection policy (goal plus ordered tie-breakers) for deterministic best-model choice
phase overrides normalized against matching profile defaults so trainer-specific phase config is preserved unless explicitly overridden
translation guidance and surfaced constraint notes for clients that map natural-language requests into structured overrides, including CV sweep requests
effective config
trainer kind

Execution is now submit-and-poll:

compile_training_request still persists the validated compiled request
start_training_run / start_model_training persist a training_run record immediately and spawn dymad.agent.exec.training_worker
the worker reloads the persisted context, marks the run RUNNING, executes the private synchronous _execute_training_run(...) helper, then persists SUCCEEDED or FAILED
describe_training_run is the polling surface and reconciles stale RUNNING jobs whose worker pid has disappeared without a terminal write
read_training_run_log returns incremental log chunks from the persisted worker log

Analysis¶

Current analysis path:

compile_analysis_request
  -> persisted compiled analysis request
  -> run_analysis_request
  -> analysis-specific execution in CompatibilityExecutor

Currently supported workflow keys:

spectral_koopman
vortex_transform_modes

Checkpoint Compatibility¶

Current checkpoint load path:

dymad.io.load_model(...)
  -> CompatibilityExecutor.plan_checkpoint_prediction(...)
  -> FacadeOperations.register_checkpoint(...)
  -> FacadeOperations.prepare_prediction_request(...)
  -> legacy checkpoint materialization in dymad.io.checkpoint

This is an important current-state detail:

CompatibilityExecutor.plan_checkpoint_prediction(...) is active.
CompatibilityExecutor.materialize_checkpoint_prediction(...) is not the active materialization path today; it is a placeholder that raises NotImplementedError.
the persisted checkpoint and prediction-request handles still record the boundary state used by load_model(...).

So the boundary plan is real, but final checkpoint materialization still goes through dymad.io.checkpoint.

Persisted Artifacts and Handles¶

The object store persists the main boundary objects used by MCP and compatibility workflows:

datasets: ds_*
checkpoints: chk_*
training runs: run_*
compiled training requests: trainreq_*
compiled analysis requests: analysisreq_*
evaluations: eval_*
prediction requests: pred_*
spectral snapshots: specsnap_*

If a new workflow needs durable planning or inspection across calls, it usually needs a new record type in agent/store plus matching facade helpers.

Design Rules¶

Keep policy and validation out of server.py.
Prefer stable user-facing keys in registry/* over raw import strings in user-mode flows.
Put request-shape validation in compiler/*, not in MCP adapters.
Put orchestration in exec/*, not in registry or MCP modules.
Put persistence logic in store/* and facade/*, not in executor methods.
Keep model/math/runtime behavior in the implementation packages unless the public boundary changes.

Tests That Define the Boundary¶

Use these as the fastest ground truth for the current architecture:

tests/test_mcp_server_modes.py: user/developer mode split
tests/test_mcp_user_tools.py: user-mode compile/train/evaluate path
tests/test_training_compiler.py: typed training compiler behavior
tests/test_analysis_workflows.py: compiled analysis workflows
tests/test_checkpoint_e2e_layering.py: checkpoint planning through exec/facade/store
tests/test_public_load_model_boundary.py: load_model(...) still materializes through dymad.io.checkpoint

When Adding Features¶

If you are deciding where a change belongs, use feature-placement.md.

If your change moves the answer, update that file too.