Repository Architecture

Purpose

This document is the current source of truth for:

  • repo package boundaries

  • MCP user/developer surfaces

  • registry/compiler/execution layering

  • checkpoint compatibility flow

  • where new features should plug in

For algorithm-specific integration expectations, read feature-placement.md after this document.

Maintenance rule:

  • If a change alters MCP surfaces, boundary ownership, registry/compiler responsibilities, checkpoint materialization flow, or persisted handle types, update this document in the same change.

Package Map

Area

Owns

src/dymad/agent/mcp

MCP-facing tool adapters and server assembly

src/dymad/agent/app

transport-neutral app services shared by user-facing adapters

src/dymad/agent/registry

user-facing capability metadata, profiles, schemas, supported analyses/evaluations

src/dymad/agent/compiler

typed request validation and compilation into persisted requests

src/dymad/agent/exec

workflow orchestration and compatibility execution

src/dymad/agent/facade

stable typed boundary over persisted objects

src/dymad/agent/store

in-memory/filesystem-backed artifact records and handle persistence

src/dymad/models

model families, collections, typed model specs, rollout helpers

src/dymad/training

training runtime, phases, trainers, execution helpers

src/dymad/io

checkpoint loading, trajectory/data managers, legacy public runtime entrypoints

src/dymad/core

typed runtime/series/transform building blocks

src/dymad/numerics

math and linear-algebra utilities

src/dymad/sako

spectral analysis runtime and adapters

Layer Stack

Current user-facing stacks:

MCP server
  -> user_tools / developer_tools
  -> app services where transport-neutral workflow assembly is shared
  -> registry + compiler
  -> CompatibilityExecutor
  -> FacadeOperations
  -> ObjectStore / FilesystemArtifactStore
  -> legacy runtime/training/checkpoint/analysis code

dymad CLI
  -> cli.py argument adapter
  -> agent/app path-first workflow service
  -> registry + compiler
  -> CompatibilityExecutor
  -> FacadeOperations
  -> ObjectStore / FilesystemArtifactStore
  -> legacy runtime/training/checkpoint/analysis code

Important distinction:

  • server.py only registers tools and mode splits.

  • user_tools.py is the high-level surface.

  • demo_tools.py plus developer_tools.py expose the raw/developer surface.

  • cli.py is the package-level path-first user interface; it should stay thin and delegate workflow assembly to agent/app.

  • CompatibilityExecutor still owns orchestration, but some compatibility flows intentionally materialize through legacy io/* code instead of fully executor-native implementations.

User Transports

DyMAD now has two user-facing transports over the same registry/compiler/executor/facade/store boundary:

  • MCP user mode is structured and handle-first. It assumes dataset handles already exist and keeps {"ok": ..., "data": ...} envelopes.

  • The dymad CLI is path-first and reproducibility-focused. It loads YAML configs, registers dataset paths through the facade, compiles through the user-mode training compiler, launches the same async worker, and writes dymad-run.json under the run directory so later CLI commands can recover handles and store location. dymad train --config ... can derive the run directory from the config file’s directory plus run.name; --out remains available to choose and validate an explicit run directory.

MCP Surfaces

build_server(mode=...) supports three registrations:

  • mode="user": high-level workflows

  • mode="developer": raw/debug/compatibility tools

  • mode="both": both surfaces on one server

User Mode

User mode is registry/compiler-backed. It currently exposes:

  • list_training_capabilities

  • list_analysis_capabilities

  • list_evaluation_capabilities

  • describe_training_capability

  • compile_training_request

  • start_training_run

  • describe_training_run

  • read_training_run_log

  • evaluate_checkpoint

  • compile_analysis_request

  • run_analysis_request

Notes:

  • user mode does not require raw model_ref

  • user mode compiles model_key plus validated overrides into persisted compiled requests

  • describe_training_capability is the authoritative contract for allowed overrides, phase-entry schemas, CV sweep support metadata, natural-language-to-override translation guidance, and surfaced training constraints

  • user mode currently assumes dataset handles already exist

Developer Mode

Developer mode keeps the raw and compatibility-oriented path available:

  • register_dataset_file

  • inspect_dataset

  • register_checkpoint

  • prepare_prediction_request

  • plan_checkpoint_prediction

  • start_model_training

  • describe_training_run

  • read_training_run_log

  • evaluate_model

  • list_evaluation_capabilities

  • list_model_capabilities

  • resolve_model_capability

  • list_profile_capabilities

  • describe_training_capability

  • describe_object

  • list_objects

Use developer mode when debugging boundary behavior, raw config/profile selection, or compatibility flows.

Current Workflow Paths

Training and Evaluation

High-level path:

register_dataset_file
  -> describe_training_capability / list_training_capabilities
  -> compile_training_request
  -> start_training_run
  -> describe_training_run / read_training_run_log
  -> evaluate_checkpoint

CLI training enters the same path after resolving files from a YAML config:

dymad train --config config.yaml [--out runs/foo]
  -> agent/app CLI workflow service
  -> register_dataset_file for train/valid/test paths
  -> compile_training_request
  -> start_training_run
  -> describe_training_run / read_training_run_log
  -> evaluate_checkpoint via dymad eval

Compilation resolves:

  • model_key -> model capability -> default model_ref

  • dataset kind compatibility

  • default or explicit profile

  • allowed user overrides

  • optional single-split CV sweep settings under overrides.cv, including:

    • param_grid candidate definitions for grid or legacy candidate-based adaptive search

    • optional search policy whose mode selects the CV optimizer (grid or nelder_mead_like) plus optimizer-specific config such as simplex-style coefficients; in current runtime nelder_mead_like can either run a bounded continuous search over search.bounds lower/upper pairs or, when bounds are omitted, the legacy adaptive path over numeric single-split param_grid candidates

    • optional selection policy (goal plus ordered tie-breakers) for deterministic best-model choice

  • phase overrides normalized against matching profile defaults so trainer-specific phase config is preserved unless explicitly overridden

  • translation guidance and surfaced constraint notes for clients that map natural-language requests into structured overrides, including CV sweep requests

  • effective config

  • trainer kind

Execution is now submit-and-poll:

  • compile_training_request still persists the validated compiled request

  • start_training_run / start_model_training persist a training_run record immediately and spawn dymad.agent.exec.training_worker

  • the worker reloads the persisted context, marks the run RUNNING, executes the private synchronous _execute_training_run(...) helper, then persists SUCCEEDED or FAILED

  • describe_training_run is the polling surface and reconciles stale RUNNING jobs whose worker pid has disappeared without a terminal write

  • read_training_run_log returns incremental log chunks from the persisted worker log

Analysis

Current analysis path:

compile_analysis_request
  -> persisted compiled analysis request
  -> run_analysis_request
  -> analysis-specific execution in CompatibilityExecutor

Currently supported workflow keys:

  • spectral_koopman

  • vortex_transform_modes

Checkpoint Compatibility

Current checkpoint load path:

dymad.io.load_model(...)
  -> CompatibilityExecutor.plan_checkpoint_prediction(...)
  -> FacadeOperations.register_checkpoint(...)
  -> FacadeOperations.prepare_prediction_request(...)
  -> legacy checkpoint materialization in dymad.io.checkpoint

This is an important current-state detail:

  • CompatibilityExecutor.plan_checkpoint_prediction(...) is active.

  • CompatibilityExecutor.materialize_checkpoint_prediction(...) is not the active materialization path today; it is a placeholder that raises NotImplementedError.

  • the persisted checkpoint and prediction-request handles still record the boundary state used by load_model(...).

So the boundary plan is real, but final checkpoint materialization still goes through dymad.io.checkpoint.

Persisted Artifacts and Handles

The object store persists the main boundary objects used by MCP and compatibility workflows:

  • datasets: ds_*

  • checkpoints: chk_*

  • training runs: run_*

  • compiled training requests: trainreq_*

  • compiled analysis requests: analysisreq_*

  • evaluations: eval_*

  • prediction requests: pred_*

  • spectral snapshots: specsnap_*

If a new workflow needs durable planning or inspection across calls, it usually needs a new record type in agent/store plus matching facade helpers.

Design Rules

  • Keep policy and validation out of server.py.

  • Prefer stable user-facing keys in registry/* over raw import strings in user-mode flows.

  • Put request-shape validation in compiler/*, not in MCP adapters.

  • Put orchestration in exec/*, not in registry or MCP modules.

  • Put persistence logic in store/* and facade/*, not in executor methods.

  • Keep model/math/runtime behavior in the implementation packages unless the public boundary changes.

Tests That Define the Boundary

Use these as the fastest ground truth for the current architecture:

  • tests/test_mcp_server_modes.py: user/developer mode split

  • tests/test_mcp_user_tools.py: user-mode compile/train/evaluate path

  • tests/test_training_compiler.py: typed training compiler behavior

  • tests/test_analysis_workflows.py: compiled analysis workflows

  • tests/test_checkpoint_e2e_layering.py: checkpoint planning through exec/facade/store

  • tests/test_public_load_model_boundary.py: load_model(...) still materializes through dymad.io.checkpoint

When Adding Features

If you are deciding where a change belongs, use feature-placement.md.

If your change moves the answer, update that file too.