seamless-config

seamless-config is the configuration and infrastructure-selection layer for Seamless projects.

Seamless models work as pipelines of cacheable steps. seamless-config answers the question "where does this pipeline run, and against which storage?" — locally in-process, in local worker processes, or on a remote cluster — without touching the step code. It reads plain YAML files, resolves cluster topology, and wires the right remote backends (buffer store, database, jobserver, dask scheduler) into the Seamless runtime before the first transformation runs.

Installation

pip install seamless-config

Quick start

import seamless.config

seamless.config.init()          # reads seamless.yaml / seamless.profile.yaml
                                # from the caller's directory upward
# … build and run your workflow …

From the command line, seamless-init performs the same initialisation and verifies that the configured remote services are reachable:

seamless-init                   # default stage
seamless-init --stage prod      # named stage
seamless-init --stage prod:gpu  # named stage + substage

Configuration files

seamless-config discovers two optional YAML files in the work directory (and optionally its parents):

File	Commit to VCS?	Purpose
`seamless.yaml`	Yes	Project-wide, deterministic defaults (project name, stage name, `inherit_from_parent`)
`seamless.profile.yaml`	No (add to `.gitignore`)	Developer-specific overrides — cluster hostnames, experimental settings, local credentials

Both files use the same command language. When a file contains inherit_from_parent, the loader also reads the parent directory and prepends its commands, repeating until a directory without that flag is reached (or the filesystem root). Parent defaults always run before child overrides.

Command language

Each file must be a YAML list. Every item is either a bare string command or a single-key mapping:

# seamless.yaml
- project: my-project
- inherit_from_parent

Command	Argument	Effect
`project`	string	Sets the project name (used as the storage path component)
`subproject`	string	Sets an optional sub-path inside the project
`cluster`	string	Selects the active cluster by name
`execution`	`process` / `spawn` / `remote`	Sets the execution mode
`queue`	string	Selects a named queue on the current cluster
`remote`	`null` / `daskserver` / `jobserver`	Pins the remote backend when a cluster exposes both
`persistent`	boolean	Forces persistent storage on or off; defaults to `true` when a cluster is set
`clusters`	mapping	Defines cluster objects inline (runs before other commands)
`record`	boolean	Enables full execution-record capture (default: minimal records only)
`node`	string / null	Selects a named node within the current cluster (advanced: cluster-internal scheduling)
`inherit_from_parent`	—	Also reads commands from the parent directory, prepended
`stage <name>`	list of commands	Runs the nested commands only when the current stage matches `<name>`

If no execution command is encountered, the loader defaults to remote when a cluster is selected and process otherwise.

See COMMAND_LANGUAGE.md for the full specification.

Stage blocks

Use a stage <name>: key to activate commands only in a specific stage:

# seamless.profile.yaml
- clusters:
    local:
      tunnel: false
      type: local
      frontends:
        - hostname: localhost
          hashserver:
            bufferdir: /data/buffers
            conda: hashserver
            network_interface: 127.0.0.1
            port_start: 55100
            port_end: 55199
          database:
            database_dir: /data/db
            conda: seamless-database
            network_interface: 127.0.0.1
            port_start: 55200
            port_end: 55299
          jobserver:
            conda: seamless-jobserver
            network_interface: 127.0.0.1
            port_start: 55300
            port_end: 55399

- cluster: local

- stage prod:
    - cluster: hpc-cluster
    - execution: remote
    - remote: daskserver

Cluster definitions

Clusters are defined in ~/.seamless/clusters.yaml and/or individual files under ~/.seamless/clusters/*.yaml. They can also be inlined in seamless.profile.yaml via the clusters command (useful for portable projects).

A cluster definition describes the topology of its frontend nodes and the services each one can host:

# ~/.seamless/clusters.yaml

mycluster:
  tunnel: true                # connect via SSH tunnel
  type: slurm                 # local | slurm | oar
  workers: 4                  # for 'spawn' / 'jobserver' mode
  frontends:
    - hostname: login.mycluster.example
      ssh_hostname: login.mycluster.example   # optional SSH override
      hashserver:
        bufferdir: /scratch/seamless/buffers
        conda: hashserver
        network_interface: 0.0.0.0
        port_start: 60100
        port_end: 60199
      database:
        database_dir: /scratch/seamless/db
        conda: seamless-database
        network_interface: 0.0.0.0
        port_start: 60200
        port_end: 60299
      daskserver:
        network_interface: 0.0.0.0
        port_start: 60300
        port_end: 60399

  default_queue: default
  queues:
    default:
      conda: seamless-dask
      walltime: "01:00:00"
      cores: 16
      memory: 32000MB
      tmpdir: /tmp
      maximum_jobs: 20
      unknown_task_duration: 1m
      target_duration: 10m
      lifetime_stagger: 4m

    highmem:
      TEMPLATE: default         # inherit all fields from 'default', then override
      cores: 4
      memory: 128000MB

Frontend services

Each frontend entry can expose any subset of:

Service	Role
`hashserver`	Stores and serves content-addressed buffers (the raw bytes of cell values)
`database`	Stores transformation results (maps input checksum → result checksum, backed by SQLite)
`jobserver`	HTTP job dispatch — accepts serialised transformations and returns results
`daskserver`	Dask-backed HPC scheduler — submits jobs to SLURM/OAR via `dask-jobqueue`

When both jobserver and daskserver are present on the same cluster, remote: jobserver or remote: daskserver must be specified explicitly in seamless.profile.yaml.

Queue templates

A queue entry with a TEMPLATE key inherits all fields from the named queue (which must be defined earlier in the same cluster), then overrides only the explicitly provided fields:

queues:
  base:
    conda: seamless-dask
    walltime: "02:00:00"
    cores: 8
    memory: 16000MB
    maximum_jobs: 10
    tmpdir: /tmp
    unknown_task_duration: 1m
    target_duration: 10m

  gpu:
    TEMPLATE: base
    memory: 32000MB
    job_extra_directives: ["--gres=gpu:1"]
    dask_resources: {GPU: 1}

Execution modes

Mode	`execution:` value	Description
In-process	`process`	Transformations run in the client Python process. Default when no cluster is defined.
Local workers	`spawn`	Transformations are dispatched to a local worker pool (uses `workers` from the cluster definition).
Remote (jobserver)	`remote` + `remote: jobserver`	Lightweight HTTP jobserver with a fixed worker pool. Typically used as a test setup or for simple local-cluster scenarios.
Remote (daskserver)	`remote` + `remote: daskserver`	Delegates to a Dask cluster with persistent storage. Any Dask `Cluster` subclass is supported; current cluster definitions cover SLURM and OAR via `dask-jobqueue`.
Pure Dask	`remote` + `remote: daskserver` + `persistent: false`	Dask execution without Seamless persistence (no hashserver/database). For batch jobs that don't need incremental caching.

The three remote backends (jobserver, daskserver, pure Dask) are mutually exclusive — a cluster frontend must expose exactly one, or you must pin the choice with remote: jobserver or remote: daskserver in seamless.profile.yaml.

remote_http_launcher handles two scheduler-placement topologies automatically: when hostname is set it SSHes into the configured frontend and runs the Dask wrapper as a daemon there; when hostname is absent it runs the wrapper as a local daemon on the client. This is independent of cluster type: a cluster with type: local still uses distributed.LocalCluster, but that LocalCluster may live either on the client or on a remote frontend reached over SSH. The configuration schema currently covers local, slurm, and oar; cloud provider support would require extending the cluster definition vocabulary, not the launcher or wrapper.

Persistence

When a cluster is selected and persistent: true (the default), seamless-config activates:

seamless_remote.buffer_remote — writes/reads buffers via the cluster's hashserver
seamless_remote.database_remote — checks and records transformation results via the cluster's database

In execution: remote mode it also activates the chosen job delegation backend (jobserver_remote or daskserver_remote).

Stages and substages

A stage is an independent execution and storage context. Each stage gets its own subdirectory under the project path:

<bufferdir>/<project>[/<subproject>][/STAGE-<stage>]
<database_dir>/<project>[/<subproject>][/STAGE-<stage>]

Stages are useful when the same project has multiple phases — e.g. build, test, prod — that must not share cached results.

A substage further subdivides the job-dispatch scope (one jobserver/daskserver per substage) without splitting storage. Substages are useful when different substages within the same stage need different hardware (CPU vs GPU queues).

Python API

import seamless.config

# Simple initialization (no named stage)
seamless.config.init()

# Named stage (re-evaluates config with 'stage prod:' blocks active)
seamless.config.set_stage("prod")

# Named stage + substage
seamless.config.set_stage("prod", "gpu")

# Change substage without changing stage
seamless.config.set_substage("cpu")

# Override the directory used for config file lookup
seamless.config.set_workdir("/path/to/project")

init() is a no-op if already initialised. set_stage() deactivates any previously active remote clients before re-configuring.

Forwarding remote clients to worker processes

When a job runs inside the cluster, it may need to connect back to the same remote services. collect_remote_clients / set_remote_clients serialise and restore the active client configuration.

Note: this API is already used by the daskserver. There is no need to do this from user code.

# On the client, before submitting a job:
clients = seamless_config.collect_remote_clients("mycluster")
# Pass 'clients' to the worker via job parameters or environment variable

# Inside the worker (or set SEAMLESS_REMOTE_CLIENTS=<json> in the environment):
seamless_config.set_remote_clients(clients, in_remote=True)

set_remote_clients must be called before init(). Worker bootstrap code can also call set_remote_clients_from_env() to pick up the JSON from the SEAMLESS_REMOTE_CLIENTS environment variable automatically.

`seamless-init` CLI

seamless-init is a convenience script that calls seamless_config.init() (or set_stage), then calls ensure_initialized() on each active remote backend to verify that the servers are reachable before the workflow starts.

usage: seamless-init [--stage STAGE[:SUBSTAGE]]

It exits immediately (success) when the SEAMLESS_REMOTE_CLIENTS environment variable is present, so that worker processes that bootstrap themselves with set_remote_clients are not affected.

Service management (`seamless-service-*`)

seamless-config ships a Seamless-aware wrapper layer over the raw rhl-* helpers from remote-http-launcher. The wrappers accept Seamless-level arguments (--service, --cluster, --project, --stage) and dispatch to the right host over SSH; readers do not need to know the raw launcher key.

Command	Purpose
`seamless-service-ps [--server] [--persistent]`	Process state (and optionally persistent buffer/DB state) — `meta` block populates per-row `(service, project, stage)`
`seamless-service-logs --service hashserver [--tail N]`	Read the service log without knowing the raw key
`seamless-service-inspect --service hashserver`	Print the server state JSON (PID, port, status, workdir, command, `meta`)
`seamless-service-stop`	Stop processes via SIGINT → SIGTERM → SIGKILL escalation; preserves JSON state
`seamless-service-rm`	Remove JSON state; logs are preserved
`seamless-service-clear --service hashserver --project P [--stage S]`	Clear hashserver buffers or `seamless.db` for a project/stage
`seamless-service-resolve --service hashserver --cluster C --project P [--stage S]`	Translate Seamless-level args → raw `key`, `ssh_hostname`, `workdir`, `log_path` (no side effects, JSON output)

seamless-service-resolve is an extractor, not a synthesizer: it shares the same code path as seamless-run and the launched clients. Tools and agents should call it on every invocation rather than caching its outputs; keys, workdir paths, and host-selection logic may change between Seamless versions.

Cluster-wide variants (--cluster MYCLUSTER) operate on every service of that cluster.

Server-side requirements

remote-http-launcher must be installed on every remote server that seamless-service-* targets — it provides all rhl-* helpers. There is no inline fallback in the wrappers. Two supported install paths:

System install (with root): pip install remote-http-launcher into the system Python; helpers land in /usr/local/bin.
Conda base env install (no root): pip install remote-http-launcher into the remote host's conda base environment; helpers land in $HOME/miniforge3/bin or $HOME/miniconda3/bin.

No .bashrc edit is required for either path. seamless-service-* automatically prepends $HOME/miniforge3/bin:$HOME/miniconda3/bin to PATH on every SSH call, so conda-base installs work without any shell startup changes.

Note: remote-http-launcher itself has its own fallback — it can probe conda configuration via inline heredocs when no rhl-* helpers are available on the host. That fallback covers only the launcher's bootstrap; it does not extend to seamless-service-* or any other tooling that calls rhl-* directly.

Execution records (`record` command)

When a transformation completes successfully, Seamless writes one execution record into seamless.db keyed by tf_checksum. The record command in seamless.profile.yaml controls how much is captured:

- record: true       # full record: env fingerprints, compilation context, freshness, ...
- record: false      # minimal record (default): timing + memory + execution mode

The default (record: false) writes a small body containing tf_checksum, result_checksum, seamless_version, execution_mode, remote_target, wall/CPU/GPU timing, and memory peak. The record: true opt-in enables full reproducibility-audit capture, intended for debugging environment drift, auditing irreproducible results, and sharing receipts alongside shared seamless.db files. Records are write-once per tf_checksum; turning the flag on does not retroactively upgrade existing rows.

The Python equivalent is seamless.config.select_record(True). Capture is worker-side, so the recorded environment reflects where the job actually ran (jobserver worker, Dask worker, spawn child, or local).

See docs/agent/contracts/execution-records.md for the full behavioral contract.

Tool launch configuration

seamless-config ships an internal tools.yaml that describes how to construct the launch parameters for each server type (hashserver, database, jobserver, daskserver, pure_daskserver). These are consumed by remote-http-launcher, which starts the servers on the cluster frontend (via SSH tunnel if needed) and returns a live port.

The configure_hashserver, configure_database, configure_jobserver, configure_daskserver, and configure_pure_daskserver functions in seamless_config.tools assemble the final launch dict from the cluster definition and the current project/stage context. These functions are called internally by seamless-config and by launcher scripts in other packages; direct use is only needed when writing custom launch tooling.

Each launch dict carries a meta block that the launcher writes verbatim into the server-side state JSON: (service, cluster, mode, project, subproject, stage, substage, queue). seamless-service-* reads this block to populate the cluster/project/stage columns in seamless-service-ps output.

Relation to the Seamless ecosystem

seamless-core          ← the computation engine (cells, transformers, caching)
    ↑
seamless-config        ← this package: reads YAML, wires backends, provides CLI
    ↑                 ↗ seamless-remote    (buffer + database + job clients)
    └── activates   ─── seamless-dask      (Dask scheduler integration)
                     ↘ seamless-jobserver  (HTTP jobserver)
                      ↘ seamless-database  (SQLite result store)
                       ↘ hashserver        (content-addressed buffer server)
                        ↘ remote-http-launcher (SSH + process launcher)

seamless-config is the only package in this stack that a Seamless workflow script typically imports directly (besides seamless itself). All other packages are implementation details activated by init() / set_stage().