seamless-config
seamless-config is the configuration and infrastructure-selection layer for
Seamless projects.
Seamless models work as pipelines of cacheable steps. seamless-config answers
the question "where does this pipeline run, and against which storage?" —
locally in-process, in local worker processes, or on a remote cluster — without
touching the step code. It reads plain YAML files, resolves cluster topology,
and wires the right remote backends (buffer store, database, jobserver, dask
scheduler) into the Seamless runtime before the first transformation runs.
Installation
pip install seamless-config
Quick start
import seamless.config
seamless.config.init() # reads seamless.yaml / seamless.profile.yaml
# from the caller's directory upward
# … build and run your workflow …
From the command line, seamless-init performs the same initialisation and
verifies that the configured remote services are reachable:
seamless-init # default stage
seamless-init --stage prod # named stage
seamless-init --stage prod:gpu # named stage + substage
Configuration files
seamless-config discovers two optional YAML files in the work directory (and
optionally its parents):
| File | Commit to VCS? | Purpose |
|---|---|---|
seamless.yaml |
Yes | Project-wide, deterministic defaults (project name, stage name, inherit_from_parent) |
seamless.profile.yaml |
No (add to .gitignore) |
Developer-specific overrides — cluster hostnames, experimental settings, local credentials |
Both files use the same command language. When a file contains
inherit_from_parent, the loader also reads the parent directory and prepends
its commands, repeating until a directory without that flag is reached (or the
filesystem root). Parent defaults always run before child overrides.
Command language
Each file must be a YAML list. Every item is either a bare string command or a single-key mapping:
# seamless.yaml
- project: my-project
- inherit_from_parent
| Command | Argument | Effect |
|---|---|---|
project |
string | Sets the project name (used as the storage path component) |
subproject |
string | Sets an optional sub-path inside the project |
cluster |
string | Selects the active cluster by name |
execution |
process / spawn / remote |
Sets the execution mode |
queue |
string | Selects a named queue on the current cluster |
remote |
null / daskserver / jobserver |
Pins the remote backend when a cluster exposes both |
persistent |
boolean | Forces persistent storage on or off; defaults to true when a cluster is set |
clusters |
mapping | Defines cluster objects inline (runs before other commands) |
record |
boolean | Enables full execution-record capture (default: minimal records only) |
node |
string / null | Selects a named node within the current cluster (advanced: cluster-internal scheduling) |
inherit_from_parent |
— | Also reads commands from the parent directory, prepended |
stage <name> |
list of commands | Runs the nested commands only when the current stage matches <name> |
If no execution command is encountered, the loader defaults to remote when
a cluster is selected and process otherwise.
See COMMAND_LANGUAGE.md for the full specification.
Stage blocks
Use a stage <name>: key to activate commands only in a specific stage:
# seamless.profile.yaml
- clusters:
local:
tunnel: false
type: local
frontends:
- hostname: localhost
hashserver:
bufferdir: /data/buffers
conda: hashserver
network_interface: 127.0.0.1
port_start: 55100
port_end: 55199
database:
database_dir: /data/db
conda: seamless-database
network_interface: 127.0.0.1
port_start: 55200
port_end: 55299
jobserver:
conda: seamless-jobserver
network_interface: 127.0.0.1
port_start: 55300
port_end: 55399
- cluster: local
- stage prod:
- cluster: hpc-cluster
- execution: remote
- remote: daskserver
Cluster definitions
Clusters are defined in ~/.seamless/clusters.yaml and/or individual files
under ~/.seamless/clusters/*.yaml. They can also be inlined in
seamless.profile.yaml via the clusters command (useful for portable
projects).
A cluster definition describes the topology of its frontend nodes and the services each one can host:
# ~/.seamless/clusters.yaml
mycluster:
tunnel: true # connect via SSH tunnel
type: slurm # local | slurm | oar
workers: 4 # for 'spawn' / 'jobserver' mode
frontends:
- hostname: login.mycluster.example
ssh_hostname: login.mycluster.example # optional SSH override
hashserver:
bufferdir: /scratch/seamless/buffers
conda: hashserver
network_interface: 0.0.0.0
port_start: 60100
port_end: 60199
database:
database_dir: /scratch/seamless/db
conda: seamless-database
network_interface: 0.0.0.0
port_start: 60200
port_end: 60299
daskserver:
network_interface: 0.0.0.0
port_start: 60300
port_end: 60399
default_queue: default
queues:
default:
conda: seamless-dask
walltime: "01:00:00"
cores: 16
memory: 32000MB
tmpdir: /tmp
maximum_jobs: 20
unknown_task_duration: 1m
target_duration: 10m
lifetime_stagger: 4m
highmem:
TEMPLATE: default # inherit all fields from 'default', then override
cores: 4
memory: 128000MB
Frontend services
Each frontend entry can expose any subset of:
| Service | Role |
|---|---|
hashserver |
Stores and serves content-addressed buffers (the raw bytes of cell values) |
database |
Stores transformation results (maps input checksum → result checksum, backed by SQLite) |
jobserver |
HTTP job dispatch — accepts serialised transformations and returns results |
daskserver |
Dask-backed HPC scheduler — submits jobs to SLURM/OAR via dask-jobqueue |
When both jobserver and daskserver are present on the same cluster,
remote: jobserver or remote: daskserver must be specified explicitly in
seamless.profile.yaml.
Queue templates
A queue entry with a TEMPLATE key inherits all fields from the named queue
(which must be defined earlier in the same cluster), then overrides only the
explicitly provided fields:
queues:
base:
conda: seamless-dask
walltime: "02:00:00"
cores: 8
memory: 16000MB
maximum_jobs: 10
tmpdir: /tmp
unknown_task_duration: 1m
target_duration: 10m
gpu:
TEMPLATE: base
memory: 32000MB
job_extra_directives: ["--gres=gpu:1"]
dask_resources: {GPU: 1}
Execution modes
| Mode | execution: value |
Description |
|---|---|---|
| In-process | process |
Transformations run in the client Python process. Default when no cluster is defined. |
| Local workers | spawn |
Transformations are dispatched to a local worker pool (uses workers from the cluster definition). |
| Remote (jobserver) | remote + remote: jobserver |
Lightweight HTTP jobserver with a fixed worker pool. Typically used as a test setup or for simple local-cluster scenarios. |
| Remote (daskserver) | remote + remote: daskserver |
Delegates to a Dask cluster with persistent storage. Any Dask Cluster subclass is supported; current cluster definitions cover SLURM and OAR via dask-jobqueue. |
| Pure Dask | remote + remote: daskserver + persistent: false |
Dask execution without Seamless persistence (no hashserver/database). For batch jobs that don't need incremental caching. |
The three remote backends (jobserver, daskserver, pure Dask) are mutually
exclusive — a cluster frontend must expose exactly one, or you must pin the
choice with remote: jobserver or remote: daskserver in
seamless.profile.yaml.
remote_http_launcher handles two scheduler-placement topologies automatically:
when hostname is set it SSHes into the configured frontend and runs the Dask
wrapper as a daemon there; when hostname is absent it runs the wrapper as a
local daemon on the client. This is independent of cluster type: a cluster with
type: local still uses distributed.LocalCluster, but that LocalCluster may
live either on the client or on a remote frontend reached over SSH. The
configuration schema currently covers local, slurm, and oar; cloud provider
support would require extending the cluster definition vocabulary, not the
launcher or wrapper.
Persistence
When a cluster is selected and persistent: true (the default), seamless-config
activates:
seamless_remote.buffer_remote— writes/reads buffers via the cluster'shashserverseamless_remote.database_remote— checks and records transformation results via the cluster'sdatabase
In execution: remote mode it also activates the chosen job delegation backend
(jobserver_remote or daskserver_remote).
Stages and substages
A stage is an independent execution and storage context. Each stage gets its own subdirectory under the project path:
<bufferdir>/<project>[/<subproject>][/STAGE-<stage>]
<database_dir>/<project>[/<subproject>][/STAGE-<stage>]
Stages are useful when the same project has multiple phases — e.g. build,
test, prod — that must not share cached results.
A substage further subdivides the job-dispatch scope (one jobserver/daskserver per substage) without splitting storage. Substages are useful when different substages within the same stage need different hardware (CPU vs GPU queues).
Python API
import seamless.config
# Simple initialization (no named stage)
seamless.config.init()
# Named stage (re-evaluates config with 'stage prod:' blocks active)
seamless.config.set_stage("prod")
# Named stage + substage
seamless.config.set_stage("prod", "gpu")
# Change substage without changing stage
seamless.config.set_substage("cpu")
# Override the directory used for config file lookup
seamless.config.set_workdir("/path/to/project")
init() is a no-op if already initialised. set_stage() deactivates any
previously active remote clients before re-configuring.
Forwarding remote clients to worker processes
When a job runs inside the cluster, it may need to connect back to the same
remote services. collect_remote_clients / set_remote_clients serialise and
restore the active client configuration.
Note: this API is already used by the daskserver. There is no need to do this from user code.
# On the client, before submitting a job:
clients = seamless_config.collect_remote_clients("mycluster")
# Pass 'clients' to the worker via job parameters or environment variable
# Inside the worker (or set SEAMLESS_REMOTE_CLIENTS=<json> in the environment):
seamless_config.set_remote_clients(clients, in_remote=True)
set_remote_clients must be called before init(). Worker bootstrap code can
also call set_remote_clients_from_env() to pick up the JSON from the
SEAMLESS_REMOTE_CLIENTS environment variable automatically.
seamless-init CLI
seamless-init is a convenience script that calls seamless_config.init() (or
set_stage), then calls ensure_initialized() on each active remote backend
to verify that the servers are reachable before the workflow starts.
usage: seamless-init [--stage STAGE[:SUBSTAGE]]
It exits immediately (success) when the SEAMLESS_REMOTE_CLIENTS environment
variable is present, so that worker processes that bootstrap themselves with
set_remote_clients are not affected.
Service management (seamless-service-*)
seamless-config ships a Seamless-aware wrapper layer over the raw rhl-*
helpers from remote-http-launcher. The wrappers accept Seamless-level
arguments (--service, --cluster, --project, --stage) and dispatch
to the right host over SSH; readers do not need to know the raw launcher key.
| Command | Purpose |
|---|---|
seamless-service-ps [--server] [--persistent] |
Process state (and optionally persistent buffer/DB state) — meta block populates per-row (service, project, stage) |
seamless-service-logs --service hashserver [--tail N] |
Read the service log without knowing the raw key |
seamless-service-inspect --service hashserver |
Print the server state JSON (PID, port, status, workdir, command, meta) |
seamless-service-stop |
Stop processes via SIGINT → SIGTERM → SIGKILL escalation; preserves JSON state |
seamless-service-rm |
Remove JSON state; logs are preserved |
seamless-service-clear --service hashserver --project P [--stage S] |
Clear hashserver buffers or seamless.db for a project/stage |
seamless-service-resolve --service hashserver --cluster C --project P [--stage S] |
Translate Seamless-level args → raw key, ssh_hostname, workdir, log_path (no side effects, JSON output) |
seamless-service-resolve is an extractor, not a synthesizer: it shares
the same code path as seamless-run and the launched clients. Tools and
agents should call it on every invocation rather than caching its outputs;
keys, workdir paths, and host-selection logic may change between Seamless
versions.
Cluster-wide variants (--cluster MYCLUSTER) operate on every service of
that cluster.
Server-side requirements
remote-http-launcher must be installed on every remote server that
seamless-service-* targets — it provides all rhl-* helpers. There is
no inline fallback in the wrappers. Two supported install paths:
- System install (with root):
pip install remote-http-launcherinto the system Python; helpers land in/usr/local/bin. - Conda base env install (no root):
pip install remote-http-launcherinto the remote host's conda base environment; helpers land in$HOME/miniforge3/binor$HOME/miniconda3/bin.
No .bashrc edit is required for either path. seamless-service-*
automatically prepends $HOME/miniforge3/bin:$HOME/miniconda3/bin to
PATH on every SSH call, so conda-base installs work without any shell
startup changes.
Note: remote-http-launcher itself has its own fallback — it can probe
conda configuration via inline heredocs when no rhl-* helpers are
available on the host. That fallback covers only the launcher's bootstrap;
it does not extend to seamless-service-* or any other tooling that
calls rhl-* directly.
Execution records (record command)
When a transformation completes successfully, Seamless writes one execution
record into seamless.db keyed by tf_checksum. The record command in
seamless.profile.yaml controls how much is captured:
- record: true # full record: env fingerprints, compilation context, freshness, ...
- record: false # minimal record (default): timing + memory + execution mode
The default (record: false) writes a small body containing tf_checksum,
result_checksum, seamless_version, execution_mode, remote_target,
wall/CPU/GPU timing, and memory peak. The record: true opt-in enables
full reproducibility-audit capture, intended for debugging environment
drift, auditing irreproducible results, and sharing receipts alongside
shared seamless.db files. Records are write-once per tf_checksum;
turning the flag on does not retroactively upgrade existing rows.
The Python equivalent is seamless.config.select_record(True). Capture
is worker-side, so the recorded environment reflects where the job
actually ran (jobserver worker, Dask worker, spawn child, or local).
See docs/agent/contracts/execution-records.md for the full behavioral contract.
Tool launch configuration
seamless-config ships an internal tools.yaml that describes how to
construct the launch parameters for each server type (hashserver, database,
jobserver, daskserver, pure_daskserver). These are consumed by
remote-http-launcher, which starts the servers on the cluster frontend (via
SSH tunnel if needed) and returns a live port.
The configure_hashserver, configure_database, configure_jobserver,
configure_daskserver, and configure_pure_daskserver functions in
seamless_config.tools assemble the final launch dict from the cluster
definition and the current project/stage context. These functions are called
internally by seamless-config and by launcher scripts in other packages;
direct use is only needed when writing custom launch tooling.
Each launch dict carries a meta block that the launcher writes verbatim
into the server-side state JSON: (service, cluster, mode, project,
subproject, stage, substage, queue). seamless-service-* reads this
block to populate the cluster/project/stage columns in seamless-service-ps
output.
Relation to the Seamless ecosystem
seamless-core ← the computation engine (cells, transformers, caching)
↑
seamless-config ← this package: reads YAML, wires backends, provides CLI
↑ ↗ seamless-remote (buffer + database + job clients)
└── activates ─── seamless-dask (Dask scheduler integration)
↘ seamless-jobserver (HTTP jobserver)
↘ seamless-database (SQLite result store)
↘ hashserver (content-addressed buffer server)
↘ remote-http-launcher (SSH + process launcher)
seamless-config is the only package in this stack that a Seamless workflow
script typically imports directly (besides seamless itself). All other
packages are implementation details activated by init() / set_stage().