Working with compressed data

Seamless natively supports .zst (Zstandard) and .gz (gzip) compressed files. Compression never affects identity or caching — a compressed file and its uncompressed form have the same checksum, the same cache key, and are interchangeable.

Uploading compressed datasets

# Upload a directory of gzip-compressed PDB files with zero-copy hardlinking
seamless-upload --destination /buffers --hardlink /data/pdb/

For each compressed file, seamless-upload computes the canonical checksum by decompressing in-memory (no decompressed copy on disk), then hard-links the original compressed file into the buffer directory. This is the recommended approach for large compressed datasets: zero storage overhead, one-time decompression cost for checksumming.

Sidecar convention

The .CHECKSUM sidecar always uses the canonical name — compression suffixes are stripped: - file.npy.zst → file.npy.CHECKSUM (not file.npy.zst.CHECKSUM) - file.npy.gz → file.npy.CHECKSUM

Downloading in compressed form

seamless-download --destination /buffers --compression zst mydir/

--compression is all-or-nothing: all output files get the suffix.

Compressing an existing buffer directory

The design is transparent enough that you can compress an existing hashserver buffer directory after the fact:

for f in /path/to/buffers/*/*; do
  if [[ -f "$f" && "$f" != *.zst && "$f" != *.gz && "$f" != *.BUFFERLENGTH && "$f" != *.LOCK ]]; then
    stat -c%s "$f" > "${f}.BUFFERLENGTH"
    zstd --rm "$f"
  fi
done

The hashserver, seamless-upload, seamless-download, and worker materialization all check for .zst and .gz variants on every file lookup. The .BUFFERLENGTH sidecar ensures /buffer-length returns the uncompressed size without decompressing. Pre-generating sidecars before compressing is important: without them, the /buffer-length endpoint would need to decompress each buffer to determine its uncompressed size (correct but expensive).

Python face

A Python-face compression mechanism (transformer.compression.my_pin) is planned but not yet implemented. Currently, compression is used through the CLI face (filename suffixes).