Skip to content

saturncloud/tokenfactory

Repository files navigation

Token Factory in-pod shim

The pod-side runner for Saturn Cloud's Token Factory product — a no-code LoRA fine-tuning service.

This package is the tiny shim that runs inside a Token Factory fine-tune job pod. All training logic lives in axolotl; this shim's job is to:

  1. Read a rendered axolotl YAML from ~/axolotl-config.yaml (overridable via TF_AXOLOTL_CONFIG_PATH).
  2. Run axolotl train <config>, tee'ing its stdout/stderr to both k8s logs and <output_dir>/training.log.
  3. On success, parse <output_dir>/trainer_state.json, verify an adapter was written, and register the result as a kind=checkpoint Artifact via the Atlas API.
  4. On failure, best-effort register an error Artifact.

Atlas owns the YAML rendering. Re-runs are submitted as fresh AI Studio resources (no in-place retry); resume-from-checkpoint is a future Atlas-side feature (the renderer sets lora_model_dir: to point at a previous run's output).

Environment

Var Required Description
SATURN_RESOURCE_ID for callback The pod's deployment id. Used as producer.id and as the Atlas idempotency key.
SATURN_TOKEN for callback Bearer JWT for Atlas.
SATURN_BASE_URL for callback Atlas base URL.
TF_OUTPUT_SF_ID for callback SharedFolder id backing the pod's output mount.
TF_OUTPUT_SF_RELPATH for callback Relative path within that SharedFolder where this job's output lives.
TF_AXOLOTL_CONFIG_PATH no Path to the rendered axolotl YAML. Defaults to ~/axolotl-config.yaml.
TF_IMAGE_TAG no Image tag string echoed into the artifact metadata.

If the SATURN_* env vars aren't set the shim still runs axolotl, but skips the Atlas callback (useful for local dev).

Atlas contract

  • Single artifact registration per resource. Idempotency key is SATURN_RESOURCE_ID; pod restarts for the same resource dedupe server-side.
  • Two-call: create the artifact (server stamps status=pending), then PATCH to ready or error. Create is retried (5 attempts, jittered backoff); PATCH is best-effort (2 attempts).
  • Hard failures (pod killed, OOM, eviction) produce no callback — Atlas's job reconciler converges the artifact row from k8s pod state.

The wire format and retry policy live in saturn_tokenfactory/atlas_client.py.

Exit codes

Code Meaning
0 axolotl succeeded; artifact registered.
1 axolotl exited non-zero, or its outputs were incomplete. Best-effort error artifact registered.
2 Config error (missing env, missing/invalid YAML).

Local development

make conda-update      # set up the conda env
make format-backend    # black + isort
make lint-backend      # black/isort/flake8/mypy
make test-backend      # pytest

To run end-to-end against a tiny dataset, set the env vars manually, drop an axolotl YAML at ~/axolotl-config.yaml, and run python -m saturn_tokenfactory.

Out of scope

The shim doesn't know about:

  • Dataset formats — axolotl handles those via the YAML.
  • Model families / LoRA target_modules / per-family quirks — handled by axolotl + Atlas's renderer.
  • Experiment trackers (MLflow / W&B / Comet) — configure via the YAML; axolotl talks to them directly. The shim writes a tag-based deep-link is left to the UI (search by tags.saturn.resource_id).
  • Multi-GPU / multi-node launching — the rendered YAML carries the right DeepSpeed/FSDP config and the pod's launch command picks the right torchrun invocation.

Releases

No releases published

Packages

 
 
 

Contributors