The pod-side runner for Saturn Cloud's Token Factory product — a no-code LoRA fine-tuning service.
This package is the tiny shim that runs inside a Token Factory fine-tune job pod. All training logic lives in axolotl; this shim's job is to:
- Read a rendered axolotl YAML from
~/axolotl-config.yaml(overridable viaTF_AXOLOTL_CONFIG_PATH). - Run
axolotl train <config>, tee'ing its stdout/stderr to both k8s logs and<output_dir>/training.log. - On success, parse
<output_dir>/trainer_state.json, verify an adapter was written, and register the result as akind=checkpointArtifact via the Atlas API. - On failure, best-effort register an error Artifact.
Atlas owns the YAML rendering. Re-runs are submitted as fresh AI Studio
resources (no in-place retry); resume-from-checkpoint is a future Atlas-side
feature (the renderer sets lora_model_dir: to point at a previous run's
output).
| Var | Required | Description |
|---|---|---|
SATURN_RESOURCE_ID |
for callback | The pod's deployment id. Used as producer.id and as the Atlas idempotency key. |
SATURN_TOKEN |
for callback | Bearer JWT for Atlas. |
SATURN_BASE_URL |
for callback | Atlas base URL. |
TF_OUTPUT_SF_ID |
for callback | SharedFolder id backing the pod's output mount. |
TF_OUTPUT_SF_RELPATH |
for callback | Relative path within that SharedFolder where this job's output lives. |
TF_AXOLOTL_CONFIG_PATH |
no | Path to the rendered axolotl YAML. Defaults to ~/axolotl-config.yaml. |
TF_IMAGE_TAG |
no | Image tag string echoed into the artifact metadata. |
If the SATURN_* env vars aren't set the shim still runs axolotl, but
skips the Atlas callback (useful for local dev).
- Single artifact registration per resource. Idempotency key is
SATURN_RESOURCE_ID; pod restarts for the same resource dedupe server-side. - Two-call: create the artifact (server stamps
status=pending), then PATCH toreadyorerror. Create is retried (5 attempts, jittered backoff); PATCH is best-effort (2 attempts). - Hard failures (pod killed, OOM, eviction) produce no callback — Atlas's job reconciler converges the artifact row from k8s pod state.
The wire format and retry policy live in saturn_tokenfactory/atlas_client.py.
| Code | Meaning |
|---|---|
| 0 | axolotl succeeded; artifact registered. |
| 1 | axolotl exited non-zero, or its outputs were incomplete. Best-effort error artifact registered. |
| 2 | Config error (missing env, missing/invalid YAML). |
make conda-update # set up the conda env
make format-backend # black + isort
make lint-backend # black/isort/flake8/mypy
make test-backend # pytestTo run end-to-end against a tiny dataset, set the env vars manually, drop an
axolotl YAML at ~/axolotl-config.yaml, and run python -m saturn_tokenfactory.
The shim doesn't know about:
- Dataset formats — axolotl handles those via the YAML.
- Model families / LoRA target_modules / per-family quirks — handled by axolotl + Atlas's renderer.
- Experiment trackers (MLflow / W&B / Comet) — configure via the YAML;
axolotl talks to them directly. The shim writes a tag-based deep-link
is left to the UI (search by
tags.saturn.resource_id). - Multi-GPU / multi-node launching — the rendered YAML carries the right
DeepSpeed/FSDP config and the pod's launch command picks the right
torchruninvocation.