weecology · henrykironde · Mar 17, 2026 · Apr 3, 2026 · Apr 3, 2026 · Apr 4, 2026
diff --git a/docs/development/cluster.md b/docs/development/cluster.md
@@ -0,0 +1,120 @@
+# Cluster Distributed Runs
+
+This page shows supported patterns for running DeepForest across multiple GPUs and multiple nodes on a Slurm-managed cluster.
+
+## Shared Settings
+
+Use the same launch pattern for `train`, `evaluate`, and `predict`:
+
+- `devices=<gpus_per_node>` is the number of GPUs on each node
+- `num_nodes=<nnodes>` is the total number of nodes
+- `strategy=ddp` enables distributed data parallel execution (use `auto` for single-GPU jobs)
+- `workers=0` is required for large-tile prediction with `dataloader_strategy="window"`
+
+Launch every job step with **`srun`** so Lightning reads the Slurm environment. Set `#SBATCH --ntasks-per-node` equal to `devices` and `#SBATCH --nodes` equal to `num_nodes`.
+
+## Environment
+
+```bash
+ml conda
+eval "$(conda shell.bash hook)"
+conda activate predict
+cd /path/to/DeepForest
+```
+
+## Train
+
+For a quick distributed smoke test:
+
+```bash
+sbatch --nodes=2 --ntasks-per-node=2 --gpus-per-node=2 --cpus-per-task=4 --mem=80G --time=00:30:00 \
+  run_cluster_multinode_smoke.sh
+```
+
+For a real training run inside an `sbatch` script:
+
+```bash
+#SBATCH --nodes=2
+#SBATCH --ntasks-per-node=2
+#SBATCH --gres=gpu:2
+
+srun uv run deepforest train \
+  --strategy ddp \
+  accelerator=gpu \
+  devices=2 \
+  num_nodes=2 \
+  train.csv_file=/path/to/train.csv \
+  train.root_dir=/path/to/train_images \
+  validation.csv_file=/path/to/val.csv \
+  validation.root_dir=/path/to/val_images
+```
+
+## Evaluate
+
+```bash
+#SBATCH --nodes=2
+#SBATCH --ntasks-per-node=2
+#SBATCH --gres=gpu:2
+
+srun uv run deepforest evaluate \
+  /path/to/ground_truth.csv \
+  --root-dir /path/to/images \
+  --save-predictions eval_preds.csv \
+  -o eval_metrics.csv \
+  --strategy ddp \
+  accelerator=gpu \
+  devices=2 \
+  num_nodes=2
+```
+
+## Predict From CSV
+
+For the cluster regression test and example launcher:
+
+```bash
+sbatch run_cluster_predict_test.sbatch
+```
+
+To run your own CSV prediction job directly:
+
+```bash
+srun uv run deepforest predict \
+  /path/to/images.csv \
+  --mode csv \
+  --root-dir /path/to/images \
+  -o predictions.csv \
+  --strategy ddp \
+  accelerator=gpu \
+  devices=2 \
+  num_nodes=2
+```
+
+## Predict A Large Tile
+
+For large rasters on a cluster, prefer `predict_tile(..., dataloader_strategy="window")`.
+
+The ready-to-run test launcher is:
+
+```bash
+sbatch run_cluster_predict_tile_test.sbatch
+```
+
+To run a tiled prediction job directly:
+
+```bash
+#SBATCH --nodes=2
+#SBATCH --ntasks-per-node=2
+#SBATCH --gres=gpu:2
+
+srun uv run python tests/cluster_predict_tile_driver.py \
+  --input-path /path/to/tile.tif \
+  --output-path tile_predictions.csv \
+  --model-name weecology/everglades-bird-species-detector \
+  --patch-size 1500 \
+  --patch-overlap 0 \
+  --dataloader-strategy window \
+  --devices 2 \
+  --num-nodes 2
+```
+
+See also the [multi-GPU and multi-node guide](../user_guide/distributed.md).
diff --git a/docs/development/index.md b/docs/development/index.md
@@ -5,6 +5,7 @@
 ```{toctree}
 :maxdepth: 1
 
+cluster
 authors
 contributing
 code_of_conduct

diff --git a/docs/user_guide/07_scaling.md b/docs/user_guide/07_scaling.md
@@ -1,5 +1,7 @@
 # Scaling DeepForest using PyTorch Lightning
 
+For concise launch recipes, see the [multi-GPU and multi-node guide](distributed.md). If you are using a Slurm-managed cluster, see the [cluster developer guide](../development/cluster.md).
+
 ## Increase batch size
 
 It is more efficient to run a larger batch size on a single GPU. This is because the overhead of loading data and moving data between the CPU and GPU is relatively large. By running a larger batch size, we can reduce the overhead of these operations.
@@ -27,9 +29,7 @@ A few notes that can trip up those less used to multi-gpu training. These are fo
 
 2. Each device gets its own portion of the dataset. This means that they do not interact during forward passes.
 
-3. Make sure to use srun when combining with SLURM! This is an easy one to miss and will cause training to hang without error. Documented here
-
-https://lightning.ai/docs/pytorch/latest/clouds/cluster_advanced.html#troubleshooting.
+3. On SLURM, launch with **`srun`**. Match `#SBATCH --ntasks-per-node` to `devices` and `#SBATCH --nodes` to `num_nodes`. See the [multi-GPU and multi-node guide](distributed.md) and [Lightning SLURM troubleshooting](https://lightning.ai/docs/pytorch/latest/clouds/cluster_advanced.html#troubleshooting).
 
 
 ## Prediction

diff --git a/docs/user_guide/09_configuration_file.md b/docs/user_guide/09_configuration_file.md
@@ -151,17 +151,30 @@ The number of cpus/gpus to use during model training. Deepforest has been tested
 
 ### accelerator
 
-Most commonly, `cpu`, `gpu` or `tpu` as well as other [options](https://pytorch-lightning.readthedocs.io/en/1.4.0/advanced/multi_gpu.html) listed:
+Most commonly, `cpu`, `gpu` or `tpu` as well as other [options](https://lightning.ai/docs/pytorch/stable/accelerators/gpu.html).
 
-If `gpu`, it can be helpful to specify the data parallelization strategy. This can be done using the `strategy` arg in `main.create_trainer()`
+### num_nodes
+
+Number of machines for distributed training. Default is `1`. Set this to your Slurm node count for multi-node jobs. See [Scaling](07_scaling.md) and [distributed runs](distributed.md).
+
+### strategy
+
+Distributed training strategy passed to the Lightning `Trainer`. Default is `auto` (appropriate for single-GPU runs). Use `ddp` for multi-GPU or multi-node training.
+
+Set in the config file, as Hydra overrides (`strategy=ddp`), or via `create_trainer(strategy="ddp")`. CLI and `create_trainer` kwargs override the config file.
 
 ```python
-from deepforest import model as m
+from deepforest import main
 
-m.create_trainer(logger=comet_logger, strategy="ddp")
+m = main.deepforest()
+m.config.accelerator = "gpu"
+m.config.devices = 4
+m.config.num_nodes = 2
+m.config.strategy = "ddp"
+m.create_trainer(logger=comet_logger)
 ```
 
-This is passed to the pytorch-lightning trainer, documented in the link above for multi-gpu training.
+On Slurm clusters, launch with `srun` so Lightning can read the job environment. Details are in [distributed runs](distributed.md).
 
 ### batch_size
 

diff --git a/docs/user_guide/11_training.md b/docs/user_guide/11_training.md
@@ -526,38 +526,28 @@ Usually creating this object does not cost too much computational time.
 
 #### Training across multiple nodes on a HPC system
 
-We have heard that this error can appear when trying to deep copy the pytorch lightning module. The trainer object is not pickleable.
-For example, on multi-gpu environments when trying to scale the deepforest model the entire module is copied leading to this error.
-Setting the trainer object to None and directly using the pytorch object is a reasonable workaround.
+On Slurm clusters, submit jobs with `srun` and set `devices`, `num_nodes`, and `strategy=ddp` to match your `#SBATCH` allocation. See [Scaling](07_scaling.md) and [distributed runs](distributed.md).
 
-Replace
+If you see **Weakly referenced objects** when scaling across GPUs, the trainer object may not pickle cleanly when the module is copied. A workaround is to construct a `Trainer` directly:
 
 ```python
 m = main.deepforest()
-m.create_trainer()
-m.trainer.fit(m)
-```
-
-with
-
-```python
 m.trainer = None
 from pytorch_lightning import Trainer
 
-    trainer = Trainer(
-        accelerator="gpu",
-        strategy="ddp",
-        devices=model.config.devices,
-        enable_checkpointing=False,
-        max_epochs=model.config.train.epochs,
-        logger=comet_logger
-    )
+trainer = Trainer(
+    accelerator="gpu",
+    strategy="ddp",
+    devices=m.config.devices,
+    num_nodes=m.config.num_nodes,
+    enable_checkpointing=False,
+    max_epochs=m.config.train.epochs,
+    logger=comet_logger,
+)
 trainer.fit(m)
 ```
 
-The added benefits of this is more control over the trainer object.
-The downside is that it doesn't align with the .config pattern where a user now has to look into the config to create the trainer.
-We are open to changing this to be the default pattern in the future and welcome input from users.
+We are open to making this the default pattern and welcome input from users.
 
 #### Visualization during training
 
@@ -598,6 +588,8 @@ We provide a basic script to trigger a training run via CLI. This script is inst
 If you are using `uv` to manage your Python environment, remember to prefix these commands with `uv run`, for example: `uv run deepforest predict`.
 ```
 
+On a Slurm cluster, wrap the command in `srun` inside your batch script (see [Scaling](07_scaling.md) and [distributed runs](distributed.md)).
+
 ```bash
 deepforest train batch_size=8 train.csv_file=your_labels.csv train.root_dir=some/path
 ```

diff --git a/docs/user_guide/distributed.md b/docs/user_guide/distributed.md
@@ -0,0 +1,84 @@
+# Multi-GPU and Multi-Node Runs
+
+DeepForest uses PyTorch Lightning distributed execution. For most multi-GPU and multi-node runs, these settings matter:
+
+- `accelerator=gpu`
+- `devices=<gpus_per_node>`
+- `num_nodes=<nnodes>`
+- `strategy=ddp`
+
+On Slurm clusters, launch with **`srun`** inside your job allocation. Match `#SBATCH --ntasks-per-node` to `devices` and `#SBATCH --nodes` to `num_nodes`. See the [cluster developer guide](../development/cluster.md).
+
+Single-GPU jobs can keep the default `strategy=auto`.
+
+## Train
+
+```bash
+#SBATCH --nodes=<nnodes>
+#SBATCH --ntasks-per-node=<gpus_per_node>
+#SBATCH --gres=gpu:<gpus_per_node>
+
+srun uv run deepforest train \
+  --strategy ddp \
+  accelerator=gpu \
+  devices=<gpus_per_node> \
+  num_nodes=<nnodes> \
+  train.csv_file=/path/to/train.csv \
+  train.root_dir=/path/to/train_images \
+  validation.csv_file=/path/to/val.csv \
+  validation.root_dir=/path/to/val_images
+```
+
+## Evaluate
+
+```bash
+srun uv run deepforest evaluate \
+  /path/to/ground_truth.csv \
+  --root-dir /path/to/images \
+  --save-predictions eval_preds.csv \
+  -o eval_metrics.csv \
+  --strategy ddp \
+  accelerator=gpu \
+  devices=<gpus_per_node> \
+  num_nodes=<nnodes>
+```
+
+## Predict From CSV
+
+```bash
+srun uv run deepforest predict \
+  /path/to/images.csv \
+  --mode csv \
+  --root-dir /path/to/images \
+  -o predictions.csv \
+  --strategy ddp \
+  accelerator=gpu \
+  devices=<gpus_per_node> \
+  num_nodes=<nnodes>
+```
+
+## Predict A Large Tile
+
+For large geospatial rasters, use `predict_tile(..., dataloader_strategy="window")` instead of the simple CLI tile mode.
+
+```python
+from deepforest.main import deepforest
+
+m = deepforest()
+m.load_model("weecology/everglades-bird-species-detector")
+m.config.accelerator = "gpu"
+m.config.devices = 2
+m.config.num_nodes = 2
+m.config.strategy = "ddp"
+m.config.workers = 0
+m.create_trainer()
+
+results = m.predict_tile(
+    path="/path/to/tile.tif",
+    patch_size=1500,
+    patch_overlap=0,
+    dataloader_strategy="window",
+)
+```
+
+Launch that script with the same `srun` Slurm pattern and trainer settings. For a complete cluster example, see the [cluster developer guide](../development/cluster.md).
diff --git a/docs/user_guide/index.md b/docs/user_guide/index.md
@@ -18,6 +18,7 @@ The User Guide covers the core DeepForest package usage and functionalities.
 05_model_architecture
 06_multi_species
 07_scaling
+distributed
 08_visualizations
 09_configuration_file
 10_better

diff --git a/pyproject.toml b/pyproject.toml
@@ -155,7 +155,8 @@ filterwarnings = [
 ]
 markers = [
     "slow: marks tests that are slow to run",
-    "integration: marks integration tests"
+    "integration: marks integration tests",
+    "cluster: marks tests intended for Slurm cluster runs only"
 ]
 
 [tool.coverage.run]

diff --git a/run_cluster_multinode_smoke.sh b/run_cluster_multinode_smoke.sh
@@ -0,0 +1,44 @@
+#!/usr/bin/env bash
+# Distributed smoke test: submit with Slurm, launch with srun (Lightning reads the job env).
+#
+# Example:
+#   sbatch --nodes=2 --ntasks-per-node=2 --gpus-per-node=2 --cpus-per-task=4 --mem=80G \
+#     --time=00:30:00 run_cluster_multinode_smoke.sh
+set -euo pipefail
+
+REPO_ROOT="${REPO_ROOT:-$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)}"
+DATA_ROOT="$REPO_ROOT/src/deepforest/data"
+TRAIN_CSV="${TRAIN_CSV:-$DATA_ROOT/OSBS_029.csv}"
+TRAIN_ROOT="${TRAIN_ROOT:-$DATA_ROOT}"
+VAL_CSV="${VAL_CSV:-$DATA_ROOT/OSBS_029.csv}"
+VAL_ROOT="${VAL_ROOT:-$DATA_ROOT}"
+LOG_ROOT="${LOG_ROOT:-$REPO_ROOT/lightning_logs_multinode_smoke}"
+
+GPUS_PER_NODE="${GPUS_PER_NODE:-${SLURM_GPUS_ON_NODE:-${SLURM_GPUS_PER_NODE:-1}}}"
+NNODES="${NNODES:-${SLURM_NNODES:-1}}"
+
+if [[ -z "${SLURM_JOB_ID:-}" ]]; then
+    echo "SLURM_JOB_ID is not set. Submit with sbatch or run inside salloc."
+    exit 1
+fi
+
+echo "HOSTNAME=$(hostname)"
+echo "SLURM_NNODES=$NNODES"
+echo "GPUS_PER_NODE=$GPUS_PER_NODE"
+echo "SLURM_NTASKS_PER_NODE=${SLURM_NTASKS_PER_NODE:-?}"
+
+cd "$REPO_ROOT"
+
+srun --kill-on-bad-exit=1 uv run deepforest train \
+  --disable-checkpoint \
+  --strategy ddp \
+  train.fast_dev_run=true \
+  workers=0 \
+  accelerator=gpu \
+  devices="$GPUS_PER_NODE" \
+  num_nodes="$NNODES" \
+  train.csv_file="$TRAIN_CSV" \
+  train.root_dir="$TRAIN_ROOT" \
+  validation.csv_file="$VAL_CSV" \
+  validation.root_dir="$VAL_ROOT" \
+  log_root="$LOG_ROOT"