Skip to content
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 120 additions & 0 deletions docs/development/cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Cluster Distributed Runs

This page shows supported patterns for running DeepForest across multiple GPUs and multiple nodes on a Slurm-managed cluster.

## Shared Settings

Use the same launch pattern for `train`, `evaluate`, and `predict`:

- `devices=<gpus_per_node>` is the number of GPUs on each node
- `num_nodes=<nnodes>` is the total number of nodes
- `strategy=ddp` enables distributed data parallel execution (use `auto` for single-GPU jobs)
- `workers=0` is required for large-tile prediction with `dataloader_strategy="window"`

Launch every job step with **`srun`** so Lightning reads the Slurm environment. Set `#SBATCH --ntasks-per-node` equal to `devices` and `#SBATCH --nodes` equal to `num_nodes`.

## Environment

```bash
ml conda
eval "$(conda shell.bash hook)"
conda activate predict
cd /path/to/DeepForest
```

## Train

For a quick distributed smoke test:

```bash
sbatch --nodes=2 --ntasks-per-node=2 --gpus-per-node=2 --cpus-per-task=4 --mem=80G --time=00:30:00 \
run_cluster_multinode_smoke.sh
```

For a real training run inside an `sbatch` script:

```bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2

srun uv run deepforest train \
--strategy ddp \
accelerator=gpu \
devices=2 \
num_nodes=2 \
train.csv_file=/path/to/train.csv \
train.root_dir=/path/to/train_images \
validation.csv_file=/path/to/val.csv \
validation.root_dir=/path/to/val_images
```

## Evaluate

```bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2

srun uv run deepforest evaluate \
/path/to/ground_truth.csv \
--root-dir /path/to/images \
--save-predictions eval_preds.csv \
-o eval_metrics.csv \
--strategy ddp \
accelerator=gpu \
devices=2 \
num_nodes=2
```

## Predict From CSV

For the cluster regression test and example launcher:

```bash
sbatch run_cluster_predict_test.sbatch
```

To run your own CSV prediction job directly:

```bash
srun uv run deepforest predict \
/path/to/images.csv \
--mode csv \
--root-dir /path/to/images \
-o predictions.csv \
--strategy ddp \
accelerator=gpu \
devices=2 \
num_nodes=2
```

## Predict A Large Tile

For large rasters on a cluster, prefer `predict_tile(..., dataloader_strategy="window")`.

The ready-to-run test launcher is:

```bash
sbatch run_cluster_predict_tile_test.sbatch
```

To run a tiled prediction job directly:

```bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2

srun uv run python tests/cluster_predict_tile_driver.py \
--input-path /path/to/tile.tif \
--output-path tile_predictions.csv \
--model-name weecology/everglades-bird-species-detector \
--patch-size 1500 \
--patch-overlap 0 \
--dataloader-strategy window \
--devices 2 \
--num-nodes 2
```

See also the [multi-GPU and multi-node guide](../user_guide/distributed.md).
1 change: 1 addition & 0 deletions docs/development/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
```{toctree}
:maxdepth: 1

cluster
authors
contributing
code_of_conduct
Expand Down
6 changes: 3 additions & 3 deletions docs/user_guide/07_scaling.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Scaling DeepForest using PyTorch Lightning

For concise launch recipes, see the [multi-GPU and multi-node guide](distributed.md). If you are using a Slurm-managed cluster, see the [cluster developer guide](../development/cluster.md).

## Increase batch size

It is more efficient to run a larger batch size on a single GPU. This is because the overhead of loading data and moving data between the CPU and GPU is relatively large. By running a larger batch size, we can reduce the overhead of these operations.
Expand Down Expand Up @@ -27,9 +29,7 @@ A few notes that can trip up those less used to multi-gpu training. These are fo

2. Each device gets its own portion of the dataset. This means that they do not interact during forward passes.

3. Make sure to use srun when combining with SLURM! This is an easy one to miss and will cause training to hang without error. Documented here

https://lightning.ai/docs/pytorch/latest/clouds/cluster_advanced.html#troubleshooting.
3. On SLURM, launch with **`srun`**. Match `#SBATCH --ntasks-per-node` to `devices` and `#SBATCH --nodes` to `num_nodes`. See the [multi-GPU and multi-node guide](distributed.md) and [Lightning SLURM troubleshooting](https://lightning.ai/docs/pytorch/latest/clouds/cluster_advanced.html#troubleshooting).


## Prediction
Expand Down
23 changes: 18 additions & 5 deletions docs/user_guide/09_configuration_file.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,17 +151,30 @@ The number of cpus/gpus to use during model training. Deepforest has been tested

### accelerator

Most commonly, `cpu`, `gpu` or `tpu` as well as other [options](https://pytorch-lightning.readthedocs.io/en/1.4.0/advanced/multi_gpu.html) listed:
Most commonly, `cpu`, `gpu` or `tpu` as well as other [options](https://lightning.ai/docs/pytorch/stable/accelerators/gpu.html).

If `gpu`, it can be helpful to specify the data parallelization strategy. This can be done using the `strategy` arg in `main.create_trainer()`
### num_nodes

Number of machines for distributed training. Default is `1`. Set this to your Slurm node count for multi-node jobs. See [Scaling](07_scaling.md) and [distributed runs](distributed.md).

### strategy

Distributed training strategy passed to the Lightning `Trainer`. Default is `auto` (appropriate for single-GPU runs). Use `ddp` for multi-GPU or multi-node training.

Set in the config file, as Hydra overrides (`strategy=ddp`), or via `create_trainer(strategy="ddp")`. CLI and `create_trainer` kwargs override the config file.

```python
from deepforest import model as m
from deepforest import main

m.create_trainer(logger=comet_logger, strategy="ddp")
m = main.deepforest()
m.config.accelerator = "gpu"
m.config.devices = 4
m.config.num_nodes = 2
m.config.strategy = "ddp"
m.create_trainer(logger=comet_logger)
```

This is passed to the pytorch-lightning trainer, documented in the link above for multi-gpu training.
On Slurm clusters, launch with `srun` so Lightning can read the job environment. Details are in [distributed runs](distributed.md).

### batch_size

Expand Down
36 changes: 14 additions & 22 deletions docs/user_guide/11_training.md
Original file line number Diff line number Diff line change
Expand Up @@ -526,38 +526,28 @@ Usually creating this object does not cost too much computational time.

#### Training across multiple nodes on a HPC system

We have heard that this error can appear when trying to deep copy the pytorch lightning module. The trainer object is not pickleable.
For example, on multi-gpu environments when trying to scale the deepforest model the entire module is copied leading to this error.
Setting the trainer object to None and directly using the pytorch object is a reasonable workaround.
On Slurm clusters, submit jobs with `srun` and set `devices`, `num_nodes`, and `strategy=ddp` to match your `#SBATCH` allocation. See [Scaling](07_scaling.md) and [distributed runs](distributed.md).

Replace
If you see **Weakly referenced objects** when scaling across GPUs, the trainer object may not pickle cleanly when the module is copied. A workaround is to construct a `Trainer` directly:

```python
m = main.deepforest()
m.create_trainer()
m.trainer.fit(m)
```

with

```python
m.trainer = None
from pytorch_lightning import Trainer

trainer = Trainer(
accelerator="gpu",
strategy="ddp",
devices=model.config.devices,
enable_checkpointing=False,
max_epochs=model.config.train.epochs,
logger=comet_logger
)
trainer = Trainer(
accelerator="gpu",
strategy="ddp",
devices=m.config.devices,
num_nodes=m.config.num_nodes,
enable_checkpointing=False,
max_epochs=m.config.train.epochs,
logger=comet_logger,
)
trainer.fit(m)
```

The added benefits of this is more control over the trainer object.
The downside is that it doesn't align with the .config pattern where a user now has to look into the config to create the trainer.
We are open to changing this to be the default pattern in the future and welcome input from users.
We are open to making this the default pattern and welcome input from users.

#### Visualization during training

Expand Down Expand Up @@ -598,6 +588,8 @@ We provide a basic script to trigger a training run via CLI. This script is inst
If you are using `uv` to manage your Python environment, remember to prefix these commands with `uv run`, for example: `uv run deepforest predict`.
```

On a Slurm cluster, wrap the command in `srun` inside your batch script (see [Scaling](07_scaling.md) and [distributed runs](distributed.md)).

```bash
deepforest train batch_size=8 train.csv_file=your_labels.csv train.root_dir=some/path
```
Expand Down
84 changes: 84 additions & 0 deletions docs/user_guide/distributed.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Multi-GPU and Multi-Node Runs

DeepForest uses PyTorch Lightning distributed execution. For most multi-GPU and multi-node runs, these settings matter:

- `accelerator=gpu`
- `devices=<gpus_per_node>`
- `num_nodes=<nnodes>`
- `strategy=ddp`

On Slurm clusters, launch with **`srun`** inside your job allocation. Match `#SBATCH --ntasks-per-node` to `devices` and `#SBATCH --nodes` to `num_nodes`. See the [cluster developer guide](../development/cluster.md).

Single-GPU jobs can keep the default `strategy=auto`.

## Train

```bash
#SBATCH --nodes=<nnodes>
#SBATCH --ntasks-per-node=<gpus_per_node>
#SBATCH --gres=gpu:<gpus_per_node>

srun uv run deepforest train \
--strategy ddp \
accelerator=gpu \
devices=<gpus_per_node> \
num_nodes=<nnodes> \
train.csv_file=/path/to/train.csv \
train.root_dir=/path/to/train_images \
validation.csv_file=/path/to/val.csv \
validation.root_dir=/path/to/val_images
```

## Evaluate

```bash
srun uv run deepforest evaluate \
/path/to/ground_truth.csv \
--root-dir /path/to/images \
--save-predictions eval_preds.csv \
-o eval_metrics.csv \
--strategy ddp \
accelerator=gpu \
devices=<gpus_per_node> \
num_nodes=<nnodes>
```

## Predict From CSV

```bash
srun uv run deepforest predict \
/path/to/images.csv \
--mode csv \
--root-dir /path/to/images \
-o predictions.csv \
--strategy ddp \
accelerator=gpu \
devices=<gpus_per_node> \
num_nodes=<nnodes>
```

## Predict A Large Tile

For large geospatial rasters, use `predict_tile(..., dataloader_strategy="window")` instead of the simple CLI tile mode.

```python
from deepforest.main import deepforest

m = deepforest()
m.load_model("weecology/everglades-bird-species-detector")
m.config.accelerator = "gpu"
m.config.devices = 2
m.config.num_nodes = 2
m.config.strategy = "ddp"
m.config.workers = 0
m.create_trainer()

results = m.predict_tile(
path="/path/to/tile.tif",
patch_size=1500,
patch_overlap=0,
dataloader_strategy="window",
)
```

Launch that script with the same `srun` Slurm pattern and trainer settings. For a complete cluster example, see the [cluster developer guide](../development/cluster.md).
1 change: 1 addition & 0 deletions docs/user_guide/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ The User Guide covers the core DeepForest package usage and functionalities.
05_model_architecture
06_multi_species
07_scaling
distributed
08_visualizations
09_configuration_file
10_better
Expand Down
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,8 @@ filterwarnings = [
]
markers = [
"slow: marks tests that are slow to run",
"integration: marks integration tests"
"integration: marks integration tests",
"cluster: marks tests intended for Slurm cluster runs only"
]

[tool.coverage.run]
Expand Down
44 changes: 44 additions & 0 deletions run_cluster_multinode_smoke.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#!/usr/bin/env bash
# Distributed smoke test: submit with Slurm, launch with srun (Lightning reads the job env).
#
# Example:
# sbatch --nodes=2 --ntasks-per-node=2 --gpus-per-node=2 --cpus-per-task=4 --mem=80G \
# --time=00:30:00 run_cluster_multinode_smoke.sh
set -euo pipefail

REPO_ROOT="${REPO_ROOT:-$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)}"
DATA_ROOT="$REPO_ROOT/src/deepforest/data"
TRAIN_CSV="${TRAIN_CSV:-$DATA_ROOT/OSBS_029.csv}"
TRAIN_ROOT="${TRAIN_ROOT:-$DATA_ROOT}"
VAL_CSV="${VAL_CSV:-$DATA_ROOT/OSBS_029.csv}"
VAL_ROOT="${VAL_ROOT:-$DATA_ROOT}"
LOG_ROOT="${LOG_ROOT:-$REPO_ROOT/lightning_logs_multinode_smoke}"

GPUS_PER_NODE="${GPUS_PER_NODE:-${SLURM_GPUS_ON_NODE:-${SLURM_GPUS_PER_NODE:-1}}}"
NNODES="${NNODES:-${SLURM_NNODES:-1}}"

if [[ -z "${SLURM_JOB_ID:-}" ]]; then
echo "SLURM_JOB_ID is not set. Submit with sbatch or run inside salloc."
exit 1
fi

echo "HOSTNAME=$(hostname)"
echo "SLURM_NNODES=$NNODES"
echo "GPUS_PER_NODE=$GPUS_PER_NODE"
echo "SLURM_NTASKS_PER_NODE=${SLURM_NTASKS_PER_NODE:-?}"

cd "$REPO_ROOT"

srun --kill-on-bad-exit=1 uv run deepforest train \
--disable-checkpoint \
--strategy ddp \
train.fast_dev_run=true \
workers=0 \
accelerator=gpu \
devices="$GPUS_PER_NODE" \
num_nodes="$NNODES" \
train.csv_file="$TRAIN_CSV" \
train.root_dir="$TRAIN_ROOT" \
validation.csv_file="$VAL_CSV" \
validation.root_dir="$VAL_ROOT" \
log_root="$LOG_ROOT"
Loading
Loading