HPX with PBS: How to run multiple HPX localities per compute node? #6713

G-071 · 2025-03-28T11:24:15Z

G-071
Mar 28, 2025

I am currently trying to run Octo-Tiger on Aurora which uses PBS instead of Slurm.
After a bit of trying, I was able to get a basic build going, with both the Intel GPU support and the MPI parcelport working. To run distributed scenarios, I was following the HPX documentation for usage with PBS.

However, currently I can only run distributed scenarios if I stick to one HPX locality per compute node.
If I run more localities on a node, Octo-Tiger will simply be executed independently once per locality.

For example, if I use the following line in my PBS runscript to run on two nodes, each with a single HPX locality like this #PBS -l select=2, Octo-Tiger works as expected.
However, using the same Octo-Tiger/HPX build on a single node with 2 processes with #PBS -l select=1:mpiprocs=2 does not work and just runs two independent instances of Octo-Tiger on the node, each running the complete scenario without communicating.

On Aurora, the recommended setting is to use one process per GPU tile, so to use the machine properly, I would need 12 HPX localities per compute node, each using 8 CPU cores and one GPU tile (which actually leaves 4 cores per socket unused, but it still appears to be the recommended setup for the machine).

Is there a way to run multiple HPX localities per compute node with PBS? With Slurm+HPX this is easy, but I do not have a lot of experience with PBS yet and could not find any information on how to do this here (I might have just missed it though).

hkaiser · 2025-03-28T13:37:52Z

hkaiser
Mar 28, 2025
Maintainer

@G-071 Could you please give me the output of printenv for the case where more than one locality is being launched on the same node? Please feel free to remove all unrelated information (if needed), leaving only the PBS_, MPI_, and similar variables that control the launch configuration.

0 replies

G-071 · 2025-03-28T15:47:46Z

G-071
Mar 28, 2025
Author

The logfiles are a bit on the longer end so I sent them to you by email! Please let me know if you need any more information or if I should try running a different configuration.

Since I forgot to mention it: I am using the current HPX master and the preinstalled mpich (version 4.2.3).

0 replies

harith-hacky03 · 2025-04-07T16:39:13Z

harith-hacky03
Apr 7, 2025

Hi @G-071,
I've analyzed your issue with running multiple HPX localities per compute node on Aurora using PBS. I understand you're trying to run Octo-Tiger with 12 localities per node (one per GPU tile) with 8 CPU cores per locality.
The reason your current setup runs independent instances is that HPX is using the PBS_NODEFILE to count unique nodes for locality determination. When you use #PBS -l select=1:mpiprocs=2, HPX sees it as one node and doesn't properly create multiple localities.
Here's how we can fix this. Create a PBS script like this:

#!/bin/bash
#PBS -l select=1:mpiprocs=12  # For 12 localities (one per GPU tile)
#PBS -l place=excl
#PBS -l walltime=00:30:00

# Load necessary modules
module load intel
module load mpich
module load oneapi

# Set the application path and options
APP_PATH=/path/to/your/octo-tiger
APP_OPTIONS="--hpx:threads=8 --hpx:gpu=1"  # 8 threads per locality, 1 GPU per locality

# Launch the application using pbsdsh
pbsdsh -u $APP_PATH $APP_OPTIONS \
    --hpx:localities=12 \
    --hpx:threads=8 \
    --hpx:nodefile=$PBS_NODEFILE

The key changes here are:
We explicitly tell HPX to create 12 localities using --hpx:localities=12
We allocate 8 threads per locality with --hpx:threads=8
We use pbsdsh -u to ensure proper process distribution
We add --hpx:gpu=1 to ensure each locality has access to one GPU tile
This should give you the setup you want:
12 HPX localities per compute node
Each locality using 8 CPU cores
Each locality having access to one GPU tile
Proper communication between localities
The main difference from your current setup is that we're explicitly controlling the number of localities rather than letting HPX determine it based on the node count.
Let me know if you need any clarification or run into any issues with this configuration!
Best regards,
Harith

0 replies

G-071 · 2025-04-11T15:01:00Z

G-071
Apr 11, 2025
Author

@harith-hacky03
Thanks for your input! I actually already tried manually adding the hpx:localities parameter initially before posting this issue, but it does not work that way. I probably should have actually added that information in the original post of the issue—sorry about that! For the sake of completeness, I got out the old logs of that run: When you manually add the hpx:localities parameter, you will encounter an error message such as the following one:

hpx::init: command line warning: --hpx:localities used when running with PBS, requesting a different number of localities (2) than have been assigned by PBS (1), the application might not run properly.
the bootstrap parcelport (tcp) has failed to initialize on locality 0:
bind: Address already in use: HPX(network_error),
bailing out
terminate called without an active exception

Alternatively, you'll get the same warning and a similar error message with the mpi parcelport.

I am curious though: Where did you get the --hpx:gpu parameter from? As far as I know, no such parameter exists! Did you get this response from some LLM, or are there a faulty HPX tutorials/doc-pages out there somewhere?

The solution:
Luckily, @hkaiser suggested a fix by email, which ended up working after a bit of trial and error on my part. I'll outline it below in case anyone else stumbles over a similar issue:

In a nutshell, one can get it to work by replacing pbsdsh with mpirun / mpiexec and adding the parameter --hpx:ignore-batch-env. This means, one can also drop the mpiprocs PBS parameter and simply use #PBS -l select=2 to run on two nodes.

The number of processes per node, the core binding and the memory binding need to be added manually to the mpirun call in this case.
GPU binding needs to be done in an extra wrapper script, which is being called by mpirun. For example, on Aurora, this can look like as follows when running 6 processes per locality and using all 6 GPUs:

# ... run on two nodes, with 6 HPX localities per node
mpiexec -n 12 -ppn 6 --cpu-bind=list:0-15:16-31:32-47:52-67:68-83:84-99 numactl -m 2-3 /path/to/gpu_wrapper_script.sh $APP_PATH $APP_OPTIONS --hpx:ignore-batch-env --hpx:ini=hpx.parcel.mpi.priority=1000 --hpx:ini=hpx.parcel.mpi.enable=1 --hpx:ini=hpx.parcel.tcp.enable=0 --hpx:threads=16 --hpx:nodefile=$PBS_NODEFILE
# ...

Here cpu-bind makes sure the first 3 localities run on the first socket, the other ones on the second. It actually leaves a few cores on each socket unbound, but there appears to be no way around this (since we have 52 cores and 3 GPUs per Socket). numactl makes sure we use the HBM memory. The GPU wrapper script looks like this:

num_gpu=6
gpu_id=$(( PALS_LOCAL_RANKID % num_gpu ))
export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
export ZE_ENABLE_PCI_ID_DEVICE_ORDER=1
export ZE_AFFINITY_MASK=$gpu_id
ulimit -c 0
"$@"

See the Aurora user guide for the explanations of the parameter choices here.

Caveats:

For some reason, using a single locality on one node does not work. The code crashes with a segfault during hpx::init for some reason. Not a huge issue though since I can simply collect my baseline run with my original runscript (from the beginning of the issue which was capable of only running one locality per node).
I get some hwloc mismatch errors causing the code to exit when using the parameters above together with --hpx:print-bind. Not quite sure why yet. When not using --hpx-bind to code runs though.

Remaining questions before closing this issue:

@hkaiser Should we add my current Aurora runscripts with mpirun to the HPX documentation as an example? Specifically to this section. Might help someone out in the future.
@hkaiser and @JiakunYan Now that I got Octo-Tiger working on Aurora with the MPI parcelport, I naturally also want to try the LCI parcelport. Since my current run script relies on mpirun, I kind of assume it would not simply work? At least I never tried mixing mpirun with LCI (since all systems I used LCI on were using SLURM so srun was the way to go). Do you know what I would need to change to run an HPX application with LCI on a PBS system? (Note: Aurora is using Slingshot 11)

0 replies

JiakunYan · 2025-04-11T16:53:44Z

JiakunYan
Apr 11, 2025

@G-071 LCI can work with mpirun. mpirun normally also relies on PMI to initialize the MPI processes, the same as srun. We need to take the following steps:

Figure out what PMI library the mpirun is actually using. Run ldd $(which mpirun) and look for entries like libpmi.so, libpmi2.so, or libpmix.so. LCI is shipped with an embedded SLURM pmi1 and pmi2 code, but they are not compatible with cray-pmi (if you are using Cray-MPICH) or hydra-pmi (if you are using MPICH). So you need to point LCI to the pmi installation by setting PMI_ROOT, PMI2_ROOT, or PMIx_ROOT. The relevant logic is here. Once LCI is linked with the correct PMI library, things should work normally.
As a last resort, you can also set -DLCT_PMI_BACKEND_ENABLE_MPI=ON and link LCI to MPI. In this case, LCI will just use MPI to bootstrap. Performance might be slightly impacted as LCI and MPI can contend for network resources, but the impact should be insignificant.

You can use export LCT_LOG_LEVEL=info to monitor what bootstrap backend LCI is actually using and use export LCT_PMI_BACKEND=[pmi1|pmi2|pmix|mpi] to change the default behavior. (relevant code logic is here).

For Slingshot-11, just set -DLCI_SERVER=ofi.

Let me know how it goes! I would love to get LCI working on Aurora.

0 replies

hkaiser · 2025-04-11T18:48:46Z

hkaiser
Apr 11, 2025
Maintainer

@hkaiser Should we add my current Aurora runscripts with mpirun to the HPX documentation as an example? Specifically to this section. Might help someone out in the future.

Yes, please! @dimitraka can help with that.

0 replies

dimitraka · 2025-04-23T14:30:59Z

dimitraka
Apr 23, 2025

Hi! Do you have any draft documentation you would like to add? I can help with integrating to the final docs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HPX with PBS: How to run multiple HPX localities per compute node? #6713

Uh oh!

{{title}}

Uh oh!

Replies: 7 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

HPX with PBS: How to run multiple HPX localities per compute node? #6713

Uh oh!

G-071 Mar 28, 2025

Replies: 7 comments

Uh oh!

hkaiser Mar 28, 2025 Maintainer

Uh oh!

G-071 Mar 28, 2025 Author

Uh oh!

harith-hacky03 Apr 7, 2025

Uh oh!

G-071 Apr 11, 2025 Author

Uh oh!

JiakunYan Apr 11, 2025

Uh oh!

hkaiser Apr 11, 2025 Maintainer

Uh oh!

dimitraka Apr 23, 2025

G-071
Mar 28, 2025

hkaiser
Mar 28, 2025
Maintainer

G-071
Mar 28, 2025
Author

harith-hacky03
Apr 7, 2025

G-071
Apr 11, 2025
Author

JiakunYan
Apr 11, 2025

hkaiser
Apr 11, 2025
Maintainer

dimitraka
Apr 23, 2025