Skip to content

Parallelize routing#1292

Merged
jcoupey merged 10 commits intoVROOM-Project:masterfrom
michael-struwe-mischok:parallelize-routing
Nov 3, 2025
Merged

Parallelize routing#1292
jcoupey merged 10 commits intoVROOM-Project:masterfrom
michael-struwe-mischok:parallelize-routing

Conversation

@michael-struwe-mischok
Copy link
Copy Markdown
Contributor

@michael-struwe-mischok michael-struwe-mischok commented Oct 7, 2025

Issue

#1291

Tasks

  • Update CHANGELOG.md (remove if irrelevant)
  • review

@jcoupey
Copy link
Copy Markdown
Collaborator

jcoupey commented Oct 8, 2025

Thanks for submitting a PR! This fixes #1218.

@jcoupey jcoupey added this to the v1.15.0 milestone Oct 8, 2025
@jcoupey
Copy link
Copy Markdown
Collaborator

jcoupey commented Oct 15, 2025

I've made a couple nitpicking adjustments so that the parallelization code looks similar to other places in the codebase and I've been testing this in a relevant setup: remote osrm-routed server with poor bandwith. The result is as expected: for an instance with 40 routes I'm going from ~17s of routing (mostly network time) down to ~4s.

The problem arise if we stretch the test to many more routes: for an instance with 400 routes, the parallelization has somehow throttled the OSRM server and I'm ending up with a [Error] Failed to connect to XX.XX.XX.XX:5000. This is especially frustrating as it could happen after a long search and spoil everything at the routing stage.

We should probably limit parallelization in a configurable way, maybe re-using the -t value from options.

Note: the same limitation theoretically applies to the parallelized matrix computations, but the number of parallel requests is much lower as it's bounded by the number of profiles in use.

@michael-struwe-mischok
Copy link
Copy Markdown
Contributor Author

michael-struwe-mischok commented Oct 28, 2025

Added a limit to the parallelization using a semaphore. This starts all the threads, but has all except nb_thread of them wait. I think it shouldn't be a big problem to start too many idle threads since at this point the bottleneck is I/O, not CPU/RAM usage.

Tried this with vroom-docker to confirm it reacts to the threads config option & doesn't break catastrophically :-).

Copy link
Copy Markdown
Collaborator

@jcoupey jcoupey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of counting_semaphore looks like a very neat solution, thanks for the update!

Out of curiosity, what is the reason for choosing a template value of 128?

@jcoupey
Copy link
Copy Markdown
Collaborator

jcoupey commented Oct 31, 2025

I've been running further tests with a remote osrm server and a low bandwidth. Already having a few threads (4 or 8) makes a huge difference: the overall routing time is more than 6x faster. Then for e.g. 32 and 64 threads, the overall routing time is slightly higher so it looks like the threads boilerplate exceeds the benefit.

Maybe that would hint toward using a max number of 32, which would also be in line with the rest of the parallelization strategy. @michael-struwe-mischok do you have other input? what do you think?

@michael-struwe-mischok
Copy link
Copy Markdown
Contributor Author

From how I understand the mentions of LeastMaxValue and max in https://en.cppreference.com/w/cpp/thread/counting_semaphore.html, we can use the template value to choose a maximum for the semaphore (just that the actual maximum delivered by the implementation might be higher if it wants to). E.g. if we say 128, then even if nb_thread is 200, the semaphore will begin its internal counter at 128 instead of 200 (or it may begin at 200, if the implementation wants to provide a higher maximum). So it's a bit like allocating enough memory for what we need.

The 128 was just a guess on how many parallel requests might be a good idea. We could also choose a higher maximum, e.g. 512. Someone might configure their OSRM to be powerful enough for this many requests / their network conditions may make it appropriate. (I assume the overhead of a too-high maximum is pretty low)

Maybe we shouldn't use nb_thread for the number of parallel requests & configure it separately instead. The nb_thread is more about how much CPU is available for VROOM, and this is more about how many requests to OSRM should be in-flight at once. The default could be sth like nb_thread * 4

Example scenario:

  • Someone runs VROOM and OSRM on two machines that both have 4 cores
  • They configure both VROOM and OSRM to use 4 threads
  • Now if VROOM would send nb_thread requests in parallel, this would send a burst of 4 requests, all of them would come back at once, then send another burst of 4 requests
  • Instead, if VROOM sends e.g. 32 requests in parallel, but OSRM can only actually handle 4 requests at once: It would send 32 requests, then get a burst of 4 requests back & at that moment send the next 4 requests. While the 4 responses are flying over the network, OSRM still has enough work to do with the other 28 requests to keep its CPU resources used
  • If the number of parallel requests is high enough to fully use the OSRM CPU resources, but otherwise not overwhelm OSRM, that is the optimal value. Given the network as a constant that we cannot change, the bottleneck we have to exploit is that the OSRM CPU is busy with our requests. Total time = network-towards-OSRM + OSRM-CPU-for-all-the-requests + network-back-to-VROOM

However in an ideal setup, probably you should run OSRM and VROOM on the same machine to minimize the network time. In that case, nb_thread should already be a pretty good value.

@jcoupey
Copy link
Copy Markdown
Collaborator

jcoupey commented Oct 31, 2025

You're right that this should ideally be configured differently, but it's kind of very routing specific so I don't really feel like having another dedicated parameter for that.

As your example shows, this is highly dependent on the routing deployment (not only osrm-routed but potential load balancing on top of it, not to mention other routing engines) so it's hard for us to make a generic guess. What we want to avoid is having regressions where users suddenly start hitting errors because we changed the routing process internally, so we have to be careful/conservative about the defaults.

If using std::counting_semaphore<32> semaphore(nb_thread); as a default we should be safe, while transparently providing a speedup by default. If someone uses -t 4, it might indeed be less efficient on the routing side than using 32 threads as your example points out, but that's still a huge improvement over the previous behavior.

On the other hand, users that have a routing setup allowing higher request rates can increase the default value (we could even make the 128 or 32 a constexpr variable configurable somewhere). @michael-struwe-mischok what do you think?

@michael-struwe-mischok
Copy link
Copy Markdown
Contributor Author

👍 In general I don't want to block this with details so feel free to just do sth that seems appropriate :-)

I think std::counting_semaphore<32> semaphore(nb_thread); is fine, just two nitpicks:

  • If the 32 is there as a maximum in order to avoid breakage in some situations, it would be better to do sth like std::counting_semaphore<32> semaphore(min(32, nb_thread)); so that the maximum actually applies. Because the implementation of counting_semaphore could use a higher maximum than what we say in the template value
  • I think we could do sth like nb_thread * 2 here which should be a bit faster while still being careful

@jcoupey
Copy link
Copy Markdown
Collaborator

jcoupey commented Nov 3, 2025

Good point about taking the min. I've introduced a constexpr unsigned MAX_ROUTING_THREADS set to 32 so it is straightforward to change the behavior for anyone that dives a bit into the code.

I kept the nb_thread value (not using e.g. nb_thread * 2) since if we keep a single parameter to control all parallelization, it feels more consistent this way. I know the number of threads used for routing requests does not make really sense in terms of CPU usage if you have a remote routing server. But then it does if the routing server is on the same machine, and that's somehow expected in a basic setup as we look for a local OSRM instance by default.

@jcoupey
Copy link
Copy Markdown
Collaborator

jcoupey commented Nov 3, 2025

For the record, I've also run some quick tests using libosrm and noticed the same magnitude in routing time reduction.

@jcoupey jcoupey merged commit a91f5c9 into VROOM-Project:master Nov 3, 2025
4 checks passed
This was referenced Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants