Fix thread leak for LOOPBACK workers in external worker pool#38432
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #38432 +/- ##
============================================
- Coverage 55.76% 55.75% -0.01%
Complexity 2095 2095
============================================
Files 1099 1099
Lines 172277 172288 +11
Branches 1350 1350
============================================
- Hits 96065 96064 -1
- Misses 73817 73829 +12
Partials 2395 2395
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
befc93c to
23dcafa
Compare
|
r: @Abacn |
|
cc'ed @tvalentyn since it is related to the DEADLINE EXCEEDED problem. |
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a resource leak issue occurring in LOOPBACK mode, where SdkHarness workers running as background daemon threads were not being properly terminated. This accumulation of threads and gRPC channels across test runs led to resource starvation and intermittent DEADLINE_EXCEEDED errors. The changes introduce a mechanism to track these threads and explicitly signal them to shut down, ensuring stable performance during large test suites. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request implements a mechanism to track and cleanly shut down thread-based SdkHarness workers in the BeamFnExternalWorkerPoolServicer, preventing resource leaks in LOOPBACK mode. Reviewers identified a potential thread leak if StartWorker is called with an existing ID and suggested a logic fix. Additionally, feedback was provided to improve test stability by replacing sleep calls with synchronization primitives and to move an inline import to the top level for better code style.
|
This LGTM, though additional reviewer for gRPC related logic is preferred |
|
The last two presubmit runs both succeeded without any reruns. |
Abacn
left a comment
There was a problem hiding this comment.
LGTM
since the main code change only touches stop_worker it won't affect pipeline functionality
…38432) * Fix thread leak for LOOPBACK workers in external worker pool. * Fix lints. * Add comments. * Fix lints. * Address review comments.
We have been seeing "DEADLINE EXCEEDED" from time to time in our tests since we switched the default python runner to Prism.
Below is an example traceback:
The above traceback suggests that the Python client tries to retrieve state from Prism server, but it is timeout. We suspect the timeout is due to resource starvation from leaking threads and grpc channels of SdkHarness workers.
LOOPBACKmode (the default environment forPrismRunner), the external worker pool launchesSdkHarnessworkers inside the Python process as background daemon threads (not separate processes).Previously,
StopWorkerexplicitly skipped terminating thread-based workers under the assumption that they would exit automatically. However, becausePrismRunneracts as a process-level shared singleton(Use singleton prism server by default. #36228), its gRPC server remains active across tests and does not immediately close these control/data streams. As a result, background daemon threads and open gRPC channels accumulate sequentially across test runs.pytest, accumulating dozens or hundreds of these threads creates severe GIL contention, memory pressure, and socket exhaustion, eventually stalling pipeline execution and causingDEADLINE_EXCEEDEDtimeouts.In this PR, we add code to (1) track active
SdkHarnessworker threads and (2) pushSentinel.sentineldirectly into its_responsesqueue inStopWorker()to unblock the worker's request iterator and shut down all associated thread pools and grpc channel resources.