Add TCP support for Python Executor and local debugging guide#1906
Add TCP support for Python Executor and local debugging guide#1906renuka-fernando merged 9 commits intomainfrom
Conversation
Introduce a configurable Python executor (UDS or TCP) and wire it into the policy engine. Key changes: - Add [python_executor] section to config-template.toml (server mode/host/port and timeout). - Add PythonExecutorConfig types, defaults and validation in policy engine config. - Initialize pythonbridge from configuration in policy-engine main and use pythonbridge.IsAvailable when starting admin server. - Refactor pythonbridge StreamManager to use address + isTCP, add Init(cfg) to configure the global manager, support TCP vs UDS dialing, expose IsAvailable, and centralize timeout handling. Update tests accordingly. - Update Python executor server to accept a listen address (supports TCP or UDS), cleanup stale UDS files, and improve logging. - Update Python executor main to read TOML config (tomli/tomllib), add --listen flag and config-aware defaults; add tomli to requirements for Python <3.11. - Prevent overriding the executor listen address in docker-entrypoint; adjust docker-compose (comment out admin/metrics port mappings and set GATEWAY_CONTROLLER_HOST=host.docker.internal). - Skip common Python virtualenv/cache/.git directories when copying policy sources in the builder. These changes allow running the Python executor over TCP (useful for containerized setups) or UDS, provide configurable timeouts, improve startup robustness, and keep tests and tooling aligned.
Add a comprehensive PYTHON_DEBUG_GUIDE.md for running and debugging the Python Executor locally (TCP mode, VS Code/pdb tips, steps to build, run, and cleanup). Change pythonbridge factory to resolve the global StreamManager at GetPolicy time instead of carrying it on BridgeFactory: remove the StreamManager field, call GetStreamManager() in GetPolicy, and pass that instance to InitPolicy and the created bridge. Update the plugin registry template to stop injecting StreamManager into the generated BridgeFactory. This ensures the singleton StreamManager is configured after pythonbridge.Init(cfg) in main before use.
Add a new VS Code debug configuration for the Python Executor (debugpy) in .vscode/launch.json to launch the executor module with appropriate args and PYTHONPATH. Overhaul gateway/PYTHON_DEBUG_GUIDE.md: clarify the host-process debugging workflow, update the mermaid diagram, add warnings about Go module resolution and Rancher Desktop, instruct using host.docker.internal (or host IP) and specific docker-compose changes (set GATEWAY_CONTROLLER_HOST and comment Policy Engine ports), renumber and clarify steps, provide updated Docker Compose commands (run from gateway/), add cleanup guidance, and update the ports table and assorted wording tweaks.
Remove TOML config parsing and dependency; rely on environment variables for defaults instead. main.py no longer reads a --config TOML file at startup and uses PYTHON_EXECUTOR_LISTEN and PYTHON_POLICY_TIMEOUT (env) as defaults; help text updated to clarify listen can be a UDS path or host:port. requirements.txt drops tomli. VS Code launch.json entries no longer pass --config. PYTHON_DEBUG_GUIDE.md updated with Rancher Desktop (Lima) instructions to export HOST_IP and apply it to the relevant steps, plus adjusted docs to reflect that the executor binds to localhost:9010 when using TCP. These changes simplify startup and remove the toml parsing requirement.
Add support for configuring the Python executor Unix socket path and TCP mode, and wire that through runtime, config validation, tests, docs and the Docker entrypoint. - config: add PythonExecutorServer.Path, validate host when mode=tcp and ensure timeout > 0 - pythonbridge: use a configured socket/path and TCP flag for StreamManager and availability checks (fall back to default path when unset) - entrypoint: export PYTHON_EXECUTOR_SOCKET and start the python executor with --listen using the socket; unset PYTHON_EXECUTOR_LISTEN before launching - tests: add TestValidate_PythonExecutorConfig covering UDS/tcp, host/port validation and timeouts - docs: update PYTHON_DEBUG_GUIDE to clarify controller env var usage and authentication placeholder for API deploy These changes allow switching between UDS and TCP modes and customizing the UDS path reliably across components.
Use net.JoinHostPort with strconv.Itoa for building the host:port address in the pythonbridge client (adds strconv import). Update PYTHON_DEBUG_GUIDE.md by changing fenced blocks to text and removing redundant `cd gateway` lines and minor whitespace cleanup to improve clarity.
Namespace the Python executor config under policy_engine: update configs/config-template.toml, move PythonExecutor into the PolicyEngine struct, adjust defaults, validation messages, and error text accordingly, and update main.go to use cfg.PolicyEngine.PythonExecutor. Update unit tests to reflect the new path. Documentation: delete the standalone PYTHON_DEBUG_GUIDE.md and expand gateway/DEBUG_GUIDE.md with an Option 2B for running the Python Executor locally and related instructions. Also tweak the VS Code Python Executor launch config (name change and console setting).
Add TCP support for Python Executor and local debugging guide
📝 WalkthroughSummaryThis pull request adds TCP support for the Python Policy Executor, enabling it to run in separate containers and communicate with the Policy Engine over network sockets. It also introduces a comprehensive local debugging guide to facilitate development workflows. Key ChangesTransport Configuration
Policy Engine Integration
Configuration & Validation
Code Generation
Python Executor Server
Docker & Startup
Debugging & Development
Plugin Registry
Impact
WalkthroughThis PR enables TCP support for the Python Executor bridge alongside Unix domain sockets. It introduces configuration-driven initialization via a new 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.1)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain modules listed in go.work or their selected dependencies" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
gateway/gateway-runtime/docker-entrypoint.sh (1)
24-24:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winSync
--py.listenguard and Python Executor startup todocker-entrypoint-debug.sh.The comment on line 24 notes the debug entrypoint must be kept in sync. The new
--py.listenguard (lines 77–80) that rejects listen overrides and theunset PYTHON_EXECUTOR_LISTENstatement (line 188) are not present in the debug file and should be added.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@gateway/gateway-runtime/docker-entrypoint.sh` at line 24, The debug entrypoint is out of sync: add the same --py.listen guard logic from docker-entrypoint.sh (the code that rejects --py.listen overrides) into docker-entrypoint-debug.sh and also add the unset PYTHON_EXECUTOR_LISTEN cleanup (the unset PYTHON_EXECUTOR_LISTEN statement) so both entrypoints behave identically; locate the guard and unset usage in docker-entrypoint.sh and copy the logic into docker-entrypoint-debug.sh near where command-line args are parsed and where PYTHON_EXECUTOR_LISTEN is cleaned up.gateway/gateway-runtime/policy-engine/internal/pythonbridge/client.go (1)
229-236:⚠️ Potential issue | 🟠 Major | ⚡ Quick winApply default timeout before
Connectto bound blocking dial.
Connectruns beforewithDefaultTimeout, sogrpc.WithBlock()can wait much longer thanpythonExecutorTimeoutwhen the caller context has no deadline.Proposed fix
func (sm *StreamManager) Execute(ctx context.Context, req *proto.StreamRequest) (*proto.StreamResponse, error) { + ctx, cancel := withDefaultTimeout(ctx, getTimeout()) + defer cancel() + if !sm.IsConnected() { if err := sm.Connect(ctx); err != nil { return nil, fmt.Errorf("connect Python Executor: %w", err) } } - - ctx, cancel := withDefaultTimeout(ctx, getTimeout()) - defer cancel() if err := ctx.Err(); err != nil { return nil, err }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@gateway/gateway-runtime/policy-engine/internal/pythonbridge/client.go` around lines 229 - 236, Move the withDefaultTimeout(ctx, getTimeout()) call to occur before checking/using sm.IsConnected() and sm.Connect so the dial is bounded by the default timeout; specifically, obtain the new ctx and cancel via withDefaultTimeout(ctx, getTimeout()) (and defer cancel) prior to calling sm.IsConnected() and then call sm.Connect(ctx) when needed, ensuring you pass the timeout-wrapped ctx into sm.Connect rather than the original caller ctx.
🧹 Nitpick comments (2)
gateway/gateway-runtime/policy-engine/cmd/policy-engine/main.go (1)
301-336: ⚡ Quick winClose the
StreamManagerduring graceful shutdown.The shutdown sequence stops every long-lived component (admin, metrics, xDS, ALS, gRPC server) except the Python bridge. The
StreamManagerholds an open gRPC client connection; closing it explicitly allows the Python executor to observe a clean disconnect rather than a TCP RST or OS-level teardown.♻️ Suggested addition after `grpcServer.GracefulStop()`
grpcServer.GracefulStop() + +// Close the Python executor bridge connection (if initialized) +if sm := pythonbridge.GetStreamManager(); sm != nil { + if err := sm.Close(); err != nil { + slog.WarnContext(ctx, "Error closing Python bridge", "error", err) + } +}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@gateway/gateway-runtime/policy-engine/cmd/policy-engine/main.go` around lines 301 - 336, The shutdown sequence omits closing the StreamManager, leaving the Python bridge’s gRPC client connection unclosed; after grpcServer.GracefulStop() call, call the StreamManager.Close() (or appropriate Close/Shutdown method on the StreamManager instance) and wait if there is a blocking Wait/Join method (e.g., StreamManager.Close() and StreamManager.Wait()) so the Python executor sees a clean disconnect; locate the StreamManager variable/struct used by the server and ensure you handle nil checks and log any close errors via slog before proceeding.gateway/gateway-runtime/policy-engine/internal/pythonbridge/client.go (1)
513-528: ⚡ Quick winAvoid silent no-op when
Initis called after singleton creation.If
GetStreamManager()initializes first, a laterInit()updates config globals but cannot reconfigure the existing singleton. Add an explicit guard/log so this state is visible.Proposed guard
func Init(cfg config.PythonExecutorConfig) { pythonExecutorTimeout = cfg.Timeout @@ configuredAddress = address configuredIsTCP = isTCP streamManagerOnce.Do(func() { globalStreamManager = NewStreamManager(configuredAddress, configuredIsTCP) }) + if globalStreamManager != nil && + (globalStreamManager.address != configuredAddress || globalStreamManager.isTCP != configuredIsTCP) { + slog.Warn("Python executor bridge already initialized; ignoring reconfiguration", + "existing_address", globalStreamManager.address, + "existing_tcp", globalStreamManager.isTCP, + "requested_address", configuredAddress, + "requested_tcp", configuredIsTCP, + ) + } slog.Info("Python executor bridge initialized",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@gateway/gateway-runtime/policy-engine/internal/pythonbridge/client.go` around lines 513 - 528, The Init function currently updates configuredAddress/configuredIsTCP but does nothing visible if the singleton was already created by GetStreamManager; add an explicit guard in Init that checks whether streamManagerOnce has already executed (i.e., whether globalStreamManager != nil) and log or return an error indicating Init was called after the StreamManager was created; specifically modify Init to detect globalStreamManager != nil (or use an atomic/boolean sentinel alongside streamManagerOnce), and emit a clear slog.Warn/Info message referencing configuredAddress and configuredIsTCP so callers can see that reconfiguration was a no-op for the existing StreamManager.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@gateway/DEBUG_GUIDE.md`:
- Around line 245-249: The markdownlint MD028 error is caused by an empty line
breaking the adjacent blockquote callouts; edit the DEBUG_GUIDE.md block
containing the [!NOTE] and [!WARNING] callouts so there is no blank line between
them (ensure each callout line begins with ">" so the NOTE and WARNING remain
part of the same blockquote sequence), e.g., remove the blank line between the
"> **When to use this:**..." and "> [!WARNING]" lines to restore contiguous
quoting.
In `@gateway/gateway-runtime/policy-engine/internal/pythonbridge/factory.go`:
- Around line 77-78: GetStreamManager() can return nil and calling
sm.InitPolicy(...) panics; add a nil guard after sm := GetStreamManager() to
handle the uninitialized case (e.g., if sm == nil return a descriptive error via
the same return path as InitPolicy uses or construct an appropriate error
response) so InitPolicy is only invoked on a non-nil *StreamManager; reference
GetStreamManager, the local variable sm, and the method InitPolicy when
implementing the guard and error return.
---
Outside diff comments:
In `@gateway/gateway-runtime/docker-entrypoint.sh`:
- Line 24: The debug entrypoint is out of sync: add the same --py.listen guard
logic from docker-entrypoint.sh (the code that rejects --py.listen overrides)
into docker-entrypoint-debug.sh and also add the unset PYTHON_EXECUTOR_LISTEN
cleanup (the unset PYTHON_EXECUTOR_LISTEN statement) so both entrypoints behave
identically; locate the guard and unset usage in docker-entrypoint.sh and copy
the logic into docker-entrypoint-debug.sh near where command-line args are
parsed and where PYTHON_EXECUTOR_LISTEN is cleaned up.
In `@gateway/gateway-runtime/policy-engine/internal/pythonbridge/client.go`:
- Around line 229-236: Move the withDefaultTimeout(ctx, getTimeout()) call to
occur before checking/using sm.IsConnected() and sm.Connect so the dial is
bounded by the default timeout; specifically, obtain the new ctx and cancel via
withDefaultTimeout(ctx, getTimeout()) (and defer cancel) prior to calling
sm.IsConnected() and then call sm.Connect(ctx) when needed, ensuring you pass
the timeout-wrapped ctx into sm.Connect rather than the original caller ctx.
---
Nitpick comments:
In `@gateway/gateway-runtime/policy-engine/cmd/policy-engine/main.go`:
- Around line 301-336: The shutdown sequence omits closing the StreamManager,
leaving the Python bridge’s gRPC client connection unclosed; after
grpcServer.GracefulStop() call, call the StreamManager.Close() (or appropriate
Close/Shutdown method on the StreamManager instance) and wait if there is a
blocking Wait/Join method (e.g., StreamManager.Close() and StreamManager.Wait())
so the Python executor sees a clean disconnect; locate the StreamManager
variable/struct used by the server and ensure you handle nil checks and log any
close errors via slog before proceeding.
In `@gateway/gateway-runtime/policy-engine/internal/pythonbridge/client.go`:
- Around line 513-528: The Init function currently updates
configuredAddress/configuredIsTCP but does nothing visible if the singleton was
already created by GetStreamManager; add an explicit guard in Init that checks
whether streamManagerOnce has already executed (i.e., whether
globalStreamManager != nil) and log or return an error indicating Init was
called after the StreamManager was created; specifically modify Init to detect
globalStreamManager != nil (or use an atomic/boolean sentinel alongside
streamManagerOnce), and emit a clear slog.Warn/Info message referencing
configuredAddress and configuredIsTCP so callers can see that reconfiguration
was a no-op for the existing StreamManager.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 538d1640-e5e1-4687-8600-11580ee75ef2
📒 Files selected for processing (15)
.vscode/launch.jsongateway/DEBUG_GUIDE.mdgateway/configs/config-template.tomlgateway/gateway-builder/internal/policyengine/generator.gogateway/gateway-builder/templates/plugin_registry.go.tmplgateway/gateway-runtime/docker-entrypoint.shgateway/gateway-runtime/policy-engine/cmd/policy-engine/main.gogateway/gateway-runtime/policy-engine/internal/config/config.gogateway/gateway-runtime/policy-engine/internal/config/config_test.gogateway/gateway-runtime/policy-engine/internal/pythonbridge/bridge_test.gogateway/gateway-runtime/policy-engine/internal/pythonbridge/client.gogateway/gateway-runtime/policy-engine/internal/pythonbridge/client_test.gogateway/gateway-runtime/policy-engine/internal/pythonbridge/factory.gogateway/gateway-runtime/python-executor/executor/server.pygateway/gateway-runtime/python-executor/main.py
💤 Files with no reviewable changes (1)
- gateway/gateway-builder/templates/plugin_registry.go.tmpl
| > [!NOTE] | ||
| > **When to use this:** You are developing or debugging a Python policy and need to set breakpoints, add print statements, or iterate rapidly without rebuilding Docker images. | ||
|
|
||
| > [!WARNING] | ||
| > Processes run directly on the host, so Go resolves modules via `go.work`. Local versions of `sdk` and other workspace modules are used instead of the published Go module versions — including any uncommitted or untagged changes. Behavior may differ from a production build. |
There was a problem hiding this comment.
Fix blockquote spacing between adjacent callouts (MD028).
There is a blank line inside the blockquote sequence between NOTE and WARNING, which triggers markdownlint. Remove the blank line or keep it quoted.
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 247-247: Blank line inside blockquote
(MD028, no-blanks-blockquote)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@gateway/DEBUG_GUIDE.md` around lines 245 - 249, The markdownlint MD028 error
is caused by an empty line breaking the adjacent blockquote callouts; edit the
DEBUG_GUIDE.md block containing the [!NOTE] and [!WARNING] callouts so there is
no blank line between them (ensure each callout line begins with ">" so the NOTE
and WARNING remain part of the same blockquote sequence), e.g., remove the blank
line between the "> **When to use this:**..." and "> [!WARNING]" lines to
restore contiguous quoting.
| sm := GetStreamManager() | ||
| resp, err := sm.InitPolicy(ctx, req) |
There was a problem hiding this comment.
Add nil guard on GetStreamManager() to prevent panic.
GetStreamManager() returns nil if pythonbridge.Init() has not been called. Calling sm.InitPolicy(...) on a nil *StreamManager panics. The comment documents the initialization contract, but does not enforce it — any test or integration path that bypasses main() would crash the process.
🛡️ Proposed fix
sm := GetStreamManager()
+if sm == nil {
+ return nil, fmt.Errorf("python bridge not initialized: call pythonbridge.Init() before GetPolicy()")
+}
resp, err := sm.InitPolicy(ctx, req)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| sm := GetStreamManager() | |
| resp, err := sm.InitPolicy(ctx, req) | |
| sm := GetStreamManager() | |
| if sm == nil { | |
| return nil, fmt.Errorf("python bridge not initialized: call pythonbridge.Init() before GetPolicy()") | |
| } | |
| resp, err := sm.InitPolicy(ctx, req) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@gateway/gateway-runtime/policy-engine/internal/pythonbridge/factory.go`
around lines 77 - 78, GetStreamManager() can return nil and calling
sm.InitPolicy(...) panics; add a nil guard after sm := GetStreamManager() to
handle the uninitialized case (e.g., if sm == nil return a descriptive error via
the same return path as InitPolicy uses or construct an appropriate error
response) so InitPolicy is only invoked on a non-nil *StreamManager; reference
GetStreamManager, the local variable sm, and the method InitPolicy when
implementing the guard and error return.
Purpose
Fix #1707
Fix #1704
Previously, communication between the Go Policy Engine and the Python Executor was hardcoded to use Unix Domain Sockets (UDS). This architectural constraint prevented the Python Executor from running in a separate container (e.g., for future GPU-isolated workloads) and made local development and debugging of Python policies extremely difficult, as developers could not easily attach IDE debuggers (like VS Code) or use tools like
pdbwithout complex container modifications.Goals
Approach
main.pyandserver.pyto accept a flexible--listenargument that automatically detects and binds to either a UDS path or a TCP address (e.g.,localhost:9010).--listen,--workers,--timeout, etc.) and environment variables.client.goto dial dynamically based on the configuration provided inconfig.toml([python_executor.server]).generator.goand the Python registry continue to use major-version keys (e.g.,prompt-compressor:v0) to maintain consistency with the Go Policy Engine's internal registry conventions.PYTHON_DEBUG_GUIDE.mddetailing the "Option 2" style local process debugging setup.Documentation
Created
gateway/PYTHON_DEBUG_GUIDE.mdto document the new local debugging workflow.Security checks
Test environment
Related PR: #1875