Skip to content

fix: drop stale ReadyForQuery expectation when server enters COPY IN mode#6

Open
NikolayS wants to merge 1 commit into
mainfrom
claude/fix-copy-in-readyforquery-ZpVXL
Open

fix: drop stale ReadyForQuery expectation when server enters COPY IN mode#6
NikolayS wants to merge 1 commit into
mainfrom
claude/fix-copy-in-readyforquery-ZpVXL

Conversation

@NikolayS
Copy link
Copy Markdown
Owner

@NikolayS NikolayS commented Apr 10, 2026

Bug

When a client sends a COPY FROM STDIN statement via extended query protocol (Parse+Bind+Execute+Sync), pgdog adds a ReadyForQuery expectation to its protocol state queue for that Sync. But PostgreSQL ignores Sync during COPY IN mode (protocol spec §55.2.6) and never sends ReadyForQuery for it. The stale entry stays in the queue, done() never returns true, and the connection is never returned to the pool.

tokio-postgres (the most popular Rust PostgreSQL client) uses exactly this pattern — it sends COPY via extended protocol: Parse+Bind+Execute+Sync, then CopyData..., then CopyDone+Sync. PostgreSQL ignores the first Sync (it's already in COPY IN mode by then), producing only one ReadyForQuery instead of the two that pgdog expects.

Exact message sequence that causes desync

# Client → pgdog (handle side):
handle(Parse)     → add('1')                    queue: [ParseComplete]
handle(Bind)      → add('2')                    queue: [ParseComplete, BindComplete]
handle(Execute)   → add(ExecutionCompleted)      queue: [ParseComplete, BindComplete, ExecutionCompleted]
handle(Sync)      → add('Z')                    queue: [ParseComplete, BindComplete, ExecutionCompleted, RFQ]

# PostgreSQL → pgdog (forward side):
forward('1')      → pops ParseComplete           queue: [BindComplete, ExecutionCompleted, RFQ]
forward('2')      → pops BindComplete             queue: [ExecutionCompleted, RFQ]
forward('G')      → 'G'→Copy, pops ExecutionCompleted
                    (not RFQ, no push-back)       queue: [RFQ]
                    prepend('G')→Copy             queue: [Copy, RFQ]  ← STALE

# Client → pgdog (copy data):
handle(CopyDone)  → action('c')→Copy, pops Copy  queue: [RFQ]
handle(Sync)      → add('Z')                     queue: [RFQ, RFQ]

# PostgreSQL → pgdog (one RFQ, not two):
forward('C')      → pops RFQ, but C≠RFQ → push back  queue: [RFQ, RFQ]
forward('Z')      → pops one RFQ                 queue: [RFQ]  ← STALE FOREVER

Consequences

  • done() never returns true
  • Connection is never returned to the pool (query.rs:282-289)
  • rollback() fails with RollbackFailed
  • Effectively a connection leak per COPY operation

Verified end-to-end

Integration test using tokio-postgres::copy_in() through pgdog (integration/rust/tests/tokio_postgres/copy.rs):

  • WITHOUT fix: FATAL: query timeout at sink.finish() — pgdog's state machine is desynced and can't complete the COPY
  • WITH fix: COPY completes, subsequent SELECT count(*) returns correct results (passes in 0.09s)

Fix

When forward() receives CopyInResponse ('G'), call remove_one_rfq() to drop the ReadyForQuery that will never arrive. This makes the proxy resilient to clients that send Sync with the initial Parse+Bind+Execute for COPY statements.

Note: pgdog already handles COPY via extended protocol in prepared_statements.rsCopyDone, CopyFail, CopyData in handle() (lines 180-188) and CopyInResponse ('G') in forward() (line 229). The fix adds one call to the existing 'G' handler.

Tests

  • Unit test test_copy_in_with_client_double_sync — exercises the full sequence through PreparedStatements::forward() (the real code path), asserts clean state after the COPY cycle completes
  • Integration test test_copy_in_extended_protocol — end-to-end test using tokio-postgres::copy_in() through pgdog, verifying both COPY completion and subsequent query success

https://claude.ai/code/session_01PQvrbw2xJHgQBXtASWHFcv

…mode

When a client sends Bind+Execute+Sync for a COPY FROM STDIN statement,
pgdog adds a ReadyForQuery expectation for that Sync.  But PostgreSQL
ignores Sync during COPY IN mode (protocol spec §55.2.6) and never sends
ReadyForQuery for it.  The stale entry stays in the queue, done() never
returns true, and the connection is never returned to the pool.

Call remove_one_rfq() in forward() when we see CopyInResponse ('G') to
drop the ReadyForQuery that will never arrive.

Verified with end-to-end integration test using tokio-postgres copy_in():
- WITHOUT fix: query timeout - CopyDone hangs because state machine is desynced
- WITH fix: COPY completes, subsequent queries work normally

https://claude.ai/code/session_01PQvrbw2xJHgQBXtASWHFcv
@NikolayS NikolayS force-pushed the claude/fix-copy-in-readyforquery-ZpVXL branch from 98dfd00 to dae4519 Compare April 10, 2026 04:26
Copy link
Copy Markdown
Owner Author

End-to-end demo: step by step

The test

integration/rust/tests/tokio_postgres/copy.rs — connects to pgdog with tokio-postgres and does a simple COPY FROM STDIN:

// Connect to pgdog (port 6432), not directly to PostgreSQL
let (conn, connection) = tokio_postgres::connect(
    "host=127.0.0.1 user=pgdog dbname=pgdog password=pgdog port=6432",
    NoTls,
).await.unwrap();

// Create table
conn.batch_execute(
    "DROP TABLE IF EXISTS _copy_test;
     CREATE TABLE _copy_test (id BIGINT, value TEXT);",
).await.unwrap();

// COPY FROM STDIN — tokio-postgres sends this via extended protocol
// (Parse, Bind, Execute, Sync), not simple query (Q)
let sink = conn
    .copy_in("COPY _copy_test (id, value) FROM STDIN")
    .await.unwrap();

// Send 10 rows
let mut buf = BytesMut::new();
for i in 0..10_i64 {
    buf.put_slice(format!("{}\trow_{}\n", i, i).as_bytes());
}
futures_util::pin_mut!(sink);
sink.send(buf.freeze()).await.unwrap();
let rows_copied = sink.finish().await.unwrap();  // ← fails WITHOUT fix
assert_eq!(rows_copied, 10);

// Query after COPY — proves the connection is still usable
let rows = conn.query("SELECT count(*) FROM _copy_test", &[]).await.unwrap();
let count: i64 = rows[0].get(0);
assert_eq!(count, 10);

What tokio-postgres sends on the wire

When copy_in() is called, tokio-postgres uses extended protocol, not simple query:

Client → Server:  Parse, Bind, Execute, Sync     ← first Sync
Server → Client:  ParseComplete, BindComplete, CopyInResponse
Client → Server:  CopyData, CopyData, ...
Client → Server:  CopyDone, Sync                 ← second Sync
Server → Client:  CommandComplete, ReadyForQuery  ← only ONE ReadyForQuery

PostgreSQL enters COPY IN mode after sending CopyInResponse. At that point it ignores everything except CopyData/CopyDone/CopyFail/Flush — including the first Sync. So it only sends one ReadyForQuery (for the second Sync with CopyDone).

What happens inside pgdog WITHOUT the fix

pgdog's handle() adds a ReadyForQuery expectation for every Sync it sees:

handle(Parse)   → queue: [ParseComplete]
handle(Bind)    → queue: [ParseComplete, BindComplete]
handle(Execute) → queue: [ParseComplete, BindComplete, ExecutionCompleted]
handle(Sync)    → queue: [ParseComplete, BindComplete, ExecutionCompleted, ReadyForQuery]

Then forward() processes server responses:

forward(ParseComplete)    → queue: [BindComplete, ExecutionCompleted, ReadyForQuery]
forward(BindComplete)     → queue: [ExecutionCompleted, ReadyForQuery]
forward(CopyInResponse)   → queue: [Copy, ReadyForQuery]  ← ReadyForQuery is STALE

That stale ReadyForQuery will never be satisfied because PostgreSQL already ignored the first Sync.

Result WITHOUT fix

$ cargo test -p rust --test mod tokio_postgres::copy::test_copy_in_extended_protocol

FAILED — FATAL: query timeout
(pgdog waits for a ReadyForQuery that never arrives, hits the 2s query_timeout)

test result: FAILED. 0 passed; 1 failed; finished in 6.25s

The fix (one line)

In prepared_statements.rs, when forward() sees CopyInResponse ('G'), drop the stale ReadyForQuery:

'G' => {
    self.state.prepend('G');
    self.state.remove_one_rfq();  // ← the fix
}

Result WITH fix

$ cargo test -p rust --test mod tokio_postgres::copy::test_copy_in_extended_protocol

test tokio_postgres::copy::test_copy_in_extended_protocol ... ok

test result: ok. 1 passed; 0 failed; finished in 0.10s

COPY completes, subsequent SELECT returns 10 rows, connection is clean.


Generated by Claude Code

Copy link
Copy Markdown
Owner Author

Driver investigation: which PostgreSQL clients trigger this bug?

We tested four popular drivers against pgdog without the fix, running the same pattern: COPY FROM STDIN, then a subsequent SELECT query.

Results

Driver Language COPY protocol Triggers bug? How verified
tokio-postgres Rust Extended (Parse+Bind+Execute+Sync) YESFATAL: query timeout Integration test through pgdog
psycopg3 Python Simple query (send_query) No — passes Tested + confirmed via source
node-postgres (pg-copy-streams) Node.js Simple query No — passes Tested end-to-end
ruby-pg Ruby Simple query (PQsendQuery) No — passes Tested + confirmed via source

Only tokio-postgres triggers the bug. All other drivers use simple query protocol (Q message) for COPY, which doesn't produce the stale ReadyForQuery.

Why only tokio-postgres?

tokio-postgres's copy_in() calls into_statement() → Parse, then query::encode() → Bind+Execute+Sync. This is the only driver we found that wraps COPY in extended protocol.

All other drivers use simple query:

  • psycopg3: despite using extended protocol for regular queries, its _execute_send() for COPY falls through to send_query() (no params, no force_extended, binary=False)
  • ruby-pg: copy_data uses exec() which calls PQsendQuery
  • node-postgres: pg-copy-streams uses query() which sends simple query

Drivers not yet tested

  • JDBC (Java) — likely simple query, worth verifying
  • pgx (Go) — may use extended protocol, worth investigating
  • libpq direct — uses simple query (PQexec)

The key question for any driver

Does it send Sync before the COPY data flow begins? If yes, PostgreSQL ignores that Sync during COPY IN mode, and pgdog's state machine will have a stale ReadyForQuery expectation that never gets fulfilled.

Lev's upstream fix

Lev independently confirmed the bug and created pgdogdev#886. His approach is more defensive — instead of removing one ReadyForQuery, he uses retain() to remove all ReadyForQuery entries when Copy is seen in extended mode, and also fixes the extended flag tracking.


Generated by Claude Code

Copy link
Copy Markdown
Owner Author

Comprehensive driver testing: which clients trigger the COPY desync bug?

We tested/researched PostgreSQL drivers across all major languages. The question for each: does it send Sync as part of extended protocol before the COPY data flow begins?

Verified end-to-end against pgdog WITHOUT the fix

Driver Language Direct PG Through pgdog Error
tokio-postgres Rust PASS FAIL FATAL: query timeout
postgres.js (porsager) TypeScript/JS PASS FAIL CONNECTION_CLOSED
pg8000 Python PASS FAIL portal "" does not exist
psycopg3 Python PASS PASS
node-postgres (pg) Node.js PASS PASS
ruby-pg Ruby PASS PASS

All three extended-protocol drivers fail through pgdog, each with a different symptom of the same root cause (stale ReadyForQuery). All simple-query drivers pass fine.

Full driver survey (tested + researched)

Triggers the bug (Extended protocol with Sync for COPY):

Driver Language Verified Error
tokio-postgres Rust Tested query timeout
postgres.js TypeScript/JS Tested CONNECTION_CLOSED
pg8000 Python Tested portal "" does not exist
Postgrex Elixir Source confirmed (Parse+Bind+Execute+Sync) Not tested (no Elixir runtime)
PostgresNIO Swift Source confirmed (sendParseDescribeBindExecuteSync) Not tested (no Swift runtime)
Diesel Rust Source confirmed (via libpq PQsendQueryParams) Not tested

Does NOT trigger the bug (Simple query for COPY):

Driver Language Verified
psycopg2 Python Source confirmed
psycopg3 Python Tested + source confirmed
asyncpg Python Source confirmed (b'Q' message)
node-postgres (pg) Node.js Tested
pgx Go Source confirmed
lib/pq Go Source confirmed
go-pg Go Source confirmed
pgjdbc Java Source confirmed
Npgsql .NET/C# Source confirmed
ruby-pg Ruby Tested + source confirmed
sqlx Rust Source confirmed
r2dbc-postgresql Kotlin/Java Source confirmed
php-pgsql PHP Source confirmed (via libpq)
libpq C Simple query reference impl

ORMs that depend on underlying driver: Drizzle (postgres.js OR pg), Kysely (pg default), Slonik (pg), SQLAlchemy (psycopg2/psycopg3/asyncpg/pg8000), TypeORM (pg), MikroORM (configurable).

Impact

6 drivers across Rust, TypeScript/JavaScript, Python, Elixir, and Swift use extended protocol for COPY and would trigger this bug. This includes the primary/default drivers for:

  • Elixir/Phoenix (Postgrex — the ONLY Elixir PG driver)
  • Swift/Vapor (PostgresNIO — the ONLY Swift PG driver)
  • TypeScript when using postgres.js (used by Drizzle, Kysely)

This is not a niche tokio-postgres-only issue.


Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants