Skip to content

Fix/cluster slot migration flaky test 2693#18

Closed
hanxizh9910 wants to merge 30 commits intounstablefrom
fix/cluster-slot-migration-flaky-test-2693
Closed

Fix/cluster slot migration flaky test 2693#18
hanxizh9910 wants to merge 30 commits intounstablefrom
fix/cluster-slot-migration-flaky-test-2693

Conversation

@hanxizh9910
Copy link
Copy Markdown
Owner

No description provided.

madolson and others added 30 commits September 30, 2025 08:25
As requested, here is a version of lolwut for 9 that visualizes a Julia
set with ASCII art.

Example:
```
127.0.0.1:6379> lolwut version 9
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                     .............                              
                                 ......................                         
                              ............................                      
                           ......:::--:::::::::::::::.......                    
                         .....:::=+*@@@=--------=+===--::....                   
                        ....:::-+@@@@&*+=====+%@@@@@@@@=-::....                 
                      .....:::-=+@@@@@%%*++*@@@@@@@@@&*=--::....                
                     .....::--=++#@@@@@@@@##@@@@@@@@@@@@@@=::....               
                    ......:-=@&#&@@@@@@@@@@@@@@@@@@@@@@@@@%-::...               
                   ......::-+@@@@@@@@@@@@@@@@@@&&@@@#%#&@@@-::....              
                  .......::-=+%@@@@@@@@@@@@@@@@#%%*+++++%@+-:.....              
                  .......::-=@&@@@@@@@@@@@@@@@@&*++=====---::.....              
                 .......:::--*@@@@@@@@@@@@@@@@@%++===----::::.....              
                ........::::-=+*%&@@@@@@@@@&&&%*+==----:::::......              
                ........::::--=+@@@@@@@@@@&##%*++==---:::::.......              
                .......:::::---=+#@@@@@@@@&&&#%*+==---:::::.......              
               ........:::::---=++*%%#&&@@@@@@@@@+=---::::........              
               .......:::::----=++*%##&@@@@@@@@@@%+=--::::.......               
               ......::::-----==++#@@@@@@@@@@@@@&%*+=-:::........               
               ......:::---====++*@@@@@@@@@@@@@@@@@@+-:::.......                
               .....::-=++==+++**%@@@@@@@@@@@@@@@@#*=--::.......                
                ....:-%@@%****%###&@@@@@@@@@@@@@@@@&+--:.......                 
                ....:-=@@@@@&@@@@@@@@@@@@@@@@@@@@@@@@=::......                  
                 ...::+@@@@@@@@@@@@@@@&&@@@@@@@@%**@+-::.....                   
                 ....::-=+%#@@@@@@@@@&%%%&@@@@@@*==-:::.....                    
                  ....::--+%@@@@@@@%++==++*#@@@@&=-:::....                      
                   ....:::-*@**@@+==----==*%@@@@+-:::....                       
                     .....:::---::::::::--=+@=--::.....                         
                       .........::::::::::::::.......                           
                         .........................                              
                             ..................                                 
                                    ...                                         
                                                                                
                                                                                
                                                                                
Ascii representation of Julia set with constant 0.41 + 0.29i
Don't forget to have fun! Valkey ver. 255.255.255
```

You can pass in arbitrary rows and colums (it's best when rows is 2x
number of columns) and an arbitrary julia constant so it is repeatable.
Worst case it takes about ~100us on my m2 macbook, which should be fine
to make sure it's not taking too many system resources.

---------

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
**Issue History:**
1. The flaky test issue "defrag didn't stop" was originally detected in
February 2025: valkey-io#1746
Solution for 1746: valkey-io#1762
2. Similar issue occurred recently:
https://github.com/valkey-io/valkey/actions/runs/16585350083/job/46909359496#step:5:7640

**Investigation:** 
1. First, the issue occurs specifically to Active Defrag stream in
cluster mode.
2. After investigating `test_stream` in `memefficiency.tcl`, I found the
root cause is in defrag logic rather than the test itself - There are
still failed tests with the same error even if I tried different
parameters for the test.
3. Then I looked at malloc-stats and identified potential defrag issues,
particularly in the 80B bin where utilization only reaches ~75% after
defrag instead of the expected near 100%, while other bins show proper
defrag behavior - 80B actually is the size of a new stream(confirmed in
`t_stream.c`) that we add during test.
4. For 80B, after adding 200000 streams and fragmenting, `curregs `=
100030, after a lot of defrag cycles, there are still 122
nonfull-slabs/511 slabs with the remaining 446 items not defragged
(average 4/nonfull-slab).

**Detailed malloc-stats:**
- Total slabs: 511
- Non-full slabs: 122
- Full slabs: 511-122=389
- Theoretical maximum per slab: 256 items
- Allocated items in non-full slabs: 100030-389*256=446
- Average items per non-full slab: 446/122=3.66

**Root Cause:** 
**There are some immovable items which  prevent complete defrag**

**Problems in old defrag logic:**
1. The previous condition (we don't defrag if slab utilization > 'avg
utilization' * 1.125), the 12.5% threshold doesn’t work well with low
utilizations.

- Let's imagine we have 446 items in 122 nonfull-slabs (avg 3.66
items/nonfull-slab), let's say, e.g. we have 81 slabs with 5 items each
+41 slabs with 1 item each)
- 12.5% threshold: 3.66*1.125=4.11
- If those 41 single items are immovable, they actually lower the
average, so the rest 81 slabs will be above the threshold (5>4.11) and
will not be defragged - defrag didn't stop.

2. Distribution of immovable items across slabs was causing inconsistent
defragmentation and flaky test outcome.

- If those 41 single items are movable, they will be moved and the avg
will be 5, then 12.5% threshold: 5*1.125=5.625, so the rest 81 slabs
will fall below the threshold (5<5.625) and will be defragged - defrag
success.
- This can explain why we got flaky defrag tests.


**Final solution :** 
1. Add one more condition before the old logic in `makeDefragDecision
`to trigger defragmentation when slab is less than 1/8 full (1/8
threshold (12.5%) chosen to align with existing utilization threshold
factor) - Ensures no low-utilization slabs left without defragged, and
stabilize the defrag behavior.
2. The reason why we have immovable items and how to handle them is
going to be investigate later.
3. Be sure to rebuild Valkey before testing it.


**Local test result:**
- Before fix: 
pass rate 80.8% (63/78)
- After fix:
Test only stream: pass rate 100% (200/200)
Test the whole memefficiency.tcl: pass rate 100% (100/100)

Resolves valkey-io#2398 , the "defrag didn't stop" issue, with help from @JimB123
@madolson

---------

Signed-off-by: Alina Liu <liusalisa6363@gmail.com>
Signed-off-by: asagegeLiu <liusalisa6363@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
)

New client flags in reported by CLIENT INFO and CLIENT LIST:

* `i` for atomic slot migration importing client
* `E` for atomic slot migration exporting client

New flags in return value of `ValkeyModule_GetContextFlags`:

* `VALKEYMODULE_CTX_FLAGS_SLOT_IMPORT_CLIENT`: Indicate the that client
attached to this context is the slot import client.
* `VALKEYMODULE_CTX_FLAGS_SLOT_EXPORT_CLIENT`: Indicate the that client
attached to this context is the slot export client.

Users could use this to monitor the underlying client info of the slot
migration, and more clearly understand why they see extra clients during
the migration.

Modules can use these to detect keyspace notifications on import
clients. I am also adding export flags for symmetry, although there
should not be keyspace notifications. But they would potentially be
visible in command filters or in server events triggered by that client.

---------

Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
The CVE fixes had a formatting and external test issue that wasn't
caught because private branches don't run those CI steps.

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
…ration (valkey-io#2635)

# Problem

In the current slot migration design, replicas are completely unaware of
the slot migration. Because of this, they do not know to hide importing
keys, which results in exposure of these keys to commands like KEYS,
SCAN, RANDOMKEY, and DBSIZE.

# Design

The main part of the design is that we will now listen for and process
the `SYNCSLOTS ESTABLISH` command on the replica. When a `SYNCSLOTS
ESTABLISH` command is received from the primary, we begin tracking a new
slot import in a special `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.
Replicas use this state to track the import, and await for a future
`SYNCSLOTS FINISH` message that tells them the import is
successful/failed.

## Success Case

```
     Source                                          Target                         Target Replica
       |                                                |                                 |
       |------------ SYNCSLOTS ESTABLISH -------------->|                                 |
       |                                                |----- SYNCSLOTS ESTABLISH ------>|
       |<-------------------- +OK ----------------------|                                 |
       |                                                |                                 |
       |~~~~~~~~~~~~~~ snapshot as AOF ~~~~~~~~~~~~~~~~>|                                 |
       |                                                |~~~~~~ forward snapshot ~~~~~~~~>|
       |----------- SYNCSLOTS SNAPSHOT-EOF ------------>|                                 |
       |                                                |                                 |
       |<----------- SYNCSLOTS REQUEST-PAUSE -----------|                                 |
       |                                                |                                 |
       |~~~~~~~~~~~~ incremental changes ~~~~~~~~~~~~~~>|                                 |
       |                                                |~~~~~~ forward changes ~~~~~~~~~>|
       |--------------- SYNCSLOTS PAUSED -------------->|                                 |
       |                                                |                                 |
       |<---------- SYNCSLOTS REQUEST-FAILOVER ---------|                                 |
       |                                                |                                 |
       |---------- SYNCSLOTS FAILOVER-GRANTED --------->|                                 |
       |                                                |                                 |
       |                                            (performs takeover &                  |
       |                                             propagates topology)                 |
       |                                                |                                 |
       |                                                |------- SYNCSLOTS FINISH ------->|
 (finds out about topology                              |                                 |
  change & marks migration done)                        |                                 |
       |                                                |                                 |
```

## Failure Case
```
     Source                                          Target                         Target Replica
       |                                                |                                 |
       |------------ SYNCSLOTS ESTABLISH -------------->|                                 |
       |                                                |----- SYNCSLOTS ESTABLISH ------>|
       |<-------------------- +OK ----------------------|                                 |
     ...                                              ...                               ...
       |                                                |                                 |
       |                                             <FAILURE>                            |
       |                                                |                                 |
       |                                      (performs cleanup)                          |
       |                                                | ~~~~~~ UNLINK <key> ... ~~~~~~~>|
       |                                                |                                 |
       |                                                | ------ SYNCSLOTS FINISH ------->|
       |                                                |                                 |
```

## Full Sync, Partial Sync, and RDB

In order to ensure replicas that resync during the import are still
aware of the import, the slot import is serialized to a new
`cluster-slot-imports` aux field. The encoding includes the job name,
the source node name, and the slot ranges being imported. Upon loading
an RDB with the `cluster-slot-imports` aux field, replicas will add a
new migration in the `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.

It's important to note that a previously saved RDB file can be used as
the basis for partial sync with a primary. Because of this, whenever we
load an RDB file with the `cluster-slot-imports` aux field, even from
disk, we will still add a new migration to track the import. If after
loading the RDB, the Valkey node is a primary, it will cancel the slot
migration. Having this tracking state loaded on primaries will ensure
that replicas partial syncing to a restarted primary still get their
`SYNCSLOTS FINISH` message in the replication stream.

## AOF

Since AOF cannot be used as the basis for a partial sync, we don't
necessarily need to persist the `SYNCSLOTS ESTABLISH` and `FINISH`
commands to the AOF.

However, considering there is work to change this (valkey-io#59 valkey-io#1901) this
design doesn't make any assumptions about this.

We will propagate the `ESTABLISH` and `FINISH` commands to the AOF, and
ensure that they can be properly replayed on AOF load to get to the
right state. Similar to RDB, if there are any pending "ESTABLISH"
commands that don't have a "FINISH" afterwards upon becoming primary, we
will make sure to fail those in `verifyClusterConfigWithData`.

Additionally, there was a bug in the existing slot migration where slot
import clients were not having their commands persisted to AOF. This has
been fixed by ensuring we still propagate to AOF even for slot import
clients.

## Promotion & Demotion

Since the primary is solely responsible for cleaning up unowned slots,
primaries that are demoted will not clean up previously active slot
imports. The promoted replica will be responsible for both cleaning up
the slot (`verifyClusterConifgWithData`) and sending a `SYNCSLOTS
FINISH`.

# Other Options Considered

I also considered tracking "dirty" slots rather than using the slot
import state machine. In this setup, primaries and replicas would simply
mark each slot's hashtable in the kvstore as dirty when something is
written to it and we do not currently own that slot.

This approach is simpler, but has a problem in that modules loaded on
the replica would still not get slot migration start/end notifications.
If the modules on the replica do not get such notifications, they will
not be able to properly contain these dirty keys during slot migration
events.

---------

Signed-off-by: Jacob Murphy <jkmurphy@google.com>
For now, introduce this and have it do nothing. Eventually, we can use
this to negotiate supported capabilities on either end. Right now, there
is nothing to send or support, so it just accepts it and doesn't reply.

---------

Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Seeing test failures due to this on the 9.0.0 branch:

```
[exception]: Executing test client: ERR Syntax error. Use: LOLWUT [columns rows] [real imaginary].
ERR Syntax error. Use: LOLWUT [columns rows] [real imaginary]
```

It turns out we are just providing the version as an argument, instead
of specifying which version to run on

Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Makes our tests possible to run with TCL 9.

The latest Fedora now has TCL 9.0 and it's working now, including the
TCL TLS package. (This wasn't working earlier due to some packaging
errors for TCL packages in Fedora, which have been fixed now.)

This PR also removes the custom compilation of TCL 8 used in our Daily
jobs and uses the system default TCL version instead. The TCL version
depends on the OS. For the latest Fedora, you get 9.0, for macOS you get
8.5 and for most other OSes you get 8.6.

The checks for TCL 8.7 are removed, because 8.7 doesn't exist. It was
never released.

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
If we don't wait for the replica to resync, the migration may be
cancelled by the time the replica resyncs, resulting in a test failure
when we can't find the migration on the replica.

---------

Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This change exposes an existing, persistent (in nodes.conf) unique shard
identifier for each shard in the cluster as part of the `CLUSTER SHARDS` command
response.

```
1) 1) "slots"
   2) 1) (integer) 0
      2) (integer) 999
      3) (integer) 2001
      4) (integer) 3999
      5) (integer) 4501
      6) (integer) 5460
   3) "nodes"
   4) 1)  1) "id"
          2) "6e76043bed00e716e85035107866ea16e9a5f700"
          3) "port"
          4) (integer) 6385
          5) "ip"
          6) "127.0.0.1"
          7) "endpoint"
          8) "127.0.0.1"
          9) "role"
         10) "replica"
         11) "replication-offset"
         12) (integer) 8092
         13) "health"
         14) "online"
      2)  1) "id"
          2) "b2f8c841707b2246ec2a641c37f16e88fe0bb700"
          3) "port"
          4) (integer) 6380
          5) "ip"
          6) "127.0.0.1"
          7) "endpoint"
          8) "127.0.0.1"
          9) "role"
         10) "master"
         11) "replication-offset"
         12) (integer) 8092
         13) "health"
         14) "online"
   5) "id"
   6) "3f2a7bb7bbd5fc2a331fe9bf95f5e02bcca02430"
```

---------

Signed-off-by: Satheesha Chattenahalli Hanume Gowda <satheesha@apple.com>
Co-authored-by: Satheesha Chattenahalli Hanume Gowda <satheesha@apple.com>
With valkey-io#1401, we introduced additional filters to CLIENT LIST/KILL
subcommand. The intended behavior was to pick the last value of the
filter. However, we introduced memory leak for all the preceding
filters.

Before this change:
```
> CLIENT LIST IP 127.0.0.1 IP 127.0.0.1
id=4 addr=127.0.0.1:37866 laddr=127.0.0.1:6379 fd=10 name= age=0 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=21 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=16989 events=r cmd=client|list user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=49 tot-net-out=0 tot-cmds=0
```
Leak:
```
Direct leak of 11 byte(s) in 1 object(s) allocated from:
    #0 0x7f2901aa557d in malloc (/lib64/libasan.so.4+0xd857d)
    #1 0x76db76 in ztrymalloc_usable_internal /workplace/harkrisp/valkey/src/zmalloc.c:156
    #2 0x76db76 in zmalloc_usable /workplace/harkrisp/valkey/src/zmalloc.c:200
    #3 0x4c4121 in _sdsnewlen.constprop.230 /workplace/harkrisp/valkey/src/sds.c:113
    #4 0x4dc456 in parseClientFiltersOrReply.constprop.63 /workplace/harkrisp/valkey/src/networking.c:4264
    #5 0x4bb9f7 in clientListCommand /workplace/harkrisp/valkey/src/networking.c:4600
    #6 0x641159 in call /workplace/harkrisp/valkey/src/server.c:3772
    #7 0x6431a6 in processCommand /workplace/harkrisp/valkey/src/server.c:4434
    #8 0x4bfa9b in processCommandAndResetClient /workplace/harkrisp/valkey/src/networking.c:3571
    #9 0x4bfa9b in processInputBuffer /workplace/harkrisp/valkey/src/networking.c:3702
    #10 0x4bffa3 in readQueryFromClient /workplace/harkrisp/valkey/src/networking.c:3812
    #11 0x481015 in callHandler /workplace/harkrisp/valkey/src/connhelpers.h:79
    #12 0x481015 in connSocketEventHandler.lto_priv.394 /workplace/harkrisp/valkey/src/socket.c:301
    #13 0x7d3fb3 in aeProcessEvents /workplace/harkrisp/valkey/src/ae.c:486
    #14 0x7d4d44 in aeMain /workplace/harkrisp/valkey/src/ae.c:543
    #15 0x453925 in main /workplace/harkrisp/valkey/src/server.c:7319
    #16 0x7f2900cd7139 in __libc_start_main (/lib64/libc.so.6+0x21139)
```

Note: For filter ID / NOT-ID we group all the option and perform
filtering whereas for remaining filters we only pick the last filter
option.

---------

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
…lkey-io#2672)

We have relaxed the `cluster-ping-interval` and `cluster-node-timeout`
so that cluster has enough time to stabilize and propagate changes.

Fixes this test occasional failure when running with valgrind:

    [err]: Node #10 should eventually replicate node #5 in tests/unit/cluster/slave-selection.tcl
    #10 didn't became slave of #5

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
* Add cross version compatibility test to run with Valkey 7.2 and 8.0
* Add mechanism in TCL test to skip tests dynamically - valkey-io#2711

---------

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…key-io#2716)

When the hash object does not exist FXX should simply fail the check
without creating the object while FNX should be trivial and succeed.

Note - also fix a potential compilation warning on some COMPILERS doing
constant folding of variable length array when size is const expression.

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
When using `skip` inside a test to skip a test, when running with
--dump-logs it causes the logs to be dumped. Introduced in valkey-io#2342.

The reason is the "skipped" exception is caught in the start_server proc
in tests/support/server.tcl. This is where the $::dump_logs variable is
checked.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…alkey-io#747)

Currently, zdiff() becomes a no-op in the case that the first key is
empty.

The existing comment of "Skip everything if the smallest input is empty"
is misleading, as qsort is bypassed for the case of ZDIFF. There's no
guarantee that the first key holds the smallest cardinality.

The real reason behind such bypass is the following. ZDIFF computes the
difference between the first and all successive input sorted sets.
Meaning, if the first key is empty, we cannot reduce further from an
already empty collection, and thus zdiff() becomes a no-op.

Signed-off-by: Kyle Kim <kimkyle@amazon.com>
DEBUG LOADAOF sometimes works but it results in -LOADING responses to
the primary so there are lots of race conditions. It isn't something we
should be doing anyways. To test, I just disconnect the replica before
loading the AOF, then reconnect it.

Fixes valkey-io#2712

Signed-off-by: Jacob Murphy <jkmurphy@google.com>
This test was failing, and causing the next test to throw an exception.
It is failing since we never waited for the slot migration to connect
before proceeding.

Fixes valkey-io#2692

Signed-off-by: Jacob Murphy <jkmurphy@google.com>
…child snapshot is active (valkey-io#2721)

The race condition causes the client to be used and subsequently double
freed by the slot migration read pipe handler. The order of events is:

1. We kill the slot migration child process during CANCELSLOTMIGRATIONS
1. We then free the associated client to the target node
1. Although we kill the child process, it is not guaranteed that the
pipe will be empty from child to parent
1. If the pipe is not empty, we later will read that out in the
slotMigrationPipeReadHandler
1. In the pipe read handler, we attempt to write to the connection. If
writing to the connection fails, we will attempt to free the client
1. However, the client was already freed, so this a double free

Notably, the slot migration being aborted doesn't need to be triggered
by `CANCELSLOTMIGRATIONS`, it can be any failure.

To solve this, we simply:

1. Set the slot migration pipe connection to NULL whenever it is
unlinked
2. Bail out early in slot migration pipe read handler if the connection
is NULL

I also consolidate the killSlotMigrationChild call to one code path,
which is executed on client unlink. Before, there were two code paths
that would do this twice (once on slot migration job finish, and once on
client unlink). Sending the signal twice is fine, but inefficient.

Also, add a test to cancel during the slot migration snapshot to make
sure this case is covered (we only caught it during the module test).

---------

Signed-off-by: Jacob Murphy <jkmurphy@google.com>
…ey-io#2727)

The test was accidentally removed in PR valkey-io#1671.

Signed-off-by: Binbin <binloveplay1314@qq.com>
valkey-io#2329 intoduced a bug that causes a blocked client in cluster mode to
receive two MOVED redirects instead of one. This was not seen in tests,
except in the reply schema validator.

The fix makes sure the client's pending command is cleared after sending
the MOVED redirect, to prevent if from being reprocessed.

Fixes valkey-io#2676.

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Resolves valkey-io#2696 

The primary issue was that with sanitizer mode, the test needed more
time for primary’s replication buffers grow beyond `2 × backlog_size`.
Increasing the threshold of `repl-timeout` to 30s, ensures that the
inactive replica is not disconnected while the full sync is proceeding.
`rdb-key-save-delay` controls or throttles the data written to the
client output buffer, and in this case, we are deterministically able to
perform the fullsync within 10s (10000 keys * 0.001s).

Increasing the `wait_for_condition` gives it enough retries to verify
that `mem_total_replication_buffers` reaches the required `2 ×
backlog_size`.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
…SLOTS ESTABLISH (valkey-io#2498)

We now have valkey-io#2688 SYNCSLOTS CAPA, remove the outupdated comment.

Signed-off-by: Binbin <binloveplay1314@qq.com>
increased the wait time to a total of 10 seconds where we check the log
for `Done loading RDB` message

Fixes valkey-io#2694

CI run (100 times):
https://github.com/roshkhatri/valkey/actions/runs/18576201712/job/52961907806

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
…load crash (valkey-io#1826)

There will be two issues in this test:
```
test {FUNCTION - test function flush} {
    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush sync
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush sync, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush async
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush async, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    puts "Test done"
}
```

The first one is the test output, we can see that after executing
FUNCTION FLUSH,
used_memory_vm_functions has not changed at all:
```
flush sync, before_flush_memory: 2962432, after_flush_memory: 2962432
flush async, before_flush_memory: 4504576, after_flush_memory: 4504576
```

The second one is there is a crash when loading the functions during the
async
flush:
```
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
 # valkey 255.255.255 crashed by signal: 11, si_code: 2
 # Accessing address: 0xe0429b7100000a3c
 # Crashed running the instruction at: 0x102e0b09c

------ STACK TRACE ------
EIP:
0   valkey-server                       0x0000000102e0b09c luaH_getstr + 52

Backtrace:
0   libsystem_platform.dylib            0x000000018b066584 _sigtramp + 56
1   valkey-server                       0x0000000102e01054 luaD_precall + 96
2   valkey-server                       0x0000000102e01b10 luaD_call + 104
3   valkey-server                       0x0000000102e00d1c luaD_rawrunprotected + 76
4   valkey-server                       0x0000000102e01e3c luaD_pcall + 60
5   valkey-server                       0x0000000102dfc630 lua_pcall + 300
6   valkey-server                       0x0000000102f77770 luaEngineCompileCode + 708
7   valkey-server                       0x0000000102f71f50 scriptingEngineCallCompileCode + 104
8   valkey-server                       0x0000000102f700b0 functionsCreateWithLibraryCtx + 2088
9   valkey-server                       0x0000000102f70898 functionLoadCommand + 312
10  valkey-server                       0x0000000102e3978c call + 416
11  valkey-server                       0x0000000102e3b5b8 processCommand + 3340
12  valkey-server                       0x0000000102e563cc processInputBuffer + 520
13  valkey-server                       0x0000000102e55808 readQueryFromClient + 92
14  valkey-server                       0x0000000102f696e0 connSocketEventHandler + 180
15  valkey-server                       0x0000000102e20480 aeProcessEvents + 372
16  valkey-server                       0x0000000102e4aad0 main + 26412
17  dyld                                0x000000018acab154 start + 2476

------ STACK TRACE DONE ------
```

The reason is that, in the old implementation (introduced in 7.0),
FUNCTION FLUSH
use lua_unref to remove the script from lua VM. lua_unref does not
trigger the gc,
it causes us to not be able to effectively reclaim memory after the
FUNCTION FLUSH.

The other issue is that, since we don't re-create the lua VM in FUNCTION
FLUSH,
loading the functions during a FUNCTION FLUSH ASYNC will result a crash
because
lua engine state is not thread-safe.

The correct solution is to re-create a new Lua VM to use, just like
SCRIPT FLUSH.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
Co-authored-by: Ricardo Dias <ricardo.dias@percona.com>
Signed-off-by: Hanxi Zhang <hanxizh@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.