Fix/cluster slot migration flaky test 2693#18
Closed
hanxizh9910 wants to merge 30 commits intounstablefrom
Closed
Conversation
As requested, here is a version of lolwut for 9 that visualizes a Julia
set with ASCII art.
Example:
```
127.0.0.1:6379> lolwut version 9
.............
......................
............................
......:::--:::::::::::::::.......
.....:::=+*@@@=--------=+===--::....
....:::-+@@@@&*+=====+%@@@@@@@@=-::....
.....:::-=+@@@@@%%*++*@@@@@@@@@&*=--::....
.....::--=++#@@@@@@@@##@@@@@@@@@@@@@@=::....
......:-=@&#&@@@@@@@@@@@@@@@@@@@@@@@@@%-::...
......::-+@@@@@@@@@@@@@@@@@@&&@@@#%#&@@@-::....
.......::-=+%@@@@@@@@@@@@@@@@#%%*+++++%@+-:.....
.......::-=@&@@@@@@@@@@@@@@@@&*++=====---::.....
.......:::--*@@@@@@@@@@@@@@@@@%++===----::::.....
........::::-=+*%&@@@@@@@@@&&&%*+==----:::::......
........::::--=+@@@@@@@@@@&##%*++==---:::::.......
.......:::::---=+#@@@@@@@@&&&#%*+==---:::::.......
........:::::---=++*%%#&&@@@@@@@@@+=---::::........
.......:::::----=++*%##&@@@@@@@@@@%+=--::::.......
......::::-----==++#@@@@@@@@@@@@@&%*+=-:::........
......:::---====++*@@@@@@@@@@@@@@@@@@+-:::.......
.....::-=++==+++**%@@@@@@@@@@@@@@@@#*=--::.......
....:-%@@%****%###&@@@@@@@@@@@@@@@@&+--:.......
....:-=@@@@@&@@@@@@@@@@@@@@@@@@@@@@@@=::......
...::+@@@@@@@@@@@@@@@&&@@@@@@@@%**@+-::.....
....::-=+%#@@@@@@@@@&%%%&@@@@@@*==-:::.....
....::--+%@@@@@@@%++==++*#@@@@&=-:::....
....:::-*@**@@+==----==*%@@@@+-:::....
.....:::---::::::::--=+@=--::.....
.........::::::::::::::.......
.........................
..................
...
Ascii representation of Julia set with constant 0.41 + 0.29i
Don't forget to have fun! Valkey ver. 255.255.255
```
You can pass in arbitrary rows and colums (it's best when rows is 2x
number of columns) and an arbitrary julia constant so it is repeatable.
Worst case it takes about ~100us on my m2 macbook, which should be fine
to make sure it's not taking too many system resources.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
**Issue History:** 1. The flaky test issue "defrag didn't stop" was originally detected in February 2025: valkey-io#1746 Solution for 1746: valkey-io#1762 2. Similar issue occurred recently: https://github.com/valkey-io/valkey/actions/runs/16585350083/job/46909359496#step:5:7640 **Investigation:** 1. First, the issue occurs specifically to Active Defrag stream in cluster mode. 2. After investigating `test_stream` in `memefficiency.tcl`, I found the root cause is in defrag logic rather than the test itself - There are still failed tests with the same error even if I tried different parameters for the test. 3. Then I looked at malloc-stats and identified potential defrag issues, particularly in the 80B bin where utilization only reaches ~75% after defrag instead of the expected near 100%, while other bins show proper defrag behavior - 80B actually is the size of a new stream(confirmed in `t_stream.c`) that we add during test. 4. For 80B, after adding 200000 streams and fragmenting, `curregs `= 100030, after a lot of defrag cycles, there are still 122 nonfull-slabs/511 slabs with the remaining 446 items not defragged (average 4/nonfull-slab). **Detailed malloc-stats:** - Total slabs: 511 - Non-full slabs: 122 - Full slabs: 511-122=389 - Theoretical maximum per slab: 256 items - Allocated items in non-full slabs: 100030-389*256=446 - Average items per non-full slab: 446/122=3.66 **Root Cause:** **There are some immovable items which prevent complete defrag** **Problems in old defrag logic:** 1. The previous condition (we don't defrag if slab utilization > 'avg utilization' * 1.125), the 12.5% threshold doesn’t work well with low utilizations. - Let's imagine we have 446 items in 122 nonfull-slabs (avg 3.66 items/nonfull-slab), let's say, e.g. we have 81 slabs with 5 items each +41 slabs with 1 item each) - 12.5% threshold: 3.66*1.125=4.11 - If those 41 single items are immovable, they actually lower the average, so the rest 81 slabs will be above the threshold (5>4.11) and will not be defragged - defrag didn't stop. 2. Distribution of immovable items across slabs was causing inconsistent defragmentation and flaky test outcome. - If those 41 single items are movable, they will be moved and the avg will be 5, then 12.5% threshold: 5*1.125=5.625, so the rest 81 slabs will fall below the threshold (5<5.625) and will be defragged - defrag success. - This can explain why we got flaky defrag tests. **Final solution :** 1. Add one more condition before the old logic in `makeDefragDecision `to trigger defragmentation when slab is less than 1/8 full (1/8 threshold (12.5%) chosen to align with existing utilization threshold factor) - Ensures no low-utilization slabs left without defragged, and stabilize the defrag behavior. 2. The reason why we have immovable items and how to handle them is going to be investigate later. 3. Be sure to rebuild Valkey before testing it. **Local test result:** - Before fix: pass rate 80.8% (63/78) - After fix: Test only stream: pass rate 100% (200/200) Test the whole memefficiency.tcl: pass rate 100% (100/100) Resolves valkey-io#2398 , the "defrag didn't stop" issue, with help from @JimB123 @madolson --------- Signed-off-by: Alina Liu <liusalisa6363@gmail.com> Signed-off-by: asagegeLiu <liusalisa6363@gmail.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
) New client flags in reported by CLIENT INFO and CLIENT LIST: * `i` for atomic slot migration importing client * `E` for atomic slot migration exporting client New flags in return value of `ValkeyModule_GetContextFlags`: * `VALKEYMODULE_CTX_FLAGS_SLOT_IMPORT_CLIENT`: Indicate the that client attached to this context is the slot import client. * `VALKEYMODULE_CTX_FLAGS_SLOT_EXPORT_CLIENT`: Indicate the that client attached to this context is the slot export client. Users could use this to monitor the underlying client info of the slot migration, and more clearly understand why they see extra clients during the migration. Modules can use these to detect keyspace notifications on import clients. I am also adding export flags for symmetry, although there should not be keyspace notifications. But they would potentially be visible in command filters or in server events triggered by that client. --------- Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
The CVE fixes had a formatting and external test issue that wasn't caught because private branches don't run those CI steps. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
…ration (valkey-io#2635) # Problem In the current slot migration design, replicas are completely unaware of the slot migration. Because of this, they do not know to hide importing keys, which results in exposure of these keys to commands like KEYS, SCAN, RANDOMKEY, and DBSIZE. # Design The main part of the design is that we will now listen for and process the `SYNCSLOTS ESTABLISH` command on the replica. When a `SYNCSLOTS ESTABLISH` command is received from the primary, we begin tracking a new slot import in a special `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state. Replicas use this state to track the import, and await for a future `SYNCSLOTS FINISH` message that tells them the import is successful/failed. ## Success Case ``` Source Target Target Replica | | | |------------ SYNCSLOTS ESTABLISH -------------->| | | |----- SYNCSLOTS ESTABLISH ------>| |<-------------------- +OK ----------------------| | | | | |~~~~~~~~~~~~~~ snapshot as AOF ~~~~~~~~~~~~~~~~>| | | |~~~~~~ forward snapshot ~~~~~~~~>| |----------- SYNCSLOTS SNAPSHOT-EOF ------------>| | | | | |<----------- SYNCSLOTS REQUEST-PAUSE -----------| | | | | |~~~~~~~~~~~~ incremental changes ~~~~~~~~~~~~~~>| | | |~~~~~~ forward changes ~~~~~~~~~>| |--------------- SYNCSLOTS PAUSED -------------->| | | | | |<---------- SYNCSLOTS REQUEST-FAILOVER ---------| | | | | |---------- SYNCSLOTS FAILOVER-GRANTED --------->| | | | | | (performs takeover & | | propagates topology) | | | | | |------- SYNCSLOTS FINISH ------->| (finds out about topology | | change & marks migration done) | | | | | ``` ## Failure Case ``` Source Target Target Replica | | | |------------ SYNCSLOTS ESTABLISH -------------->| | | |----- SYNCSLOTS ESTABLISH ------>| |<-------------------- +OK ----------------------| | ... ... ... | | | | <FAILURE> | | | | | (performs cleanup) | | | ~~~~~~ UNLINK <key> ... ~~~~~~~>| | | | | | ------ SYNCSLOTS FINISH ------->| | | | ``` ## Full Sync, Partial Sync, and RDB In order to ensure replicas that resync during the import are still aware of the import, the slot import is serialized to a new `cluster-slot-imports` aux field. The encoding includes the job name, the source node name, and the slot ranges being imported. Upon loading an RDB with the `cluster-slot-imports` aux field, replicas will add a new migration in the `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state. It's important to note that a previously saved RDB file can be used as the basis for partial sync with a primary. Because of this, whenever we load an RDB file with the `cluster-slot-imports` aux field, even from disk, we will still add a new migration to track the import. If after loading the RDB, the Valkey node is a primary, it will cancel the slot migration. Having this tracking state loaded on primaries will ensure that replicas partial syncing to a restarted primary still get their `SYNCSLOTS FINISH` message in the replication stream. ## AOF Since AOF cannot be used as the basis for a partial sync, we don't necessarily need to persist the `SYNCSLOTS ESTABLISH` and `FINISH` commands to the AOF. However, considering there is work to change this (valkey-io#59 valkey-io#1901) this design doesn't make any assumptions about this. We will propagate the `ESTABLISH` and `FINISH` commands to the AOF, and ensure that they can be properly replayed on AOF load to get to the right state. Similar to RDB, if there are any pending "ESTABLISH" commands that don't have a "FINISH" afterwards upon becoming primary, we will make sure to fail those in `verifyClusterConfigWithData`. Additionally, there was a bug in the existing slot migration where slot import clients were not having their commands persisted to AOF. This has been fixed by ensuring we still propagate to AOF even for slot import clients. ## Promotion & Demotion Since the primary is solely responsible for cleaning up unowned slots, primaries that are demoted will not clean up previously active slot imports. The promoted replica will be responsible for both cleaning up the slot (`verifyClusterConifgWithData`) and sending a `SYNCSLOTS FINISH`. # Other Options Considered I also considered tracking "dirty" slots rather than using the slot import state machine. In this setup, primaries and replicas would simply mark each slot's hashtable in the kvstore as dirty when something is written to it and we do not currently own that slot. This approach is simpler, but has a problem in that modules loaded on the replica would still not get slot migration start/end notifications. If the modules on the replica do not get such notifications, they will not be able to properly contain these dirty keys during slot migration events. --------- Signed-off-by: Jacob Murphy <jkmurphy@google.com>
For now, introduce this and have it do nothing. Eventually, we can use this to negotiate supported capabilities on either end. Right now, there is nothing to send or support, so it just accepts it and doesn't reply. --------- Signed-off-by: Jacob Murphy <jkmurphy@google.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Seeing test failures due to this on the 9.0.0 branch: ``` [exception]: Executing test client: ERR Syntax error. Use: LOLWUT [columns rows] [real imaginary]. ERR Syntax error. Use: LOLWUT [columns rows] [real imaginary] ``` It turns out we are just providing the version as an argument, instead of specifying which version to run on Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Makes our tests possible to run with TCL 9. The latest Fedora now has TCL 9.0 and it's working now, including the TCL TLS package. (This wasn't working earlier due to some packaging errors for TCL packages in Fedora, which have been fixed now.) This PR also removes the custom compilation of TCL 8 used in our Daily jobs and uses the system default TCL version instead. The TCL version depends on the OS. For the latest Fedora, you get 9.0, for macOS you get 8.5 and for most other OSes you get 8.6. The checks for TCL 8.7 are removed, because 8.7 doesn't exist. It was never released. --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
If we don't wait for the replica to resync, the migration may be cancelled by the time the replica resyncs, resulting in a test failure when we can't find the migration on the replica. --------- Signed-off-by: Jacob Murphy <jkmurphy@google.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This change exposes an existing, persistent (in nodes.conf) unique shard
identifier for each shard in the cluster as part of the `CLUSTER SHARDS` command
response.
```
1) 1) "slots"
2) 1) (integer) 0
2) (integer) 999
3) (integer) 2001
4) (integer) 3999
5) (integer) 4501
6) (integer) 5460
3) "nodes"
4) 1) 1) "id"
2) "6e76043bed00e716e85035107866ea16e9a5f700"
3) "port"
4) (integer) 6385
5) "ip"
6) "127.0.0.1"
7) "endpoint"
8) "127.0.0.1"
9) "role"
10) "replica"
11) "replication-offset"
12) (integer) 8092
13) "health"
14) "online"
2) 1) "id"
2) "b2f8c841707b2246ec2a641c37f16e88fe0bb700"
3) "port"
4) (integer) 6380
5) "ip"
6) "127.0.0.1"
7) "endpoint"
8) "127.0.0.1"
9) "role"
10) "master"
11) "replication-offset"
12) (integer) 8092
13) "health"
14) "online"
5) "id"
6) "3f2a7bb7bbd5fc2a331fe9bf95f5e02bcca02430"
```
---------
Signed-off-by: Satheesha Chattenahalli Hanume Gowda <satheesha@apple.com>
Co-authored-by: Satheesha Chattenahalli Hanume Gowda <satheesha@apple.com>
With valkey-io#1401, we introduced additional filters to CLIENT LIST/KILL subcommand. The intended behavior was to pick the last value of the filter. However, we introduced memory leak for all the preceding filters. Before this change: ``` > CLIENT LIST IP 127.0.0.1 IP 127.0.0.1 id=4 addr=127.0.0.1:37866 laddr=127.0.0.1:6379 fd=10 name= age=0 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=21 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=16989 events=r cmd=client|list user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=49 tot-net-out=0 tot-cmds=0 ``` Leak: ``` Direct leak of 11 byte(s) in 1 object(s) allocated from: #0 0x7f2901aa557d in malloc (/lib64/libasan.so.4+0xd857d) #1 0x76db76 in ztrymalloc_usable_internal /workplace/harkrisp/valkey/src/zmalloc.c:156 #2 0x76db76 in zmalloc_usable /workplace/harkrisp/valkey/src/zmalloc.c:200 #3 0x4c4121 in _sdsnewlen.constprop.230 /workplace/harkrisp/valkey/src/sds.c:113 #4 0x4dc456 in parseClientFiltersOrReply.constprop.63 /workplace/harkrisp/valkey/src/networking.c:4264 #5 0x4bb9f7 in clientListCommand /workplace/harkrisp/valkey/src/networking.c:4600 #6 0x641159 in call /workplace/harkrisp/valkey/src/server.c:3772 #7 0x6431a6 in processCommand /workplace/harkrisp/valkey/src/server.c:4434 #8 0x4bfa9b in processCommandAndResetClient /workplace/harkrisp/valkey/src/networking.c:3571 #9 0x4bfa9b in processInputBuffer /workplace/harkrisp/valkey/src/networking.c:3702 #10 0x4bffa3 in readQueryFromClient /workplace/harkrisp/valkey/src/networking.c:3812 #11 0x481015 in callHandler /workplace/harkrisp/valkey/src/connhelpers.h:79 #12 0x481015 in connSocketEventHandler.lto_priv.394 /workplace/harkrisp/valkey/src/socket.c:301 #13 0x7d3fb3 in aeProcessEvents /workplace/harkrisp/valkey/src/ae.c:486 #14 0x7d4d44 in aeMain /workplace/harkrisp/valkey/src/ae.c:543 #15 0x453925 in main /workplace/harkrisp/valkey/src/server.c:7319 #16 0x7f2900cd7139 in __libc_start_main (/lib64/libc.so.6+0x21139) ``` Note: For filter ID / NOT-ID we group all the option and perform filtering whereas for remaining filters we only pick the last filter option. --------- Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
…lkey-io#2672) We have relaxed the `cluster-ping-interval` and `cluster-node-timeout` so that cluster has enough time to stabilize and propagate changes. Fixes this test occasional failure when running with valgrind: [err]: Node #10 should eventually replicate node #5 in tests/unit/cluster/slave-selection.tcl #10 didn't became slave of #5 Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
* Add cross version compatibility test to run with Valkey 7.2 and 8.0 * Add mechanism in TCL test to skip tests dynamically - valkey-io#2711 --------- Signed-off-by: Harkrishn Patro <harkrisp@amazon.com> Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…key-io#2716) When the hash object does not exist FXX should simply fail the check without creating the object while FNX should be trivial and succeed. Note - also fix a potential compilation warning on some COMPILERS doing constant folding of variable length array when size is const expression. Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
When using `skip` inside a test to skip a test, when running with --dump-logs it causes the logs to be dumped. Introduced in valkey-io#2342. The reason is the "skipped" exception is caught in the start_server proc in tests/support/server.tcl. This is where the $::dump_logs variable is checked. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…alkey-io#747) Currently, zdiff() becomes a no-op in the case that the first key is empty. The existing comment of "Skip everything if the smallest input is empty" is misleading, as qsort is bypassed for the case of ZDIFF. There's no guarantee that the first key holds the smallest cardinality. The real reason behind such bypass is the following. ZDIFF computes the difference between the first and all successive input sorted sets. Meaning, if the first key is empty, we cannot reduce further from an already empty collection, and thus zdiff() becomes a no-op. Signed-off-by: Kyle Kim <kimkyle@amazon.com>
DEBUG LOADAOF sometimes works but it results in -LOADING responses to the primary so there are lots of race conditions. It isn't something we should be doing anyways. To test, I just disconnect the replica before loading the AOF, then reconnect it. Fixes valkey-io#2712 Signed-off-by: Jacob Murphy <jkmurphy@google.com>
This test was failing, and causing the next test to throw an exception. It is failing since we never waited for the slot migration to connect before proceeding. Fixes valkey-io#2692 Signed-off-by: Jacob Murphy <jkmurphy@google.com>
…child snapshot is active (valkey-io#2721) The race condition causes the client to be used and subsequently double freed by the slot migration read pipe handler. The order of events is: 1. We kill the slot migration child process during CANCELSLOTMIGRATIONS 1. We then free the associated client to the target node 1. Although we kill the child process, it is not guaranteed that the pipe will be empty from child to parent 1. If the pipe is not empty, we later will read that out in the slotMigrationPipeReadHandler 1. In the pipe read handler, we attempt to write to the connection. If writing to the connection fails, we will attempt to free the client 1. However, the client was already freed, so this a double free Notably, the slot migration being aborted doesn't need to be triggered by `CANCELSLOTMIGRATIONS`, it can be any failure. To solve this, we simply: 1. Set the slot migration pipe connection to NULL whenever it is unlinked 2. Bail out early in slot migration pipe read handler if the connection is NULL I also consolidate the killSlotMigrationChild call to one code path, which is executed on client unlink. Before, there were two code paths that would do this twice (once on slot migration job finish, and once on client unlink). Sending the signal twice is fine, but inefficient. Also, add a test to cancel during the slot migration snapshot to make sure this case is covered (we only caught it during the module test). --------- Signed-off-by: Jacob Murphy <jkmurphy@google.com>
…ey-io#2727) The test was accidentally removed in PR valkey-io#1671. Signed-off-by: Binbin <binloveplay1314@qq.com>
valkey-io#2329 intoduced a bug that causes a blocked client in cluster mode to receive two MOVED redirects instead of one. This was not seen in tests, except in the reply schema validator. The fix makes sure the client's pending command is cleared after sending the MOVED redirect, to prevent if from being reprocessed. Fixes valkey-io#2676. --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Resolves valkey-io#2696 The primary issue was that with sanitizer mode, the test needed more time for primary’s replication buffers grow beyond `2 × backlog_size`. Increasing the threshold of `repl-timeout` to 30s, ensures that the inactive replica is not disconnected while the full sync is proceeding. `rdb-key-save-delay` controls or throttles the data written to the client output buffer, and in this case, we are deterministically able to perform the fullsync within 10s (10000 keys * 0.001s). Increasing the `wait_for_condition` gives it enough retries to verify that `mem_total_replication_buffers` reaches the required `2 × backlog_size`. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
…SLOTS ESTABLISH (valkey-io#2498) We now have valkey-io#2688 SYNCSLOTS CAPA, remove the outupdated comment. Signed-off-by: Binbin <binloveplay1314@qq.com>
increased the wait time to a total of 10 seconds where we check the log for `Done loading RDB` message Fixes valkey-io#2694 CI run (100 times): https://github.com/roshkhatri/valkey/actions/runs/18576201712/job/52961907806 Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
…load crash (valkey-io#1826) There will be two issues in this test: ``` test {FUNCTION - test function flush} { for {set i 0} {$i < 10000} {incr i} { r function load [get_function_code LUA test_$i test_$i {return 'hello'}] } set before_flush_memory [s used_memory_vm_functions] r function flush sync set after_flush_memory [s used_memory_vm_functions] puts "flush sync, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory" for {set i 0} {$i < 10000} {incr i} { r function load [get_function_code LUA test_$i test_$i {return 'hello'}] } set before_flush_memory [s used_memory_vm_functions] r function flush async set after_flush_memory [s used_memory_vm_functions] puts "flush async, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory" for {set i 0} {$i < 10000} {incr i} { r function load [get_function_code LUA test_$i test_$i {return 'hello'}] } puts "Test done" } ``` The first one is the test output, we can see that after executing FUNCTION FLUSH, used_memory_vm_functions has not changed at all: ``` flush sync, before_flush_memory: 2962432, after_flush_memory: 2962432 flush async, before_flush_memory: 4504576, after_flush_memory: 4504576 ``` The second one is there is a crash when loading the functions during the async flush: ``` === VALKEY BUG REPORT START: Cut & paste starting from here === # valkey 255.255.255 crashed by signal: 11, si_code: 2 # Accessing address: 0xe0429b7100000a3c # Crashed running the instruction at: 0x102e0b09c ------ STACK TRACE ------ EIP: 0 valkey-server 0x0000000102e0b09c luaH_getstr + 52 Backtrace: 0 libsystem_platform.dylib 0x000000018b066584 _sigtramp + 56 1 valkey-server 0x0000000102e01054 luaD_precall + 96 2 valkey-server 0x0000000102e01b10 luaD_call + 104 3 valkey-server 0x0000000102e00d1c luaD_rawrunprotected + 76 4 valkey-server 0x0000000102e01e3c luaD_pcall + 60 5 valkey-server 0x0000000102dfc630 lua_pcall + 300 6 valkey-server 0x0000000102f77770 luaEngineCompileCode + 708 7 valkey-server 0x0000000102f71f50 scriptingEngineCallCompileCode + 104 8 valkey-server 0x0000000102f700b0 functionsCreateWithLibraryCtx + 2088 9 valkey-server 0x0000000102f70898 functionLoadCommand + 312 10 valkey-server 0x0000000102e3978c call + 416 11 valkey-server 0x0000000102e3b5b8 processCommand + 3340 12 valkey-server 0x0000000102e563cc processInputBuffer + 520 13 valkey-server 0x0000000102e55808 readQueryFromClient + 92 14 valkey-server 0x0000000102f696e0 connSocketEventHandler + 180 15 valkey-server 0x0000000102e20480 aeProcessEvents + 372 16 valkey-server 0x0000000102e4aad0 main + 26412 17 dyld 0x000000018acab154 start + 2476 ------ STACK TRACE DONE ------ ``` The reason is that, in the old implementation (introduced in 7.0), FUNCTION FLUSH use lua_unref to remove the script from lua VM. lua_unref does not trigger the gc, it causes us to not be able to effectively reclaim memory after the FUNCTION FLUSH. The other issue is that, since we don't re-create the lua VM in FUNCTION FLUSH, loading the functions during a FUNCTION FLUSH ASYNC will result a crash because lua engine state is not thread-safe. The correct solution is to re-create a new Lua VM to use, just like SCRIPT FLUSH. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Ricardo Dias <ricardo.dias@percona.com> Co-authored-by: Ricardo Dias <ricardo.dias@percona.com>
Signed-off-by: Hanxi Zhang <hanxizh@amazon.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.