DAOS-18727 pool: Fix reconf error handling (#18442)#18508
Conversation
|
Ticket title is './recovery/pool_list_consolidation.py:PoolListConsolidationTest.test_lost_majority_ps_replicas - rdb-pool are recovered, three out of four ranks should have rdb-pool' |
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18508/1/execution/node/1307/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18508/1/testReport/ |
When pool_svc_reconf_ult adds a PS replica, the replica creation request may encounter a network error such as -DER_GRPVER (e.g., if the destination rank has just started). This patch adds a retry loop for such errors, to avoid giving up the reconfiguration. In addition, add flag CRT_RPC_FLAG_CO_FAILOUT to RSVC_START and RSVC_STOP CoRPCs, because by default a CoRPC executes the local handler even upon a group version mismatch, which seems unnecessary and has caused confusions during past debugging activities. Signed-off-by: Li Wei <liwei@hpe.com>
dbc3737 to
0b6fe89
Compare
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18508/2/testReport/ |
When pool_svc_reconf_ult adds a PS replica, the replica creation request may encounter a network error such as -DER_GRPVER (e.g., if the destination rank has just started). This patch adds a retry loop for such errors, to avoid giving up the reconfiguration.
In addition, add flag CRT_RPC_FLAG_CO_FAILOUT to RSVC_START and RSVC_STOP CoRPCs, because by default a CoRPC executes the local handler even upon a group version mismatch, which seems unnecessary and has caused confusions during past debugging activities.
Steps for the author:
After all prior steps are complete: