DAOS-18928 dfuse: increase MAX_DAOS_MT to 32#18526
Conversation
ece2201 to
60600f5
Compare
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18526/2/execution/node/582/log |
|
Ticket title is 'pil4dfs: increase MAX_DAOS_MT to 32' |
60600f5 to
b038095
Compare
Also prevent a crash if we go beyond MAX_DAOS_MT by just disabling interception in that case. Features: pil4dfs Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
b038095 to
740d6ce
Compare
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18526/4/execution/node/1027/log |
|
Test stage Functional on EL 9 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18526/5/testReport/ |
|
Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18526/5/testReport/ |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18526/5/execution/node/1345/log |
|
Test stage Functional on EL 9 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18526/6/testReport/ |
daltonbohning
left a comment
There was a problem hiding this comment.
I pushed most of my feedback here if you want to use it
https://github.com/daos-stack/daos/compare/mschaara/18928...dbohning/mschaara/18928?expand=1
| finally: | ||
| # Unmount this case's dfuse instances so the next case only sees its own mounts. | ||
| for dfuse in dfuses: | ||
| dfuse.stop() |
There was a problem hiding this comment.
Since we know this runs with 10, 32, then 33 instances, do we want to optimize this to reuse the containers and mounts? I.e.
- Create and mount 10 containers
- Verify behavior
- Create and mount additional containers up to 32
- Verify behavior
- Create and mount additional containers up to 33
- Verify behavior
I could make that change if you agree
There was a problem hiding this comment.
good call.. i will make the change.
will wait for all testing to complete first and will post following PR with test only changes to address your feedback.
|
|
||
| /* The max number of mount points for DAOS mounted simultaneously */ | ||
| #define MAX_DAOS_MT (8) | ||
| #define MAX_DAOS_MT (32) |
There was a problem hiding this comment.
Can we not make this 128 or 256? The cost of this is what, a few KB? 32 is still in the realm of possibility of mounts on a common login.
There was a problem hiding this comment.
I suggest increasing the size to something that will likely not happen. Otherwise you'll just hit this in the future and be annoyed again.
There was a problem hiding this comment.
discussed with Kevin offline. the change made to just disable interception once > 32 mounts are there is sufficient and this won't abort apps as before and cause problems.
Quick-Functional: true Test-tag: test_pil4dfs_many_mounts Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
|
only failure in https://jenkins-3.daos.hpc.amslabs.hpecorp.net/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-18526/6/tests was the known issue with the dfuse build test. repushed with only test change using test-tag to avoid running all CI |
| hosts: | ||
| test_servers: 1 | ||
| test_clients: 1 | ||
| timeout: 900 |
There was a problem hiding this comment.
Only if you need to push again: could reduce this to 600 since the actual execution time was ~6 minutes
| timeout: 900 | |
| timeout: 600 |
There was a problem hiding this comment.
sigh.. i did repush but forgot about this
knard38
left a comment
There was a problem hiding this comment.
From my understanding, the behaviour of discover_daos_mount_with_env() should also be changed to be consistent with the change done to discover_dfuse_mounts(). The same num_dfs >= MAX_DAOS_MT test returns an error that triggers an abort() via the caller init_myhook(). If the previous call to discover_dfuse_mounts() fills the last slot, the call to discover_daos_mount_with_env() will abort.
There is an additional ordering issue: the overflow guard fires before the dedup check (query_dfs_mount at L603). This means that even when D_IL_MOUNT_POINT points to a mount already registered by discover_dfuse_mounts() — so no new slot would be needed — the function aborts before it can detect that.
If I am correct, do you think the fix below (dedup first, then graceful warn-and-skip) would address this properly?
/* move dedup before the overflow guard */
idx = query_dfs_mount(fs_root);
if (idx >= 0)
D_GOTO(out, rc = 0); /* already registered — no new slot needed */
if (num_dfs >= MAX_DAOS_MT) {
D_WARN("D_IL_MOUNT_POINT ignored: table full (%d mounts). "
"Increase MAX_DAOS_MT or reduce simultaneous mounts.\n",
MAX_DAOS_MT);
D_GOTO(out, rc = 0); /* graceful skip, interception disabled by caller */
}| abort(); | ||
| } | ||
| pt_dfs_mt = &dfs_list[num_dfs]; | ||
| if (memcmp(fs_entry->mnt_type, STR_AND_SIZE(MNT_TYPE_FUSE)) == 0) { |
There was a problem hiding this comment.
Nit
| if (memcmp(fs_entry->mnt_type, STR_AND_SIZE(MNT_TYPE_FUSE)) == 0) { | |
| if (memcmp(fs_entry->mnt_type, STR_AND_SIZE(MNT_TYPE_FUSE)) != 0) | |
| continue; |
Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
while i totally agree with you, the env path is a test only scenario that i had long asked Lei to remove since it should never be used in production. but for the sake of consistency, ill push the change since it is quite straighforward. |
Steps for the author:
After all prior steps are complete: