Skip to content

DAOS-18928 dfuse: increase MAX_DAOS_MT to 32#18526

Open
mchaarawi wants to merge 4 commits into
masterfrom
mschaara/18928
Open

DAOS-18928 dfuse: increase MAX_DAOS_MT to 32#18526
mchaarawi wants to merge 4 commits into
masterfrom
mschaara/18928

Conversation

@mchaarawi

Copy link
Copy Markdown
Contributor

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@daosbuild3

Copy link
Copy Markdown
Collaborator

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown

Ticket title is 'pil4dfs: increase MAX_DAOS_MT to 32'
Status is 'In Progress'
Labels: 'triaged'
https://daosio.atlassian.net/browse/DAOS-18928

Also prevent a crash if we go beyond MAX_DAOS_MT by just disabling
interception in that case.

Features: pil4dfs
Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
@daosbuild3

Copy link
Copy Markdown
Collaborator

@daosbuild3

Copy link
Copy Markdown
Collaborator

@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18526/5/testReport/

@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18526/5/execution/node/1345/log

@daosbuild3

Copy link
Copy Markdown
Collaborator

@mchaarawi mchaarawi marked this pull request as ready for review June 24, 2026 19:27
@mchaarawi mchaarawi requested review from a team as code owners June 24, 2026 19:27

@daltonbohning daltonbohning left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread src/tests/ftest/dfuse/pil4dfs_many_mounts.py Outdated
Comment thread src/tests/ftest/dfuse/pil4dfs_many_mounts.py Outdated
Comment thread src/tests/ftest/dfuse/pil4dfs_many_mounts.py Outdated
Comment on lines +91 to +94
finally:
# Unmount this case's dfuse instances so the next case only sees its own mounts.
for dfuse in dfuses:
dfuse.stop()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we know this runs with 10, 32, then 33 instances, do we want to optimize this to reuse the containers and mounts? I.e.

  1. Create and mount 10 containers
  2. Verify behavior
  3. Create and mount additional containers up to 32
  4. Verify behavior
  5. Create and mount additional containers up to 33
  6. Verify behavior

I could make that change if you agree

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call.. i will make the change.
will wait for all testing to complete first and will post following PR with test only changes to address your feedback.


/* The max number of mount points for DAOS mounted simultaneously */
#define MAX_DAOS_MT (8)
#define MAX_DAOS_MT (32)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not make this 128 or 256? The cost of this is what, a few KB? 32 is still in the realm of possibility of mounts on a common login.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest increasing the size to something that will likely not happen. Otherwise you'll just hit this in the future and be annoyed again.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed with Kevin offline. the change made to just disable interception once > 32 mounts are there is sufficient and this won't abort apps as before and cause problems.

Quick-Functional: true
Test-tag: test_pil4dfs_many_mounts

Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
@mchaarawi

Copy link
Copy Markdown
Contributor Author

only failure in https://jenkins-3.daos.hpc.amslabs.hpecorp.net/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-18526/6/tests was the known issue with the dfuse build test.

repushed with only test change using test-tag to avoid running all CI

daltonbohning
daltonbohning previously approved these changes Jun 25, 2026
hosts:
test_servers: 1
test_clients: 1
timeout: 900

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if you need to push again: could reduce this to 600 since the actual execution time was ~6 minutes

Suggested change
timeout: 900
timeout: 600

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sigh.. i did repush but forgot about this

@knard38 knard38 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, the behaviour of discover_daos_mount_with_env() should also be changed to be consistent with the change done to discover_dfuse_mounts(). The same num_dfs >= MAX_DAOS_MT test returns an error that triggers an abort() via the caller init_myhook(). If the previous call to discover_dfuse_mounts() fills the last slot, the call to discover_daos_mount_with_env() will abort.

There is an additional ordering issue: the overflow guard fires before the dedup check (query_dfs_mount at L603). This means that even when D_IL_MOUNT_POINT points to a mount already registered by discover_dfuse_mounts() — so no new slot would be needed — the function aborts before it can detect that.

If I am correct, do you think the fix below (dedup first, then graceful warn-and-skip) would address this properly?

/* move dedup before the overflow guard */
idx = query_dfs_mount(fs_root);
if (idx >= 0)
    D_GOTO(out, rc = 0);   /* already registered — no new slot needed */

if (num_dfs >= MAX_DAOS_MT) {
    D_WARN("D_IL_MOUNT_POINT ignored: table full (%d mounts). "
           "Increase MAX_DAOS_MT or reduce simultaneous mounts.\n",
           MAX_DAOS_MT);
    D_GOTO(out, rc = 0);   /* graceful skip, interception disabled by caller */
}

Comment thread src/client/dfuse/pil4dfs/int_dfs.c Outdated
abort();
}
pt_dfs_mt = &dfs_list[num_dfs];
if (memcmp(fs_entry->mnt_type, STR_AND_SIZE(MNT_TYPE_FUSE)) == 0) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Suggested change
if (memcmp(fs_entry->mnt_type, STR_AND_SIZE(MNT_TYPE_FUSE)) == 0) {
if (memcmp(fs_entry->mnt_type, STR_AND_SIZE(MNT_TYPE_FUSE)) != 0)
continue;

Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
@mchaarawi

Copy link
Copy Markdown
Contributor Author

From my understanding, the behaviour of discover_daos_mount_with_env() should also be changed to be consistent with the change done to discover_dfuse_mounts(). The same num_dfs >= MAX_DAOS_MT test returns an error that triggers an abort() via the caller init_myhook(). If the previous call to discover_dfuse_mounts() fills the last slot, the call to discover_daos_mount_with_env() will abort.

There is an additional ordering issue: the overflow guard fires before the dedup check (query_dfs_mount at L603). This means that even when D_IL_MOUNT_POINT points to a mount already registered by discover_dfuse_mounts() — so no new slot would be needed — the function aborts before it can detect that.

If I am correct, do you think the fix below (dedup first, then graceful warn-and-skip) would address this properly?

/* move dedup before the overflow guard */
idx = query_dfs_mount(fs_root);
if (idx >= 0)
    D_GOTO(out, rc = 0);   /* already registered — no new slot needed */

if (num_dfs >= MAX_DAOS_MT) {
    D_WARN("D_IL_MOUNT_POINT ignored: table full (%d mounts). "
           "Increase MAX_DAOS_MT or reduce simultaneous mounts.\n",
           MAX_DAOS_MT);
    D_GOTO(out, rc = 0);   /* graceful skip, interception disabled by caller */
}

while i totally agree with you, the env path is a test only scenario that i had long asked Lei to remove since it should never be used in production. but for the sake of consistency, ill push the change since it is quite straighforward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants