Test misc-next (regular, SELF)#1629
Open
kdave wants to merge 10000 commits into
Open
Conversation
a464848 to
8f84140
Compare
d93f97d to
7d5a51d
Compare
When decoding osd_state and osd_weight from an incoming osdmap in osdmap_decode(), both are decoded for each osd, i.e., map->max_osd times. The ceph_decode_need() check only accounts for sizeof(*map->osd_weight) once. This can potentially result in an out-of-bounds memory access if the incoming message is corrupted such that the max_osd value exceeds the actual content of the osdmap message. This patch fixes the issue by changing the corresponding part in the ceph_decode_need() check to account for map->max_osd*sizeof(*map->osd_weight). Cc: stable@vger.kernel.org Fixes: dcbc919 ("libceph: switch osdmap decoding to use ceph_decode_entity_addr") Signed-off-by: Raphael Zimmer <raphael.zimmer@tu-ilmenau.de> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
cpuset_can_attach() currently adds the bandwidth of all migrating SCHED_DEADLINE tasks to sum_migrate_dl_bw. If the source and destination cpuset effective CPU masks do not overlap, the whole sum is then reserved in the destination root domain. set_cpus_allowed_dl(), however, subtracts bandwidth from the source root domain only when the affinity change really moves the task between root domains. A DL task can move between cpusets that are still in the same root domain, so including that task in sum_migrate_dl_bw can reserve destination bandwidth without a matching source-side subtraction. Share the root-domain move test with set_cpus_allowed_dl(). Keep nr_migrate_dl_tasks counting all migrating deadline tasks for cpuset DL task accounting, but add to sum_migrate_dl_bw only for tasks that need a root-domain bandwidth move. Keep using the destination cpuset effective CPU mask and leave the broader can_attach()/attach() transaction model unchanged. Fixes: 2ef269e ("cgroup/cpuset: Free DL BW in case can_attach() fails") Cc: stable@vger.kernel.org # v6.10+ Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn> Reviewed-by: Waiman Long <longman@redhat.com> Acked-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
xe_bo_recompute_purgeable_state() walks all VMAs of a BO to determine
whether the BO can be made purgeable. This makes VMA create/destroy and
madvise updates O(n) in the number of mappings.
Replace the walk with BO-local counters protected by the BO dma-resv
lock:
- vma_count tracks the number of VMAs mapping the BO.
- willneed_count tracks active WILLNEED holders, including WILLNEED
VMAs and active dma-buf exports for non-imported BOs.
A DONTNEED BO is promoted back to WILLNEED on a 0->1 transition of
willneed_count. A BO is demoted to DONTNEED on a 1->0 transition only
when it still has VMAs, preserving the previous behaviour where a BO
with no mappings keeps its current madvise state.
PURGED remains terminal, preserving the existing "once purged, always
purged" rule.
Fixes: 4f44961 ("drm/xe/vm: Prevent binding of purged buffer objects")
v2:
- Use early return for imported BOs in all four helpers to avoid
nesting (Matt B).
- Group purgeability state into a purgeable sub-struct on struct
xe_bo (Matt B).
- Reword xe_bo_willneed_put_locked() kernel-doc to explain that a 1->0
transition means all remaining active VMAs are DONTNEED (Matt B).
v3:
- Move DONTNEED/PURGED reject from vma_lock_and_validate() into
xe_vma_create(), gated on attr->purgeable_state == WILLNEED.
Fixes vm_bind bypass and partial-unbind rejection on DONTNEED
BOs (Matt B).
- Drop .check_purged from MAP and REMAP; keep it for PREFETCH and
add a comment why (Matt B).
- Skip BO validation in vma_lock_and_validate() for non-WILLNEED
VMA remnants so cleanup/remap paths do not repopulate
DONTNEED/PURGED BOs.
Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Arvind Yadav <arvind.yadav@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20260506132027.2556046-1-arvind.yadav@intel.com
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
(cherry picked from commit 23fb2ea)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
… flag The blitter engines' MEM_COPY and MEM_SET instructions were added as part of the same hardware change that introduced service copy engines (i.e., BCS1-BCS8) which is why the driver checks for service copy engine presence when deciding whether to use these instructions or the older XY_* instructions. However when making this decision the driver should consider which engines are part of the hardware architecture, not which engines are present/usable on the current device. For graphics IP versions that architecturally include service copy engines (i.e., everything Xe2 and later, plus PVC's Xe_HPC) we should use MEM_SET and MEM_COPY even in if all of the service copy engines wind up getting fused off. I.e., we need to decide based on whether the platform's graphics descriptor contains these engines, rather than whether the usable engine mask contains them. This logic got broken when gt->info.__engine_mask was removed, although in practice that mistake has been harmless so far because there haven't been any hardware SKUs that fuse off all of the service copy engines yet. Replace the incorrect has_service_copy_support() function with a GT feature flag that tracks more accurately whether the new blitter instructions are usable. In addition to fixing incorrect logic if all service copies are fused off, the flag also makes it more obvious what the calling code is trying to do; previously it wasn't terribly obvious why "has service copy engines" was being used as the condition for using different instructions on all copy engine types. The new feature flag is named 'has_xe2_blt_instructions' because we expect this flag to be set for all Xe2 and later platforms (i.e., everything officially supported by the Xe driver). Technically there's also one Xe1-era platform (PVC) that supports these engines/instructions and will set this flag, but this still seems to be the most clear and understandable name for the flag. Fixes: 61549a2 ("drm/xe: Drop __engine_mask") Cc: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com> Reviewed-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com> Link: https://patch.msgid.link/20260507-xe2_copy-v1-1-26506381b821@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com> (cherry picked from commit 09b3998) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
There look to be some nasty races here when triggering the invalidate_mappings hook: 1) We do xe_bo_alloc() followed by the attach, before the actual full bo init step in xe_dma_buf_init_obj(). However the bo is visible on the attachments list after the attach. This is bad since exporter driver, say amdgpu, can at any time call back into our invalidate_mappings hook, with an empty/bogus bo, leading to potential bugs/crashes. 2) Similar to 1) but here we get a UAF, when the invalidate_mappings hook is triggered. For example, we get as far as xe_bo_init_locked() but this fails in some way. But here the bo will be freed on error, but we still have it attached from dma-buf pov, so if the invalidate_mappings is now triggered then the bo we access is gone and we trigger UAF and more bugs/crashes. To fix this, move the attach step until after we actually have a fully set up buffer object. Note that the bo is not published to userspace until later, so not sure what the comment "Don't publish the bo until we have a valid attachment", is referring to. We have at least two different customers reporting hitting a NULL ptr deref in evict_flags when importing something from amdgpu, followed by triggering the evict flow. Hit rate is also pretty low, which would hint at some kind of race, so something like 1) or 2) might explain this. v2: - Shuffle the order of the ops slightly (no functional change) - Improve the comment to better explain the ordering (Matt B) Assisted-by: Gemini:gemini-3 #debug Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/work_items/7903 Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/work_items/4055 Fixes: dd08ebf ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patch.msgid.link/20260508102635.149172-3-matthew.auld@intel.com (cherry picked from commit af1f2ad) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Retry doesn't work here, since bo will be freed on error, leading to UAF. However, now that we do the alloc & init before the attach, we can now combine this as one unit and have the init do the alloc for us. This should make the retry safe. Reported by Sashiko. v2: Fix up the error unwind (CI) Closes: https://sashiko.dev/#/patchset/20260506184332.86743-2-matthew.auld%40intel.com Fixes: eb289a5 ("drm/xe: Convert xe_dma_buf.c for exhaustive eviction") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.18+ Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patch.msgid.link/20260508102635.149172-4-matthew.auld@intel.com (cherry picked from commit 4796694) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
The purpose of a GPU reset is to make sure that fence can be signaled again and the signal and resume workers can make progress again. So waiting for the resume worker or any fence in the GPU reset path is just utterly nonsense. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit fcd5f06)
This one was fortunately not looking so bad as the wait ioctl path, but there were still a few things which could be fixed/improved: 1. Allocating with GFP_ATOMIC was quite unnecessary, we can do that before taking the userq_lock. 2. Use a new mutex as protection for the fence_drv_xa so that we can do memory allocations while holding it. 3. Starting the reset timer is unnecessary when the fence is already signaled when we create it. 4. Cleanup error handling, avoid trying to free the queue when we don't even got one. v2: fix incorrect usage of xa_find, destroy the new mutex on error v3: cleanup ref ordering Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 1609eb0)
…queues Well the reset handling seems broken on multiple levels. As first step of fixing this remove most calls to the hang detection. That function should only be called after we run into a timeout! And *NOT* as random check spread over the code in multiple places. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 71bea36)
Fix lock inversions pointed out by Prike and Sunil. The hang detection timeout *CAN'T* grab locks under which we wait for fences, especially not the userq_mutex lock. Then instead of this completely broken handling with the hang_detect_fence just cancel the work when fences are processed and re-start if necessary. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 1b62077)
…REEMPTION_ENABLED [Why] dcn32_validate_bandwidth() wraps dcn32_internal_validate_bw() with DC_FP_START()/DC_FP_END(). In x86 non-RT, DC_FP_START takes fpregs_lock(), which disables local softirqs. The DML1 path through dcn32_enable_phantom_plane() calls kvzalloc() to allocate ~335 KiB for dc_plane_state. This triggers the vmalloc path, which calls BUG_ON(in_interrupt()) because it's invoked within the FPU-enabled (softirq disabled) region, leading to a kernel crash. [How] Wrap the dc_state_create_phantom_plane() call with the DC_RUN_WITH_PREEMPTION_ENABLED() macro to allow preemption during this memory allocation. Fixes: 235c676 ("drm/amd/display: add DCN32/321 specific files for Display Core") Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/4470 Reviewed-by: Aurabindo Pillai <aurabindo.pillai@amd.com> Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Signed-off-by: James Lin <pinglei.lin@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 885ccbe) Cc: stable@vger.kernel.org
The legacy CPER debugfs reader can reach the payload path without a valid pointer snapshot. The remaining user byte count is also treated as the ring occupancy in dwords, so reads past the header can copy more than requested. Take the CPER lock before sampling pointers. Resample rptr/wptr for payload reads, bound the payload copy by available dwords and the remaining user size, and advance the file position for each dword copied. Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 1e40ef8)
gfx_v12_0_init_microcode() always loads RS64 CP ucode but never set adev->gfx.rs64_enable, so it stayed false and code that branches on it (e.g. MEC pipe reset) used the legacy CP_MEC_CNTL path incorrectly. Match GFX11: derive RS64 mode from the PFP firmware header (v2.0) via amdgpu_ucode_hdr_version(). Log at debug when RS64 is enabled. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit b03d535)
In scx_root_enable_workfn(), put_task_struct(p) is called before scx_error() dereferences p->comm and p->pid. If the iterator's reference is the last drop, the task is freed synchronously and the deref becomes a UAF. Move put_task_struct() past scx_error(). Reported-by: Sashiko <sashiko-bot@kernel.org> Closes: https://lore.kernel.org/all/20260511214031.AF5E9C2BCB0@smtp.kernel.org/ Fixes: f0e1a06 ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Tejun Heo <tj@kernel.org>
…rg/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull kunit fixes from Shuah Khan: "Fix to decouple KUNIT_DEBUGFS and KUNIT_ALL_TESTS options and fix KUNIT_DEBUGFS dependencies so it depends on DEBUG_FS without which it will not be useful" * tag 'linux_kselftest-kunit-fixes-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: kunit: config: KUNIT_DEBUGFS should depend on DEBUG_FS kunit: config: Enable KUNIT_DEBUGFS by default
When iov_iter_get_pages2() fails in rds_message_zcopy_from_user(), the pinned pages are released with put_page(), and rm->data.op_mmp_znotifier is cleared. But we fail to properly clear rm->data.op_nents. Later when rds_message_purge() is called from rds_sendmsg() the cleanup loop iterates over the incorrectly non zero number of op_nents and frees them again. Fix this by properly resetting op_nents when it should be in rds_message_zcopy_from_user(). Fixes: 0cebacc ("rds: zerocopy Tx support.") Signed-off-by: Allison Henderson <achender@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260505234336.2132721-1-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ena_phc_gettimex64() is setting the output parameter regardless of whether ena_com_phc_get_timestamp() succeeded or failed. When ena_com_phc_get_timestamp() returns an error, the timestamp parameter may contain uninitialized stack memory (e.g., when PHC is disabled or in blocked state) or invalid hardware values. Passing these to userspace via the PTP ioctl is both a security issue (information leak) and a correctness bug. Fix by checking the return code after releasing the lock and only setting the output timestamp on success. Fixes: e0ea341 ("net: ena: Add PHC support in the ENA driver") Cc: stable@vger.kernel.org Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20260507003518.22554-1-akiyano@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The function extent_invalidate_folio() has only a single caller inside btree_invalidate_folio(). There is no need to export such a function just for a single caller inside another file. Unexport extent_invalidate_folio() and move it to disk-io.c. And since we're moving the code, update the commit to match the current style, and remove the seemingly stale comment on the extent state removal, it's better explained by the comment just before btrfs_unlock_extent(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
The btree inode is very different from regular data inodes, as the btree inode is never exposed to user space operations. All operations are either initiated by btrfs metadata operations, or MM layer like memory pressure to release folios. This means we never need to handle partial folio invalidation inside btree_invalidate_folio(). With that said, we can slightly simplify the btree folio invalidation by: - Add ASSERT()s to make sure the range covers the whole folio - Remove "if (start > end)" check As the range always covers the full folio, that check is always false and can be removed. - Open code extent_invalidate_folio() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info is available everywhere and we don't need to store it inside a structure that is used within one function only, which is build_backref_tree(). The size of btrfs_backref_iter is now 48 bytes. Signed-off-by: David Sterba <dsterba@suse.com>
The iterator is used only once and within build_backref_tree() so we can avoid one allocation and place it on stack. Signed-off-by: David Sterba <dsterba@suse.com>
When writing into a preallocated extent, ordered extent completion calls btrfs_mark_extent_written() to convert the file extent item from the BTRFS_FILE_EXTENT_PREALLOC type to the BTRFS_FILE_EXTENT_REG type. If the preallocated extent was created beyond i_size with fallocate keep-size, and the inode is evicted and loaded again before the write, the inode's file_extent_tree is initialized only up to i_size. The beyond i_size prealloc extent is therefore not tracked there. After a write into that extent extends i_size, btrfs_mark_extent_written() updates the file extent item, but the corresponding range is not marked dirty in the inode's file_extent_tree. This can leave disk_i_size stale when the filesystem does not use the no-holes feature, so after remount the file size can go back to the old value. The following reproducer triggers the problem: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f -O ^no-holes $DEV mount $DEV $MNT touch $MNT/file fallocate -n -l 2M $MNT/file umount $MNT mount $DEV $MNT dd if=/dev/zero of=$MNT/file bs=1M count=1 conv=notrunc ls -lh $MNT/file umount $MNT mount $DEV $MNT ls -lh $MNT/file umount $MNT Running the reproducer gives the following result: $ ./test.sh (...) 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.000596024 s, 1.8 GB/s -rw-rw-r-- 1 root root 1.0M May 8 16:34 /mnt/sdi/file -rw-rw-r-- 1 root root 0 May 8 16:34 /mnt/sdi/file Fix this by marking the written range dirty in the inode's file_extent_tree after successfully converting the prealloc extent to a regular extent. Fixes: 9ddc959 ("btrfs: use the file extent tree infrastructure") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> [ Minor change log updates ] Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_load_free_space_tree() reads FREE_SPACE_INFO once and then chooses the bitmap or extent loader for all following free-space records until the next FREE_SPACE_INFO item. Those loaders currently enforce the selected record type only with ASSERT(). On production builds without CONFIG_BTRFS_ASSERT, a malformed free-space tree can therefore be decoded in the wrong mode. An EXTENT item can reach btrfs_free_space_test_bit() as bitmap data, while a BITMAP item can be added as a full free extent. The latter corrupts the in-memory free-space cache and the former can read beyond the item payload. Sanitizer validation reported: general protection fault Call trace: assert_eb_folio_uptodate() (fs/btrfs/extent_io.c:4134) extent_buffer_test_bit() (?:?) btrfs_free_space_test_bit() (fs/btrfs/free-space-tree.c:518) srso_alias_return_thunk() (arch/x86/include/asm/nospec-branch.h:375) __entry_text_end() (?:?) __asan_memcpy() (mm/kasan/shadow.c:103) read_extent_buffer() (?:?) load_free_space_bitmaps() (fs/btrfs/free-space-tree.c:1548) btrfs_get_32() (fs/btrfs/free-space-tree.c:?) btrfs_set_16() (fs/btrfs/free-space-tree.c:?) kmem_cache_alloc_noprof() (?:?) btrfs_load_free_space_tree() (fs/btrfs/free-space-tree.c:1685) load_free_space_tree_for_test() (?:?) rcu_disable_urgency_upon_qs() (kernel/rcu/tree.c:721) vprintk_emit() (?:?) __up_write() (kernel/locking/rwsem.c:1401) clone_commit_root_for_test() (?:?) test_extent_as_bitmap_mode_mismatch() (?:?) kmem_cache_free() (?:?) btrfs_free_path() (fs/btrfs/free-space-tree.c:1449) __add_block_group_free_space() (fs/btrfs/free-space-tree.c:20) run_test() (?:?) do_raw_spin_unlock() (?:?) btrfs_test_free_space_tree() (fs/btrfs/tests/free-space-tree-tests.c:547) btrfs_test_qgroups() (fs/btrfs/tests/qgroup-tests.c:462) btrfs_run_sanity_tests() (fs/btrfs/free-space-tree.c:?) init_btrfs_fs() (fs/btrfs/super.c:2690) do_one_initcall() (init/main.c:1382) __kasan_kmalloc() (?:?) rcu_is_watching() (?:?) do_initcalls() (init/main.c:1457) kernel_init_freeable() (init/main.c:1674) kernel_init() (init/main.c:1584) ret_from_fork() (?:?) __switch_to() (?:?) ret_from_fork_asm() (?:?) Validate every post-info key before decoding it. Reject keys whose type does not match the mode selected by FREE_SPACE_INFO, and reject keys whose range extends past the block group, returning -EUCLEAN instead of feeding the wrong record type to the bitmap or extent decoder. Also reject zero-length FREE_SPACE_EXTENT items in tree-checker, matching the existing FREE_SPACE_BITMAP zero-length check. This keeps the loader range check simple and prevents a zero-length extent item from being a valid on-disk free-space record. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Zhang Cen <rollkingzzc@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
ROOT_REF and ROOT_BACKREF items contain a struct btrfs_root_ref followed by the subvolume name. Several readers assume that this layout is already valid and then use the on-disk name length directly. A corrupted item can therefore make those readers address bytes outside the item, and BTRFS_IOC_GET_SUBVOL_INFO can copy too many bytes into its fixed-size UAPI name buffer. Validate ROOT_REF and ROOT_BACKREF items in tree-checker before any reader uses them. Reject records that do not contain a non-empty name, whose name_len does not exactly describe the remaining item payload, or whose name exceeds BTRFS_NAME_LEN. For BTRFS_IOC_GET_SUBVOL_INFO, copy only the validated on-disk name_len instead of deriving the copy length from the item size. The ioctl result is zeroed when allocated. That leaves the existing trailing zero byte untouched. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Zhang Cen <rollkingzzc@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
find_first_inode() and find_first_inode_to_shrink() lock root->inodes, then loop over them, occasionally skipping some inodes. When they skip an inode, they attempt to share the cpu/lock with cond_resched_lock(). However, that has a subtle problem associated with it. cond_resched_lock() only drops the lock if it needs to actually call schedule(). With CONFIG_PREEMPT_NONE, this means the full timeslice as detected at ticks. With 8+ cpus and default tunables, this is 2.8ms. So regardless of HZ, we will run for at least 2.8ms in this loop without dropping the lock, assuming it finds no suitable inodes. If HZ is small enough, it might be even worse as the tick granularity becomes bigger than the timeslice. The knock-on effect of this is that callers to btrfs_del_inode_from_root() like kswapd trying to shrink the inode slab or userspace threads calling evict() will spin on xa_lock(&root->inodes) for 2.8ms, so the extent map shrinker dominates the lock even though ostensibly it is intending to share it. This produces memory pressure as there is only one kswapd and it runs sequentially so it can get stuck in the inode slab shrinking. To fix it, simply replace cond_resched_lock() with an open coded variant which unconditionally does unlock/lock around cond_resched. Sharing the lock is decoupled from sharing the CPU, and all the users of the lock now share it fairly. I was able to reproduce this on test systems by producing a lot of empty files (to make a big root->inodes xarray), then producing memory pressure by reading large files larger than ram, triggering kswapd and the extent_map shrinker. The lock contention is visible with perf or lockstat. This patch also relieved a user-apparent bottleneck on a production system from the original report. Tested-by: Rik van Riel <riel@surriel.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
Currently there are two members inside btrfs_folio_state that are related to locked bitmap: - locked sub-bitmap inside btrfs_folio_state::bitmaps[] The enum btrfs_bitmap_nr_locked determines the sub-bitmap. - btrfs_folio_state::nr_locked Which records how many blocks are locked inside the folio. The locked sub-bitmap is a btrfs specific per-block tracking mechanism, which is mostly for async-submission, utilized by compressed writes. The sub-bitmap itself is a super set of nr_locked, as it can provide a more reliable tracking. But the sub-bitmap itself can be pretty large for the incoming huge folio, 2M sized folio for 4K page size, meaning 512 bits for one sub-bitmap. Furthermore, in the long run compression will be reworked to get rid of async-submission completely, there is not much need for a full sub-bitmap to track the locked status. This patch removes the locked sub-bitmap and only relies on @nr_locked atomic to do the tracking. This can also save 64 bytes from btrfs_folio_state::bitmaps[] for a huge folio. This will reduce some safety checks, as previously if a block is not locked, btrfs_folio_end_lock()/btrfs_folio_end_lock_bitmap() will find out that, and skip reducing @nr_locked for that block, and avoid under-flow. But this safety net itself shouldn't be necessary in the first place. If we're unlocking a block that is not locked, it's a bug in the logic, and we should catch it, not silently ignoring it. Thus I believe the removal of the extra safety net should not be a problem. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs detects dirty folio which doesn't have an ordered extent at extent_writepage_io(), but that is not ideal: - The check is not handling all dirty blocks We can have multiple blocks inside a large folio, but the whole folio is marked ordered as long as there is one ordered extent in the range. We can still hit cases where some dirty blocks do not have corresponding ordered extents. Instead of checking the folio ordered flags, do the check at alloc_new_bio(), where we're already searching for ordered extents for writebacks. If we didn't find an ordered extent, we should already give an error message and notify the caller there is something wrong. This allows us to check every block that goes through submit_extent_folio(). With this new and more reliable check, we can remove the old check. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Currently during folio writeback, we call folio_clear_dirty_for_io()
before extent_writepage(), which causes folio dirty flag to be cleared,
but without touching the subpage bitmaps.
This works fine for the bio submission path, as we always call
btrfs_folio_clear_dirty() to clear the subpage bitmap.
But this is far from consistent, thus this patch is going to unify the
behavior to always use btrfs_folio_clear_dirty() helper to clear both
folio flag and subpage bitmap.
This involves:
- Replace folio_clear_dirty_for_io() with folio_test_dirty()
There is only one call site calling folio_clear_dirty_for_io() outside
of subpage.c, that's inside extent_write_cache_pages() just before
extent_writepage().
- Make btrfs_invalidate_folio() clear dirty range for the whole folio
The function btrfs_invalidate_folio() is also called during
extent_writepage().
If we had a folio completely beyond isize, we call
folio_invalidate() -> btrfs_invalidate_folio() to free the folio.
Since we no longer have folio_clear_dirty_for_io() to clear the folio
dirty flag, we must manually clear the folio dirty flag for the
to-be-invalidated folio, and also clear the PAGECACHE_TAG_DIRTY tag.
The tag clearing is done using a new helper,
btrfs_clear_folio_dirty_tag(), which is almost the same as the old
btree_clear_folio_dirty_tag(), but with minor improvements including:
* Remove the folio_test_dirty() check
We have already done an ASSERT().
* Add an ASSERT() to make sure folio is mapped
- Add extra ASSERT()s before clearing folio private
During development I hit dirty folios without the private flag set,
and that caused a lot of ASSERT()s.
The reason is that btrfs_invalidate_folio() is relying on the dirty
flag being cleared when it's called from extent_writepage().
Add extra ASSERT()s inside clear_folio_extent_mapped() to catch
wild dirty/writeback flags.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…ated Currently there are only two folio ordered flag users: - extent_writepage_io() To ensure the folio range has an ordered extent covering it. This is from the legacy COW fixup mechanism, which is already removed and only a simple check is left. - btrfs_invalidate_folio() This is to avoid race with end_bbio_data_write(), where btrfs_finish_ordered_extent() will be called to handle the OE finishing. But for btrfs_invalidate_folio() we have already waited for the folio writeback to finish, and locked the folio. This means we can use the dirty flag to check if a range is already submitted or not. If the OE range is not dirty, it means the range has been submitted and its dirty flag was cleared. And since we have already waited for writeback, the endio function will handle the OE finishing. Thus if the range is not dirty, we must skip the range. If the OE range is dirty, it means we have allocated an ordered extent but have not yet submitted the range. And that's exactly the case where we need to truncate the ordered extent. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
This involves: - The ASSERT() inside end_bbio_data_write() It's only an ASSERT() and it has never been triggered as far as I know. - btrfs_migrate_folio() Since all folio_test_ordered() usage will be removed, there is no need to copy the folio ordered flag. - The ASSERT() inside btrfs_invalidate_folio() This one has its usefulness as it indeed caught some bugs during development. But that's the last user and will not be worth the folio flag or the subpage bitmap. This will allow btrfs to finally remove the ordered flags. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs has an internal flag/subpage bitmap called ordered, which is to indicate that a block has corresponding ordered extent covering it. However this requires extra synchronization between the inode ordered tree, and the folio flag/subpage bitmap, not to mention we need to maintain the extra folio flag with subpage bitmap. As a step to align btrfs_folio_state more closely to iomap_folio_state, remove the btrfs specific ordered flag/bitmap. This will also save us 64 bytes for the bitmap of a huge folio. Since we're here, also update the ASCII graph of the bitmap, as there are only 3 sub-bitmaps now, show all sub-bitmaps directly. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
The invariant that we want to maintain with subvolume qgroups is that the qgroup can only be deleted if there is no root. With squotas, we thought that it was sufficient to just check the usage, because we assumed that deleting a subvolume will drive it's qgroups usage to 0, and thus 0 usage implies no subvolume. However, this is false, for two reasons: - A subvol whose extents are all from before squotas was enabled. - A subvol that was created in this transaction and for which we have not yet run any delayed refs. In both cases, deleting the qgroup breaks the desired invariant and we are left with a subvolume with no qgroup but squotas are enabled. Fix this by unifying the deletion check logic between full qgroups and squotas. Squotas do all the same checks *and* the additional usage == 0 check, which is the one extra rule peculiar to squotas. Link: https://lore.kernel.org/linux-btrfs/adnBhWfJQ1n3hZC8@merlins.org/ Fixes: a8df356 ("btrfs: forbid deleting live subvol qgroup") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
The first transaction that enables squotas is special and a bit tricky. We have to set BTRFS_FS_QUOTA_ENABLED after the transaction to avoid a deadlock, so any delayed refs that run before we set the bit are not squota accounted. For data this is fine, we don't get an owner_ref, so there is no real harm, it's as if the extent predated squotas. However for metadata, the tree block will have gen == enable_gen so when we free it later, we will decrement the squota accounting, which can result in an underflow. Before it is freed, btrfs check shows errors, as we have mismatched usage between the node generations/owners and the squota values. There are two angles to this fix: 1. For extents that come in delayed_refs that run during the enable_gen transaction, we must actually set enable_gen to the *next* transaction. That is the first transaction that we can really properly account in any way. 2. For extents that come in between the end of our transaction handle and the time we set the BTRFS_FS_QUOTA_ENABLED bit, we need an additional bit, BTRFS_FS_SQUOTA_ENABLING which only affects recording squota deltas, so we do pick up those extents. Otherwise, we would miss them, even for enable_gen + 1. Fixes: bd7c1ea ("btrfs: qgroup: check generation when recording simple quota delta") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Simple quota accounting can undercount metadata tree block allocations in certain scenarios. When an undercounted subvolume is deleted and its tree blocks freed, the free deltas decrement rfer/excl past zero, wrapping the u64 to a value near U64_MAX. Once wrapped, can_delete_squota_qgroup() sees non-zero rfer and refuses to delete the qgroup. The qgroup becomes permanently orphaned in the quota tree, since there is no subvolume left to generate frees that would bring the counter back to zero. While we ultimately want to fix any mis-accounting at the source, it is also helpful and worthwhile to mitigate the damage by clamping rfer and excl to zero on underflow rather than allowing the u64 to wrap. This at least allows us to clean up the messed up qgroups on subvol deletion. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
I thought that it was likely I could harden squota deletion to the point that it was impossible to end up with an extent accounted to a qgroup outliving its qgroup. Several recent bugs have made me re-consider that position. Ultimately, this is a tradeoff between short term stability and long term strictness, but I think given that there could be another layer of bugs behind the 2-3 I just fixed, I would feel much more confident in people using squotas if the risk was "your values can get a bit out of whack which you can fix by deleting stuff or disabling/re-enabling/repairing" vs "it will abort your filesystem". As the final nail in the coffin, the Meta production kernel was lacking earlier fixes from me and Qu regarding subvol qgroup lifetime, so this is what we have been testing at scale, so I think at least for now upstream should have the same extra layer of protection. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Currently both check_free_space_extent() and check_free_space_bitmap()
share a very common validation on the keys.
Extract them into a helper, check_free_space_common_key(), and
change the output string ("extent" or "bitmap") depending on the key type.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add an extra check to ensure the free space extent/bitmap and space info keys won't overflow. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
This introduces extra checks using the previous key. If there is a previous key, we can do extra validations: - The previous key is FREE_SPACE_INFO This means the current extent/bitmap should be inside the free space info key range. And matches the type of the free space info. - The previous key is FREE_SPACE_EXTENT or FREE_SPACE_BITMAP In that case both the current and previous key should belong to the same block group. Thus the key type must match, and no overlap between the two keys. These extra checks are inspired by the recently added type checks during free space tree loading. The new tree-checker checks will allow earlier detection, but the loading time checks are still needed, as the tree-checker checks are still inside the same leaf, not matching per-bg level checks. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Any commits after this one are for testing and evaluation only. Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_record_root_in_trans() has a lockless fast path for shareable roots. It skips reloc_mutex when root->last_trans matches the current transaction and BTRFS_ROOT_IN_TRANS_SETUP is clear. The writer side publishes that state in two phases: it sets IN_TRANS_SETUP before updating root->last_trans, then clears the bit after btrfs_init_reloc_root() finishes. However, the reader-side smp_rmb() is before both loads, so it does not order the last_trans load against the later bit test. A reader can observe the new last_trans value while missing the setup bit and return before the relocation-root setup is complete. Read root->last_trans first, then issue the read barrier before testing IN_TRANS_SETUP. Also use clear_bit_unlock() for the writer's final clear and test_bit_acquire() for the successful fast path, so the lockless return observes the setup done before the bit was cleared. Fixes: 7585717 ("Btrfs: fix relocation races") Signed-off-by: Cen Zhang <zzzccc427@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
The comments at the beginning of subpage.c are out-of-date, a lot of the limits are already resolved. Update them to reflect the latest status. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
…maps [CURRENT LIMIT] Btrfs currently only supports sub-bitmaps (e.g. dirty bitmap) no larger than BITS_PER_LONG. That limit allows us to easily grab an unsigned long without the need to properly allocate memory for a larger bitmap. Unfortunately that limit prevents us from supporting huge folios. For 4K page size and block size, a huge folio (order 9) means 512 blocks inside a 2M folio. [ENHANCEMENT] To allow direct bitmap operations without allocating new memory, introduce two different ways to access the subpage bitmaps: - Return an unsigned long value This only happens if blocks_per_folio <= BITS_PER_LONG. We read out the sub-bitmap into an unsigned long, and return the value. This is the old existing method. This involves get_bitmap_value_##name() helper functions. And this time the helper functions are defined as inline functions instead of macros to provide better type checks. - Return a pointer where the sub-bitmap starts This only happens if blocks_per_folio >= BITS_PER_LONG. This is the new method for sub-bitmaps larger than BITS_PER_LONG. Since the sizes of sub-bitmaps are all aligned to BITS_PER_LONG, we can directly access the start word of the sub-bitmap. This involves get_bitmap_pointer_##name() helper functions. Then change the existing sub-bitmaps users to use the new helpers: - Bitmap dumping Switch between get_bitmap_value_##name() and get_bitmap_pointer_##name() depending on the sub-bitmap size. - btrfs_get_subpage_dirty_bitmap() Rename it to btrfs_get_subpage_dirty_bitmap_value() to follow the new value/pointer naming. Since we do not support huge folios yet, there is no pointer version for the dirty bitmap. Furthermore, add the support for bs == ps cases for btrfs_get_subpage_dirty_bitmap_value(), so that the caller no longer needs to check if the folio needs subpage handling. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[CURRENT LIMIT] Btrfs currently only supports sub-bitmaps (e.g. dirty bitmap) no larger than BITS_PER_LONG. One call site that utilizes this limit is btrfs_bio_ctrl::submit_bitmap, which makes it very simple and straightforward to just grab an unsigned long value and assign it to submit_bitmap. Unfortunately that limit prevents us from supporting huge folios. For 4K page size and block size, a huge folio (order 9) means 512 blocks inside a 2M folio. [ENHANCEMENT] Instead of using a fixed unsigned long value, change btrfs_bio_ctrl::submit_bitmap to an unsigned long pointer. And for cases where an unsigned long can hold the whole bitmap, introduce @submit_bitmap_value, and just point that pointer to that unsigned long. Then update all direct users of bio_ctrl->submit_bitmap to use the pointer version. There are several call sites that get extra changes: - @range_bitmap inside extent_writepage_io() Which is only utilized to truncate the bitmap. Since we do not want to allocate new memory just for such temporary usage, change the original bitmap_set() and bitmap_and() into bitmap_clear() for the ranges outside of the target range. - Getting dirty subpage bitmap inside writepage_delalloc() Since we're passing an unsigned long pointer now, we need to go with different handling (bs == ps, blocks_per_folio <= BITS_PER_LONG, blocks_per_folio > BITS_PER_LONG). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
With all the previous preparations, it's finally time to enable the huge folio support. - The max folio size Here we define BTRFS_MAX_FOLIO_SIZE, which is fixed at 2MiB. This will ensure we have a large enough but not too large folio for btrfs. This limit applies to all systems regardless of page size. Then we also define BTRFS_MAX_BLOCKS_PER_FOLIO, which depends on CONFIG_BTRFS_EXPERIMENTAL. If it's an experimental build, BTRFS_MAX_BLOCKS_PER_FOLIO is 512, otherwise it's BITS_PER_LONG. The filemap max order will be calculated using both BTRFS_MAX_FOLIO_SIZE and BTRFS_MAX_BLOCKS_PER_FOLIO. E.g. for 64K page size with 64K fs block size, the limit will be BTRFS_MAX_FOLIO_SIZE (2M), which limits the filemap max order to 5. This will be lower than the old order (6), but folios larger than 2M are rarely any better for IO performance. Meanwhile excessively large folios can cause other problems like stalling the IO pipeline for too long. For 4K page size and 4K fs block size, the limit will be increased to 2M from the old 256K. This new size is constrained by both BTRFS_MAX_FOLIO_SIZE (2M) and BTRFS_MAX_BLOCKS_PER_FOLIO (512 * 4K), allowing x86_64 to reach huge folio support, and the filemap max order will be 9. - btrfs_bio_ctrl::submit_bitmap This will be enlarged to contain BTRFS_MAX_BLOCKS_PER_FOLIO bits, and this will be on-stack memory. This will increase on-stack memory usage by 56 bytes compared to the baseline (before the first patch in the series). - Local @delalloc_bitmap inside writepage_delalloc() Unfortunately we cannot afford to handle an allocation error here, thus again we use on-stack memory. Thus this will increase on-stack memory usage by 56 bytes again. So unfortunately this means during the delalloc window, the writeback path will have +112 bytes on-stack memory usage, and for other cases the writeback path will have +56 bytes on-stack memory usage. The +56 bytes (btrfs_bio_ctrl::submit_bitmap) can be removed after we have reworked the compression submission, so the current on-stack submit_bitmap is mostly a workaround until then. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
…on()
In preparation to encode more information to the error value add a step
that verifies if the value is valid (i.e. < 0). This works for
compile-time and runtime (in debugging mode).
The compile-time check recognizes direct constants and defines an array
type. An invalid condition leads to negative array size which is caught
by compiler.
The runtime check constructs the array type from the condition and only
verifies the correct size, as we don't need to tweak the size to be
negative.
The sizeof() expressions do not generate any code. In the debugging
config the warning adds about 9KiB of btrfs.ko code size.
The array size trick is needed as we can't use static_array(), not even
with __builtin_constant_p().
Sample error message:
In file included from inode.c:40:
inode.c: In function ‘__cow_file_range_inline’:
transaction.h:261:26: error: size of unnamed array is negative
261 | (void)sizeof(char[-!(__builtin_constant_p(error) ? (error) < 0 : 1)]); \
| ^
transaction.h:275:9: note: in expansion of macro ‘VERIFY_NEGATIVE_ERROR’
275 | VERIFY_NEGATIVE_ERROR(error); \
| ^~~~~~~~~~~~~~~~~~~~~
inode.c:665:17: note: in expansion of macro ‘btrfs_abort_transaction’
665 | btrfs_abort_transaction(trans, 17);
| ^~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: David Sterba <dsterba@suse.com>
Optimize the btrfs_abort_transaction() for size as it (by our convention) must be put right after the error condition is detected. The exact file:line is reported so there's a portion that must be inlined. As this is cold code it bloats functions. In previous patch "btrfs: move transaction abort message to __btrfs_abort_transaction()" the error message was moved to the common helper, saving like 20KiB of btrfs.ko and several instructions per call site and some stack space. There's little left to be optimized, we need to keep the atomic test_and_set_bit() and to convey that as 'first hit' to __btrfs_abort_transaction(). Right now it's a bool, which takes 8 bytes on stack for each call but it's 1 bit of information. We can encode that to some of the other parameters. For that let's use the 'error' parameter, by convention it's negative errno so we can reliably detect if it's the first hit or a later error. Also the negation is usually implemented by a single instruction (NEG on x86_64) so the resulting object code is kept short. This reduces btrfs.ko by 8K and stack in several functions by 8 bytes. Cumulative effect with the other commit is -30K of btrfs.ko. While the encoding is an implementation detail, it's contained within the API. Making the transaction abort calls very light is desired. Signed-off-by: David Sterba <dsterba@suse.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.