Redfs ubuntu hwe 6.17.0 16.16 24.04.1#101
Redfs ubuntu hwe 6.17.0 16.16 24.04.1#101hbirth wants to merge 46 commits intoDDNStorage:redfs-ubuntu-hwe-6.17.0-16.16-24.04.1from
Conversation
|
The clone from the ubuntu repository is also in here as a separate branch, in case we have to check back some time. |
d1ad060 to
863185e
Compare
|
@hbirth I think the last patch needs to be integrated into the other causing. Patch and I'm confused about things like
|
I thought you had a script to check if all patches are there, so I even kept the sequence the same. |
There are too many exceptions already - we need to create a whitelist per branch. |
|
And for things that are merged upstream and have the cherry-pick ID in our 6.8 branch - script already handles it. The order does not matter, it just needs to be there. |
OK, so I shouldn't have applied those and fixed later? |
|
Yeah, I think the branch should only get patches that are absolutely needed. I had made an exception for the el9.6 branch and applied an empty patch to make the script happy. But in the end, all the other branches still show differences - we need to introduce a white list. |
|
OK, if I do that, the patches will still not apply cleanly, so I have to fix ... sometimes it's alomst a rewrite since the context changed. That's why I fixed all that later. So now I should create separate patches for that, or stop for every patch and compile and test? ... I will not know if after I have applied a patch ... even if it applies cleanly, the result is the correct functionality (e.g. iomap and DLM_LOCK the logic changed in the write iterator) My problem is, that I will need fixes. Where do they go? Since you don't want a separate patch only for the fixes. I thought it would be easier if we have the series of patches and then one where we fix the kernel and then we know if a line was from a wrong fix, or we missed it and it was from the original patch, that did not actually apply to this version. BTW, if I had to do a 6.19 .. I would omit the last patch but keep all the rest, and check afterwards how much of the fix applies. |
863185e to
90a7d83
Compare
|
filtered the patches that were already in Ubuntu-hwe-6.17 |
90a7d83 to
8feba64
Compare
|
if we have to do this multiple times, we should think about squashing OUR changes to a couple of topics: page pinning, io-uring fixes, DLM_LOCK, etc. |
There is a good reason why upstream first is preferred ;) Keeps things in management agenda and avoids such changes. I think for upstream changes it makes sense to revert these internal patches and then to apply what upstream eventually gets. |
| args->out_args[0].size = count; | ||
| } | ||
|
|
||
| static void fuse_release_user_pages(struct fuse_args_pages *ap, ssize_t nres, |
There was a problem hiding this comment.
2nd patch is not right, I had only done that in the 6.8 branch because it had introduced bad conflicts. Should not be needed for 6.17
8feba64 to
c683e87
Compare
This is especially needed for better ftrace analysis, for example to build histograms. So far the request unique was missing, because it was added after the first trace message. IDs/req-unique now might not come up perfectly sequentially anymore, but especially with cloned device or io-uring this did not work perfectly anyway. Signed-off-by: Bernd Schubert <bschubert@ddn.com>
fuse_uring_send_next_to_ring() can just call into fuse_uring_send and avoid code dup. Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Rename trace_fuse_request_send to trace_fuse_request_enqueue Add trace_fuse_request_send Add trace_fuse_request_bg_enqueue Add trace_fuse_request_enqueue This helps to track entire request time and time in different queues. Signed-off-by: Bernd Schubert <bschubert@ddn.com>
c683e87 to
ee96a90
Compare
This is preparation for follow up commits that allow to run with a reduced number of queues. Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Add per-CPU and per-NUMA node bitmasks to track which io-uring queues are registered. Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Queues selection (fuse_uring_get_queue) can handle reduced number queues - using io-uring is possible now even with a single queue and entry. The FUSE_URING_REDUCED_Q flag is being introduce tell fuse server that reduced queues are possible, i.e. if the flag is set, fuse server is free to reduce number queues. Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Running background IO on a different core makes quite a difference. fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \ --bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\ --runtime=30s --group_reporting --ioengine=io_uring\ --direct=1 unpatched READ: bw=272MiB/s (285MB/s) ... patched READ: bw=650MiB/s (682MB/s) Reason is easily visible, the fio process is migrating between CPUs when requests are submitted on the queue for the same core. With --iodepth=8 unpatched READ: bw=466MiB/s (489MB/s) patched READ: bw=641MiB/s (672MB/s) Without io-uring (--iodepth=8) READ: bw=729MiB/s (764MB/s) Without fuse (--iodepth=8) READ: bw=2199MiB/s (2306MB/s) (Test were done with <libfuse>/example/passthrough_hp -o allow_other --nopassthrough \ [-o io_uring] /tmp/source /tmp/dest ) Additional notes: With FURING_NEXT_QUEUE_RETRIES=0 (--iodepth=8) READ: bw=903MiB/s (946MB/s) With just a random qid (--iodepth=8) READ: bw=429MiB/s (450MB/s) With --iodepth=1 unpatched READ: bw=195MiB/s (204MB/s) patched READ: bw=232MiB/s (243MB/s) With --iodepth=1 --numjobs=2 unpatched READ: bw=366MiB/s (384MB/s) patched READ: bw=472MiB/s (495MB/s) With --iodepth=1 --numjobs=8 unpatched READ: bw=1437MiB/s (1507MB/s) patched READ: bw=1529MiB/s (1603MB/s) fuse without io-uring READ: bw=1314MiB/s (1378MB/s), 1314MiB/s-1314MiB/s ... no-fuse READ: bw=2566MiB/s (2690MB/s), 2566MiB/s-2566MiB/s ... In summary, for async requests the core doing application IO is busy sending requests and processing IOs should be done on a different core. Spreading the load on random cores is also not desirable, as the core might be frequency scaled down and/or in C1 sleep states. Not shown here, but differnces are much smaller when the system uses performance govenor instead of schedutil (ubuntu default). Obviously at the cost of higher system power consumption for performance govenor - not desirable either. Results without io-uring (which uses fixed libfuse threads per queue) heavily depend on the current number of active threads. Libfuse uses default of max 10 threads, but actual nr max threads is a parameter. Also, no-fuse-io-uring results heavily depend on, if there was already running another workload before, as libfuse starts these threads dynamically - i.e. the more threads are active, the worse the performance. Signed-off-by: Bernd Schubert <bschubert@ddn.com>
This is to further improve performance. fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \ --bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\ --runtime=30s --group_reporting --ioengine=io_uring\ --direct=1 unpatched READ: bw=650MiB/s (682MB/s) patched: READ: bw=995MiB/s (1043MB/s) with --iodepth=8 unpatched READ: bw=641MiB/s (672MB/s) patched READ: bw=966MiB/s (1012MB/s) Reason is that with --iodepth=x (x > 1) fio submits multiple async requests and a single queue might become CPU limited. I.e. spreading the load helps.
generic_file_direct_write() also does this and has a large comment about. Reproducer here is xfstest's generic/209, which is exactly to have competing DIO write and cached IO read. Signed-off-by: Bernd Schubert <bschubert@ddn.com>
This was done as condition on direct_io_allow_mmap, but I believe this is not right, as a file might be open two times - once with write-back enabled another time with FOPEN_DIRECT_IO. Signed-off-by: Bernd Schubert <bschubert@ddn.com>
With the reduced queue feature io-uring is marked as ready after
receiving the 1st ring entry. At this time other queues just
might be in the process of registration and then a race happens
fuse_uring_queue_fuse_req -> no queue entry registered yet
list_add_tail -> fuse request gets queued
So far fetching requests from the list only happened from
FUSE_IO_URING_CMD_COMMIT_AND_FETCH, but without new requests
on the same queue, it would actually never send requests
from that queue - the request was stuck.
fuse.h: add new opcode FUSE_COMPOUND fuse_compound.c: add new functionality to pack multiple fuse operations into one compound command file.c: add an implementation of open+getattr Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
There was a race between fuse_uring_cancel() and
fuse_uring_register()/fuse_uring_next_fuse_req(),
which comes from the queue reduction feature.
Race was
core-A core-B
fuse_uring_register
spin_lock(&queue->lock);
fuse_uring_ent_avail()
spin_unlock(&queue->lock);
fuse_uring_cancel()
spin_lock(&queue->lock);
ent->state = FRRS_USERSPACE;
list_move()
fuse_uring_next_fuse_req()
spin_lock(&queue->lock);
fuse_uring_ent_avail(ent, queue);
fuse_uring_send_next_to_ring()
spin_unlock(&queue->lock);
fuse_uring_send_next_to_ring
I.e. fuse_uring_ent_avail() was called two times and the 2nd time
when the entry was actually already handled by fuse_uring_cancel().
Solution is to not call fuse_uring_ent_avail() from
fuse_uring_register. With that the entry is not in state
FRRS_AVAILABLE and fuse_uring_cancel() will not touch it.
fuse_uring_send_next_to_ring() will mark it as FRRS_AVAILABLE,
and then either assign a request to it and change state again
or will not touch it at all anymore - race fixed.
This will be folded into the upstream queue reduction patches
and therefore has the RED-34640 commit message.
Also entirely removed is fuse_uring_do_register() as remaining
work can be done by the caller.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
This is just to avoid code dup with an upcoming commit. Signed-off-by: Bernd Schubert <bschubert@ddn.com>
This issue could be observed sometimes during libfuse xfstests, from dmseg prints some like "kernel: WARNING: CPU: 4 PID: 0 at fs/fuse/dev_uring.c:204 fuse_uring_destruct+0x1f5/0x200 [fuse]". The cause is, if when fuse daemon just submitted FUSE_IO_URING_CMD_REGISTER SQEs, then umount or fuse daemon quits at this very early stage. After all uring queues stopped, might have one or more unprocessed FUSE_IO_URING_CMD_REGISTER SQEs get processed then some new ring entities are created and added to ent_avail_queue, and immediately fuse_uring_cancel moved them to ent_in_userspace after SQEs get canceled. These ring entities were not moved to ent_released, and stayed in ent_in_userspace when fuse_uring_destruct was called. One way to solve it would be to also free 'ent_in_userspace' in fuse_uring_destruct(), but from code point of view it is hard to see why it is needed. As suggested by Joanne, another solution is to avoid moving entries in fuse_uring_cancel() to the 'ent_in_userspace' list and just releasing them directly. Fixes: b6236c8 ("fuse: {io-uring} Prevent mount point hang on fuse-server termination") Cc: Joanne Koong <joannelkoong@gmail.com> Cc: <stable@vger.kernel.org> # v6.14 Signed-off-by: Jian Huang Li <ali@ddn.com> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
When a request is terminated before it has been committed, the request is not removed from the queue's list. This leaves a dangling list entry that leads to list corruption and use-after-free issues. Remove the request from the queue's list for terminated non-committed requests. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Fixes: c090c8a ("fuse: Add io-uring sqe commit and fetch support") Cc: stable@vger.kernel.org Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This fixes a memory leak.
FUSE_INIT has always been asynchronous with mount. That means that the server processed this request after the mount syscall returned. This means that FUSE_INIT can't supply the root inode's ID, hence it currently has a hardcoded value. There are other limitations such as not being able to perform getxattr during mount, which is needed by selinux. To remove these limitations allow server to process FUSE_INIT while initializing the in-core super block for the fuse filesystem. This can only be done if the server is prepared to handle this, so add FUSE_DEV_IOC_SYNC_INIT ioctl, which a) lets the server know whether this feature is supported, returning ENOTTY othewrwise. b) lets the kernel know to perform a synchronous initialization The implementation is slightly tricky, since fuse_dev/fuse_conn are set up only during super block creation. This is solved by setting the private data of the fuse device file to a special value ((struct fuse_dev *) 1) and waiting for this to be turned into a proper fuse_dev before commecing with operations on the device file. Several merge conflicts during the backport - the ioctl handler was easier to rewrite from scratch. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Bernd Schubert <bschubert@ddn.com> (cherry picked from commit dfb84c3)
no functional changes Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
Take actions on the PR merged event of this repo. Run copy-from-linux-branch.sh and create a PR for redfs. (cherry picked from commit f54872e)
Switch to pull_request_target instead of pull_request as the github security requirement. Also limits the scope to protected PR. (cherry picked from commit b9980ad)
Remove the pull_request_target as it doesn't work. (cherry picked from commit 5328f66)
no functional change Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
…) returns some other error than ENOSYS. The returned error got overwritten. Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
Simplify fuse_compound_req to hold only the pointers to the added fuse args and the request housekeeping. Simplify open+getattr call by using helper functions to fill out the fuse request parameters Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
For now compounds are a module option and disabled by default Signed-off-by: Bernd Schubert <bschubert@ddn.com>
The use of bitmap_weight() didn't give the actual index, but always returned the current cpu, which resulted in a totally wrong mapping. It now just increases a counter for every mapping and ignores cores not in the given (numa) map and then find the index for that. Also added is a pr_debug(), which can be activated for example with echo "module redfs +p" >/proc/dynamic_debug/control (Pity that upstream is not open for such debug messages).
ee96a90 to
ca3d1bf
Compare
| continue; | ||
|
|
||
| /* memory barrier to ensure we see the latest list state */ | ||
| smp_rmb(); |
There was a problem hiding this comment.
This patch can be equally removed - I was hoping it would fix a race, but it actually didn't.
| hlist_for_each_entry(dentry, &inode->i_dentry, d_u.d_alias) { | ||
| spin_lock(&dentry->d_lock); | ||
| if (!d_unhashed(dentry)) | ||
| __d_drop(dentry); |
There was a problem hiding this comment.
I think there is a similar patch from Miklos and Miklos had said this __d_drop() is not right. I need to check again. Keep it for now as it is, but we need to have a look at some point.
fs/fuse/file.c
Outdated
| cleanup: | ||
| unlock_page(page); | ||
| put_page(page); | ||
| error: |
There was a problem hiding this comment.
This function doesn't look right. The patch in 6.8 was not touching fuse_vma_close at all - why does it here?
| *ppos = pos; | ||
|
|
||
| if (res > 0 && write && fopen_direct_io) { | ||
| /* |
There was a problem hiding this comment.
Here we could take the upstream patch, with the right cherry-pick ID
This is a checkout of ubuntu-hwe-6.17.0-16.16 from the ubuntu repository and all our patches applied.
After resolving all the conflicts I had to fix some code that was semantically wrong since the kernel has moved a lot since the patches were created.
This code has survived xfstests with /dev/fuse and io-uring so far.