Skip to content

Redfs ubuntu hwe 6.17.0 16.16 24.04.1#101

Open
hbirth wants to merge 46 commits intoDDNStorage:redfs-ubuntu-hwe-6.17.0-16.16-24.04.1from
hbirth:redfs-ubuntu-hwe-6.17.0-16.16_24.04.1
Open

Redfs ubuntu hwe 6.17.0 16.16 24.04.1#101
hbirth wants to merge 46 commits intoDDNStorage:redfs-ubuntu-hwe-6.17.0-16.16-24.04.1from
hbirth:redfs-ubuntu-hwe-6.17.0-16.16_24.04.1

Conversation

@hbirth
Copy link
Collaborator

@hbirth hbirth commented Feb 18, 2026

This is a checkout of ubuntu-hwe-6.17.0-16.16 from the ubuntu repository and all our patches applied.
After resolving all the conflicts I had to fix some code that was semantically wrong since the kernel has moved a lot since the patches were created.

This code has survived xfstests with /dev/fuse and io-uring so far.

@hbirth hbirth requested a review from bsbernd February 18, 2026 19:35
@hbirth
Copy link
Collaborator Author

hbirth commented Feb 18, 2026

The clone from the ubuntu repository is also in here as a separate branch, in case we have to check back some time.

@hbirth hbirth force-pushed the redfs-ubuntu-hwe-6.17.0-16.16_24.04.1 branch from d1ad060 to 863185e Compare February 18, 2026 19:43
@bsbernd
Copy link
Collaborator

bsbernd commented Feb 18, 2026

@hbirth I think the last patch needs to be integrated into the other causing. Patch and I'm confused about things like 396b209 fuse: add simple request tracepoints, because

git tag --contains shows that it is part of 6.12 - already merged and not needed.

@hbirth
Copy link
Collaborator Author

hbirth commented Feb 18, 2026

@hbirth I think the last patch needs to be integrated into the other causing. Patch and I'm confused about things like 396b209 fuse: add simple request tracepoints, because

git tag --contains shows that it is part of 6.12 - already merged and not needed.

I thought you had a script to check if all patches are there, so I even kept the sequence the same.

@bsbernd
Copy link
Collaborator

bsbernd commented Feb 18, 2026

@hbirth I think the last patch needs to be integrated into the other causing. Patch and I'm confused about things like 396b209 fuse: add simple request tracepoints, because
git tag --contains shows that it is part of 6.12 - already merged and not needed.

I thought you had a script to check if all patches are there, so I even kept the sequence the same.

There are too many exceptions already - we need to create a whitelist per branch.

@bsbernd
Copy link
Collaborator

bsbernd commented Feb 18, 2026

And for things that are merged upstream and have the cherry-pick ID in our 6.8 branch - script already handles it. The order does not matter, it just needs to be there.

@hbirth
Copy link
Collaborator Author

hbirth commented Feb 18, 2026

And for things that are merged upstream and have the cherry-pick ID in our 6.8 branch - script already handles it. The order does not matter, it just needs to be there.

OK, so I shouldn't have applied those and fixed later?

@bsbernd
Copy link
Collaborator

bsbernd commented Feb 18, 2026

Yeah, I think the branch should only get patches that are absolutely needed. I had made an exception for the el9.6 branch and applied an empty patch to make the script happy. But in the end, all the other branches still show differences - we need to introduce a white list.

@hbirth
Copy link
Collaborator Author

hbirth commented Feb 19, 2026

OK, if I do that, the patches will still not apply cleanly, so I have to fix ... sometimes it's alomst a rewrite since the context changed. That's why I fixed all that later. So now I should create separate patches for that, or stop for every patch and compile and test? ... I will not know if after I have applied a patch ... even if it applies cleanly, the result is the correct functionality (e.g. iomap and DLM_LOCK the logic changed in the write iterator)

My problem is, that I will need fixes. Where do they go? Since you don't want a separate patch only for the fixes.

I thought it would be easier if we have the series of patches and then one where we fix the kernel and then we know if a line was from a wrong fix, or we missed it and it was from the original patch, that did not actually apply to this version.

BTW, if I had to do a 6.19 .. I would omit the last patch but keep all the rest, and check afterwards how much of the fix applies.

@hbirth hbirth force-pushed the redfs-ubuntu-hwe-6.17.0-16.16_24.04.1 branch from 863185e to 90a7d83 Compare February 19, 2026 07:44
@hbirth
Copy link
Collaborator Author

hbirth commented Feb 19, 2026

filtered the patches that were already in Ubuntu-hwe-6.17

@hbirth hbirth force-pushed the redfs-ubuntu-hwe-6.17.0-16.16_24.04.1 branch from 90a7d83 to 8feba64 Compare February 19, 2026 13:54
@hbirth
Copy link
Collaborator Author

hbirth commented Feb 19, 2026

if we have to do this multiple times, we should think about squashing OUR changes to a couple of topics: page pinning, io-uring fixes, DLM_LOCK, etc.
Having code added just to change it later ... and even to a version that is already upsteam is really error prone.
I'm referring especially to the compounds, which have to be updated to the upstream version

@bsbernd
Copy link
Collaborator

bsbernd commented Feb 19, 2026

if we have to do this multiple times, we should think about squashing OUR changes to a couple of topics: page pinning, io-uring fixes, DLM_LOCK, etc. Having code added just to change it later ... and even to a version that is already upsteam is really error prone. I'm referring especially to the compounds, which have to be updated to the upstream version

There is a good reason why upstream first is preferred ;) Keeps things in management agenda and avoids such changes. I think for upstream changes it makes sense to revert these internal patches and then to apply what upstream eventually gets.

args->out_args[0].size = count;
}

static void fuse_release_user_pages(struct fuse_args_pages *ap, ssize_t nres,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2nd patch is not right, I had only done that in the 6.8 branch because it had introduced bad conflicts. Should not be needed for 6.17

@hbirth hbirth force-pushed the redfs-ubuntu-hwe-6.17.0-16.16_24.04.1 branch from 8feba64 to c683e87 Compare February 19, 2026 19:08
This is especially needed for better ftrace analysis,
for example to build histograms. So far the request unique
was missing, because it was added after the first trace message.

IDs/req-unique now might not come up perfectly sequentially
anymore, but especially  with cloned device or io-uring this
did not work perfectly anyway.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
fuse_uring_send_next_to_ring() can just call into fuse_uring_send
and avoid code dup.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Rename trace_fuse_request_send to trace_fuse_request_enqueue
Add trace_fuse_request_send
Add trace_fuse_request_bg_enqueue
Add trace_fuse_request_enqueue

This helps to track entire request time and time in different
queues.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
@hbirth hbirth force-pushed the redfs-ubuntu-hwe-6.17.0-16.16_24.04.1 branch from c683e87 to ee96a90 Compare February 19, 2026 19:13
bsbernd and others added 25 commits February 19, 2026 20:30
This is preparation for follow up commits that allow to run with a
reduced number of queues.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Add per-CPU and per-NUMA node bitmasks to track which
io-uring queues are registered.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Queues selection (fuse_uring_get_queue) can handle reduced number
queues - using io-uring is possible now even with a single
queue and entry.

The FUSE_URING_REDUCED_Q flag is being introduce tell fuse server that
reduced queues are possible, i.e. if the flag is set, fuse server
is free to reduce number queues.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Running background IO on a different core makes quite a difference.

fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
--bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
--runtime=30s --group_reporting --ioengine=io_uring\
 --direct=1

unpatched
   READ: bw=272MiB/s (285MB/s) ...
patched
   READ: bw=650MiB/s (682MB/s)

Reason is easily visible, the fio process is migrating between CPUs
when requests are submitted on the queue for the same core.

With --iodepth=8

unpatched
   READ: bw=466MiB/s (489MB/s)
patched
   READ: bw=641MiB/s (672MB/s)

Without io-uring (--iodepth=8)
   READ: bw=729MiB/s (764MB/s)

Without fuse (--iodepth=8)
   READ: bw=2199MiB/s (2306MB/s)

(Test were done with
<libfuse>/example/passthrough_hp -o allow_other --nopassthrough  \
[-o io_uring] /tmp/source /tmp/dest
)

Additional notes:

With FURING_NEXT_QUEUE_RETRIES=0 (--iodepth=8)
   READ: bw=903MiB/s (946MB/s)

With just a random qid (--iodepth=8)
   READ: bw=429MiB/s (450MB/s)

With --iodepth=1
unpatched
   READ: bw=195MiB/s (204MB/s)
patched
   READ: bw=232MiB/s (243MB/s)

With --iodepth=1 --numjobs=2
unpatched
   READ: bw=366MiB/s (384MB/s)
patched
   READ: bw=472MiB/s (495MB/s)

With --iodepth=1 --numjobs=8
unpatched
   READ: bw=1437MiB/s (1507MB/s)
patched
   READ: bw=1529MiB/s (1603MB/s)
fuse without io-uring
   READ: bw=1314MiB/s (1378MB/s), 1314MiB/s-1314MiB/s ...
no-fuse
   READ: bw=2566MiB/s (2690MB/s), 2566MiB/s-2566MiB/s ...

In summary, for async requests the core doing application IO is busy
sending requests and processing IOs should be done on a different core.
Spreading the load on random cores is also not desirable, as the core
might be frequency scaled down and/or in C1 sleep states. Not shown here,
but differnces are much smaller when the system uses performance govenor
instead of schedutil (ubuntu default). Obviously at the cost of higher
system power consumption for performance govenor - not desirable either.

Results without io-uring (which uses fixed libfuse threads per queue)
heavily depend on the current number of active threads. Libfuse uses
default of max 10 threads, but actual nr max threads is a parameter.
Also, no-fuse-io-uring results heavily depend on, if there was already
running another workload before, as libfuse starts these threads
dynamically - i.e. the more threads are active, the worse the
performance.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
This is to further improve performance.

fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
--bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
--runtime=30s --group_reporting --ioengine=io_uring\
--direct=1

unpatched
   READ: bw=650MiB/s (682MB/s)
patched:
   READ: bw=995MiB/s (1043MB/s)

with --iodepth=8

unpatched
   READ: bw=641MiB/s (672MB/s)
patched
   READ: bw=966MiB/s (1012MB/s)

Reason is that with --iodepth=x (x > 1) fio submits multiple async
requests and a single queue might become CPU limited. I.e. spreading
the load helps.
generic_file_direct_write() also does this and has a large
comment about.

Reproducer here is xfstest's generic/209, which is exactly to
have competing DIO write and cached IO read.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
This was done as condition on direct_io_allow_mmap, but I believe
this is not right, as a file might be open two times - once with
write-back enabled another time with FOPEN_DIRECT_IO.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
With the reduced queue feature io-uring is marked as ready after
receiving the 1st ring entry. At this time other queues just
might be in the process of registration and then a race happens

fuse_uring_queue_fuse_req -> no queue entry registered yet
    list_add_tail -> fuse request gets queued

So far fetching requests from the list only happened from
FUSE_IO_URING_CMD_COMMIT_AND_FETCH, but without new requests
on the same queue, it would actually never send requests
from that queue - the request was stuck.
fuse.h: add new opcode FUSE_COMPOUND

fuse_compound.c: add new functionality to pack multiple
fuse operations into one compound command

file.c: add an implementation of open+getattr

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
There was a race between fuse_uring_cancel() and
fuse_uring_register()/fuse_uring_next_fuse_req(),
which comes from the queue reduction feature.

Race was

core-A                         core-B
fuse_uring_register
    spin_lock(&queue->lock);
    fuse_uring_ent_avail()
    spin_unlock(&queue->lock);

                                fuse_uring_cancel()
                                    spin_lock(&queue->lock);
                                    ent->state = FRRS_USERSPACE;
                                    list_move()

    fuse_uring_next_fuse_req()
        spin_lock(&queue->lock);
        fuse_uring_ent_avail(ent, queue);
        fuse_uring_send_next_to_ring()
        spin_unlock(&queue->lock);
        fuse_uring_send_next_to_ring

I.e. fuse_uring_ent_avail() was called two times and the 2nd time
when the entry was actually already handled by fuse_uring_cancel().

Solution is to not call fuse_uring_ent_avail() from
fuse_uring_register. With that the entry is not in state
FRRS_AVAILABLE and fuse_uring_cancel() will not touch it.
fuse_uring_send_next_to_ring() will mark it as FRRS_AVAILABLE,
and then either assign a request to it and change state again
or will not touch it at all anymore - race fixed.

This will be folded into the upstream queue reduction patches
and therefore has the RED-34640 commit message.

Also entirely removed is fuse_uring_do_register() as remaining
work can be done by the caller.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
This is just to avoid code dup with an upcoming commit.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
This issue could be observed sometimes during libfuse xfstests, from
dmseg prints some like "kernel: WARNING: CPU: 4 PID: 0 at
fs/fuse/dev_uring.c:204 fuse_uring_destruct+0x1f5/0x200 [fuse]".

The cause is, if when fuse daemon just submitted
FUSE_IO_URING_CMD_REGISTER SQEs, then umount or fuse daemon quits at
this very early stage. After all uring queues stopped, might have one or
more unprocessed FUSE_IO_URING_CMD_REGISTER SQEs get processed then some
new ring entities are created and added to ent_avail_queue, and
immediately fuse_uring_cancel moved them to ent_in_userspace after SQEs
get canceled. These ring entities were not moved to ent_released, and
stayed in ent_in_userspace when fuse_uring_destruct was called.

One way to solve it would be to also free 'ent_in_userspace' in
fuse_uring_destruct(), but from code point of view it is hard to see why
it is needed. As suggested by Joanne, another solution is to avoid moving
entries in fuse_uring_cancel() to the 'ent_in_userspace' list and just
releasing them directly.

Fixes: b6236c8 ("fuse: {io-uring} Prevent mount point hang on fuse-server termination")
Cc: Joanne Koong <joannelkoong@gmail.com>
Cc: <stable@vger.kernel.org> # v6.14
Signed-off-by: Jian Huang Li <ali@ddn.com>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
When a request is terminated before it has been committed, the request
is not removed from the queue's list. This leaves a dangling list entry
that leads to list corruption and use-after-free issues.

Remove the request from the queue's list for terminated non-committed
requests.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Fixes: c090c8a ("fuse: Add io-uring sqe commit and fetch support")
Cc: stable@vger.kernel.org
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
FUSE_INIT has always been asynchronous with mount.  That means that the
server processed this request after the mount syscall returned.

This means that FUSE_INIT can't supply the root inode's ID, hence it
currently has a hardcoded value.  There are other limitations such as not
being able to perform getxattr during mount, which is needed by selinux.

To remove these limitations allow server to process FUSE_INIT while
initializing the in-core super block for the fuse filesystem.  This can
only be done if the server is prepared to handle this, so add
FUSE_DEV_IOC_SYNC_INIT ioctl, which

 a) lets the server know whether this feature is supported, returning
 ENOTTY othewrwise.

 b) lets the kernel know to perform a synchronous initialization

The implementation is slightly tricky, since fuse_dev/fuse_conn are set up
only during super block creation.  This is solved by setting the private
data of the fuse device file to a special value ((struct fuse_dev *) 1) and
waiting for this to be turned into a proper fuse_dev before commecing with
operations on the device file.

Several merge conflicts during the backport - the ioctl handler
was easier to rewrite from scratch.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
(cherry picked from commit dfb84c3)
no functional changes

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
Take actions on the PR merged event of this repo. Run
copy-from-linux-branch.sh and create a PR for redfs.

(cherry picked from commit f54872e)
Switch to pull_request_target instead of pull_request as the github
security requirement. Also limits the scope to protected PR.

(cherry picked from commit b9980ad)
Remove the pull_request_target as it doesn't work.

(cherry picked from commit 5328f66)
no functional change

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
…) returns

some other error than ENOSYS. The returned error got overwritten.

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
Simplify fuse_compound_req to hold only the pointers
to the added fuse args and the request housekeeping.

Simplify open+getattr call by using helper functions
to fill out the fuse request parameters

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
For now compounds are a module option and disabled by default

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
The use of bitmap_weight() didn't give the actual index,
but always returned the current cpu, which resulted
in a totally wrong mapping.

It now just increases a counter for every mapping and ignores
cores not in the given (numa) map and then find the index
for that.

Also added is a pr_debug(), which can be activated for example
with
echo "module redfs +p" >/proc/dynamic_debug/control
(Pity that upstream is not open for such debug messages).
@hbirth hbirth force-pushed the redfs-ubuntu-hwe-6.17.0-16.16_24.04.1 branch from ee96a90 to ca3d1bf Compare February 19, 2026 19:33
continue;

/* memory barrier to ensure we see the latest list state */
smp_rmb();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch can be equally removed - I was hoping it would fix a race, but it actually didn't.

hlist_for_each_entry(dentry, &inode->i_dentry, d_u.d_alias) {
spin_lock(&dentry->d_lock);
if (!d_unhashed(dentry))
__d_drop(dentry);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a similar patch from Miklos and Miklos had said this __d_drop() is not right. I need to check again. Keep it for now as it is, but we need to have a look at some point.

fs/fuse/file.c Outdated
cleanup:
unlock_page(page);
put_page(page);
error:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't look right. The patch in 6.8 was not touching fuse_vma_close at all - why does it here?

*ppos = pos;

if (res > 0 && write && fopen_direct_io) {
/*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we could take the upstream patch, with the right cherry-pick ID

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants

Comments