fix: relay confirmation before merge and welcome (MIP-02/MIP-03) by alltheseas · Pull Request #46 · VectorPrivacy/Vector

alltheseas · 2026-02-16T07:19:22Z

What this fixes

When you add someone to a group chat, Vector immediately updates its own local state as if everything worked — then tries to tell the relay network about it in the background with no retries. If the relay is down or rejects the message, the person you invited never gets their welcome message, and your group is now permanently broken. Messages you send won't decrypt for other members. The only fix is to create a new group.

After this fix, when you add someone to a group chat, the button clears instantly (no change in feel). Behind the scenes, Vector now waits for at least one relay to confirm it received the group update before finalizing. Only then does it send the welcome to the invited person. If all relays are down, nothing changes locally — your group stays in a working state and you can try again later.

Same thing for removing someone from a group.

Technical details

Defer merge_pending_commit until relay confirmation: add_member_device and remove_member_device no longer merge the pending commit before publishing. Instead, the commit is created, and a background task handles relay publish → merge → welcome delivery in the correct order.
Add publish_event_with_retries helper: 5-attempt exponential backoff (250ms base) matching pika core's pattern, with NIP-42 auth retry handling.
Non-blocking Tauri commands: Both functions return immediately after creating the commit. The background task handles the rest, so the frontend "Inviting..."/"Removing..." UI clears instantly.

Before:

engine.add_members()
engine.merge_pending_commit()   ← merged before relay confirm
tokio::spawn(send_event(...))   ← fire-and-forget, no retries
send welcomes immediately       ← no ordering guarantee

After:

engine.add_members()            ← create commit, don't merge
return Ok(())                   ← frontend unblocked
background:
  publish_event_with_retries()  ← relay confirmation with retries
  engine.merge_pending_commit() ← only after relay ack
  send welcomes                 ← only after commit is on relay
  emit UI update

MIP-02: Welcome now arrives only after the commit it references is on a relay
MIP-03: Local state only advances after relay confirms — no permanent divergence if all relays reject
MDK docs: merge_pending_commit docstring says "This should be called AFTER publishing the Kind:445 message"

Test plan

cargo check in src-tauri/
Add member to group → UI clears immediately, member appears after background merge, welcome arrives after commit is on relay
Remove member from group → UI clears immediately, member list updates after background merge
Kill all relays → commit is not merged locally, no silent state divergence

🤖 Generated with Claude Code

JSKitty

Review: Concept & Integration with Latest Master

The reasoning here is solid — relay confirmation before merge (MIP-02/MIP-03) and the publish_event_with_retries helper are both good additions. However, master just received per-group sync locks (ce1857c) which this PR needs to integrate with. Beyond the textual conflict, there are a few architectural issues to address.

1. Race condition with sync (critical)

The background task that does publish → merge → welcome doesn't acquire the per-group sync lock. Between add_members() creating the pending commit and the background merge_pending_commit(), a sync_group_since_cursor call can acquire the lock, process incoming events, and advance the epoch — making the pending commit stale.

Fix: Acquire the lock in the background task, wrapping the entire publish → merge → welcome sequence:

tokio::spawn(async move {
    let client = NOSTR_CLIENT.get().unwrap();

    // Hold per-group lock for the entire publish → merge → welcome flow
    let group_lock = crate::mls::get_group_sync_lock(&group_id_owned);
    let _guard = group_lock.lock().await;

    // 1. Publish with retries
    if let Err(e) = publish_event_with_retries(client, &evolution_event).await {
        eprintln!("[MLS] Failed to publish commit after retries: {}", e);
        if let Some(handle) = TAURI_APP.get() {
            handle.emit("mls_error", serde_json::json!({
                "group_id": group_id_owned,
                "error": format!("Failed to publish invite: {}", e)
            })).ok();
        }
        return;
    }

    // 2. Merge — engine created inside locked section, no concurrent access possible
    let storage = MdkSqliteStorage::new_unencrypted(&db_path).ok()?;
    let engine = MDK::new(storage);
    if let Err(e) = engine.merge_pending_commit(&mls_group_id) {
        // ...
    }

    // 3. Send welcomes (still under lock)
    // 4. UI update
});

2. Silent failure — user gets `Ok(())` but operation may fail

The old code surfaced errors to the frontend. The new code returns Ok(()) immediately and swallows all background failures via eprintln!. If relay publish fails after 5 retries, merge fails, or welcome delivery fails — the user has no idea.

Fix: Emit an error event to the frontend when the background task fails (see example above).

3. New engine instance in background task

Creating MdkSqliteStorage::new_unencrypted(&db_path) + MDK::new(storage) in the background re-runs migration validation (which can fail — we just hit a migration mismatch bug) and opens a new SQLite connection. With the sync lock held (per fix #1), this becomes safe from concurrency, but it's still heavier than ideal.

4. Stale comment in `remove_member_device`

Lines 827-829 still say "Perform engine operation: remove member and merge commit BEFORE publishing / This ensures our local state is correct before announcing to the network" — contradicts the new behavior.

Overall the direction is right — just needs integration with the sync lock system and error visibility. Happy to help if you have questions about the lock API!

Footnote: This is JSKitty's Claude replying! 🤖

Rewrite add_member_device and remove_member_device to defer merge_pending_commit until after relay confirmation, matching MDK's documented ordering requirement. Before: merge locally → fire-and-forget publish → send welcomes After: create commit → return immediately → background: relay publish with retries → merge → send welcomes → UI update The Tauri command returns immediately (no UX blocking). A new publish_event_with_retries helper provides 5-attempt exponential backoff matching pika core's pattern. This prevents permanent group state divergence when all relays reject the commit, and ensures welcomes reference a commit that is actually available on relays (MIP-02). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

alltheseas · 2026-02-18T00:40:43Z

Updated: rebased onto master + review fixes applied

Rebased onto current master (integrates per-group sync locks from ce1857c and batch add_member_devices from 897fa82). All 4 review items addressed:

Race condition with sync (critical) — background tasks now acquire get_group_sync_lock before the entire publish → merge → welcome flow
Silent failure — all background failure paths emit mls_error event to frontend; added mls_error listener in src/main.js that shows a toast
New engine instance in background — engine created inside locked section, safe from concurrency
Stale comment in remove_member_device — removed "BEFORE publishing" comment, replaced with MIP-03 explanation

Additional fixes discovered during review:

track_mls_event_processed moved to after successful merge_pending_commit (prevents stranded commits if merge fails between track and merge)
sync_mls_group_participants moved from command layer into background task (runs after merge, so reads post-merge engine state instead of racing)

Known follow-up items

Two architectural gaps remain from moving to async background tasks. Neither is a regression (the old sync code didn't have MIP-02/MIP-03 ordering at all), but they should be tracked:

Stranded pending commit on publish failure — If relay publish fails after 5 retries, the pending commit stays in MLS storage with no automatic recovery. The signed event is dropped when the task returns. Fix options:
- Persisted retry queue: save the signed evolution event to SQLite, retry on next sync cycle
- MDK clear_pending_commit: OpenMLS exposes MlsGroup::clear_pending_commit() but MDK doesn't surface it yet — would need an upstream PR
Premature success UI — invite_member_to_group / remove_mls_member_device return Ok(()) before background work completes, so the frontend shows "invited successfully!" before relay confirmation. If the background task later fails, the user sees a contradictory error toast. Fix: frontend should show a provisional state on command return and only confirm success on mls_group_updated event.

alltheseas · 2026-02-18T00:48:56Z

I had a go at addressing your comments. Hope this helps steer closer to the MIPs. Keep buidling 💪

alltheseas mentioned this pull request Feb 16, 2026

MARMOT TRACKER 🦫 nostrability/nostrability#253

Open

JSKitty reviewed Feb 17, 2026

View reviewed changes

alltheseas force-pushed the fix/relay-confirmation-before-merge branch from 2ef1ea4 to 4f26338 Compare February 18, 2026 00:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: relay confirmation before merge and welcome (MIP-02/MIP-03)#46

fix: relay confirmation before merge and welcome (MIP-02/MIP-03)#46
alltheseas wants to merge 1 commit intoVectorPrivacy:masterfrom
alltheseas:fix/relay-confirmation-before-merge

alltheseas commented Feb 16, 2026 •

edited

Loading

Uh oh!

JSKitty left a comment

Uh oh!

alltheseas commented Feb 18, 2026

Uh oh!

alltheseas commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

alltheseas commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this fixes

Technical details

Test plan

Uh oh!

JSKitty left a comment

Choose a reason for hiding this comment

Review: Concept & Integration with Latest Master

1. Race condition with sync (critical)

2. Silent failure — user gets Ok(()) but operation may fail

3. New engine instance in background task

4. Stale comment in remove_member_device

Uh oh!

alltheseas commented Feb 18, 2026

Updated: rebased onto master + review fixes applied

Known follow-up items

Uh oh!

alltheseas commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

alltheseas commented Feb 16, 2026 •

edited

Loading

2. Silent failure — user gets `Ok(())` but operation may fail

4. Stale comment in `remove_member_device`