Skip to content

fix: relay confirmation before merge and welcome (MIP-02/MIP-03)#46

Open
alltheseas wants to merge 1 commit intoVectorPrivacy:masterfrom
alltheseas:fix/relay-confirmation-before-merge
Open

fix: relay confirmation before merge and welcome (MIP-02/MIP-03)#46
alltheseas wants to merge 1 commit intoVectorPrivacy:masterfrom
alltheseas:fix/relay-confirmation-before-merge

Conversation

@alltheseas
Copy link

@alltheseas alltheseas commented Feb 16, 2026

What this fixes

When you add someone to a group chat, Vector immediately updates its own local state as if everything worked — then tries to tell the relay network about it in the background with no retries. If the relay is down or rejects the message, the person you invited never gets their welcome message, and your group is now permanently broken. Messages you send won't decrypt for other members. The only fix is to create a new group.

After this fix, when you add someone to a group chat, the button clears instantly (no change in feel). Behind the scenes, Vector now waits for at least one relay to confirm it received the group update before finalizing. Only then does it send the welcome to the invited person. If all relays are down, nothing changes locally — your group stays in a working state and you can try again later.

Same thing for removing someone from a group.

Technical details

  • Defer merge_pending_commit until relay confirmation: add_member_device and remove_member_device no longer merge the pending commit before publishing. Instead, the commit is created, and a background task handles relay publish → merge → welcome delivery in the correct order.
  • Add publish_event_with_retries helper: 5-attempt exponential backoff (250ms base) matching pika core's pattern, with NIP-42 auth retry handling.
  • Non-blocking Tauri commands: Both functions return immediately after creating the commit. The background task handles the rest, so the frontend "Inviting..."/"Removing..." UI clears instantly.

Before:

engine.add_members()
engine.merge_pending_commit()   ← merged before relay confirm
tokio::spawn(send_event(...))   ← fire-and-forget, no retries
send welcomes immediately       ← no ordering guarantee

After:

engine.add_members()            ← create commit, don't merge
return Ok(())                   ← frontend unblocked
background:
  publish_event_with_retries()  ← relay confirmation with retries
  engine.merge_pending_commit() ← only after relay ack
  send welcomes                 ← only after commit is on relay
  emit UI update
  • MIP-02: Welcome now arrives only after the commit it references is on a relay
  • MIP-03: Local state only advances after relay confirms — no permanent divergence if all relays reject
  • MDK docs: merge_pending_commit docstring says "This should be called AFTER publishing the Kind:445 message"

Test plan

  • cargo check in src-tauri/
  • Add member to group → UI clears immediately, member appears after background merge, welcome arrives after commit is on relay
  • Remove member from group → UI clears immediately, member list updates after background merge
  • Kill all relays → commit is not merged locally, no silent state divergence

🤖 Generated with Claude Code

Copy link

@JSKitty JSKitty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Concept & Integration with Latest Master

The reasoning here is solid — relay confirmation before merge (MIP-02/MIP-03) and the publish_event_with_retries helper are both good additions. However, master just received per-group sync locks (ce1857c) which this PR needs to integrate with. Beyond the textual conflict, there are a few architectural issues to address.

1. Race condition with sync (critical)

The background task that does publish → merge → welcome doesn't acquire the per-group sync lock. Between add_members() creating the pending commit and the background merge_pending_commit(), a sync_group_since_cursor call can acquire the lock, process incoming events, and advance the epoch — making the pending commit stale.

Fix: Acquire the lock in the background task, wrapping the entire publish → merge → welcome sequence:

tokio::spawn(async move {
    let client = NOSTR_CLIENT.get().unwrap();

    // Hold per-group lock for the entire publish → merge → welcome flow
    let group_lock = crate::mls::get_group_sync_lock(&group_id_owned);
    let _guard = group_lock.lock().await;

    // 1. Publish with retries
    if let Err(e) = publish_event_with_retries(client, &evolution_event).await {
        eprintln!("[MLS] Failed to publish commit after retries: {}", e);
        if let Some(handle) = TAURI_APP.get() {
            handle.emit("mls_error", serde_json::json!({
                "group_id": group_id_owned,
                "error": format!("Failed to publish invite: {}", e)
            })).ok();
        }
        return;
    }

    // 2. Merge — engine created inside locked section, no concurrent access possible
    let storage = MdkSqliteStorage::new_unencrypted(&db_path).ok()?;
    let engine = MDK::new(storage);
    if let Err(e) = engine.merge_pending_commit(&mls_group_id) {
        // ...
    }

    // 3. Send welcomes (still under lock)
    // 4. UI update
});

2. Silent failure — user gets Ok(()) but operation may fail

The old code surfaced errors to the frontend. The new code returns Ok(()) immediately and swallows all background failures via eprintln!. If relay publish fails after 5 retries, merge fails, or welcome delivery fails — the user has no idea.

Fix: Emit an error event to the frontend when the background task fails (see example above).

3. New engine instance in background task

Creating MdkSqliteStorage::new_unencrypted(&db_path) + MDK::new(storage) in the background re-runs migration validation (which can fail — we just hit a migration mismatch bug) and opens a new SQLite connection. With the sync lock held (per fix #1), this becomes safe from concurrency, but it's still heavier than ideal.

4. Stale comment in remove_member_device

Lines 827-829 still say "Perform engine operation: remove member and merge commit BEFORE publishing / This ensures our local state is correct before announcing to the network" — contradicts the new behavior.


Overall the direction is right — just needs integration with the sync lock system and error visibility. Happy to help if you have questions about the lock API!


Footnote: This is JSKitty's Claude replying! 🤖

Rewrite add_member_device and remove_member_device to defer
merge_pending_commit until after relay confirmation, matching
MDK's documented ordering requirement.

Before: merge locally → fire-and-forget publish → send welcomes
After:  create commit → return immediately → background: relay
        publish with retries → merge → send welcomes → UI update

The Tauri command returns immediately (no UX blocking). A new
publish_event_with_retries helper provides 5-attempt exponential
backoff matching pika core's pattern.

This prevents permanent group state divergence when all relays
reject the commit, and ensures welcomes reference a commit that
is actually available on relays (MIP-02).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alltheseas alltheseas force-pushed the fix/relay-confirmation-before-merge branch from 2ef1ea4 to 4f26338 Compare February 18, 2026 00:39
@alltheseas
Copy link
Author

Updated: rebased onto master + review fixes applied

Rebased onto current master (integrates per-group sync locks from ce1857c and batch add_member_devices from 897fa82). All 4 review items addressed:

  1. Race condition with sync (critical) — background tasks now acquire get_group_sync_lock before the entire publish → merge → welcome flow
  2. Silent failure — all background failure paths emit mls_error event to frontend; added mls_error listener in src/main.js that shows a toast
  3. New engine instance in background — engine created inside locked section, safe from concurrency
  4. Stale comment in remove_member_device — removed "BEFORE publishing" comment, replaced with MIP-03 explanation

Additional fixes discovered during review:

  • track_mls_event_processed moved to after successful merge_pending_commit (prevents stranded commits if merge fails between track and merge)
  • sync_mls_group_participants moved from command layer into background task (runs after merge, so reads post-merge engine state instead of racing)

Known follow-up items

Two architectural gaps remain from moving to async background tasks. Neither is a regression (the old sync code didn't have MIP-02/MIP-03 ordering at all), but they should be tracked:

  1. Stranded pending commit on publish failure — If relay publish fails after 5 retries, the pending commit stays in MLS storage with no automatic recovery. The signed event is dropped when the task returns. Fix options:

    • Persisted retry queue: save the signed evolution event to SQLite, retry on next sync cycle
    • MDK clear_pending_commit: OpenMLS exposes MlsGroup::clear_pending_commit() but MDK doesn't surface it yet — would need an upstream PR
  2. Premature success UIinvite_member_to_group / remove_mls_member_device return Ok(()) before background work completes, so the frontend shows "invited successfully!" before relay confirmation. If the background task later fails, the user sees a contradictory error toast. Fix: frontend should show a provisional state on command return and only confirm success on mls_group_updated event.

@alltheseas
Copy link
Author

I had a go at addressing your comments. Hope this helps steer closer to the MIPs. Keep buidling 💪

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments