fix: relay confirmation before merge and welcome (MIP-02/MIP-03)#46
fix: relay confirmation before merge and welcome (MIP-02/MIP-03)#46alltheseas wants to merge 1 commit intoVectorPrivacy:masterfrom
Conversation
JSKitty
left a comment
There was a problem hiding this comment.
Review: Concept & Integration with Latest Master
The reasoning here is solid — relay confirmation before merge (MIP-02/MIP-03) and the publish_event_with_retries helper are both good additions. However, master just received per-group sync locks (ce1857c) which this PR needs to integrate with. Beyond the textual conflict, there are a few architectural issues to address.
1. Race condition with sync (critical)
The background task that does publish → merge → welcome doesn't acquire the per-group sync lock. Between add_members() creating the pending commit and the background merge_pending_commit(), a sync_group_since_cursor call can acquire the lock, process incoming events, and advance the epoch — making the pending commit stale.
Fix: Acquire the lock in the background task, wrapping the entire publish → merge → welcome sequence:
tokio::spawn(async move {
let client = NOSTR_CLIENT.get().unwrap();
// Hold per-group lock for the entire publish → merge → welcome flow
let group_lock = crate::mls::get_group_sync_lock(&group_id_owned);
let _guard = group_lock.lock().await;
// 1. Publish with retries
if let Err(e) = publish_event_with_retries(client, &evolution_event).await {
eprintln!("[MLS] Failed to publish commit after retries: {}", e);
if let Some(handle) = TAURI_APP.get() {
handle.emit("mls_error", serde_json::json!({
"group_id": group_id_owned,
"error": format!("Failed to publish invite: {}", e)
})).ok();
}
return;
}
// 2. Merge — engine created inside locked section, no concurrent access possible
let storage = MdkSqliteStorage::new_unencrypted(&db_path).ok()?;
let engine = MDK::new(storage);
if let Err(e) = engine.merge_pending_commit(&mls_group_id) {
// ...
}
// 3. Send welcomes (still under lock)
// 4. UI update
});2. Silent failure — user gets Ok(()) but operation may fail
The old code surfaced errors to the frontend. The new code returns Ok(()) immediately and swallows all background failures via eprintln!. If relay publish fails after 5 retries, merge fails, or welcome delivery fails — the user has no idea.
Fix: Emit an error event to the frontend when the background task fails (see example above).
3. New engine instance in background task
Creating MdkSqliteStorage::new_unencrypted(&db_path) + MDK::new(storage) in the background re-runs migration validation (which can fail — we just hit a migration mismatch bug) and opens a new SQLite connection. With the sync lock held (per fix #1), this becomes safe from concurrency, but it's still heavier than ideal.
4. Stale comment in remove_member_device
Lines 827-829 still say "Perform engine operation: remove member and merge commit BEFORE publishing / This ensures our local state is correct before announcing to the network" — contradicts the new behavior.
Overall the direction is right — just needs integration with the sync lock system and error visibility. Happy to help if you have questions about the lock API!
Footnote: This is JSKitty's Claude replying! 🤖
Rewrite add_member_device and remove_member_device to defer
merge_pending_commit until after relay confirmation, matching
MDK's documented ordering requirement.
Before: merge locally → fire-and-forget publish → send welcomes
After: create commit → return immediately → background: relay
publish with retries → merge → send welcomes → UI update
The Tauri command returns immediately (no UX blocking). A new
publish_event_with_retries helper provides 5-attempt exponential
backoff matching pika core's pattern.
This prevents permanent group state divergence when all relays
reject the commit, and ensures welcomes reference a commit that
is actually available on relays (MIP-02).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2ef1ea4 to
4f26338
Compare
Updated: rebased onto master + review fixes appliedRebased onto current
Additional fixes discovered during review:
Known follow-up itemsTwo architectural gaps remain from moving to async background tasks. Neither is a regression (the old sync code didn't have MIP-02/MIP-03 ordering at all), but they should be tracked:
|
|
I had a go at addressing your comments. Hope this helps steer closer to the MIPs. Keep buidling 💪 |
What this fixes
When you add someone to a group chat, Vector immediately updates its own local state as if everything worked — then tries to tell the relay network about it in the background with no retries. If the relay is down or rejects the message, the person you invited never gets their welcome message, and your group is now permanently broken. Messages you send won't decrypt for other members. The only fix is to create a new group.
After this fix, when you add someone to a group chat, the button clears instantly (no change in feel). Behind the scenes, Vector now waits for at least one relay to confirm it received the group update before finalizing. Only then does it send the welcome to the invited person. If all relays are down, nothing changes locally — your group stays in a working state and you can try again later.
Same thing for removing someone from a group.
Technical details
merge_pending_commituntil relay confirmation:add_member_deviceandremove_member_deviceno longer merge the pending commit before publishing. Instead, the commit is created, and a background task handles relay publish → merge → welcome delivery in the correct order.publish_event_with_retrieshelper: 5-attempt exponential backoff (250ms base) matching pika core's pattern, with NIP-42 auth retry handling.Before:
After:
merge_pending_commitdocstring says "This should be called AFTER publishing the Kind:445 message"Test plan
cargo checkinsrc-tauri/🤖 Generated with Claude Code