extending shutdown to handle single agent setup in relayed mode#19
Conversation
License Check Results🚀 The license check job ran with the Bazel command: bazel run //:license-checkStatus: Click to expand output |
|
The created documentation from the pull request is available at: docu-html |
Signed-off-by: ShoroukRamzy <82869891+ShoroukRamzy@users.noreply.github.com>
| // (100.into(), vec![w40, w41, w42, w43, w44]), | ||
| // ] | ||
| // .into_iter() | ||
| // .collect(); |
| recorder_ids: vec![], | ||
| worker_assignments: agent_assignments().remove(&AGENT_ID).unwrap(), | ||
| timeout: Duration::from_secs(10), | ||
| timeout: Duration::from_secs(60), |
There was a problem hiding this comment.
Why do we need a longer timeout?
There was a problem hiding this comment.
while testing the single agent in relayed mode I encountered sporadic issue of missing all worker connections within the timeout, that's why I needed to increase the timeout.
There was a problem hiding this comment.
If there is a separate issue, we better make a separate ticket for it. I don't think it's a good idea to mask the issue with an unreasonably long timeout, especially as part of an unrelated PR.
There was a problem hiding this comment.
Thanks @artemsheinacn, I fully agree with you. I will revert it
|
@armin-acn please review when you have time |
4c11f36 to
8220056
Compare
artemsheinacn
left a comment
There was a problem hiding this comment.
Thanks for the PR @ShoroukRamzy!
Thank you for your feedbacks @artemsheinacn |
Hi @armin-acn, I don't have permissions to merge this PR. Could you please help me with this when you have time? Thanks! |
…pse-score#19) * extending shutdown to handle single agent setup in relayed mode * extending shutdown to handle single agent setup in relayed mode --------- Signed-off-by: ShoroukRamzy <82869891+ShoroukRamzy@users.noreply.github.com>
Hi @artemsheinacn,
This PR fixes a deadlock during shutdown when using a relayed signaling mode in a single-agent setup as per our discussion here
#17
The PrimaryReceiveRelay thread would block indefinitely waiting for remote agents that don't exist. The fix is to prevent the relay threads from being created at all in this scenario, as all workers are local and communicate via MPSC.
The SchedulerConnector for the relayed mode now conditionally creates the PrimaryReceiveRelay and PrimarySendRelay only if there are remote agents configured. This makes the shutdown clean and avoids spawning an unnecessary thread.
Thanks!