Fix agent-cascade/speak streaming TTS and thread STT/LLM/TTS options#176
Merged
Conversation
The terminal cascade (assembly agent-cascade) and assembly speak could not produce audio. Three independent bugs in the streaming-TTS client, each masking the next, plus a diagnostics gap: - Auth scheme: the TTS socket was opened with 'Authorization: Bearer <key>', but AssemblyAI streaming endpoints authenticate with the raw key (only Voice Agent uses Bearer). open_authorized_ws gained a bearer= flag (default True); TTS now passes bearer=False, matching the working agent-cascade init template. - Flush tag: sent 'ForceFlushTextBuffer'; the server's tag is 'Flush'. - End-of-stream: the loop waited for an Audio frame with is_final, which the live server never sets — it ends a synthesis with a 'FlushDone' frame. Without handling it the loop blocked until the 60s recv timeout and the audio was lost. - The session-start error path discarded the server's Error-frame contents (generic "got 'Error'"); it now surfaces error_code/error, which is how the flush validation error became visible. Also thread per-leg customization into agent-cascade (hybrid: common named flags + per-leg KEY=VALUE escape hatches), grouped into --help panels: - STT: --speech-model, --format-turns/--no-format-turns, --turn-detection, --stt-config, --stt-config-file - LLM: --max-tokens, --llm-config - TTS: --language, --tts-config (new SpeakConfig.extra) The reply trigger is now format-aware so --no-format-turns still replies; TTS sample rate stays locked to the live player; --tts-config rejects reserved keys. Verified against the live sandbox: assembly speak produces a valid 24kHz WAV and a fully-customized file-driven cascade runs STT->LLM->TTS without error. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
f743ff2 to
00128ac
Compare
Collaborator
Author
|
Rebased onto |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
assembly agent-framework(andassembly speak) could not produce audio. This fixes three independent streaming-TTS client bugs — each was masking the next — and then threads full STT/LLM/TTS customization intoagent-framework.Bug fixes (
aai_cli/tts/session.py,aai_cli/streaming/diagnostics.py)Bearer <key>vs raw key)open_authorized_wsgainedbearer=(default True for Voice Agent); TTS passesbearer=FalseForceFlushTextBuffer){"type": "Flush"}Audio.is_final, server sendsFlushDone)FlushDoneSupporting: the session-start error path now surfaces the server's
error_code/errorinstead of a genericgot 'Error'(this is how the flush validation error became visible).The working
agent_frameworkinittemplate was the reference for all three — it authenticates with the raw key and sendsFlush.Feature: per-leg customization on
agent-frameworkHybrid surface (common named flags + per-leg
KEY=VALUEescape hatches), grouped into--helppanels:--speech-model,--format-turns/--no-format-turns,--turn-detection,--stt-config,--stt-config-file--max-tokens,--llm-config--language,--tts-configDetails: the reply trigger is now format-aware so
--no-format-turnsstill replies; the TTS sample rate stays locked to the live speaker;--tts-configrejects reserved keys (voice/language/sample_rate). Precedence matchesstream(named flag/preset wins a head-to-head with--stt-config).Test plan
./scripts/check.sh): 2738 tests, 100% patch coverage, diff-scoped mutation gate, build/twine.assembly --sandbox speak "hello there"writes a valid 0.96s / 24kHz WAV; a fully-customized file-drivenagent-frameworkrun drives STT→LLM→TTS with no error.🤖 Generated with Claude Code