Skip to content

Fix agent-cascade/speak streaming TTS and thread STT/LLM/TTS options#176

Merged
alexkroman merged 1 commit into
mainfrom
agent-framework-tts-and-options
Jun 16, 2026
Merged

Fix agent-cascade/speak streaming TTS and thread STT/LLM/TTS options#176
alexkroman merged 1 commit into
mainfrom
agent-framework-tts-and-options

Conversation

@alexkroman

Copy link
Copy Markdown
Collaborator

Summary

assembly agent-framework (and assembly speak) could not produce audio. This fixes three independent streaming-TTS client bugs — each was masking the next — and then threads full STT/LLM/TTS customization into agent-framework.

Bug fixes (aai_cli/tts/session.py, aai_cli/streaming/diagnostics.py)

Bug Symptom Fix
Wrong auth scheme (Bearer <key> vs raw key) "did not start the session (got 'Error')" open_authorized_ws gained bearer= (default True for Voice Agent); TTS passes bearer=False
Wrong flush tag (ForceFlushTextBuffer) "TTS error (3006): … 'Flush'" send {"type": "Flush"}
Missing end-of-stream marker (waited for Audio.is_final, server sends FlushDone) audio arrived but loop hung 60s then "stopped responding" break the collect loop on FlushDone

Supporting: the session-start error path now surfaces the server's error_code/error instead of a generic got 'Error' (this is how the flush validation error became visible).

The working agent_framework init template was the reference for all three — it authenticates with the raw key and sends Flush.

Feature: per-leg customization on agent-framework

Hybrid surface (common named flags + per-leg KEY=VALUE escape hatches), grouped into --help panels:

  • Speech-to-text--speech-model, --format-turns/--no-format-turns, --turn-detection, --stt-config, --stt-config-file
  • Language model--max-tokens, --llm-config
  • Text-to-speech--language, --tts-config

Details: the reply trigger is now format-aware so --no-format-turns still replies; the TTS sample rate stays locked to the live speaker; --tts-config rejects reserved keys (voice/language/sample_rate). Precedence matches stream (named flag/preset wins a head-to-head with --stt-config).

Test plan

  • Full gate green (./scripts/check.sh): 2738 tests, 100% patch coverage, diff-scoped mutation gate, build/twine.
  • Verified live against the sandbox: assembly --sandbox speak "hello there" writes a valid 0.96s / 24kHz WAV; a fully-customized file-driven agent-framework run drives STT→LLM→TTS with no error.

🤖 Generated with Claude Code

The terminal cascade (assembly agent-cascade) and assembly speak could not
produce audio. Three independent bugs in the streaming-TTS client, each masking
the next, plus a diagnostics gap:

- Auth scheme: the TTS socket was opened with 'Authorization: Bearer <key>', but
  AssemblyAI streaming endpoints authenticate with the raw key (only Voice Agent
  uses Bearer). open_authorized_ws gained a bearer= flag (default True); TTS now
  passes bearer=False, matching the working agent-cascade init template.
- Flush tag: sent 'ForceFlushTextBuffer'; the server's tag is 'Flush'.
- End-of-stream: the loop waited for an Audio frame with is_final, which the live
  server never sets — it ends a synthesis with a 'FlushDone' frame. Without
  handling it the loop blocked until the 60s recv timeout and the audio was lost.
- The session-start error path discarded the server's Error-frame contents
  (generic "got 'Error'"); it now surfaces error_code/error, which is how the
  flush validation error became visible.

Also thread per-leg customization into agent-cascade (hybrid: common named
flags + per-leg KEY=VALUE escape hatches), grouped into --help panels:
- STT: --speech-model, --format-turns/--no-format-turns, --turn-detection,
  --stt-config, --stt-config-file
- LLM: --max-tokens, --llm-config
- TTS: --language, --tts-config (new SpeakConfig.extra)

The reply trigger is now format-aware so --no-format-turns still replies; TTS
sample rate stays locked to the live player; --tts-config rejects reserved keys.

Verified against the live sandbox: assembly speak produces a valid 24kHz WAV and
a fully-customized file-driven cascade runs STT->LLM->TTS without error.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexkroman alexkroman force-pushed the agent-framework-tts-and-options branch from f743ff2 to 00128ac Compare June 16, 2026 14:00
@alexkroman alexkroman changed the title Fix agent-framework/speak streaming TTS and thread STT/LLM/TTS options Fix agent-cascade/speak streaming TTS and thread STT/LLM/TTS options Jun 16, 2026
@alexkroman

Copy link
Copy Markdown
Collaborator Author

Rebased onto main after #175 renamed agent-frameworkagent-cascade. Ported all changes onto the new module/command/symbol names (agent_cascade, AgentCascadeOptions, run_agent_cascade, command agent-cascade); the TTS bug-fix files (tts/session.py, streaming/diagnostics.py) were unaffected by the rename. Full gate re-run green against the new base.

@alexkroman alexkroman added this pull request to the merge queue Jun 16, 2026
Merged via the queue into main with commit 53b3141 Jun 16, 2026
19 checks passed
@alexkroman alexkroman deleted the agent-framework-tts-and-options branch June 16, 2026 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants