Add voice-controlled computer use: `assembly control` command by alexkroman · Pull Request #271 · AssemblyAI/cli

alexkroman · 2026-06-23T19:39:09Z

Implements a new assembly control command that enables hands-free macOS UI automation through voice instructions. Users speak commands, which are transcribed via Streaming STT and executed by an LLM agent that decides which UI actions to take (typing, key chords, clicking elements, launching apps) through a native Swift helper.

Key changes

New aai_cli/control/ module — The core agent loop and supporting infrastructure:
- engine.py: Pure observe/act loop with injected responder/executor/renderer seams for testability
- actions.py: Action vocabulary (type_text, key_combo, click, launch_app, focus_app, get_ui_tree, screenshot)
- bridge.py: Adapts LLM Gateway (OpenAI-compatible) into the engine's responder interface
- tools.py: Exposes actions as OpenAI function-calling tool definitions
- helper.py: Manages the native Swift helper process (compile-once, run-long-lived, JSON-lines protocol)
- listen.py: Converts mic Streaming STT into an utterance stream (queue + worker thread)
- render.py: Surfaces loop progress (human stderr narration or NDJSON events)
- prompt.py: System prompt briefing the model on the voice-control loop
Native macOS helper — aai_cli/control/macos_ui_control.swift:
- Compiled once and cached by digest; runs as a long-lived child process
- Handles synthetic input (CGEvent for keystrokes/clicks), accessibility tree reading, app launch/focus
- JSON-lines request/response protocol matching the streaming system-audio helper pattern
Command wiring — aai_cli/commands/control/:
- __init__.py: Typer command with options (device, sample_rate, model, max_tokens, max_steps, dry_run, json)
- _exec.py: Run logic with injectable dependencies (transcripts, responder, helper) for testability
Comprehensive test coverage:
- tests/test_control.py: Pure loop, actions, engine, bridge, rendering (all external legs faked)
- tests/test_control_exec.py: Helper transport, build, mic listener, command wiring (macOS paths mocked)
- tests/_control_helpers.py: Shared fakes (RecordingRenderer, FakeProc, scripted responder, etc.)
Integration:
- Registered in command registry with help panel and ordering
- Updated help snapshots and root command list
- Added Swift resource to wheel artifacts in pyproject.toml
- Added control module to import-linter architecture contracts

Notable implementation details

Dependency injection throughout: Every external leg (mic, LLM, helper subprocess) is injectable so the loop is exercised with fakes — no microphone, network, subprocess, or macOS required in tests
macOS-only with graceful fallback: Platform check and Swift compiler detection raise CLIError with helpful suggestions; --dry-run mode refuses mutating actions but runs observe actions so the model can still "see"
Caching strategy: Helper binary is cached by source digest in user cache dir; rebuild only if source changes
Step budget: Per-turn step limit prevents runaway loops; hitting the budget surfaces feedback to the user
Tool validation: Model tool calls are validated against ACTION_SPECS before execution; invalid/refused calls are reported back to the model as failed tool results, not crashes

https://claude.ai/code/session_01PiUeSiTo5aV99PPfEQkuNc

A hands-free, voice-in/voice-out terminal agent that turns spoken instructions into real macOS UI actions — the "voice control plane" a browser/web service can't be, because it drives the actual desktop. Architecture (a `control/` feature slice with every external leg behind an injected seam, so the loop is hermetically testable with no mic, network, subprocess, or macOS): - actions/tools: the action vocabulary + its OpenAI function-calling schema. - engine: the pure observe/act loop (transcript -> LLM tool calls -> execute). - bridge: adapts the LLM Gateway into the engine's Responder seam. - listen: mic Streaming STT -> finalized utterances. - helper: spawns/talks JSON to a bundled Swift helper (CGEvent + the Accessibility API + NSWorkspace) — the "hands". - macos_ui_control.swift: the native helper (Codable JSON-lines protocol). `--dry-run` refuses every UI-mutating action (observe-only). macOS-only; fails fast elsewhere. Registered additively via SPEC; full gate green (100% patch coverage, mutation, types, lint, architecture contracts). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PiUeSiTo5aV99PPfEQkuNc

aikido-pr-checks · 2026-06-23T19:39:37Z

+        if self._json:
+            self._event("user", text=text)
+        else:
+            output.error_console.print(output.muted(f"you: {text}"))


ControlRenderer.on_user prints user speech verbatim to stderr; avoid logging unsanitized user-controlled text (mask, truncate, or omit sensitive data).

Details

✨ AI Reasoning
The renderer's on_user implementation prints the finalized spoken instruction directly to stderr (error_console.print) in human mode. This logs unsanitized user-controlled speech (potential PII or CR/LF log injection) with no masking or sanitization.

🔧 How do I fix it?
Keep sensitive data such as emails, passwords, and tokens out of logs. When logging values tied to a user, prefer a safe identifier like a user ID over the raw input, and strip line breaks from any user-provided text you do log.

_{Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.}
_{Reply @AikidoSec ignore: [REASON] to ignore this issue.}
_{More info}

+    hands = deps.helper()
+    try:
+        api_key = state.resolve_api_key()
+        respond = deps.responder(api_key, opts)


+    try:
+        api_key = state.resolve_api_key()
+        respond = deps.responder(api_key, opts)
+        transcripts = deps.transcripts(api_key, opts)


alexkroman enabled auto-merge June 23, 2026 19:39

aikido-pr-checks Bot reviewed Jun 23, 2026

View reviewed changes

github-code-quality Bot found potential problems Jun 23, 2026

View reviewed changes

alexkroman added this pull request to the merge queue Jun 23, 2026

Merged via the queue into main with commit b7eb293 Jun 23, 2026
21 checks passed

alexkroman deleted the claude/zen-sagan-39ruol branch June 23, 2026 19:48

alexkroman mentioned this pull request Jun 23, 2026

Fix assembly control: helper failed to build on current macOS SDK #272

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add voice-controlled computer use: `assembly control` command#271

Add voice-controlled computer use: `assembly control` command#271
alexkroman merged 1 commit into
mainfrom
claude/zen-sagan-39ruol

alexkroman commented Jun 23, 2026

Uh oh!

aikido-pr-checks Bot Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

alexkroman commented Jun 23, 2026

Key changes

Notable implementation details

Uh oh!

aikido-pr-checks Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants