Skip to content

Add voice-controlled computer use: assembly control command#271

Merged
alexkroman merged 1 commit into
mainfrom
claude/zen-sagan-39ruol
Jun 23, 2026
Merged

Add voice-controlled computer use: assembly control command#271
alexkroman merged 1 commit into
mainfrom
claude/zen-sagan-39ruol

Conversation

@alexkroman

Copy link
Copy Markdown
Collaborator

Implements a new assembly control command that enables hands-free macOS UI automation through voice instructions. Users speak commands, which are transcribed via Streaming STT and executed by an LLM agent that decides which UI actions to take (typing, key chords, clicking elements, launching apps) through a native Swift helper.

Key changes

  • New aai_cli/control/ module — The core agent loop and supporting infrastructure:

    • engine.py: Pure observe/act loop with injected responder/executor/renderer seams for testability
    • actions.py: Action vocabulary (type_text, key_combo, click, launch_app, focus_app, get_ui_tree, screenshot)
    • bridge.py: Adapts LLM Gateway (OpenAI-compatible) into the engine's responder interface
    • tools.py: Exposes actions as OpenAI function-calling tool definitions
    • helper.py: Manages the native Swift helper process (compile-once, run-long-lived, JSON-lines protocol)
    • listen.py: Converts mic Streaming STT into an utterance stream (queue + worker thread)
    • render.py: Surfaces loop progress (human stderr narration or NDJSON events)
    • prompt.py: System prompt briefing the model on the voice-control loop
  • Native macOS helperaai_cli/control/macos_ui_control.swift:

    • Compiled once and cached by digest; runs as a long-lived child process
    • Handles synthetic input (CGEvent for keystrokes/clicks), accessibility tree reading, app launch/focus
    • JSON-lines request/response protocol matching the streaming system-audio helper pattern
  • Command wiringaai_cli/commands/control/:

    • __init__.py: Typer command with options (device, sample_rate, model, max_tokens, max_steps, dry_run, json)
    • _exec.py: Run logic with injectable dependencies (transcripts, responder, helper) for testability
  • Comprehensive test coverage:

    • tests/test_control.py: Pure loop, actions, engine, bridge, rendering (all external legs faked)
    • tests/test_control_exec.py: Helper transport, build, mic listener, command wiring (macOS paths mocked)
    • tests/_control_helpers.py: Shared fakes (RecordingRenderer, FakeProc, scripted responder, etc.)
  • Integration:

    • Registered in command registry with help panel and ordering
    • Updated help snapshots and root command list
    • Added Swift resource to wheel artifacts in pyproject.toml
    • Added control module to import-linter architecture contracts

Notable implementation details

  • Dependency injection throughout: Every external leg (mic, LLM, helper subprocess) is injectable so the loop is exercised with fakes — no microphone, network, subprocess, or macOS required in tests
  • macOS-only with graceful fallback: Platform check and Swift compiler detection raise CLIError with helpful suggestions; --dry-run mode refuses mutating actions but runs observe actions so the model can still "see"
  • Caching strategy: Helper binary is cached by source digest in user cache dir; rebuild only if source changes
  • Step budget: Per-turn step limit prevents runaway loops; hitting the budget surfaces feedback to the user
  • Tool validation: Model tool calls are validated against ACTION_SPECS before execution; invalid/refused calls are reported back to the model as failed tool results, not crashes

https://claude.ai/code/session_01PiUeSiTo5aV99PPfEQkuNc

A hands-free, voice-in/voice-out terminal agent that turns spoken
instructions into real macOS UI actions — the "voice control plane" a
browser/web service can't be, because it drives the actual desktop.

Architecture (a `control/` feature slice with every external leg behind an
injected seam, so the loop is hermetically testable with no mic, network,
subprocess, or macOS):
- actions/tools: the action vocabulary + its OpenAI function-calling schema.
- engine: the pure observe/act loop (transcript -> LLM tool calls -> execute).
- bridge: adapts the LLM Gateway into the engine's Responder seam.
- listen: mic Streaming STT -> finalized utterances.
- helper: spawns/talks JSON to a bundled Swift helper (CGEvent + the
  Accessibility API + NSWorkspace) — the "hands".
- macos_ui_control.swift: the native helper (Codable JSON-lines protocol).

`--dry-run` refuses every UI-mutating action (observe-only). macOS-only;
fails fast elsewhere. Registered additively via SPEC; full gate green
(100% patch coverage, mutation, types, lint, architecture contracts).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01PiUeSiTo5aV99PPfEQkuNc
@alexkroman alexkroman enabled auto-merge June 23, 2026 19:39
Comment thread aai_cli/control/render.py
if self._json:
self._event("user", text=text)
else:
output.error_console.print(output.muted(f"you: {text}"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ControlRenderer.on_user prints user speech verbatim to stderr; avoid logging unsanitized user-controlled text (mask, truncate, or omit sensitive data).

Details

✨ AI Reasoning
​The renderer's on_user implementation prints the finalized spoken instruction directly to stderr (error_console.print) in human mode. This logs unsanitized user-controlled speech (potential PII or CR/LF log injection) with no masking or sanitization.

🔧 How do I fix it?
Keep sensitive data such as emails, passwords, and tokens out of logs. When logging values tied to a user, prefer a safe identifier like a user ID over the raw input, and strip line breaks from any user-provided text you do log.

Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

hands = deps.helper()
try:
api_key = state.resolve_api_key()
respond = deps.responder(api_key, opts)
try:
api_key = state.resolve_api_key()
respond = deps.responder(api_key, opts)
transcripts = deps.transcripts(api_key, opts)
@alexkroman alexkroman added this pull request to the merge queue Jun 23, 2026
Merged via the queue into main with commit b7eb293 Jun 23, 2026
21 checks passed
@alexkroman alexkroman deleted the claude/zen-sagan-39ruol branch June 23, 2026 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants