Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

README.md

SkillOpt-Sleep — plugins for Claude Code, Codex, and Copilot

Your coding agent forgets everything between sessions. SkillOpt-Sleep fixes that. While you sleep, it reviews what you did today, notices the rules you keep repeating ("always add a LIMIT", "answers in \boxed{}", "cite the source"), and writes them into your agent's long-term memory and skills — but only the rules that actually make it score better on your own past tasks. You wake up to an agent that's better at your work, and you approve every change before it sticks.

One engine, three thin shells. It synthesizes SkillOpt (validation-gated bounded text optimization — the research in this repo), Claude Dreams (offline consolidation; input never mutated; review-then-adopt), and the agent sleep idea (short-term experience → long-term competence).

Open-source tool, decoupled from the research. The engine lives in the top-level skillopt_sleep/ package with zero dependency on the paper's skillopt/ experiment code (the validation gate is vendored). Use it without the research stack.


Platform Folder Mechanism Status
Claude Code claude-code/ .claude-plugin + /skillopt-sleep command + skill + hooks full, installable
Codex codex/ user-level skillopt-sleep skill + shared runner full
Copilot copilot/ MCP server (sleep_* tools) + copilot-instructions full (MCP)

Install (pick your agent)

Platform Install Then
Claude Code /plugin marketplace add microsoft/SkillOpt/plugin install skillopt-sleep /skillopt-sleep status
Codex git clonebash plugins/codex/install.sh /skillopt-sleep status
Copilot git clone → register plugins/copilot/mcp_server.py as an MCP server ask "run the sleep cycle"

Requirements: Python ≥ 3.10 and the agent's CLI on PATH. All three call the same run-sleep.shpython -m skillopt_sleep, so behaviour is identical everywhere. Default backend is mock (no API spend); --backend claude|codex uses your own budget.


How it works: one "night", in plain terms

harvest your past sessions → mine the tasks you keep doing → replay them offline
  → reflect on failures → propose a few rule edits → KEEP only edits that raise
    your held-out score → stage a proposal → (you) review & adopt

Nothing live changes until you adopt; every adopt backs up the prior file.

The split that keeps it honest: dream-train / real-val / real-test

This is the heart of the design, borrowed from the SkillOpt paper's train/selection/test protocol:

Split Where it comes from What it's for
train your real tasks + optional "dreamed" variants what the optimizer learns from. Over-dreaming here is fine — it's imagination.
val (selection) your real tasks only, held out the gate: an edit is kept only if it raises this score. Stops overfitting.
test your real tasks only, held out, never seen during optimization the final score we report. Kept as close to your real usage as possible.

So you can dream up extra training examples to learn a rule robustly, while the rule is still judged on real, unseen tasks. A dream task can never land in val or test — that invariant is unit-tested.


What each feature does for you (with examples)

Every control below works on all three platforms (pass it after the action, e.g. /skillopt-sleep run --rollouts-k 3).

--preferences "..." — tell it your house rules

The single most useful knob. Free text that steers what the optimizer writes, as a prior. Use it to encode the conventions you're tired of repeating.

# A backend engineer:
/skillopt-sleep run --preferences "Always use async/await, never callbacks. \
  Prefer pytest over unittest. Commit subjects in imperative mood under 50 chars."

# A data analyst:
/skillopt-sleep run --preferences "Every SQL query must end with LIMIT 1000 unless \
  I say otherwise. Money in USD with 2 decimals. Prefer CTEs over nested subqueries."

# A researcher:
/skillopt-sleep run --preferences "Cite sources as [Author, Year]. Math answers in \
  \\boxed{}. Keep explanations under 150 words unless I ask for depth."

What it does for you: the next morning your agent already follows these without you re-typing them, and the rules are validated against your real tasks (if a "preference" actually hurts your held-out score, the gate drops it).

--gate on|off — strict vs. greedy

  • on (default): an edit is kept only if it raises your held-out score. Safe — blocks plausible-but-wrong rules and reward-hacking.
  • off: greedy — keep edits without the strict check (still reports whether quality moved).

What it does for you: leave it on for trust. Flip it off when you're exploring and want to see everything the optimizer proposes.

--rollouts-k K — learn from contrast, not just failure

Re-runs each task K times and learns from the difference between the good and bad attempts, not just a single failure.

/skillopt-sleep run --rollouts-k 3

What it does for you: a much stronger signal. If your agent gets a task right 1 time in 3, the optimizer figures out what the winning attempt did and makes it reliable.

--optimizer-model / --target-model — optimize cheap, deploy anywhere

Use a strong model to write the rules and a cheap model to run your tasks. The learned skill then helps the cheap model — or any model.

/skillopt-sleep run --optimizer-model sonnet --target-model haiku

What it does for you: spend a little on a smart optimizer overnight; your everyday cheap/fast agent inherits the upgrade. (Verified: a skill optimized on one model lifts a different one — cross-model and even cross-runtime Codex↔Claude.)

--budget-tokens N / --budget-minutes M — cap the spend

You decide how much the nightly "dreaming" costs; it auto-plans how many nights × how many rollouts fit.

/skillopt-sleep run --backend claude --budget-tokens 60000

What it does for you: predictable cost. It stops cleanly when the budget is hit and tells you what it skipped.

multi-objective (accuracy ↑, tokens ↓, latency ↓)

The reward can weight not just correctness but cost and speed, so a skill can learn to be cheaper and faster, not only more accurate. What it does for you: "answer directly instead of opening five files" becomes a learned habit.

schedule / unschedule — set it and forget it

Built-in nightly scheduling (no manual cron):

/skillopt-sleep schedule --hour 3 --minute 17     # runs every night for this project
/skillopt-sleep unschedule                        # stop it

What it does for you: it just gets better while you sleep. The nightly run only stages a proposal — adopting is still your call (or add --auto-adopt when you schedule, if you trust it).


Full action / flag reference

Action Does
status nights so far + the latest staged proposal (read-only)
dry-run harvest→mine→replay→report; stages nothing
run full cycle; stages a proposal; nothing live changes
adopt apply the staged proposal to CLAUDE.md/SKILL.md (backs up first)
harvest debug: print the recurring tasks it mined
schedule / unschedule install/remove the nightly cron entry
Flag Default Meaning
--backend mock|claude|codex mock who runs/optimizes (mock = free)
--preferences "..." your house rules, as a prior
--gate on|off on strict held-out gate vs. greedy
--rollouts-k K 1 multi-rollout contrastive reflection
--optimizer-model / --target-model split the optimizer from the target
--budget-tokens / --budget-minutes cap the nightly spend
--scope invoked|all invoked this project only, or all projects
--auto-adopt off apply without manual review (power users)

Deep dive: the SkillOpt-Sleep guide section.


Does it actually work?

Yes — measured with real models on both Claude and Codex, scored on held-out tasks the optimizer never trained on:

  • gbrain-evals skillopt-v1 (the public suite gbrain scores SkillOpt on): deficient skills go 0.00 → 1.00 on all 4 seeds, including a real tool-use loop; cross-model transfer is positive; the gate blocks regressions. → the SkillOpt-Sleep guide section
  • Academic daily-cases (math / spreadsheet / search-QA, the paper's 4:1:5 split with dream-augmented train): see the SkillOpt-Sleep guide section.
  • Fresh load-test (a "SQL must always include LIMIT" analyst, built from scratch): held-out 0.00 → 1.00 on both backends. → the SkillOpt-Sleep guide section

Try the deterministic proof yourself (no API key, no spend):

python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves

It prints the held-out score rising to 1.0 as the gate accepts the right rules, and confirms the gate rejects an injected harmful edit.


Safety

  • Read-only harvest of your sessions. mock replay has no side effects.
  • Proposals are staged, never auto-applied (unless you opt in with --auto-adopt).
  • Every adopt writes a backup. Per-night token/time budget caps. Secrets redacted.