Your coding agent forgets everything between sessions. SkillOpt-Sleep fixes
that. While you sleep, it reviews what you did today, notices the rules you
keep repeating ("always add a LIMIT", "answers in \boxed{}", "cite the
source"), and writes them into your agent's long-term memory and skills — but
only the rules that actually make it score better on your own past tasks. You
wake up to an agent that's better at your work, and you approve every change
before it sticks.
One engine, three thin shells. It synthesizes SkillOpt (validation-gated bounded text optimization — the research in this repo), Claude Dreams (offline consolidation; input never mutated; review-then-adopt), and the agent sleep idea (short-term experience → long-term competence).
Open-source tool, decoupled from the research. The engine lives in the top-level
skillopt_sleep/package with zero dependency on the paper'sskillopt/experiment code (the validation gate is vendored). Use it without the research stack.
| Platform | Folder | Mechanism | Status |
|---|---|---|---|
| Claude Code | claude-code/ |
.claude-plugin + /skillopt-sleep command + skill + hooks |
full, installable |
| Codex | codex/ |
user-level skillopt-sleep skill + shared runner |
full |
| Copilot | copilot/ |
MCP server (sleep_* tools) + copilot-instructions |
full (MCP) |
| Platform | Install | Then |
|---|---|---|
| Claude Code | /plugin marketplace add microsoft/SkillOpt → /plugin install skillopt-sleep |
/skillopt-sleep status |
| Codex | git clone → bash plugins/codex/install.sh |
/skillopt-sleep status |
| Copilot | git clone → register plugins/copilot/mcp_server.py as an MCP server |
ask "run the sleep cycle" |
Requirements: Python ≥ 3.10 and the agent's CLI on PATH. All three call the same
run-sleep.sh → python -m skillopt_sleep, so behaviour is
identical everywhere. Default backend is mock (no API spend); --backend claude|codex uses your own budget.
harvest your past sessions → mine the tasks you keep doing → replay them offline
→ reflect on failures → propose a few rule edits → KEEP only edits that raise
your held-out score → stage a proposal → (you) review & adopt
Nothing live changes until you adopt; every adopt backs up the prior file.
This is the heart of the design, borrowed from the SkillOpt paper's train/selection/test protocol:
| Split | Where it comes from | What it's for |
|---|---|---|
| train | your real tasks + optional "dreamed" variants | what the optimizer learns from. Over-dreaming here is fine — it's imagination. |
| val (selection) | your real tasks only, held out | the gate: an edit is kept only if it raises this score. Stops overfitting. |
| test | your real tasks only, held out, never seen during optimization | the final score we report. Kept as close to your real usage as possible. |
So you can dream up extra training examples to learn a rule robustly, while
the rule is still judged on real, unseen tasks. A dream task can never
land in val or test — that invariant is unit-tested.
Every control below works on all three platforms (pass it after the action,
e.g. /skillopt-sleep run --rollouts-k 3).
The single most useful knob. Free text that steers what the optimizer writes, as a prior. Use it to encode the conventions you're tired of repeating.
# A backend engineer:
/skillopt-sleep run --preferences "Always use async/await, never callbacks. \
Prefer pytest over unittest. Commit subjects in imperative mood under 50 chars."
# A data analyst:
/skillopt-sleep run --preferences "Every SQL query must end with LIMIT 1000 unless \
I say otherwise. Money in USD with 2 decimals. Prefer CTEs over nested subqueries."
# A researcher:
/skillopt-sleep run --preferences "Cite sources as [Author, Year]. Math answers in \
\\boxed{}. Keep explanations under 150 words unless I ask for depth."What it does for you: the next morning your agent already follows these without you re-typing them, and the rules are validated against your real tasks (if a "preference" actually hurts your held-out score, the gate drops it).
on(default): an edit is kept only if it raises your held-out score. Safe — blocks plausible-but-wrong rules and reward-hacking.off: greedy — keep edits without the strict check (still reports whether quality moved).
What it does for you: leave it on for trust. Flip it off when you're
exploring and want to see everything the optimizer proposes.
Re-runs each task K times and learns from the difference between the good
and bad attempts, not just a single failure.
/skillopt-sleep run --rollouts-k 3What it does for you: a much stronger signal. If your agent gets a task right 1 time in 3, the optimizer figures out what the winning attempt did and makes it reliable.
Use a strong model to write the rules and a cheap model to run your tasks. The learned skill then helps the cheap model — or any model.
/skillopt-sleep run --optimizer-model sonnet --target-model haikuWhat it does for you: spend a little on a smart optimizer overnight; your everyday cheap/fast agent inherits the upgrade. (Verified: a skill optimized on one model lifts a different one — cross-model and even cross-runtime Codex↔Claude.)
You decide how much the nightly "dreaming" costs; it auto-plans how many nights × how many rollouts fit.
/skillopt-sleep run --backend claude --budget-tokens 60000What it does for you: predictable cost. It stops cleanly when the budget is hit and tells you what it skipped.
The reward can weight not just correctness but cost and speed, so a skill can learn to be cheaper and faster, not only more accurate. What it does for you: "answer directly instead of opening five files" becomes a learned habit.
Built-in nightly scheduling (no manual cron):
/skillopt-sleep schedule --hour 3 --minute 17 # runs every night for this project
/skillopt-sleep unschedule # stop itWhat it does for you: it just gets better while you sleep. The nightly run only
stages a proposal — adopting is still your call (or add --auto-adopt when you
schedule, if you trust it).
| Action | Does |
|---|---|
status |
nights so far + the latest staged proposal (read-only) |
dry-run |
harvest→mine→replay→report; stages nothing |
run |
full cycle; stages a proposal; nothing live changes |
adopt |
apply the staged proposal to CLAUDE.md/SKILL.md (backs up first) |
harvest |
debug: print the recurring tasks it mined |
schedule / unschedule |
install/remove the nightly cron entry |
| Flag | Default | Meaning |
|---|---|---|
--backend mock|claude|codex |
mock |
who runs/optimizes (mock = free) |
--preferences "..." |
– | your house rules, as a prior |
--gate on|off |
on |
strict held-out gate vs. greedy |
--rollouts-k K |
1 |
multi-rollout contrastive reflection |
--optimizer-model / --target-model |
– | split the optimizer from the target |
--budget-tokens / --budget-minutes |
– | cap the nightly spend |
--scope invoked|all |
invoked |
this project only, or all projects |
--auto-adopt |
off | apply without manual review (power users) |
Deep dive: the SkillOpt-Sleep guide section.
Yes — measured with real models on both Claude and Codex, scored on held-out tasks the optimizer never trained on:
- gbrain-evals
skillopt-v1(the public suite gbrain scores SkillOpt on): deficient skills go 0.00 → 1.00 on all 4 seeds, including a real tool-use loop; cross-model transfer is positive; the gate blocks regressions. → the SkillOpt-Sleep guide section - Academic daily-cases (math / spreadsheet / search-QA, the paper's 4:1:5 split with dream-augmented train): see the SkillOpt-Sleep guide section.
- Fresh load-test (a "SQL must always include LIMIT" analyst, built from scratch): held-out 0.00 → 1.00 on both backends. → the SkillOpt-Sleep guide section
Try the deterministic proof yourself (no API key, no spend):
python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improvesIt prints the held-out score rising to 1.0 as the gate accepts the right rules, and confirms the gate rejects an injected harmful edit.
- Read-only harvest of your sessions.
mockreplay has no side effects. - Proposals are staged, never auto-applied (unless you opt in with
--auto-adopt). - Every adopt writes a backup. Per-night token/time budget caps. Secrets redacted.