Claude Dreaming and Outcomes: What Anthropic's Self-Improving Agents Mean for Your Team

TL;DR

On May 6, 2026, Anthropic launched two new features for Claude Managed Agents: Dreaming and Outcomes. In plain language: your AI agents can now review their own past work between sessions, learn from their mistakes, write themselves playbooks for next time — and grade their output against a quality standard you define before delivering results. Dreaming is a scheduled background process where the agent looks back at what it did, identifies patterns, and reorganizes its memory so the next run is better. Outcomes is a built-in quality gate where a separate AI evaluator checks the agent's work against your rubric and sends it back for another pass if it doesn't meet the bar. Early adopters are reporting dramatic results: legal AI company Harvey saw a 6x increase in task completion rates with Dreaming, and medical document company Wisedocs cut review time by 50% with Outcomes. Here's what shipped, why it matters, and how to think about it for your team.

What Just Shipped

Two weeks ago we wrote about Memory — agents that remember. Dreaming is what happens when those agents start reviewing their notes.

If you read our post on Claude agents and memory, you know that Claude Managed Agents can now store information across sessions. That was a big step: instead of starting from scratch every time, an agent could carry forward what it learned. But storing memories is only half the story. The other half is making sense of them.

Think about how you get better at your job. You don't just accumulate experiences. You reflect on them. You notice that the same type of mistake keeps happening. You realize that one approach works better than another. You develop rules of thumb. You write down the process that worked so you can repeat it next time. That reflection — the act of stepping back, reviewing what happened, and extracting useful patterns — is what Dreaming gives AI agents.

Dreaming is a scheduled background process that runs between agent sessions. While nobody is using the agent, it reviews its past session transcripts, looks at its existing memory store, and produces a reorganized, improved version. Duplicate memories get merged. Contradicted entries get replaced with the latest information. New patterns get surfaced and written down as structured playbooks that future sessions can reference. The agent wakes up to its next job a little smarter than it was before.

Alongside Dreaming, Anthropic also shipped Outcomes — a quality assurance system built directly into the agent workflow. You write a rubric describing what a good result looks like, and a separate AI evaluator grades the agent's output against that rubric. If the output doesn't pass, the evaluator tells the agent exactly what needs to change, and the agent takes another pass. This loop continues until the work meets your standard. The evaluator runs in its own context window, completely independent of the agent that did the work — which means it isn't influenced by the agent's reasoning or biased by how the output was produced.

Why This Is a Bigger Deal Than It Sounds

The difference between a tool and a team member isn't capability. It's the ability to get better without being told.

Every team that has deployed AI agents at any real scale has hit the same two problems. First, the agent keeps making the same type of mistake. You correct it, it fixes the immediate issue, but two weeks later it makes the same mistake again because the correction didn't stick in a useful way. Second, the agent's output quality is inconsistent. Monday's report is great. Tuesday's report misses a section. Wednesday's report includes the section but formats it differently. There's no reliable floor on quality.

Dreaming attacks the first problem. Instead of passively accumulating memories — a growing pile of notes that might or might not be useful — the agent actively reviews its work and distills what it learned into structured, actionable knowledge. It's the difference between a filing cabinet full of notes and a well-organized operations manual that someone actually maintains.

Outcomes attacks the second problem. Instead of delivering whatever comes out on the first pass, the agent runs its work through a quality gate. Every time. Automatically. The rubric you write becomes the floor that no output falls below. It's the equivalent of having a reviewer who checks every piece of work before it goes out — except the reviewer never takes a day off, never gets tired, and applies the same standard every time.

Together, they solve the two things that make teams stop trusting AI agents: unpredictable quality and repeated mistakes.

How Dreaming Actually Works

It's not magic. It's a structured review process — the same thing good employees do on their own.

Here's what happens during a dreaming cycle. The agent reads through its existing memory store — all the notes, corrections, and patterns it has accumulated from past sessions. Then it reads through the transcripts of recent sessions — the actual work it did, including where it succeeded and where it struggled. It compares the two and asks: what did I learn that isn't reflected in my memories yet? What memories are outdated or wrong? What patterns keep coming up that I should codify into a playbook?

The output is a new, reorganized memory store. Duplicates merged. Stale entries updated. New insights added. And critically, the agent produces playbooks — structured guides for handling specific types of work based on what actually worked in past sessions. A playbook might say: “When processing vendor invoices from European suppliers, always check for VAT line items separately because they use a different format than domestic invoices. When the VAT field is missing, flag it for human review instead of guessing.”

The important detail: Dreaming does not change the AI model itself. It doesn't modify the underlying weights or retrain anything. It writes plain-text notes and structured documents that the agent reads at the start of its next session. This means everything the agent “learned” is fully transparent, auditable, and editable by your team. You can open the playbooks, read them, correct them if needed, and delete ones you don't want. The agent's learning is visible, not hidden inside a neural network.

For teams in regulated industries — finance, healthcare, legal — this matters enormously. When an auditor asks “how does your AI decide how to handle this type of case,” you can point to a specific playbook with specific rules that a human can read and verify. Try doing that with a fine-tuned model.

How Outcomes Actually Works

You define what “good” looks like. A separate AI enforces that standard on every run.

Outcomes works like this. You write a rubric — a description of what a successful output should contain, what quality standards it should meet, and what common mistakes should be flagged. This is written in plain language, not code. It might say: “The weekly client report should include a summary of activity, three key metrics compared to last week, any flagged risks, and a recommended next step. The tone should be professional but not formal. Numbers should be specific, not rounded. If data is missing, say so explicitly rather than omitting the section.”

When the agent finishes its work, a separate Claude instance — the grader — evaluates the output against your rubric. The grader runs in its own context window, completely isolated from the agent that produced the work. It doesn't see the agent's reasoning process. It doesn't know what shortcuts the agent took or what constraints it ran into. It only sees the final output and your rubric. This isolation is deliberate: it prevents the grader from being sympathetic to the agent's struggles and holding it to a lower standard.

If the output passes, it's delivered. If it fails, the grader produces specific feedback — not just “try again” but exactly what needs to change. The agent reads the feedback, revises the output, and submits it again. This loop continues until the rubric is satisfied. The result: a consistent quality floor on every single run, no matter how complex the task or how many things the agent is juggling.

What This Looks Like in Practice

Four concrete examples of work that just got more reliable.

1. A customer onboarding agent that stops repeating mistakes. Your onboarding agent sends welcome sequences to new customers. Last month it sent the wrong pricing tier information to three enterprise customers because it misread a field in your CRM. You corrected it each time. Without Dreaming, those corrections might fade as the memory store grows cluttered. With Dreaming, the agent reviews those sessions, notices the pattern, and writes a playbook: “When a customer's account type is Enterprise, always pull pricing from the Enterprise rate card, not the default. The CRM field ‘plan_type’ is the authoritative source, not ‘signup_tier.’” Next month, zero mistakes on that specific issue — and the playbook is there for your team to review.

2. A report generation agent that meets your quality bar every time. Your operations team has an agent that produces weekly performance reports. Some weeks the report is excellent. Other weeks it misses a section, rounds numbers when it shouldn't, or buries the key takeaway in paragraph three instead of leading with it. With Outcomes, you write a rubric: “Lead with the most significant metric change. Include all five department sections. Use exact numbers. Flag any metric that changed more than 10% from last week.” Now every report meets that standard before your team sees it. The weeks of inconsistent quality are over.

3. A legal document review agent that builds institutional knowledge. Harvey, the legal AI company, deployed Dreaming for their document review agents. These agents process contracts, filings, and case materials across hundreds of legal workflows. Before Dreaming, each session started fresh — the agent had memory, but it was an unstructured pile that grew less useful over time. With Dreaming, the agent consolidates what it has learned about specific document types, formatting quirks in different jurisdictions, and tool-specific workarounds into organized playbooks. The result: task completion rates increased roughly 6x. Not because the model got smarter, but because the agent's accumulated knowledge became organized and actionable.

4. A medical document review agent that meets compliance standards. Wisedocs processes medical documents and needs every review to meet specific internal quality guidelines — missing a field or misclassifying a document type can have real consequences. They built a quality check agent on Managed Agents and used Outcomes to grade each review against their guidelines. Reviews that fall short get sent back automatically. The result: review time dropped by 50% while quality stayed aligned with their team's standards. The agent doesn't rush through reviews to save time — it does them faster because it doesn't have to second-guess itself when there's a clear rubric defining what “done” looks like.

How Dreaming and Outcomes Work Together

Dreaming makes the agent better over time. Outcomes makes the agent reliable right now. The combination is what matters.

Think of it this way. Outcomes ensures that the agent's output meets your standard today — it's the quality gate that catches problems before they reach your team. Dreaming ensures that the agent needs fewer passes through that quality gate over time — it's the improvement loop that reduces the number of mistakes the agent makes in the first place.

In week one, your agent might need three passes through the Outcomes grader before the report meets your rubric. By week four, after a few dreaming cycles, the agent has internalized the patterns that kept causing it to fail the rubric — and now it passes on the first try most of the time. The Outcomes rubric is still there as a safety net, but it catches less because the agent has genuinely improved.

This is the same dynamic you see with good employees. Early on, they need more review. Over time, they internalize the standards and need less. The review process doesn't go away — it just catches less, because the work is better.

What About Teams With Multiple Agents?

When 20 subagents are working in the same domain, Dreaming can aggregate what they all learned.

If you read our post on multiagent orchestration, you know that Claude can now coordinate teams of specialist agents working in parallel. Dreaming adds another dimension to this: when multiple subagents are all working on related tasks, a dreaming cycle can review their collective sessions, identify cross-agent patterns, and publish shared insights to a team-wide memory store.

This is something no individual agent session could produce on its own. One subagent might notice that a particular data format is inconsistent. Another might discover that a specific client always provides information in a non-standard way. A third might find a shortcut that works for a specific type of task. Individually, those are isolated observations. Through Dreaming, they become shared knowledge that every agent on the team can reference.

For organizations that are starting to run multiple agents across different functions — research, operations, customer support, sales ops — this is the beginning of organizational learning for your AI team. Not just individual agents getting smarter, but the whole group benefiting from what any one member learns.

What This Doesn't Solve

Self-improving agents are powerful. But they still need clear direction and human oversight.

Dreaming makes agents better at the work they're already doing. It doesn't make them better at figuring out what work to do. If your agent is pointed at the wrong problem or given unclear instructions, it will get more efficient at doing the wrong thing. The playbooks it writes will optimize for the wrong outcomes. This is the same risk you face with any employee who works hard but doesn't understand the goal.

Similarly, Outcomes is only as good as the rubric you write. A vague rubric produces vague quality enforcement. A rubric that misses an important dimension will let outputs through that fail on that dimension. The quality gate is a mirror of your standards — if your standards aren't clearly articulated, the gate won't help.

The good news: both the playbooks and the rubrics are written in plain language, and your team can review, edit, and improve them over time. The agent's self-improvement is a starting point, not a replacement for human judgment about what “good” actually means.

Where This Fits in the Bigger Picture

Memory gave agents the ability to remember. Dreaming gives them the ability to learn. Outcomes gives them the ability to self-correct. Together, these three features close the loop.

If you've been following the Claude releases in 2026, each one has removed a specific limitation that kept AI agents from handling real work reliably. Better models (Opus 4.7) gave agents raw capability. Tools (Design, Creative Connectors, Microsoft 365) gave agents access to the software your team actually uses. Memory gave agents persistence. Multiagent orchestration gave agents the ability to coordinate complex, multi-step work.

Dreaming and Outcomes add the two things that were still missing: continuous improvement and quality assurance. An agent that can use your tools, remember what it learned, coordinate with other agents, get better over time, and check its own work against your standards — that's not a chatbot. That's infrastructure.

For teams that have been cautious about deploying AI agents because of reliability concerns, these features directly address the two most common objections: “the agent keeps making the same mistakes” and “the quality is too inconsistent to trust.” Both of those problems now have systematic solutions built into the platform.

How to Get Started

If you already have agents running, the first step is to write an Outcomes rubric for your most important workflow.

Start with the agent that produces the work your team reviews most carefully — the one where someone always checks the output before it goes out. That review process your team is doing manually? Write it down as a rubric. What do they look for? What causes them to send it back? What does “good enough to ship” actually mean? Turn that into an Outcomes rubric and let the agent self-correct before the work ever reaches a human reviewer.

For Dreaming, the path is even simpler: enable it and wait. Once your agents have accumulated enough sessions, schedule a dreaming cycle and review the playbooks it produces. You might be surprised by what the agent noticed that your team hadn't explicitly taught it — and you might also find playbooks that need correcting, which is exactly how the system is designed to work. The agent proposes what it learned. Your team approves, edits, or removes what doesn't fit.

Both features are available now through Claude Managed Agents — Dreaming in research preview, Outcomes in public beta. The setup involves a developer or technical partner, but the rubrics and playbook reviews are work your team's domain experts are best equipped to handle.

FAQ

Does Dreaming change the AI model?

No. Dreaming does not modify Claude's underlying model weights or retrain anything. It writes plain-text notes and structured playbooks that the agent reads at the start of its next session. Everything the agent “learned” is fully transparent and auditable — you can read, edit, and delete any playbook.

How is Dreaming different from Memory?

Memory lets agents store and retrieve information across sessions. Dreaming is the process of actively reviewing that information, identifying patterns, merging duplicates, removing outdated entries, and creating organized playbooks. Memory is the filing cabinet. Dreaming is the person who organizes it.

What does an Outcomes rubric look like?

It's written in plain language. A rubric might say: “The report should include all five sections, lead with the most significant finding, use exact numbers rather than approximations, and flag any data gaps explicitly.” You don't need to write code. You need to describe what good work looks like.

Can I see what the agent learned through Dreaming?

Yes. All playbooks and memory entries are visible in the Claude Console and exportable. A developer or admin on your team can review every piece of knowledge the agent has accumulated. This is by design — Anthropic built Dreaming to be fully transparent.

What results are companies seeing?

Legal AI company Harvey saw task completion rates increase roughly 6x after implementing Dreaming. Medical document company Wisedocs cut document review time by 50% using Outcomes. These are early results from companies that were already running agents at scale.

Do I need to be technical to use Dreaming or Outcomes?

The initial setup requires a developer or technical partner who can configure Claude Managed Agents. But the ongoing work — writing Outcomes rubrics and reviewing Dreaming playbooks — is domain expertise, not technical expertise. The people on your team who know what good work looks like are the ones best positioned to define the rubrics and validate the playbooks.

How does this relate to multiagent orchestration?

They complement each other. Multiagent orchestration lets a lead agent coordinate multiple specialist agents working in parallel. Dreaming can aggregate learnings across all those agents and publish shared playbooks to a team-wide memory store. An agent team that coordinates, remembers, and learns collectively is significantly more capable than agents working in isolation.

The Deployed Kickstart gets your team building reliable agent workflows in a single day — including Outcomes rubrics for the work that matters most. The Partner program keeps your agent infrastructure current as Anthropic ships new capabilities like Dreaming, so the systems you build today get more capable over time, not obsolete.