Newsfeed / Glossary / Long-running Agents
technical

Long-running Agents

Pronunciation

/lɒŋ ˈrʌnɪŋ ˈeɪdʒənts/

Also known as:persistent agentsmulti-session agentsbackground agentsautonomous workers

What are Long-running Agents?

Long-running agents are AI systems designed to work on tasks that span hours, days, or even weeks—far exceeding the context limits of a single conversation. Unlike chatbots that handle quick Q&A, these agents tackle substantial work:

  • Building entire features across multiple coding sessions
  • Processing thousands of documents over days
  • Managing ongoing projects with multiple stakeholders
  • Running continuous monitoring and response operations

The Core Challenge

Language models have finite context windows—typically 100K-200K tokens. A long-running task might require millions of tokens of context across its lifetime. How do you maintain coherent work when you can't remember everything?

The solution: Persistent state, strategic context management, and robust handoff between sessions.

How Long-running Agents Work

Session Architecture

Session 1: Initialization
├── Set up environment
├── Create progress tracking
├── Complete initial work
└── Checkpoint state

Session 2-N: Continuation
├── Load state from checkpoint
├── Verify environment health
├── Continue from last milestone
└── Checkpoint state

Final Session: Completion
├── Load state
├── Complete remaining work
├── Verify all requirements
└── Clean handoff

State Persistence Strategies

StrategyUse CaseExample
Progress filesHuman-readable statusprogress.txt with completed/pending tasks
Git commitsCode changesDescriptive commits as state snapshots
Structured dataMachine-readable stateJSON/YAML task lists with pass/fail status
External databasesComplex stateCustomer records, workflow status

Common Failure Modes

Anthropic's research identified several patterns that cause long-running agents to fail:

One-shotting

Problem: Agent tries to complete the entire project in a single session, exhausts context, and leaves work half-done.

Solution: Break work into milestones. Force checkpoints. Design for incremental progress.

Premature Completion Declaration

Problem: Agent claims "Done!" without actually verifying all requirements are met.

Solution: Mandate verification testing. Require end-to-end checks before completion. Don't trust the agent's self-assessment.

Environmental Degradation

Problem: Agent leaves bugs, undocumented changes, or broken state—forcing subsequent sessions to debug instead of advance.

Solution: Require "clean state" at each checkpoint. Run tests before handoff. Document all changes.

Testing Gaps

Problem: Agent marks features complete based on unit tests, but they fail in real user workflows.

Solution: Require browser automation or user-like verification. Unit tests aren't enough.

Context Amnesia

Problem: New session loses critical context from previous sessions, repeating work or making contradictory decisions.

Solution: Structured handoff documents. External memory systems. Comprehensive progress tracking.

Design Patterns for Success

The Initializer Pattern

First session is special:

Initializer Agent:
1. Set up development environment
2. Create tracking infrastructure (progress.txt, etc.)
3. Establish baseline (initial git commit)
4. Define milestone structure
5. Complete first milestone
6. Clean checkpoint

Subsequent sessions assume infrastructure exists.

The Feature List Pattern

Maintain a structured requirements document:

{
  "features": [
    {"name": "User authentication", "status": "complete", "verified": true},
    {"name": "Dashboard UI", "status": "in_progress", "verified": false},
    {"name": "Export to PDF", "status": "pending", "verified": false}
  ]
}

Agents can scan this to understand what's done and what's next.

The Verification Protocol

Before declaring any milestone complete:

  1. Run all tests (unit, integration, e2e)
  2. Verify through actual user interaction (browser automation)
  3. Check for regressions in previous functionality
  4. Document any deviations from spec

The Clean Handoff Pattern

End each session with:

  • All tests passing
  • No uncommitted changes
  • Updated progress documentation
  • Clear next steps documented

Infrastructure Requirements

Long-running agents need robust harnesses:

CapabilityWhy It's Needed
Context compactionSummarize old context to fit new work
State persistenceRemember across context window boundaries
Error recoveryHandle failures gracefully mid-task
Progress trackingKnow what's done and what's remaining
Environment managementMaintain clean, reproducible state

When to Use Long-running Agents

Good fit:

  • Large codebase changes (new features, refactors)
  • Document processing at scale
  • Multi-day research projects
  • Ongoing operational tasks

Poor fit:

  • Quick questions or single-turn tasks
  • Tasks requiring real-time human collaboration
  • Highly ambiguous work needing frequent clarification
  • Tasks where errors have immediate severe consequences

The Future of Long-running Work

As context windows grow and harness technology improves, long-running agents will handle increasingly ambitious projects:

  • Today: Build a feature over 3-4 sessions
  • Near future: Complete a sprint's worth of work autonomously
  • Long term: Run entire development projects with human oversight

The key insight: success requires infrastructure, not just better models. A sophisticated harness can make a good model great at long-running work. A poor harness will make even the best model fail.

Mentioned In

Long-running agents face unique challenges around context management, state persistence, and error recovery that don't appear in single-turn interactions.

Anthropic Engineering at 00:00:00

"Long-running agents face unique challenges around context management, state persistence, and error recovery that don't appear in single-turn interactions."

Related Terms

See Also