technical

Long-running Agents

Pronunciation

/lɒŋ ˈrʌnɪŋ ˈeɪdʒənts/

Also known as:persistent agentsmulti-session agentsbackground agentsautonomous workers

What are Long-running Agents?

Long-running agents are AI systems designed to work on tasks that span hours, days, or even weeks—far exceeding the context limits of a single conversation. Unlike chatbots that handle quick Q&A, these agents tackle substantial work:

Building entire features across multiple coding sessions
Processing thousands of documents over days
Managing ongoing projects with multiple stakeholders
Running continuous monitoring and response operations

The Core Challenge

Language models have finite context windows—typically 100K-200K tokens. A long-running task might require millions of tokens of context across its lifetime. How do you maintain coherent work when you can't remember everything?

The solution: Persistent state, strategic context management, and robust handoff between sessions.

How Long-running Agents Work

Session Architecture

Session 1: Initialization
├── Set up environment
├── Create progress tracking
├── Complete initial work
└── Checkpoint state

Session 2-N: Continuation
├── Load state from checkpoint
├── Verify environment health
├── Continue from last milestone
└── Checkpoint state

Final Session: Completion
├── Load state
├── Complete remaining work
├── Verify all requirements
└── Clean handoff

State Persistence Strategies

Strategy	Use Case	Example
Progress files	Human-readable status	`progress.txt` with completed/pending tasks
Git commits	Code changes	Descriptive commits as state snapshots
Structured data	Machine-readable state	JSON/YAML task lists with pass/fail status
External databases	Complex state	Customer records, workflow status

Common Failure Modes

Anthropic's research identified several patterns that cause long-running agents to fail:

One-shotting

Problem: Agent tries to complete the entire project in a single session, exhausts context, and leaves work half-done.

Solution: Break work into milestones. Force checkpoints. Design for incremental progress.

Premature Completion Declaration

Problem: Agent claims "Done!" without actually verifying all requirements are met.

Solution: Mandate verification testing. Require end-to-end checks before completion. Don't trust the agent's self-assessment.

Environmental Degradation

Problem: Agent leaves bugs, undocumented changes, or broken state—forcing subsequent sessions to debug instead of advance.

Solution: Require "clean state" at each checkpoint. Run tests before handoff. Document all changes.

Testing Gaps

Problem: Agent marks features complete based on unit tests, but they fail in real user workflows.

Solution: Require browser automation or user-like verification. Unit tests aren't enough.

Context Amnesia

Problem: New session loses critical context from previous sessions, repeating work or making contradictory decisions.

Solution: Structured handoff documents. External memory systems. Comprehensive progress tracking.

Design Patterns for Success

The Initializer Pattern

First session is special:

Initializer Agent:
1. Set up development environment
2. Create tracking infrastructure (progress.txt, etc.)
3. Establish baseline (initial git commit)
4. Define milestone structure
5. Complete first milestone
6. Clean checkpoint

Subsequent sessions assume infrastructure exists.

The Feature List Pattern

Maintain a structured requirements document:

{
  "features": [
    {"name": "User authentication", "status": "complete", "verified": true},
    {"name": "Dashboard UI", "status": "in_progress", "verified": false},
    {"name": "Export to PDF", "status": "pending", "verified": false}
  ]
}

Agents can scan this to understand what's done and what's next.

The Verification Protocol

Before declaring any milestone complete:

Run all tests (unit, integration, e2e)
Verify through actual user interaction (browser automation)
Check for regressions in previous functionality
Document any deviations from spec

The Clean Handoff Pattern

End each session with:

All tests passing
No uncommitted changes
Updated progress documentation
Clear next steps documented

Infrastructure Requirements

Long-running agents need robust harnesses:

Capability	Why It's Needed
Context compaction	Summarize old context to fit new work
State persistence	Remember across context window boundaries
Error recovery	Handle failures gracefully mid-task
Progress tracking	Know what's done and what's remaining
Environment management	Maintain clean, reproducible state

When to Use Long-running Agents

Good fit:

Large codebase changes (new features, refactors)
Document processing at scale
Multi-day research projects
Ongoing operational tasks

Poor fit:

Quick questions or single-turn tasks
Tasks requiring real-time human collaboration
Highly ambiguous work needing frequent clarification
Tasks where errors have immediate severe consequences

The Future of Long-running Work

As context windows grow and harness technology improves, long-running agents will handle increasingly ambitious projects:

Today: Build a feature over 3-4 sessions
Near future: Complete a sprint's worth of work autonomously
Long term: Run entire development projects with human oversight

The key insight: success requires infrastructure, not just better models. A sophisticated harness can make a good model great at long-running work. A poor harness will make even the best model fail.

AI Agents - The systems that run long
Agent Harness - Infrastructure enabling persistence
Agent Evaluation - Testing long-running behavior

Mentioned In

Anthropic Engineering at 00:00:00

"Long-running agents face unique challenges around context management, state persistence, and error recovery that don't appear in single-turn interactions."