Long-running Agents
/lɒŋ ˈrʌnɪŋ ˈeɪdʒənts/
What are Long-running Agents?
Long-running agents are AI systems designed to work on tasks that span hours, days, or even weeks—far exceeding the context limits of a single conversation. Unlike chatbots that handle quick Q&A, these agents tackle substantial work:
- Building entire features across multiple coding sessions
- Processing thousands of documents over days
- Managing ongoing projects with multiple stakeholders
- Running continuous monitoring and response operations
The Core Challenge
Language models have finite context windows—typically 100K-200K tokens. A long-running task might require millions of tokens of context across its lifetime. How do you maintain coherent work when you can't remember everything?
The solution: Persistent state, strategic context management, and robust handoff between sessions.
How Long-running Agents Work
Session Architecture
Session 1: Initialization
├── Set up environment
├── Create progress tracking
├── Complete initial work
└── Checkpoint state
Session 2-N: Continuation
├── Load state from checkpoint
├── Verify environment health
├── Continue from last milestone
└── Checkpoint state
Final Session: Completion
├── Load state
├── Complete remaining work
├── Verify all requirements
└── Clean handoff
State Persistence Strategies
| Strategy | Use Case | Example |
|---|---|---|
| Progress files | Human-readable status | progress.txt with completed/pending tasks |
| Git commits | Code changes | Descriptive commits as state snapshots |
| Structured data | Machine-readable state | JSON/YAML task lists with pass/fail status |
| External databases | Complex state | Customer records, workflow status |
Common Failure Modes
Anthropic's research identified several patterns that cause long-running agents to fail:
One-shotting
Problem: Agent tries to complete the entire project in a single session, exhausts context, and leaves work half-done.
Solution: Break work into milestones. Force checkpoints. Design for incremental progress.
Premature Completion Declaration
Problem: Agent claims "Done!" without actually verifying all requirements are met.
Solution: Mandate verification testing. Require end-to-end checks before completion. Don't trust the agent's self-assessment.
Environmental Degradation
Problem: Agent leaves bugs, undocumented changes, or broken state—forcing subsequent sessions to debug instead of advance.
Solution: Require "clean state" at each checkpoint. Run tests before handoff. Document all changes.
Testing Gaps
Problem: Agent marks features complete based on unit tests, but they fail in real user workflows.
Solution: Require browser automation or user-like verification. Unit tests aren't enough.
Context Amnesia
Problem: New session loses critical context from previous sessions, repeating work or making contradictory decisions.
Solution: Structured handoff documents. External memory systems. Comprehensive progress tracking.
Design Patterns for Success
The Initializer Pattern
First session is special:
Initializer Agent:
1. Set up development environment
2. Create tracking infrastructure (progress.txt, etc.)
3. Establish baseline (initial git commit)
4. Define milestone structure
5. Complete first milestone
6. Clean checkpoint
Subsequent sessions assume infrastructure exists.
The Feature List Pattern
Maintain a structured requirements document:
{
"features": [
{"name": "User authentication", "status": "complete", "verified": true},
{"name": "Dashboard UI", "status": "in_progress", "verified": false},
{"name": "Export to PDF", "status": "pending", "verified": false}
]
}
Agents can scan this to understand what's done and what's next.
The Verification Protocol
Before declaring any milestone complete:
- Run all tests (unit, integration, e2e)
- Verify through actual user interaction (browser automation)
- Check for regressions in previous functionality
- Document any deviations from spec
The Clean Handoff Pattern
End each session with:
- All tests passing
- No uncommitted changes
- Updated progress documentation
- Clear next steps documented
Infrastructure Requirements
Long-running agents need robust harnesses:
| Capability | Why It's Needed |
|---|---|
| Context compaction | Summarize old context to fit new work |
| State persistence | Remember across context window boundaries |
| Error recovery | Handle failures gracefully mid-task |
| Progress tracking | Know what's done and what's remaining |
| Environment management | Maintain clean, reproducible state |
When to Use Long-running Agents
Good fit:
- Large codebase changes (new features, refactors)
- Document processing at scale
- Multi-day research projects
- Ongoing operational tasks
Poor fit:
- Quick questions or single-turn tasks
- Tasks requiring real-time human collaboration
- Highly ambiguous work needing frequent clarification
- Tasks where errors have immediate severe consequences
The Future of Long-running Work
As context windows grow and harness technology improves, long-running agents will handle increasingly ambitious projects:
- Today: Build a feature over 3-4 sessions
- Near future: Complete a sprint's worth of work autonomously
- Long term: Run entire development projects with human oversight
The key insight: success requires infrastructure, not just better models. A sophisticated harness can make a good model great at long-running work. A poor harness will make even the best model fail.
Related Reading
- AI Agents - The systems that run long
- Agent Harness - Infrastructure enabling persistence
- Agent Evaluation - Testing long-running behavior
