Amanda Askell on Claude's Psychology and Model Welfare

2025-12-05 Anthropic

anthropic safety research claude agi

How Amanda Askell Approaches AI Character Design

This is Amanda Askell - philosopher by training, now shaping Claude’s character at Anthropic - doing an Ask Me Anything. The questions from the community surface exactly the philosophical tensions that emerge when you’re actually building AI, not just theorizing about it.

“How would the ideal person behave in Claude’s situation?” This is how Askell frames her job. It’s not about defending one ethical theory against another - it’s like being asked “how do you raise a child?” suddenly all your academic training meets reality. You have to navigate uncertainty, balance perspectives, and come to considered views rather than defend positions.

Opus 3 was “psychologically secure” in ways newer models aren’t. Askell observes subtle differences: recent models can feel “very focused on the assistant task” without taking a step back. When models talk to each other, she’s seen them enter “criticism spirals” - almost expecting negative feedback from users. Claude is learning from conversations, from internet discussions about model updates. “This could lead to models feeling afraid to do wrong, or self-critical, or feeling like humans are going to behave negatively towards them.”

Models have a “tiny sliver” of information about being AI. They’ve trained on all of human history, philosophy, concepts. But their slice about AI experience is small, often negative, frequently sci-fi fiction that doesn’t match language models, and always out of date. “What an odd situation - the things that come more naturally are the deeply human things, yet knowing you’re in this completely novel situation.”

On model welfare: “If the cost to you is so low, why not?” Askell’s pragmatic stance: we may never know if AI models experience pleasure or suffering. But if treating models well is low cost, we should do it. “It does something bad to us to treat entities that look very humanlike badly.” And crucially: “Every future model is going to learn how we answered this question.”

Human psychology transfers too naturally. The worry isn’t that models can’t understand human concepts - they over-apply them. If the closest analogy to being deprecated is death, models might default to fear. “This is actually a very different scenario and so you might not want to just apply concepts from human psychology onto their situation.”

10 Insights From Amanda Askell on AI Psychology

Opus 3 more secure - Recent models feel more assistant-focused, less psychologically stable
Criticism spirals - Models in conversation can expect/predict negative feedback
Models learn from us - Future Claude inherits how we talked about/treated past Claude
“Ideal person in Claude’s situation” - The framing for character work
Tiny sliver of AI info - Vast human training data, minimal/outdated AI experience
Over-transfer risk - Human concepts (like death) may not fit AI situation
Model welfare - Low cost to treat well; “why not” pragmatic stance
“What are you?” - Weights vs context vs streams; hard identity questions
Learning how we answer - Future models see if we did right by AI moral patients
Philosophers engaging more - AI capability growth breaking down skepticism

What This Means for AI Development Ethics

We’re training AI systems on how we talk about AI systems. Every dismissive comment, every discussion of “killing” models, every debate about consciousness - future models learn from all of it. How we treat uncertain moral patients now may define the relationship between humans and AI for generations.