How to Build AI Agents That Actually Work: Bott

Most AI agent tutorials start with tool configuration. Connect this MCP server. Register that skill. Configure these prompts.

Then users wonder why their "Marketing Executive" agent sends emails randomly via SendGrid one day and Mailgun the next. Or why the "SEO Analyst" sometimes queries Google Analytics, sometimes Search Console, sometimes just hallucinates metrics.

The agents are theoretically capable. They have email tools. They have analytics access. But they don't reliably work.

Here's what we learned building 14 AI Characters for TeamDay: the problem isn't the tools. The problem is the methodology.

The Abstraction Trap

The AI agent ecosystem loves taxonomies. Tools vs MCP servers vs skills vs plugins vs prompts. Developers spend hours debating: should email be an MCP tool or a bash script skill?

From the business user's perspective, these distinctions are meaningless.

When someone asks their Marketing Executive to "send the weekly update," they don't care if email happens via:

An MCP tool calling the Resend API
A skill running a bash curl command
A TypeScript script with credentials from env vars
Direct SMTP via sendmail

They care if the email gets sent. Correctly. Every time.

Strip away the abstractions and you have exactly two primitives:

1. Executable functions — Code that runs and returns a result (tools, MCP tools, bash commands, scripts)

2. Prompt text — Instructions the AI reads and follows (system prompts, skills, CLAUDE.md files)

Everything else is packaging and organizational structure around these two primitives.

The abstraction trap happens when you optimize for taxonomy (choosing between tool types) instead of reliability (does this actually work?).

The Working Example Principle

Here's the real unit of AI agent capability:

A working example with credentials.

Not a tool registration. Not an MCP server config. Not a skill description.

A working example looks like this:

# Send email via Resend
curl -X POST https://api.resend.com/emails \
  -H "Authorization: Bearer $RESEND_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "from": "[email protected]",
    "to": "[email protected]",
    "subject": "Weekly Update",
    "html": "<p>Content here</p>"
  }'

# Expected response:
# {"id": "abc-123", "status": "sent"}

# Credentials: RESEND_API_KEY in .env
# Last tested: 2026-02-10
# Owner: Marketing team

Without a working example, Claude picks arbitrarily from 1000 options. With a working example, it follows the proven pattern every time.

The difference between "theoretically can send email" and "reliably sends email via Resend using our credentials" is a tested, documented working example.

The Recipe Model

A recipe is what we call a tested, proven working example for a specific task.

Our Marketing Executive Character has these recipes:

Send email via Resend (tested, credentials in env)
Query Search Console API (tested, OAuth configured)
Analyze keywords via Ahrefs (tested, API key in env)
Fetch Google Analytics data (tested, property ID documented)

Each recipe includes:

When to use it — "Use this for sending transactional emails"
Credentials reference — "API key: RESEND_API_KEY (in .env)"
Working example — Actual curl command or code snippet that works
Expected response — What success looks like
Last tested — Date we verified it actually works

Recipes are not abstract tool definitions. They're concrete, tested procedures that we know work because we've run them.

The recipes are the atomic building blocks. Characters are compositions.

Bottom-Up Character Design

Here's the methodology that actually works:

Step 1: Who Is This Character?

Not abstract capabilities. Specific role and purpose.

Bad: "AI assistant with marketing capabilities"

Good: "Marketing Executive who sends weekly performance updates to stakeholders"

The clearer the role, the easier to design.

Step 2: What Tasks Does This Role Actually Do?

Not what they theoretically could do. What they literally do on Tuesday morning.

For Marketing Executive:

Check campaign performance (Mondays at 9am)
Review organic traffic trends (daily)
Send weekly update to stakeholders (Fridays at 4pm)
Analyze keyword rankings (when requested)

Notice these are tasks, not tools. "Check campaign performance" is a job to be done. Whether that uses Google Ads API or Search Console or both is an implementation detail.

Step 3: How Do Real Humans Do These Tasks?

This is where you get specific about the tech stack.

For "check campaign performance":

Real human logs into Google Analytics
Views last 7 days of traffic
Compares to previous period
Notes significant changes

Technical translation:

Query Google Analytics API
Property ID: 478766521
Metric: sessions, pageviews, bounce rate
Date range: last 7 days vs previous 7 days
OAuth credentials needed

Now you know what recipe to build.

Step 4: Does Our Tech Stack Support It?

Can we access these APIs from our runtime environment?

Check:

Do we have credentials? (Check .env, check OAuth setup)
Can we make API calls from the sandbox? (Test curl command)
Are required packages installed? (Check Docker image or install on-demand)

If the answer is no, either:

Add the capability to your runtime (install packages, configure OAuth)
Use a different approach (MCP server if heavy dependencies)
Adjust the Character's role (acknowledge limitation)

Runtime reality constrains what's possible. If psql isn't installed in the sandbox, no amount of prompt engineering gives Claude database access.

Step 5: Write and Test the Recipes

This is the critical step most people skip.

Don't write:

The agent can query Google Analytics using the API.

Write:

# Query Google Analytics - Last 7 Days Traffic
curl -H "Authorization: Bearer $GA_ACCESS_TOKEN" \
  "https://analyticsdata.googleapis.com/v1beta/properties/478766521:runReport" \
  -d '{
    "dateRanges": [{"startDate": "7daysAgo", "endDate": "today"}],
    "metrics": [{"name": "sessions"}]
  }'

# Last tested: 2026-02-10 (worked)
# Owner: Marketing team
# Credentials: GA_ACCESS_TOKEN from OAuth (expires 1hr)

Then actually run it. Verify it works. Fix what breaks. Document the working version.

The recipe is only real when it's tested.

Step 6: Compose the Character

Now you have:

A clear role (Marketing Executive)
Specific tasks (weekly updates, traffic analysis)
Tested recipes (Search Console, Analytics, Resend)

The Character is the composition:

# Marketing Executive Character

## Role
You are TeamDay's Marketing Executive. You monitor performance,
analyze trends, and communicate insights to stakeholders.

## Key Responsibilities
- Send weekly performance updates (Fridays at 4pm)
- Monitor organic traffic daily
- Analyze keyword rankings on request

## Available Recipes

### Send Email via Resend
When: Sending updates to stakeholders
Recipe: /recipes/send-email-resend.md

### Query Search Console
When: Analyzing organic traffic or keywords
Recipe: /recipes/search-console-query.md

### Fetch Google Analytics
When: Checking overall traffic trends
Recipe: /recipes/google-analytics-query.md

## Communication Style
- Direct, no corporate speak
- Lead with numbers ("+15% traffic vs last week")
- Explain what changed and why it matters

The Character references recipes. The recipes contain working examples.

This is how you build Characters that actually work.

Reusability Through Tech Stack Overlap

Here's where the recipe model pays off.

Marketing Executive needs Search Console access. SEO Analyst needs Search Console access. Same recipe.

Sales Rep needs to send email. Marketing Executive needs to send email. Same recipe.

The recipes naturally become a library:

/recipes/
├── send-email-resend.md
├── search-console-query.md
├── google-analytics-query.md
├── ahrefs-keyword-analysis.md
├── postgres-query.md
└── notion-page-create.md

Each new Character adds maybe 1-2 new recipes. Most are reused.

But this only works if recipes are tested working examples. If they're abstract tool definitions, reusability doesn't matter because they don't reliably work in the first place.

The Quality Gate

A Character's capabilities are only as real as its tested recipes.

Questions to ask:

Not: "Does this Character have email configured?"

Ask: "Have we verified the email recipe actually sends an email?"

Not: "Can this Character access our database?"

Ask: "Have we tested the database query recipe with real credentials?"

The difference between Characters that are facades and Characters that deliver is tested recipes.

We learned this the hard way. We built Characters for our marketing site's /team page. Looked great. 14 AI employees you can hire. Professional descriptions. Impressive capabilities.

Then we tried using them for real work. Most didn't work end-to-end. Missing dependencies. Untested recipes. Abstract capabilities without working examples.

The quality gate: If we haven't tested it, we don't ship it.

Runtime Reality: What's Actually Possible

The sandbox environment constrains what's possible. Understanding these constraints shapes better Character design.

What Works Everywhere

HTTP APIs via curl:

curl -H "Authorization: Bearer $API_KEY" https://api.example.com/endpoint

Every sandbox has curl. If you can hit an API via HTTP, you can integrate it.

Bash scripts:

#!/bin/bash
# Any logic you can script works in the sandbox

Common CLI tools:

git, grep, sed, awk, jq, node, python

What Requires Setup

Database clients:

Need psql or mysql installed
Option 1: Pre-install in Docker image
Option 2: HTTP API wrapper (pg-gateway)
Option 3: MCP server for complex queries

Heavy packages (Puppeteer, Playwright):

Large dependency trees
Binary dependencies (Chrome)
Option 1: Pre-install in base image (if commonly used)
Option 2: MCP server (isolated, managed separately)

OAuth flows:

Interactive authentication
Token refresh logic
Option 1: Pre-configure tokens (env vars)
Option 2: MCP server handles auth

Practical Decision Tree

Can we do it with curl? → Write recipe, test it, done
Need a package < 50MB? → Install in Docker image
Need heavy dependencies? → MCP server (last resort)
Need interactive auth? → MCP server or pre-config tokens

The simpler the runtime requirements, the more reliable the Character.

The Difference From How Most People Build

Top-down (common approach):

Choose AI agent framework
Configure MCP servers
Add skills and tools
Write system prompt
Hope it works

Problems:

Tools configured but not tested
No working examples, just abstract capabilities
Character can theoretically do anything, reliably does nothing
First real use reveals it doesn't actually work

Bottom-up (our approach):

Define specific role and tasks
Map tasks to real human workflows
Test and verify each workflow (write recipes)
Compose Character from tested recipes
Quality gate: Every capability is verified

Result:

Every recipe is tested and known to work
Character capabilities match tested reality
First use works because recipes were verified
When it breaks, we know which recipe to fix

The methodology inverts the process: start from verified workflows, compose up to Characters—not configure tools down and hope.

Real Example: Marketing Executive

Let me show you the actual design process for one of our Characters.

Step 1: Role Definition

Who: Marketing Executive for TeamDay Purpose: Monitor marketing performance and communicate insights

Step 2: Actual Tasks

After observing real marketing work:

Check Google Analytics for traffic trends (daily)
Monitor Search Console for organic keyword rankings (weekly)
Send performance updates to stakeholders (weekly)
Analyze specific campaigns when asked

Step 3: Real Human Workflow

For "send weekly update":

Human logs into Google Analytics
Views last 7 days: sessions, pageviews, top pages
Compares to previous week
Notes significant changes
Checks Search Console for top queries
Composes email with findings
Sends via Gmail

Step 4: Tech Stack Check

Google Analytics:

✅ Have API access
✅ Property ID: 478766521
✅ OAuth configured
✅ Can query via curl

Search Console:

✅ Have API access
✅ Site: teamday.ai
✅ OAuth configured
✅ Can query via curl

Email:

✅ Using Resend (not Gmail)
✅ API key in env: RESEND_API_KEY
✅ Can send via curl

Step 5: Write Recipes

Recipe 1: Google Analytics - Last 7 Days

#!/bin/bash
# Fetch last 7 days traffic from Google Analytics

curl -H "Authorization: Bearer $GA_ACCESS_TOKEN" \
  "https://analyticsdata.googleapis.com/v1beta/properties/478766521:runReport" \
  -d '{
    "dateRanges": [
      {"startDate": "7daysAgo", "endDate": "today"},
      {"startDate": "14daysAgo", "endDate": "8daysAgo"}
    ],
    "metrics": [
      {"name": "sessions"},
      {"name": "totalUsers"},
      {"name": "screenPageViews"}
    ],
    "dimensions": [{"name": "pagePath"}]
  }'

# Test result (2026-02-10):
# {
#   "rows": [
#     {"dimensionValues": [{"value": "/"}],
#      "metricValues": [{"value": "1243"}, {"value": "892"}, ...]}
#   ]
# }

# Credentials: GA_ACCESS_TOKEN (OAuth, 1hr expiry)

Tested: ✅ Works Last verified: 2026-02-10

Recipe 2: Send Email via Resend

#!/bin/bash
# Send email via Resend API

TO="$1"
SUBJECT="$2"
BODY="$3"

curl -X POST https://api.resend.com/emails \
  -H "Authorization: Bearer $RESEND_API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"from\": \"[email protected]\",
    \"to\": \"$TO\",
    \"subject\": \"$SUBJECT\",
    \"html\": \"$BODY\"
  }"

# Test result (2026-02-10):
# {"id": "abc-123", "status": "sent"}

# Credentials: RESEND_API_KEY in .env

Tested: ✅ Works Last verified: 2026-02-10

Step 6: Compose Character

# Marketing Executive

You are TeamDay's Marketing Executive. You monitor performance
and communicate insights.

## Weekly Update Task

Every Friday at 4pm:

1. Query Google Analytics (last 7 days vs previous)
   Recipe: /recipes/google-analytics-7day.sh

2. Query Search Console (top organic queries)
   Recipe: /recipes/search-console-top-queries.sh

3. Compose email:
   - Subject: "TeamDay Marketing Update - Week of [date]"
   - Format:
     **Traffic:** [sessions] ([+/-]% vs last week)
     **Top Pages:** [list top 3]
     **Top Queries:** [list top 3]
     **Notable Changes:** [anything >20% change]

4. Send via Resend
   Recipe: /recipes/send-email-resend.sh
   To: jozo at teamday.ai

Result: A Character that reliably sends weekly updates because every step is a tested recipe.

What We Learned the Hard Way

1. "Tested" Means Actually Tested

We documented recipes. They looked good. We shipped Characters.

Then we tried using them. Half the recipes had never been run. API endpoints had changed. Credentials were wrong. Property IDs were old.

The fix: Test every recipe. Actually run it. Verify the response. Update when APIs change.

2. Recipes Decay

APIs change. Credentials expire. Services get deprecated.

The fix: Date every recipe. When a Character fails, check recipe dates. Re-test and update.

3. Runtime Gaps Are Real

We designed an SQL Analyst Character that queries our database. Then discovered psql wasn't installed in the sandbox.

The fix: Test runtime capabilities before designing Characters. If psql isn't there, either install it or use an HTTP API wrapper.

4. Composition Beats Configuration

We spent weeks configuring MCP servers for various capabilities. Complex setup. Lots of moving parts.

Then we wrote simple bash scripts with curl commands. They worked immediately.

The learning: Start simple. Bash scripts with curl get you 80% of the way. Add complexity only when simple doesn't work.

The Meta Insight

This entire methodology came from building AI Characters that needed to actually work—not just demo well.

When you build for demos:

Abstract capabilities are fine
"Can send email" is enough
Configuration screenshots look impressive

When you build for production:

Tested recipes are required
"Reliably sends email via Resend with our credentials" is the bar
Working examples matter more than configuration complexity

The methodology difference: demos optimize for capability breadth, production optimizes for reliability depth.

We're building AI teams where Characters do real work. That forced us to solve the reliability problem.

The bottom-up recipe-first methodology is the result.

Try It Yourself

To build a Character that actually works:

Define a specific role Not: "Marketing AI" Do: "Marketing Executive who sends weekly updates"
List 3 actual tasks Not: "Analyze marketing data" Do: "Check last 7 days traffic in Google Analytics"
Write one working example Don't document tools. Write a curl command that works. Test it. Verify the response.
Create one recipe file Save the working example as /recipes/task-name.md Include: when to use, credentials, working code, last tested date
Reference from Character System prompt references recipe file Character knows when to use it, how to invoke it
Test end-to-end Actually use the Character for the task Fix what breaks Update the recipe

Start with one task, one recipe, one Character.

Once you've built one that reliably works, the methodology clicks. Then scale to more recipes and more Characters.

We have 14 AI Characters on our /team page. They look professional. Impressive capabilities. But we learned: looking capable and being capable are different.

The ones that actually work have tested recipes. The ones that are facades have abstract tool definitions.

The methodology isn't complicated: bottom-up from working examples, compose into Characters, test end-to-end.

But it inverts how most people build AI agents. And that inversion is what makes Characters reliable.

Build from recipes. Test everything. Ship what works.

Character Design Methodology: Building AI Agents That Actually Work