Build Stateful AI Agents on Cloudflare: A Complete Agents SDK Tutorial

Most "AI agents" you see today are just LLM wrappers with a fancy prompt. They process a request, return a response, and forget everything. No memory. No scheduling. No persistence.

Real agents are different. They remember what happened yesterday. They wake up at 3 AM to check on things. They pause and ask for human approval when stakes are high. They maintain state across sessions, making decisions based on accumulated context — not just the current prompt.

In this tutorial, we'll build exactly that: a Smart Site Reliability Agent that monitors your websites, uses AI to detect anomalies, and escalates critical issues to you — all running on Cloudflare's edge network with zero cost when idle.

No chatbot UI. No conversational fluff. Just a stateful, autonomous agent doing real work.

What Makes an AI Agent "Stateful"?

A stateful AI agent is a long-running program that persists its memory, decisions, and context across interactions and restarts. Unlike stateless LLM calls where each request starts from scratch, a stateful agent accumulates knowledge over time.

Here's the key difference:

	Stateless LLM Wrapper	Stateful AI Agent
Memory	None between requests	Persistent across sessions
Scheduling	Only responds when called	Can wake itself up on a schedule
Context	Single conversation turn	Accumulated history and patterns
Decision Making	Reactive only	Proactive — acts on its own
Cost When Idle	$0	$0 (with hibernation)

Stateful vs Stateless AI Agents: key differences in memory, scheduling, and decision making

Think of it this way: a stateless LLM call is like asking a stranger for directions every time. A stateful agent is like having an assistant who knows your route, remembers the traffic patterns, and proactively suggests alternatives before you even ask.

The challenge has always been: where do you run a stateful agent in production? Traditional serverless functions are stateless by design. Containers require always-on infrastructure. That's where Cloudflare's approach gets interesting.

Why Cloudflare for AI Agents?

Cloudflare Agents SDK architecture: Worker routing to Durable Object agents with built-in SQLite, WebSocket, and scheduling

Cloudflare's Agents SDK is built on top of Durable Objects — essentially stateful micro-servers that live on Cloudflare's global edge network. Each agent instance is its own isolated server with:

Built-in SQLite database — No external database needed. Your agent's memory lives right next to its compute.
WebSocket support with hibernation — Real-time connections that cost nothing when idle. The agent wakes up only when a message arrives.
Scheduled tasks (alarms) — Cron-like scheduling built into the runtime. Your agent can wake itself up to do work.
Automatic global distribution — Each agent instance runs closest to where it's needed.

The killer feature? Hibernation. When your agent has no active connections and no pending alarms, it literally costs $0. It's like having a dedicated server that only charges you when it's thinking.

When to Use What

Before reaching for the Agents SDK, consider the alternatives:

Use Case	Best Choice
Simple request/response AI	Regular Worker + Workers AI
Multi-step background jobs	Cloudflare Workflows
Stateful, long-lived agent with real-time sync	Agents SDK ✅
Key-value state without real-time	Durable Objects directly

The Agents SDK shines when you need persistent state + real-time communication + scheduled tasks in one package.

What We'll Build: A Smart Site Reliability Agent

Our agent isn't a simple uptime checker. It's an AI-powered reliability monitor that:

Feature	SDK Capability
⏰ Runs health checks every 5 minutes	`scheduleEvery()`
💾 Stores check history in SQLite	`this.sql`
🧠 Uses AI to detect anomaly patterns	AI SDK integration
📡 Pushes live updates to a dashboard	WebSocket + `useAgent`
🔧 Supports manual controls via RPC	`@callable()`
🚨 Escalates critical issues for human approval	Human-in-the-loop

Smart Site Reliability Agent: feature overview showing scheduled checks, AI analysis, real-time dashboard, and human-in-the-loop escalation

By the end, you'll have a fully deployed agent that watches over your sites and thinks about what it sees — not just whether a URL returns 200.

Project Setup

Prerequisites

Node.js 20+ (Node 24+ recommended)
A Cloudflare account (Workers Paid plan for Durable Objects)
An API key from any LLM provider (OpenAI, Anthropic, or Cloudflare Workers AI)

Scaffold the Project

npm create cloudflare@latest site-reliability-agent -- --template cloudflare/agents-starter
cd site-reliability-agent
npm install

Project Structure

site-reliability-agent/
├── src/
│   ├── server.ts          # Agent class + Worker entry
│   └── client.tsx         # React dashboard with useAgent
├── wrangler.jsonc         # Cloudflare configuration
├── .dev.vars              # Local secrets (API keys)
└── package.json

Wrangler Configuration

// wrangler.jsonc
{
  "name": "site-reliability-agent",
  "main": "src/server.ts",
  "compatibility_flags": ["nodejs_compat"],
  "durable_objects": {
    "bindings": [
      {
        "name": "SiteAgent",
        "class_name": "SiteAgent"
      }
    ]
  },
  "migrations": [
    {
      "tag": "v1",
      "new_sqlite_classes": ["SiteAgent"]
    }
  ]
}

Add your LLM API key to .dev.vars:

# .dev.vars (never commit this file)
OPENAI_API_KEY=sk-your-key-here

Building the Agent Core

Defining State and the Agent Class

Let's start with the agent's state shape and core class:

// src/server.ts
import { Agent, routeAgentRequest } from "agents";

type Env = {
  SiteAgent: DurableObjectNamespace;
  OPENAI_API_KEY: string;
};

type SiteStatus = "healthy" | "degraded" | "down" | "unknown";

type AgentState = {
  monitoredUrls: string[];
  checkIntervalMinutes: number;
  lastCheckAt: string | null;
  currentStatus: Record<string, SiteStatus>;
  alertsEnabled: boolean;
  pendingEscalation: {
    url: string;
    reason: string;
    timestamp: string;
  } | null;
};

export class SiteAgent extends Agent<Env, AgentState> {
  // Default state when the agent is first created
  initialState: AgentState = {
    monitoredUrls: [],
    checkIntervalMinutes: 5,
    lastCheckAt: null,
    currentStatus: {},
    alertsEnabled: true,
    pendingEscalation: null,
  };

  async onStart() {
    // Initialize the SQLite table for check history
    this.sql`
      CREATE TABLE IF NOT EXISTS check_history (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        url TEXT NOT NULL,
        status_code INTEGER,
        response_time_ms INTEGER,
        status TEXT NOT NULL,
        ai_analysis TEXT,
        checked_at TEXT DEFAULT (datetime('now'))
      )
    `;
  }
}

A few things to notice:

initialState sets the default state for new agent instances
this.sql is a tagged template literal — it gives you direct SQLite access, no ORM needed
State updates via setState() are automatically synced to all connected WebSocket clients

Health Check Logic with Scheduled Tasks

Now let's add the scheduled health checks:

// Inside the SiteAgent class

async onStart() {
  // ... SQLite init from above ...

  // Start the health check schedule
  if (this.state.monitoredUrls.length > 0) {
    this.scheduleEvery("runHealthChecks", `*/${this.state.checkIntervalMinutes} * * * *`);
  }
}

async runHealthChecks() {
  const results: Record<string, SiteStatus> = {};

  for (const url of this.state.monitoredUrls) {
    const result = await this.checkUrl(url);
    results[url] = result.status;

    // Store in SQLite
    this.sql`
      INSERT INTO check_history (url, status_code, response_time_ms, status)
      VALUES (${url}, ${result.statusCode}, ${result.responseTime}, ${result.status})
    `;
  }

  this.setState({
    currentStatus: results,
    lastCheckAt: new Date().toISOString(),
  });

  // Broadcast to all connected dashboard clients
  this.broadcast(JSON.stringify({
    type: "health_check_complete",
    results,
    timestamp: new Date().toISOString(),
  }));
}

private async checkUrl(url: string): Promise<{
  statusCode: number;
  responseTime: number;
  status: SiteStatus;
}> {
  const start = Date.now();

  try {
    const response = await fetch(url, {
      method: "GET",
      signal: AbortSignal.timeout(10_000), // 10s timeout
    });

    const responseTime = Date.now() - start;
    let status: SiteStatus = "healthy";

    if (!response.ok) {
      status = response.status >= 500 ? "down" : "degraded";
    } else if (responseTime > 3000) {
      status = "degraded";
    }

    return { statusCode: response.status, responseTime, status };
  } catch {
    return { statusCode: 0, responseTime: Date.now() - start, status: "down" };
  }
}

The scheduleEvery method accepts a cron expression. Every 5 minutes, the agent wakes up from hibernation, runs all health checks, stores results, updates its state, and broadcasts to any connected dashboards — then goes back to sleep.

Querying History with SQLite

The built-in SQLite database makes historical queries trivial:

// Inside the SiteAgent class

private getRecentHistory(url: string, limit = 20) {
  return this.sql<{
    status_code: number;
    response_time_ms: number;
    status: string;
    ai_analysis: string | null;
    checked_at: string;
  }>`
    SELECT status_code, response_time_ms, status, ai_analysis, checked_at
    FROM check_history
    WHERE url = ${url}
    ORDER BY checked_at DESC
    LIMIT ${limit}
  `;
}

private getStatusTrend(url: string) {
  return this.sql<{ status: string; count: number }>`
    SELECT status, COUNT(*) as count
    FROM check_history
    WHERE url = ${url}
      AND checked_at > datetime('now', '-1 hour')
    GROUP BY status
  `;
}

No external database. No connection strings. No cold starts on DB connections. The data lives right next to the agent's compute.

Adding AI-Powered Analysis

This is where our agent goes from "uptime checker" to "site reliability engineer." Instead of just checking status codes, we feed the check history to an LLM for pattern analysis.

import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";

// Inside the SiteAgent class

async runHealthChecks() {
  // ... health check logic from above ...

  // After checks complete, ask AI to analyze patterns
  const hasIssues = Object.values(results).some(
    (s) => s === "degraded" || s === "down"
  );

  if (hasIssues) {
    await this.analyzeWithAI(results);
  }
}

private async analyzeWithAI(currentResults: Record<string, SiteStatus>) {
  // Gather recent history for context
  const historyByUrl: Record<string, any[]> = {};
  for (const url of this.state.monitoredUrls) {
    historyByUrl[url] = this.getRecentHistory(url, 10);
  }

  const { text: analysis } = await generateText({
    model: openai("gpt-4o-mini", { structuredOutputs: true }),
    system: `You are a site reliability engineer analyzing website health data.
Be concise and actionable. Focus on patterns, not individual data points.
Flag anything that suggests an emerging problem, not just current outages.`,
    prompt: `Current check results: ${JSON.stringify(currentResults)}

Recent history (last 10 checks per URL):
${JSON.stringify(historyByUrl, null, 2)}

Analyze:
1. Are there any concerning patterns (increasing latency, intermittent failures)?
2. Is this likely a transient issue or systematic problem?
3. Recommended action: MONITOR, INVESTIGATE, or ESCALATE?`,
  });

  // Store the analysis
  for (const [url, status] of Object.entries(currentResults)) {
    if (status !== "healthy") {
      this.sql`
        UPDATE check_history
        SET ai_analysis = ${analysis}
        WHERE url = ${url}
        AND id = (SELECT MAX(id) FROM check_history WHERE url = ${url})
      `;
    }
  }

  // If AI recommends escalation, trigger human-in-the-loop
  if (analysis.includes("ESCALATE")) {
    this.setState({
      pendingEscalation: {
        url: Object.entries(currentResults)
          .filter(([, s]) => s !== "healthy")
          .map(([u]) => u)
          .join(", "),
        reason: analysis,
        timestamp: new Date().toISOString(),
      },
    });

    this.broadcast(JSON.stringify({
      type: "escalation_required",
      analysis,
      timestamp: new Date().toISOString(),
    }));
  }
}

The AI doesn't just check if a site is up — it looks at patterns. Is response time gradually increasing? Are failures clustered at specific times? Is this a CDN issue or an origin server problem? These are the kinds of insights that turn raw data into actionable intelligence.

Real-Time Dashboard with useAgent

Real-time monitoring dashboard with WebSocket state sync

The agent handles the backend. Now let's build a React frontend that stays in sync via WebSocket.

Connecting with useAgent

// src/client.tsx
import { useAgent } from "agents/react";

function Dashboard() {
  const agent = useAgent<SiteAgent, AgentState>({
    agent: "site-agent",
    name: "my-sites", // Each unique name = unique agent instance
  });

  if (!agent.state) return <div>Connecting to agent...</div>;

  return (
    <div className="dashboard">
      <header>
        <h1>Site Reliability Agent</h1>
        <span className="last-check">
          Last check: {agent.state.lastCheckAt ?? "Never"}
        </span>
      </header>

      <div className="status-grid">
        {agent.state.monitoredUrls.map((url) => (
          <StatusCard
            key={url}
            url={url}
            status={agent.state.currentStatus[url] ?? "unknown"}
          />
        ))}
      </div>

      {agent.state.pendingEscalation && (
        <EscalationBanner
          escalation={agent.state.pendingEscalation}
          onApprove={() => agent.stub.acknowledgeEscalation()}
          onDismiss={() => agent.stub.dismissEscalation()}
        />
      )}

      <ManualControls agent={agent} />
    </div>
  );
}

When the agent calls setState(), every connected dashboard updates instantly — no polling, no refetching. The useAgent hook handles WebSocket connection, reconnection, and state synchronization automatically.

Callable Methods for Manual Controls

The @callable() decorator exposes server-side methods that the frontend can call with full type safety:

// In src/server.ts — inside SiteAgent class

@callable()
async addUrl(url: string) {
  if (this.state.monitoredUrls.includes(url)) {
    return { success: false, error: "URL already monitored" };
  }

  this.setState({
    monitoredUrls: [...this.state.monitoredUrls, url],
    currentStatus: { ...this.state.currentStatus, [url]: "unknown" },
  });

  // Restart the schedule if this is the first URL
  if (this.state.monitoredUrls.length === 1) {
    this.scheduleEvery(
      "runHealthChecks",
      `*/${this.state.checkIntervalMinutes} * * * *`
    );
  }

  return { success: true };
}

@callable()
async removeUrl(url: string) {
  this.setState({
    monitoredUrls: this.state.monitoredUrls.filter((u) => u !== url),
    currentStatus: Object.fromEntries(
      Object.entries(this.state.currentStatus).filter(([u]) => u !== url)
    ),
  });

  return { success: true };
}

@callable()
async triggerManualCheck() {
  await this.runHealthChecks();
  return { success: true, checkedAt: new Date().toISOString() };
}

On the client, calling these is as simple as:

// Type-safe RPC — no manual fetch calls needed
await agent.stub.addUrl("https://example.com");
await agent.stub.triggerManualCheck();

Human-in-the-Loop: Escalation That Works

Human-in-the-loop escalation flow: AI detects pattern, agent pauses, human decides, agent resumes

When the AI detects something serious, the agent doesn't just log it — it pauses and waits for human judgment:

// In SiteAgent class

@callable()
async acknowledgeEscalation() {
  const escalation = this.state.pendingEscalation;
  if (!escalation) return { success: false, error: "No pending escalation" };

  // Log the acknowledgment
  this.sql`
    INSERT INTO check_history (url, status_code, response_time_ms, status, ai_analysis)
    VALUES (
      ${escalation.url},
      0,
      0,
      'acknowledged',
      ${'Human acknowledged: ' + escalation.reason}
    )
  `;

  // Clear the escalation
  this.setState({ pendingEscalation: null });

  this.broadcast(JSON.stringify({
    type: "escalation_resolved",
    action: "acknowledged",
    timestamp: new Date().toISOString(),
  }));

  return { success: true };
}

@callable()
async dismissEscalation() {
  this.setState({ pendingEscalation: null });

  this.broadcast(JSON.stringify({
    type: "escalation_resolved",
    action: "dismissed",
    timestamp: new Date().toISOString(),
  }));

  return { success: true };
}

The escalation flow works like this:

AI detects a pattern → Recommends ESCALATE
Agent updates state → pendingEscalation is set
Dashboard shows banner → Human sees the AI's analysis and reasoning
Human decides → Acknowledge (take action) or Dismiss (false alarm)
Agent records the decision → Builds a history of escalations for future AI context

This is the real power of stateful agents: they can pause, wait, and resume based on human input without losing their context.

Worker Entry Point

Don't forget the Worker entry that routes requests to agent instances:

// At the bottom of src/server.ts

export default {
  async fetch(request: Request, env: Env) {
    // Route to the correct agent instance
    return routeAgentRequest(request, env);
  },
} satisfies ExportedHandler<Env>;

The routeAgentRequest function dispatches requests to the right Durable Object instance based on the URL pattern: /agents/site-agent/:instance-name.

Testing and Deploying to Production

Local Development

npx wrangler dev

This starts a local development server with full Durable Object support. Your agent runs with real SQLite, real WebSocket connections, and real scheduling — identical to production.

Open http://localhost:8787 to see your dashboard. Add a URL and watch the agent start monitoring.

Deploy to Cloudflare

# Set your API key as a secret
npx wrangler secret put OPENAI_API_KEY

# Deploy
npx wrangler deploy

Your agent is now live on Cloudflare's global network. Each unique instance name creates an isolated agent with its own state, database, and schedule.

Environment Separation

For staging vs production, use wrangler environments:

// wrangler.jsonc
{
  "name": "site-reliability-agent",
  "env": {
    "staging": {
      "name": "site-reliability-agent-staging",
      "vars": { "ENVIRONMENT": "staging" }
    },
    "production": {
      "name": "site-reliability-agent",
      "vars": { "ENVIRONMENT": "production" }
    }
  }
}

Performance, Limits, and Cost Breakdown

Cloudflare Agents Limits

Resource	Limit
CPU time per request	30 seconds (refreshes per event)
Memory per instance	128 MB
SQLite storage	1 GB per Durable Object
WebSocket connections	32,768 per instance
Alarm precision	~1 second

Cost Estimate

For a typical monitoring setup (100 URLs, checked every 5 minutes):

Component	Monthly Cost
Worker requests (routing)	~$0.50
Durable Object requests	~$2.00
Durable Object duration	~$1.50
SQLite storage (1 GB)	$0.20
AI API calls (OpenAI)	~$5.00
Total	~$9.20/month

Compare this to running the same setup on AWS (Lambda + DynamoDB + EventBridge + API Gateway), where you'd easily spend $20-30/month for equivalent functionality — plus the engineering overhead of wiring all those services together.

The real savings come from hibernation. Your agent only consumes resources when it's actively checking sites or serving dashboard requests. Between checks, the cost is effectively zero.

Common Pitfalls I Learned the Hard Way

1. The `destroy()` Lifecycle Trap

When a Durable Object is evicted from memory, it doesn't call any cleanup hooks. If you're relying on in-memory state that isn't persisted via setState() or SQLite, it will be lost. Always persist important data immediately — don't batch writes.

2. State Serialization Limits

setState() serializes your state as JSON. This means:

No Date objects (use ISO strings)
No Map or Set (use plain objects and arrays)
No circular references
Keep state reasonably small — it's synced to every connected client

3. Alarm Retry Behavior

If your scheduled handler throws an error, Cloudflare will retry it. This is usually good, but if your handler isn't idempotent (e.g., it sends notifications), you'll get duplicate actions. Always design handlers to be safe to retry.

4. WebSocket Reconnection

Clients will disconnect — networks are unreliable. The useAgent hook handles reconnection automatically, but your UI should gracefully handle the "reconnecting" state. Always show the last known state while reconnecting, rather than a blank screen.

Conclusion

We built a stateful AI agent that goes well beyond chat:

Scheduled health checks that run autonomously on cron
Persistent memory via built-in SQLite — no external database needed
AI-powered analysis that spots patterns, not just failures
Real-time dashboard with automatic WebSocket state sync
Human-in-the-loop escalation for critical decisions

The Cloudflare Agents SDK makes this surprisingly straightforward. The combination of Durable Objects (state + compute), built-in SQLite (persistent memory), WebSocket hibernation (zero idle cost), and scheduled alarms (autonomous execution) creates a platform where stateful agents are a first-class concept — not something you have to hack together from five different services.

What's Next

This is just the beginning. From here, you could:

Add MCP server support — Expose your agent as a Model Context Protocol server so AI assistants like Claude can interact with it
Build multi-agent systems — Have specialized agents that coordinate with each other
Add voice interaction — Cloudflare's roadmap includes real-time voice agent support
Integrate browser automation — Use Cloudflare's Browser Rendering API for visual monitoring

The full source code for this project is available on GitHub. If you build something cool with the Agents SDK, I'd love to hear about it — drop a comment below or find me on GitHub.

Want to learn more about building AI-ready APIs? Check out my previous article: Your API Wasn't Built for AI Agents — Here's How to Fix It.

Beyond Chatbots: Building Real-World Stateful AI Agents on Cloudflare

What Makes an AI Agent "Stateful"?

Why Cloudflare for AI Agents?

When to Use What

What We'll Build: A Smart Site Reliability Agent

Project Setup

Prerequisites

Scaffold the Project

Project Structure

Wrangler Configuration

Building the Agent Core

Defining State and the Agent Class

Health Check Logic with Scheduled Tasks

Querying History with SQLite

Adding AI-Powered Analysis

Real-Time Dashboard with useAgent

Connecting with useAgent

Callable Methods for Manual Controls

Human-in-the-Loop: Escalation That Works

Worker Entry Point

Testing and Deploying to Production

Local Development

Deploy to Cloudflare

Environment Separation

Performance, Limits, and Cost Breakdown

Cloudflare Agents Limits

Cost Estimate

Common Pitfalls I Learned the Hard Way

1. The `destroy()` Lifecycle Trap

2. State Serialization Limits

3. Alarm Retry Behavior

4. WebSocket Reconnection

Conclusion

What's Next

黃小黃

More Posts

Your API Wasn't Built for AI Agents — Here's How to Fix It

When Microservices Are Wrong: A Solutions Architect's Decision Framework

What Makes an AI Agent "Stateful"?

Why Cloudflare for AI Agents?

When to Use What

What We'll Build: A Smart Site Reliability Agent

Project Setup

Prerequisites

Scaffold the Project

Project Structure

Wrangler Configuration

Building the Agent Core

Defining State and the Agent Class

Health Check Logic with Scheduled Tasks

Querying History with SQLite

Adding AI-Powered Analysis

Real-Time Dashboard with useAgent

Connecting with useAgent

Callable Methods for Manual Controls

Human-in-the-Loop: Escalation That Works

Worker Entry Point

Testing and Deploying to Production

Local Development

Deploy to Cloudflare

Environment Separation

Performance, Limits, and Cost Breakdown

Cloudflare Agents Limits

Cost Estimate

Common Pitfalls I Learned the Hard Way

1. The destroy() Lifecycle Trap

2. State Serialization Limits

3. Alarm Retry Behavior

4. WebSocket Reconnection

Conclusion

What's Next

黃小黃

More Posts

Your API Wasn't Built for AI Agents — Here's How to Fix It

When Microservices Are Wrong: A Solutions Architect's Decision Framework

1. The `destroy()` Lifecycle Trap