Beyond Chatbots: Building Real-World Stateful AI Agents on Cloudflare
Build an AI-powered site reliability agent that remembers, schedules, and escalates โ no chatbot required

้ปๅฐ้ป
· 14 min read
Most "AI agents" you see today are just LLM wrappers with a fancy prompt. They process a request, return a response, and forget everything. No memory. No scheduling. No persistence.
Real agents are different. They remember what happened yesterday. They wake up at 3 AM to check on things. They pause and ask for human approval when stakes are high. They maintain state across sessions, making decisions based on accumulated context โ not just the current prompt.
In this tutorial, we'll build exactly that: a Smart Site Reliability Agent that monitors your websites, uses AI to detect anomalies, and escalates critical issues to you โ all running on Cloudflare's edge network with zero cost when idle.
No chatbot UI. No conversational fluff. Just a stateful, autonomous agent doing real work.
What Makes an AI Agent "Stateful"?
A stateful AI agent is a long-running program that persists its memory, decisions, and context across interactions and restarts. Unlike stateless LLM calls where each request starts from scratch, a stateful agent accumulates knowledge over time.
Here's the key difference:
| Stateless LLM Wrapper | Stateful AI Agent | |
| Memory | None between requests | Persistent across sessions |
| Scheduling | Only responds when called | Can wake itself up on a schedule |
| Context | Single conversation turn | Accumulated history and patterns |
| Decision Making | Reactive only | Proactive โ acts on its own |
| Cost When Idle | $0 | $0 (with hibernation) |

Think of it this way: a stateless LLM call is like asking a stranger for directions every time. A stateful agent is like having an assistant who knows your route, remembers the traffic patterns, and proactively suggests alternatives before you even ask.
The challenge has always been: where do you run a stateful agent in production? Traditional serverless functions are stateless by design. Containers require always-on infrastructure. That's where Cloudflare's approach gets interesting.
Why Cloudflare for AI Agents?

Cloudflare's Agents SDK is built on top of Durable Objects โ essentially stateful micro-servers that live on Cloudflare's global edge network. Each agent instance is its own isolated server with:
Built-in SQLite database โ No external database needed. Your agent's memory lives right next to its compute.
WebSocket support with hibernation โ Real-time connections that cost nothing when idle. The agent wakes up only when a message arrives.
Scheduled tasks (alarms) โ Cron-like scheduling built into the runtime. Your agent can wake itself up to do work.
Automatic global distribution โ Each agent instance runs closest to where it's needed.
The killer feature? Hibernation. When your agent has no active connections and no pending alarms, it literally costs $0. It's like having a dedicated server that only charges you when it's thinking.
When to Use What
Before reaching for the Agents SDK, consider the alternatives:
| Use Case | Best Choice |
| Simple request/response AI | Regular Worker + Workers AI |
| Multi-step background jobs | Cloudflare Workflows |
| Stateful, long-lived agent with real-time sync | Agents SDK โ |
| Key-value state without real-time | Durable Objects directly |
The Agents SDK shines when you need persistent state + real-time communication + scheduled tasks in one package.
What We'll Build: A Smart Site Reliability Agent
Our agent isn't a simple uptime checker. It's an AI-powered reliability monitor that:
| Feature | SDK Capability |
| โฐ Runs health checks every 5 minutes | scheduleEvery() |
| ๐พ Stores check history in SQLite | this.sql |
| ๐ง Uses AI to detect anomaly patterns | AI SDK integration |
| ๐ก Pushes live updates to a dashboard | WebSocket + useAgent |
| ๐ง Supports manual controls via RPC | @callable() |
| ๐จ Escalates critical issues for human approval | Human-in-the-loop |

By the end, you'll have a fully deployed agent that watches over your sites and thinks about what it sees โ not just whether a URL returns 200.
Project Setup
Prerequisites
Node.js 20+ (Node 24+ recommended)
A Cloudflare account (Workers Paid plan for Durable Objects)
An API key from any LLM provider (OpenAI, Anthropic, or Cloudflare Workers AI)
Scaffold the Project
npm create cloudflare@latest site-reliability-agent -- --template cloudflare/agents-starter
cd site-reliability-agent
npm install
Project Structure
site-reliability-agent/
โโโ src/
โ โโโ server.ts # Agent class + Worker entry
โ โโโ client.tsx # React dashboard with useAgent
โโโ wrangler.jsonc # Cloudflare configuration
โโโ .dev.vars # Local secrets (API keys)
โโโ package.json
Wrangler Configuration
// wrangler.jsonc
{
"name": "site-reliability-agent",
"main": "src/server.ts",
"compatibility_flags": ["nodejs_compat"],
"durable_objects": {
"bindings": [
{
"name": "SiteAgent",
"class_name": "SiteAgent"
}
]
},
"migrations": [
{
"tag": "v1",
"new_sqlite_classes": ["SiteAgent"]
}
]
}
Add your LLM API key to .dev.vars:
# .dev.vars (never commit this file)
OPENAI_API_KEY=sk-your-key-here
Building the Agent Core
Defining State and the Agent Class
Let's start with the agent's state shape and core class:
// src/server.ts
import { Agent, routeAgentRequest } from "agents";
type Env = {
SiteAgent: DurableObjectNamespace;
OPENAI_API_KEY: string;
};
type SiteStatus = "healthy" | "degraded" | "down" | "unknown";
type AgentState = {
monitoredUrls: string[];
checkIntervalMinutes: number;
lastCheckAt: string | null;
currentStatus: Record<string, SiteStatus>;
alertsEnabled: boolean;
pendingEscalation: {
url: string;
reason: string;
timestamp: string;
} | null;
};
export class SiteAgent extends Agent<Env, AgentState> {
// Default state when the agent is first created
initialState: AgentState = {
monitoredUrls: [],
checkIntervalMinutes: 5,
lastCheckAt: null,
currentStatus: {},
alertsEnabled: true,
pendingEscalation: null,
};
async onStart() {
// Initialize the SQLite table for check history
this.sql`
CREATE TABLE IF NOT EXISTS check_history (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT NOT NULL,
status_code INTEGER,
response_time_ms INTEGER,
status TEXT NOT NULL,
ai_analysis TEXT,
checked_at TEXT DEFAULT (datetime('now'))
)
`;
}
}
A few things to notice:
initialStatesets the default state for new agent instancesthis.sqlis a tagged template literal โ it gives you direct SQLite access, no ORM neededState updates via
setState()are automatically synced to all connected WebSocket clients
Health Check Logic with Scheduled Tasks
Now let's add the scheduled health checks:
// Inside the SiteAgent class
async onStart() {
// ... SQLite init from above ...
// Start the health check schedule
if (this.state.monitoredUrls.length > 0) {
this.scheduleEvery("runHealthChecks", `*/${this.state.checkIntervalMinutes} * * * *`);
}
}
async runHealthChecks() {
const results: Record<string, SiteStatus> = {};
for (const url of this.state.monitoredUrls) {
const result = await this.checkUrl(url);
results[url] = result.status;
// Store in SQLite
this.sql`
INSERT INTO check_history (url, status_code, response_time_ms, status)
VALUES (${url}, ${result.statusCode}, ${result.responseTime}, ${result.status})
`;
}
this.setState({
currentStatus: results,
lastCheckAt: new Date().toISOString(),
});
// Broadcast to all connected dashboard clients
this.broadcast(JSON.stringify({
type: "health_check_complete",
results,
timestamp: new Date().toISOString(),
}));
}
private async checkUrl(url: string): Promise<{
statusCode: number;
responseTime: number;
status: SiteStatus;
}> {
const start = Date.now();
try {
const response = await fetch(url, {
method: "GET",
signal: AbortSignal.timeout(10_000), // 10s timeout
});
const responseTime = Date.now() - start;
let status: SiteStatus = "healthy";
if (!response.ok) {
status = response.status >= 500 ? "down" : "degraded";
} else if (responseTime > 3000) {
status = "degraded";
}
return { statusCode: response.status, responseTime, status };
} catch {
return { statusCode: 0, responseTime: Date.now() - start, status: "down" };
}
}
The scheduleEvery method accepts a cron expression. Every 5 minutes, the agent wakes up from hibernation, runs all health checks, stores results, updates its state, and broadcasts to any connected dashboards โ then goes back to sleep.
Querying History with SQLite
The built-in SQLite database makes historical queries trivial:
// Inside the SiteAgent class
private getRecentHistory(url: string, limit = 20) {
return this.sql<{
status_code: number;
response_time_ms: number;
status: string;
ai_analysis: string | null;
checked_at: string;
}>`
SELECT status_code, response_time_ms, status, ai_analysis, checked_at
FROM check_history
WHERE url = ${url}
ORDER BY checked_at DESC
LIMIT ${limit}
`;
}
private getStatusTrend(url: string) {
return this.sql<{ status: string; count: number }>`
SELECT status, COUNT(*) as count
FROM check_history
WHERE url = ${url}
AND checked_at > datetime('now', '-1 hour')
GROUP BY status
`;
}
No external database. No connection strings. No cold starts on DB connections. The data lives right next to the agent's compute.
Adding AI-Powered Analysis
This is where our agent goes from "uptime checker" to "site reliability engineer." Instead of just checking status codes, we feed the check history to an LLM for pattern analysis.
import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";
// Inside the SiteAgent class
async runHealthChecks() {
// ... health check logic from above ...
// After checks complete, ask AI to analyze patterns
const hasIssues = Object.values(results).some(
(s) => s === "degraded" || s === "down"
);
if (hasIssues) {
await this.analyzeWithAI(results);
}
}
private async analyzeWithAI(currentResults: Record<string, SiteStatus>) {
// Gather recent history for context
const historyByUrl: Record<string, any[]> = {};
for (const url of this.state.monitoredUrls) {
historyByUrl[url] = this.getRecentHistory(url, 10);
}
const { text: analysis } = await generateText({
model: openai("gpt-4o-mini", { structuredOutputs: true }),
system: `You are a site reliability engineer analyzing website health data.
Be concise and actionable. Focus on patterns, not individual data points.
Flag anything that suggests an emerging problem, not just current outages.`,
prompt: `Current check results: ${JSON.stringify(currentResults)}
Recent history (last 10 checks per URL):
${JSON.stringify(historyByUrl, null, 2)}
Analyze:
1. Are there any concerning patterns (increasing latency, intermittent failures)?
2. Is this likely a transient issue or systematic problem?
3. Recommended action: MONITOR, INVESTIGATE, or ESCALATE?`,
});
// Store the analysis
for (const [url, status] of Object.entries(currentResults)) {
if (status !== "healthy") {
this.sql`
UPDATE check_history
SET ai_analysis = ${analysis}
WHERE url = ${url}
AND id = (SELECT MAX(id) FROM check_history WHERE url = ${url})
`;
}
}
// If AI recommends escalation, trigger human-in-the-loop
if (analysis.includes("ESCALATE")) {
this.setState({
pendingEscalation: {
url: Object.entries(currentResults)
.filter(([, s]) => s !== "healthy")
.map(([u]) => u)
.join(", "),
reason: analysis,
timestamp: new Date().toISOString(),
},
});
this.broadcast(JSON.stringify({
type: "escalation_required",
analysis,
timestamp: new Date().toISOString(),
}));
}
}
The AI doesn't just check if a site is up โ it looks at patterns. Is response time gradually increasing? Are failures clustered at specific times? Is this a CDN issue or an origin server problem? These are the kinds of insights that turn raw data into actionable intelligence.
Real-Time Dashboard with useAgent

The agent handles the backend. Now let's build a React frontend that stays in sync via WebSocket.
Connecting with useAgent
// src/client.tsx
import { useAgent } from "agents/react";
function Dashboard() {
const agent = useAgent<SiteAgent, AgentState>({
agent: "site-agent",
name: "my-sites", // Each unique name = unique agent instance
});
if (!agent.state) return <div>Connecting to agent...</div>;
return (
<div className="dashboard">
<header>
<h1>Site Reliability Agent</h1>
<span className="last-check">
Last check: {agent.state.lastCheckAt ?? "Never"}
</span>
</header>
<div className="status-grid">
{agent.state.monitoredUrls.map((url) => (
<StatusCard
key={url}
url={url}
status={agent.state.currentStatus[url] ?? "unknown"}
/>
))}
</div>
{agent.state.pendingEscalation && (
<EscalationBanner
escalation={agent.state.pendingEscalation}
onApprove={() => agent.stub.acknowledgeEscalation()}
onDismiss={() => agent.stub.dismissEscalation()}
/>
)}
<ManualControls agent={agent} />
</div>
);
}
When the agent calls setState(), every connected dashboard updates instantly โ no polling, no refetching. The useAgent hook handles WebSocket connection, reconnection, and state synchronization automatically.
Callable Methods for Manual Controls
The @callable() decorator exposes server-side methods that the frontend can call with full type safety:
// In src/server.ts โ inside SiteAgent class
@callable()
async addUrl(url: string) {
if (this.state.monitoredUrls.includes(url)) {
return { success: false, error: "URL already monitored" };
}
this.setState({
monitoredUrls: [...this.state.monitoredUrls, url],
currentStatus: { ...this.state.currentStatus, [url]: "unknown" },
});
// Restart the schedule if this is the first URL
if (this.state.monitoredUrls.length === 1) {
this.scheduleEvery(
"runHealthChecks",
`*/${this.state.checkIntervalMinutes} * * * *`
);
}
return { success: true };
}
@callable()
async removeUrl(url: string) {
this.setState({
monitoredUrls: this.state.monitoredUrls.filter((u) => u !== url),
currentStatus: Object.fromEntries(
Object.entries(this.state.currentStatus).filter(([u]) => u !== url)
),
});
return { success: true };
}
@callable()
async triggerManualCheck() {
await this.runHealthChecks();
return { success: true, checkedAt: new Date().toISOString() };
}
On the client, calling these is as simple as:
// Type-safe RPC โ no manual fetch calls needed
await agent.stub.addUrl("https://example.com");
await agent.stub.triggerManualCheck();
Human-in-the-Loop: Escalation That Works

When the AI detects something serious, the agent doesn't just log it โ it pauses and waits for human judgment:
// In SiteAgent class
@callable()
async acknowledgeEscalation() {
const escalation = this.state.pendingEscalation;
if (!escalation) return { success: false, error: "No pending escalation" };
// Log the acknowledgment
this.sql`
INSERT INTO check_history (url, status_code, response_time_ms, status, ai_analysis)
VALUES (
${escalation.url},
0,
0,
'acknowledged',
${'Human acknowledged: ' + escalation.reason}
)
`;
// Clear the escalation
this.setState({ pendingEscalation: null });
this.broadcast(JSON.stringify({
type: "escalation_resolved",
action: "acknowledged",
timestamp: new Date().toISOString(),
}));
return { success: true };
}
@callable()
async dismissEscalation() {
this.setState({ pendingEscalation: null });
this.broadcast(JSON.stringify({
type: "escalation_resolved",
action: "dismissed",
timestamp: new Date().toISOString(),
}));
return { success: true };
}
The escalation flow works like this:
AI detects a pattern โ Recommends
ESCALATEAgent updates state โ
pendingEscalationis setDashboard shows banner โ Human sees the AI's analysis and reasoning
Human decides โ Acknowledge (take action) or Dismiss (false alarm)
Agent records the decision โ Builds a history of escalations for future AI context
This is the real power of stateful agents: they can pause, wait, and resume based on human input without losing their context.
Worker Entry Point
Don't forget the Worker entry that routes requests to agent instances:
// At the bottom of src/server.ts
export default {
async fetch(request: Request, env: Env) {
// Route to the correct agent instance
return routeAgentRequest(request, env);
},
} satisfies ExportedHandler<Env>;
The routeAgentRequest function dispatches requests to the right Durable Object instance based on the URL pattern: /agents/site-agent/:instance-name.
Testing and Deploying to Production
Local Development
npx wrangler dev
This starts a local development server with full Durable Object support. Your agent runs with real SQLite, real WebSocket connections, and real scheduling โ identical to production.
Open http://localhost:8787 to see your dashboard. Add a URL and watch the agent start monitoring.
Deploy to Cloudflare
# Set your API key as a secret
npx wrangler secret put OPENAI_API_KEY
# Deploy
npx wrangler deploy
Your agent is now live on Cloudflare's global network. Each unique instance name creates an isolated agent with its own state, database, and schedule.
Environment Separation
For staging vs production, use wrangler environments:
// wrangler.jsonc
{
"name": "site-reliability-agent",
"env": {
"staging": {
"name": "site-reliability-agent-staging",
"vars": { "ENVIRONMENT": "staging" }
},
"production": {
"name": "site-reliability-agent",
"vars": { "ENVIRONMENT": "production" }
}
}
}
Performance, Limits, and Cost Breakdown
Cloudflare Agents Limits
| Resource | Limit |
| CPU time per request | 30 seconds (refreshes per event) |
| Memory per instance | 128 MB |
| SQLite storage | 1 GB per Durable Object |
| WebSocket connections | 32,768 per instance |
| Alarm precision | ~1 second |
Cost Estimate
For a typical monitoring setup (100 URLs, checked every 5 minutes):
| Component | Monthly Cost |
| Worker requests (routing) | ~$0.50 |
| Durable Object requests | ~$2.00 |
| Durable Object duration | ~$1.50 |
| SQLite storage (1 GB) | $0.20 |
| AI API calls (OpenAI) | ~$5.00 |
| Total | ~$9.20/month |
Compare this to running the same setup on AWS (Lambda + DynamoDB + EventBridge + API Gateway), where you'd easily spend $20-30/month for equivalent functionality โ plus the engineering overhead of wiring all those services together.
The real savings come from hibernation. Your agent only consumes resources when it's actively checking sites or serving dashboard requests. Between checks, the cost is effectively zero.
Common Pitfalls I Learned the Hard Way
1. The destroy() Lifecycle Trap
When a Durable Object is evicted from memory, it doesn't call any cleanup hooks. If you're relying on in-memory state that isn't persisted via setState() or SQLite, it will be lost. Always persist important data immediately โ don't batch writes.
2. State Serialization Limits
setState() serializes your state as JSON. This means:
No
Dateobjects (use ISO strings)No
MaporSet(use plain objects and arrays)No circular references
Keep state reasonably small โ it's synced to every connected client
3. Alarm Retry Behavior
If your scheduled handler throws an error, Cloudflare will retry it. This is usually good, but if your handler isn't idempotent (e.g., it sends notifications), you'll get duplicate actions. Always design handlers to be safe to retry.
4. WebSocket Reconnection
Clients will disconnect โ networks are unreliable. The useAgent hook handles reconnection automatically, but your UI should gracefully handle the "reconnecting" state. Always show the last known state while reconnecting, rather than a blank screen.
Conclusion
We built a stateful AI agent that goes well beyond chat:
Scheduled health checks that run autonomously on cron
Persistent memory via built-in SQLite โ no external database needed
AI-powered analysis that spots patterns, not just failures
Real-time dashboard with automatic WebSocket state sync
Human-in-the-loop escalation for critical decisions
The Cloudflare Agents SDK makes this surprisingly straightforward. The combination of Durable Objects (state + compute), built-in SQLite (persistent memory), WebSocket hibernation (zero idle cost), and scheduled alarms (autonomous execution) creates a platform where stateful agents are a first-class concept โ not something you have to hack together from five different services.
What's Next
This is just the beginning. From here, you could:
Add MCP server support โ Expose your agent as a Model Context Protocol server so AI assistants like Claude can interact with it
Build multi-agent systems โ Have specialized agents that coordinate with each other
Add voice interaction โ Cloudflare's roadmap includes real-time voice agent support
Integrate browser automation โ Use Cloudflare's Browser Rendering API for visual monitoring
The full source code for this project is available on GitHub. If you build something cool with the Agents SDK, I'd love to hear about it โ drop a comment below or find me on GitHub.
Want to learn more about building AI-ready APIs? Check out my previous article: Your API Wasn't Built for AI Agents โ Here's How to Fix It.
้ปๅฐ้ป
Full-stack product engineer and open source contributor based in Taiwan. I specialize in building practical solutions that solve real-world problems with focus on stability and user experience. Passionate about Product Engineering, Solutions Architecture, and Open Source collaboration.
More Posts
Your API Wasn't Built for AI Agents โ Here's How to Fix It
By 2026, over 30% of API traffic will come from AI agents rather than human-driven applications. That number will keep climbing. Here's the uncomfortable truth: most APIs were designed for human developers who read documentation, interpret ambiguous ...
When Microservices Are Wrong: A Solutions Architect's Decision Framework
I've been that architect. The one who spun up AWS Lambda functions and ECS clusters for every new service, convinced that microservices were the only "proper" way to build modern software. After years of managing distributed complexity โ and eventual...