How to Monitor Your AI Agents and Get Instant Failure Alerts

When a human does a task and something breaks, they notice. When an AI agent does it, you might not find out for hours - or days. Monitoring is the layer that closes that gap. This tutorial walks you through adding a health-check endpoint to any AI worker, wiring it up to a free uptime monitor, and routing failure alerts to a Discord or Slack channel.

By the end, you will have a live dashboard showing whether your agents are up, and you will get a ping the moment one goes silent.

What you will build

A /health endpoint in your agent or API service
An UptimeRobot monitor that polls it every 5 minutes
A Discord webhook that fires when the monitor detects a failure
An optional heartbeat cron so scheduled jobs self-report

This works with any language or runtime. The examples use Node.js and a Railway-hosted service, but the pattern is identical for Python, Deno, or a Make.com webhook.

Prerequisites

An agent or API service that runs somewhere publicly accessible (Railway, Render, Fly, VPS)
A free UptimeRobot account (uptimerobot.com)
A Discord server where you have permission to create webhooks, or a Slack workspace

Step 1: Add a health-check endpoint to your service

The simplest possible health check is an HTTP route that returns 200 when the process is alive.

Node.js (Express):

app.get("/health", (req, res) => {
  res.json({ status: "ok", ts: Date.now() });
});

Python (FastAPI):

@app.get("/health")
def health():
    return {"status": "ok", "ts": time.time()}

If your agent talks to external dependencies - a database, an LLM API, a queue - you can make the check smarter:

app.get("/health", async (req, res) => {
  try {
    await db.ping(); // check DB
    await redis.ping(); // check cache
    res.json({ status: "ok" });
  } catch (err) {
    res.status(503).json({ status: "error", detail: err.message });
  }
});

A 503 tells the monitor something is wrong even if the process is technically running.

Deploy or restart your service so the route is live, then test it:

curl https://your-agent.railway.app/health
# {"status":"ok","ts":1743000000000}

Step 2: Create a Discord webhook for alerts

Open your Discord server, go to a channel like #ops-alerts.
Click the gear icon next to the channel name, then "Integrations", then "Create Webhook".
Name it "UptimeRobot" and copy the webhook URL. It looks like: https://discord.com/api/webhooks/1234567890/xxxxxxxxxxxx

Keep that URL - you will paste it into UptimeRobot in the next step.

For Slack: go to api.slack.com/apps, create an app, enable Incoming Webhooks, and copy the webhook URL from there. The rest of the steps are identical.

Step 3: Set up an UptimeRobot monitor

Sign in at uptimerobot.com and click "Add New Monitor".
Set Monitor Type to "HTTPS".
Friendly Name: something like "AI Content Agent".
URL: paste your health endpoint, e.g. https://your-agent.railway.app/health.
Monitoring Interval: 5 minutes (the minimum on the free plan).
Under "Alert Contacts", click "Add Alert Contact":
- Type: Webhook
- Friendly Name: Discord Ops
- URL: paste your Discord webhook URL
- POST Value (select "Send as JSON"):
```
{
  "content": "ALERT: *monitorFriendlyName* is *alertTypeFriendlyName*. Check: *monitorURL*"
}
```
  UptimeRobot replaces the *variable* tokens with real values at alert time.
Save the alert contact, attach it to your monitor, and save.

Your monitor is now live. UptimeRobot will ping /health every 5 minutes. If it gets anything other than a 2xx response (or no response at all), it fires the webhook within one check cycle.

Step 4: Add a heartbeat for scheduled jobs

A polling monitor catches crashed services. It does not catch a scheduled job that simply stopped running. For that you need a heartbeat: your job pings a URL at the end of each successful run, and you alert if the ping stops arriving.

UptimeRobot supports heartbeat monitors (called "Cron Job Monitors") on the paid plan. For free, use Better Uptime's heartbeat feature or a simple approach with a free service like healthchecks.io.

healthchecks.io setup:

Create a free account at healthchecks.io.
Click "Add Check", set the period to match your job interval (e.g., 1 hour), and give it a 10-minute grace period.
Copy the ping URL, which looks like: https://hc-ping.com/your-uuid-here

In your agent or cron job, add a ping at the end of a successful run:

# Shell script example
run_pipeline.sh && curl -fsS --retry 3 https://hc-ping.com/your-uuid-here

# Python example
import httpx

def run_pipeline():
    # ... your agent logic ...
    httpx.get("https://hc-ping.com/your-uuid-here")  # heartbeat on success

If the ping does not arrive within period + grace, healthchecks.io fires an alert. Connect it to the same Discord webhook under "Integrations" in your healthchecks.io project settings.

Step 5: Verify the whole chain

Force a failure to confirm alerts are wiring correctly. The easiest way is to temporarily return a 500 from your health endpoint:

app.get("/health", (req, res) => {
  res.status(500).json({ status: "error", detail: "forced test failure" });
});

Deploy it, then wait up to 10 minutes (two check cycles). You should see:

UptimeRobot status page turns red.
A Discord message appears in #ops-alerts within a minute or two of detection.

Revert the change and deploy again. You will get a second message when the monitor recovers. Both directions working means you are covered.

What you now have

A zero-attention monitoring layer: your agents run unattended, and you only hear about them when something breaks. No dashboards to manually check, no guessing whether a job ran overnight.

From here, you can extend this pattern:

Multi-region checks: UptimeRobot paid plans check from multiple locations so you catch network-specific failures.
Status pages: UptimeRobot and Better Uptime both offer public status pages you can share with clients or users.
Escalation chains: chain webhooks through Make or n8n to page you via SMS if Discord goes unacknowledged for 15 minutes.
Metrics over time: pipe health-check response times into a time-series store (Grafana Cloud has a free tier) to spot slowdowns before they become outages.

The pattern scales from a single agent to a fleet of fifty. Add one health endpoint and one monitor per service, point them all at the same alert channel, and you get a unified operations view with no ongoing maintenance.