Skip to content

M11: Post-Deployment — Monitoring AI-Generated Systems

Tier 3 — Operations and Scale | Duration: 90 min (pre-work + workshop)


Shipping code is only half the battle. The real complexity emerges at 3 AM when your system is failing and your on-call engineer has 10 minutes to diagnose a production incident. This module teaches you how AI fundamentally changes incident response and system observability.

In traditional incident workflows, an engineer sees an alert, opens dashboards, digs through logs, correlates events, forms hypotheses about root causes, and deploys fixes—all manually. It’s slow and error-prone, especially on unfamiliar systems. AI changes this. With Claude Code and MCP tools, you can automate the investigation itself: ask an agent to “find what changed in the last hour,” “correlate this error with recent deployments,” or “trace this request through all services.” The agent pulls from monitoring systems, version control, logs, and traces—simultaneously—and narrows the search space to actionable causes in minutes instead of hours.

This module also covers the dual-agent model: local agents for interactive debugging (when you need human judgment) and cloud agents for automated monitoring and alerting (when you need scalability and speed). You’ll learn how to preserve organizational knowledge as experienced engineers leave, and how to structure observability so that both humans and AI can reason about what went wrong.

Takeaway: Incident investigation prompt templates, observability audit checklist, agent failure pattern taxonomy, escalation frameworks, and judgment about when AI accelerates vs. replacing human diagnosis.


  • Completed Tier 1 and Tier 2
  • Familiar with Claude Code, prompts, and MCP integrations
  • Basic understanding of monitoring, logging, or observability tools (or willingness to learn)
  • Access to a production system with logs and metrics (or willingness to use a simulated scenario)

Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to operational problems. Core tenets:

  • Reliability is a feature, not a bug. It competes with velocity; trade-offs are intentional.
  • Observability > Monitoring. You can’t just track “CPU is high.” You need to ask arbitrary questions: “Why did this request take 5 seconds?” Observability is the ability to answer questions you didn’t anticipate.
  • Blameless postmortems. Incidents are learning opportunities, not staff problems.
  • Automation saves humans. Toil (manual, repetitive work) is the enemy. Automate it.

Observability Basics: Metrics, Logs, Traces

Section titled “Observability Basics: Metrics, Logs, Traces”

Metrics: Numbers over time. CPU usage, request latency, error rate. Questions: “Is the system healthy right now?”

Logs: Textual records of events. “User login failed,” “database connection timeout.” Questions: “What happened during that time window?”

Traces: End-to-end request flows across services. “Request entered API → hit database → returned to client.” Questions: “Why was that specific request slow?”

Together, they form the Three Pillars of Observability. No single pillar is enough.

Understanding the three pillars conceptually is necessary but not sufficient. You also need a way to collect and propagate telemetry data across services in a consistent, vendor-neutral way. That standard is OpenTelemetry (OTel).

What OpenTelemetry is: An open-source, vendor-neutral framework for instrumenting, generating, collecting, and exporting telemetry data (traces, metrics, logs). It is now the industry standard for distributed tracing and is supported by every major monitoring platform (Datadog, Honeycomb, Grafana, AWS, GCP, etc.).

Why it matters for AI systems: Without standardized trace IDs propagated in HTTP headers across service boundaries, you cannot reconstruct the full path of a request. An agent asked to “find what caused this latency spike” cannot reason across service boundaries if traces don’t connect.

Key concepts:

  • Trace context propagation: A unique trace ID is injected into request headers (traceparent). Every service that handles the request reads and forwards that ID, creating a connected chain from ingress to database and back.
  • Spans: Each unit of work within a trace (a single service call, a database query, an LLM inference). Spans carry start time, duration, status, and custom attributes.
  • OpenInference: An extension of OpenTelemetry with LLM-specific semantics — token counts, model names, cost per call, prompt/response payloads. Enables cost tracking and quality monitoring alongside latency.

Practical example: Instead of asking an agent to “show me logs,” a properly instrumented system lets you query: “Get all LLM calls with latency > 5 seconds in the last hour” or “Show me every request where cost exceeded $0.50 and the response was marked incorrect.” That level of precision requires structured traces with consistent attributes — which OTel provides.

Instrumentation checklist:

  • Use an OTel SDK (available for Python, Node.js, Java, Go, .NET, and others)
  • Propagate traceparent headers across all service-to-service calls
  • Add OpenInference spans around LLM calls to capture token usage and cost
  • Export to your monitoring platform of choice (vendor-neutral by design)

The timelines below are illustrative example scenarios, not benchmarks. Traditional incident resolution can range from 15 minutes to multiple days depending on incident complexity, system familiarity, and observability maturity. The AI-augmented scenario assumes good observability infrastructure, appropriate tool access, and a trained agent — conditions that take time to establish.

Example: Traditional workflow (well-instrumented system, familiar engineer):

Alert fires → Engineer wakes up
→ Opens dashboard (5 min)
→ Scans logs for errors (10 min)
→ Checks recent deployments (5 min)
→ Forms hypothesis (10 min)
→ Tests hypothesis with more logs (15 min)
→ Deploys fix or escalates (20 min)
Total: ~65 minutes

Example: AI-augmented workflow (same system, good observability, agent has tool access):

Alert fires → Engineer wakes up
→ Runs AI investigation: "What changed in the last hour?"
(agent queries metrics, logs, Git history, deployment logs in parallel)
→ Claude returns: "Deployment X modified retry logic; this request pattern wasn't tested"
→ Engineer validates hypothesis (5 min)
→ Deploys fix (5 min)
Total: ~10 minutes

The AI doesn’t replace the engineer’s judgment. It compresses the investigation phase and surfaces the most likely causes. The engineer still decides what to do.

The 10-minute scenario is an optimistic illustration of what becomes possible with mature observability. If your logs are unstructured, traces are missing, or the agent lacks tool access to the relevant systems, the AI-augmented workflow won’t deliver that compression. Observability readiness is a prerequisite, not an afterthought.

Multi-Agent Systems for Production Monitoring

Section titled “Multi-Agent Systems for Production Monitoring”

Two patterns:

  1. Local Agent (Interactive Debugging): You run claude code on your laptop, ask questions about your system, and iterate. Useful for “I don’t know what to check next” scenarios. Full context, human judgment, but not automated.

  2. Cloud Agent (Automated Monitoring): A background process runs continuously, monitors your systems, and escalates anomalies or incidents. Useful for “detect this pattern and alert me” scenarios. Less context, but doesn’t require human presence.

Both can coexist: automated agents catch issues; local agents investigate when humans are needed.

Note: Cloud agents for continuous monitoring are an emerging architectural pattern with growing adoption (2024-2026), not yet a universal production standard. Start with local agent-assisted investigation and graduate to automated cloud agents as your observability infrastructure matures.

When your best on-call engineer leaves, what walks out the door? All the unwritten knowledge: “When Service A acts up, always check Service B’s cache first.” “That log message is confusing, but it’s actually fine.” “The retry logic has a bug in edge case X, so watch for Y.”

AI provides a solution: encode this knowledge as agents and monitoring rules. A playbook for incident investigation becomes a set of MCP tools and prompts. A decade of debugging wisdom becomes part of the baseline for every on-call engineer, new or veteran.

Every alert you ignore teaches you to ignore alerts. Alert fatigue kills incident response. The traditional solution is manual tuning of thresholds. The AI solution: use agents to filter noise. “Alert me if error rate exceeds 5% and latency is above 2s and this error is new (not seen in the last week).” Multi-condition logic, adaptive thresholds, learned baselines.

When AI agents are part of your production system, they introduce failure modes that traditional monitoring doesn’t cover. These aren’t hypothetical — they’re documented patterns from production deployments.

Four failure archetypes:

  1. Premature action without grounding — The agent acts before it has sufficient context. It reads one log file, concludes root cause, and applies a fix without checking related services. Prevention: require agents to gather evidence from multiple sources before proposing remediation actions.

  2. Over-helpfulness / scope creep — The agent “helpfully” modifies adjacent systems while fixing the stated problem. Prevention: define explicit action boundaries per agent role; use read-only agents for investigation and separate agents with limited write permissions for remediation.

  3. Context pollution — Irrelevant history or misleading intermediate results accumulate in the agent’s context window, degrading reasoning quality over a long incident. Prevention: segment tasks into fresh agent invocations; summarize and compress context between phases.

  4. Fragile execution under load — The agent works correctly during testing but fails unpredictably under production load (concurrent requests, partial tool failures, slow API responses). Prevention: test agents under realistic load conditions; build in retry logic and graceful degradation.

Specific failure modes to monitor:

Failure ModeSignalResponse
Tool invocation errorsAgent calls a tool that returns an error; agent ignores or misinterprets itLog tool call success/failure rate; alert when error rate > 5%
Multi-step reasoning driftAgent’s reasoning quality degrades across a long chain of stepsLog quality scores per step; alert on > 20% degradation
Latent inconsistencySame input produces different outputs across repeated callsRun k-trial consistency checks; alert when consistency drops below 90%
Context window saturationAgent has exceeded its effective context capacity and starts losing earlier informationMonitor context length; truncate or summarize when approaching limits
Version driftModel upgrade changes output format, breaking downstream parsingRun regression test suites after model updates before routing production traffic
Cost-driven collapseAgent enters an infinite or runaway tool-calling loopAlert on cost-per-request exceeding 3x baseline (see next section)

Example walkthrough: A model upgrade introduces a subtle change in JSON serialization format. Your agent’s downstream parser breaks silently — it doesn’t throw an error, it just returns empty results. The system appears to function; no alerts fire. A regression test suite run against the upgraded model would have caught this before it reached production.

Key principle: Monitor the agent’s behavior, not just the system it’s monitoring. Tool invocation error rates, consistency scores, and reasoning step quality are first-class metrics.

For AI systems, cost is not just a budget concern — it’s an early warning indicator of system failures.

Why cost anomalies precede functional failures:

A well-functioning agent has a predictable cost-per-interaction baseline. When that baseline spikes, it almost always means something is wrong before any functional metric degrades:

  • 10x cost spike → Agent is likely in an infinite tool-calling loop, calling the same tool repeatedly without making progress
  • Sustained cost increase → Model may be generating excessively long outputs (prompt injection, runaway reasoning chains)
  • Cost spike correlated with accuracy drop → Context pollution or hallucination amplifying token usage
  • Cost spike after deployment → New code introduced a logic error causing redundant LLM calls

What to track:

  • Cost per request (absolute and relative to baseline)
  • Cost per model, per tool, per interaction type
  • Total daily/hourly cost with anomaly detection
  • Token usage breakdown: input tokens vs. output tokens (output token spikes are often the first signal)

Alert configuration:

Set alerts at multiple thresholds:

  • Warning: cost-per-request > 2x baseline for last 100 requests
  • Critical: cost-per-request > 5x baseline, or total hourly spend > 3x daily average rate

Example: An agent’s daily operating cost is $12. After a code deployment, it climbs to $180 within two hours. Investigation reveals a new code path causes the agent to retry a failing tool call in a tight loop, generating thousands of redundant LLM calls. The cost alert caught this before any user-facing error appeared.

Add cost tracking to your observability dashboards alongside latency and error rate from day one.

Escalation and Safety in Automated Response

Section titled “Escalation and Safety in Automated Response”

Automated agents require explicit escalation protocols. Without them, automation creates new failure modes: agents applying irreversible changes without human review, acting on insufficient evidence, or creating accountability gaps.

Action categorization:

Before deploying any automated agent, categorize the actions it can take:

CategoryExamplesPolicy
InformationalQuery logs, read metrics, generate summariesAuto-execute; no approval needed
ReversibleScale up a service, restart a pod, adjust a config flagAuto-execute with monitoring window and rollback readiness
Irreversible or high-riskDelete data, modify user records, make financial transactionsRequire explicit human approval before execution

Escalation triggers:

An automated agent should escalate to a human when:

  • The proposed action is irreversible or affects user data
  • Confidence in root cause is low (insufficient evidence)
  • The incident is novel — no matching historical pattern
  • Cost or blast radius of the remediation is above a defined threshold
  • Two consecutive automated actions have failed to resolve the issue

Audit trails:

Every agent action should be logged with: the triggering event, the reasoning chain, the action taken, the outcome, and the agent version. This log is essential for postmortems and for detecting systematic agent errors over time.

Example escalation flow: Agent detects database connection pool exhaustion. It identifies the likely fix (scale the connection pool). This is a reversible action with moderate risk. The agent pages the on-call engineer with its analysis and proposed action, rather than executing immediately. The engineer reviews in two minutes and approves. The agent applies the change and monitors for ten minutes with automatic rollback armed if error rates don’t improve.

Meta-Observability: Monitoring the Agent Itself

Section titled “Meta-Observability: Monitoring the Agent Itself”

Standard system observability tells you when your application is failing. Meta-observability tells you when your AI agent is failing — which is a different and equally important signal.

LLM observability metrics:

  • Token usage: Input and output tokens per call. Unexpected output token spikes often precede quality issues.
  • Cost per call: Track cost attribution by model, agent role, and interaction type.
  • Latency: P50 and P99 latency for LLM calls. P99 degradation often indicates context window pressure.
  • Tool call success rate: Fraction of tool calls that complete without error. This is often the first indicator of integration failures.
  • Consistency score: Across repeated identical inputs, how stable are outputs? Measured as k-trial pass rate.

Tooling options:

  • LangSmith / Langfuse: Purpose-built LLM observability platforms. Capture traces of agent reasoning chains, tool calls, prompt/response pairs.
  • Helicone: Proxy-based LLM observability; minimal code changes required.
  • Datadog / Honeycomb: General observability platforms with LLM monitoring integrations; useful if you want agent metrics in the same dashboard as system metrics.

What to watch for:

SignalLikely Cause
Accuracy drop after model upgradeVersion drift; run regression suite
Latency increase (P99)Context window growing; context length monitoring needed
Cost spike without traffic increaseRunaway tool calls or infinite loops
Consistency drop below 90%Model update or prompt regression
Tool call error rate > 5%Integration failure or API change in external tool

Connecting layers: Your observability stack should cover three layers simultaneously — infrastructure (CPU, memory, network), application (request latency, error rate, throughput), and agent (LLM cost, consistency, tool success rate). Gaps at any layer blind you to a whole class of failures.

  1. “Introduction to SRE” — Google SRE Book, Chapter 1. (Free online: https://sre.google/books/)

    • ~15 min. Defines SRE, sets context for why this matters.
  2. “Observability Basics” — Observe ability (Chapter 3 of “Observability Engineering” by Majors, Fong-Jones, Miranda). (Free summary online or book purchase)

    • ~15 min. The three pillars, why they matter, how to use them.
  3. “Kubernetes Troubleshooting with AI” — Kubernetes documentation on debugging + practical examples.

    • ~20 min. Concrete examples of using logs and traces to diagnose issues. Read even if you don’t use Kubernetes; the patterns transfer.
  4. “Benefits of Agentic AI in On-call Engineering” — Internal research or blog posts (e.g., from Anthropic, Stripe, or your own company).

    • ~15 min. Case studies of how organizations are using AI to speed incident response.
  5. “Your New Autonomous Teammate” — Internal or community documentation on using Claude Code in production contexts.

    • ~10 min. Practical advice on trust, safety, and integration.

ConceptDefinition
SRESite Reliability Engineering: applying software engineering to operational problems. Focus: reliability, automation, reducing toil.
ObservabilityAbility to understand system behavior by observing outputs (metrics, logs, traces). More powerful than traditional monitoring.
MetricsTime-series data: CPU, memory, request latency, error rate. Answer: “How is the system performing right now?”
LogsStructured or unstructured text records of events. Answer: “What happened during this time window?”
TracesEnd-to-end request flows across services. Answer: “Why was this specific request slow or failing?”
IncidentAn unplanned interruption or reduction in quality of service.
Root CauseThe underlying reason an incident occurred. Often multiple layers deep (not just “the code has a bug”).
Blameless PostmortemA learning meeting after an incident focused on systems, not people. What failed? Why? How do we prevent it?
Alert FatigueCondition where too many alerts cause responders to ignore them. Kills incident response.
ToilManual, repetitive, operational work. SRE goal: eliminate it through automation.
OpenTelemetry (OTel)Vendor-neutral, open-source standard for instrumenting and collecting telemetry (traces, metrics, logs). Industry default for distributed tracing.
Trace Context PropagationThe mechanism by which a trace ID is forwarded across service boundaries via HTTP headers, enabling end-to-end request reconstruction.
OpenInferenceAn OpenTelemetry extension with LLM-specific semantics: token counts, model names, cost per call, prompt/response content.
Meta-ObservabilityObservability applied to the AI agent itself: LLM cost, token usage, tool call success rate, consistency scores. Distinct from observing the system the agent monitors.
Observability MaturityA staged measure of how well a system is instrumented for answering arbitrary diagnostic questions. Prerequisite for effective AI-assisted incident response.

Observability Maturity: Established vs. Emerging Practices

Section titled “Observability Maturity: Established vs. Emerging Practices”

Not all practices in this module carry the same level of industry adoption. Using established practices without the AI-augmented extensions is perfectly valid; piloting emerging practices on non-critical systems before broader rollout is prudent.

PracticeMaturityGuidance
Three Pillars of Observability (metrics, logs, traces)Established — Universal industry standardImplement fully; no caveats
Blameless postmortemsEstablished — Proven across Google, Netflix, Stripe, and othersAdopt unconditionally
SRE automation (eliminate toil)Established — Core SRE principle since Google SRE (2016)Adopt unconditionally
Observability > monitoring distinctionEstablished — Well-documented; foundational to modern SREAdopt unconditionally
OpenTelemetry for instrumentationEstablished — CNCF standard, universal platform supportAdopt as default instrumentation approach
AI-assisted incident investigation (local agent)Emerging — Research-validated (2024-2025), growing adoptionPilot on non-critical systems first; build observability maturity first
Cloud agents for automated monitoringEmerging — Architectural pattern with growing production adoptionStart with alerts + human triage; automate incrementally as confidence grows
Cost tracking as reliability signalEmerging — Operationally documented in AI deployments (2025-2026)Implement from day one for AI systems; limited precedent in traditional SRE
Agent failure pattern detectionEmerging — Taxonomy documented (2025-2026); tooling maturingImplement monitoring metrics; adopt regression testing practices
Automated incident diagnosis (multi-agent)Pilot — Demonstrated feasibility (AidAI, IRCopilot); not yet standardResearch and plan; do not deploy without robust human oversight gates

Practical guidance: Start with the established practices. They are prerequisites. Layer in emerging AI-augmented patterns once your observability foundation (structured logs, distributed traces, OTel instrumentation) is solid. Attempting AI-assisted incident response on a poorly instrumented system wastes engineering effort and produces unreliable results.

Before deploying AI-assisted incident response, assess your observability readiness. These are prerequisites, not nice-to-haves.

  • Structured logs: Logs are emitted as JSON with consistent field names (timestamp, level, service, trace_id, message)
  • Consistent service naming: Every service uses the same identifier across logs, metrics, and traces
  • Error context captured: Errors include stack traces, request IDs, and relevant state — not just error messages
  • Metrics with clear labels: All key metrics (latency, error rate, throughput) include service, endpoint, and environment labels
  • Alerting on symptoms, not causes: Alerts fire on user-visible symptoms (error rate, latency SLO breach) rather than internal signals (CPU, memory) alone
  • Trace IDs propagated: traceparent headers are forwarded across all service-to-service calls
  • Spans instrumented: Key operations (DB queries, external calls, LLM inference) emit spans with duration and status
  • Service dependency map exists: You can enumerate which services call which, and what the failure blast radius is
  • Log aggregation in place: Logs from all services (including containers and pods) flow to a central queryable store
  • LLM calls instrumented: Every LLM call captures model name, token counts (input + output), cost, latency, and outcome
  • Tool call outcomes logged: Agent tool invocations log the tool name, input parameters, success/failure, and execution time
  • Cost attribution in place: You can query cost by model, agent role, and time window
  • Consistency baseline established: You have a k-trial pass rate benchmark for critical agent behaviors
  • Agent version tracked: Agent model version and prompt version are logged alongside actions for postmortem analysis

If you have gaps in the foundation tier, address those before investing in AI-augmented incident response. Common remediation steps:

  • Add a structured logging library (e.g., structlog in Python, winston in Node.js)
  • Instrument with OTel SDK and export to your monitoring platform
  • Add traceparent header propagation to your HTTP clients and servers
  • Wrap LLM client calls in OpenInference spans

System Observability:

OpenTelemetry:

LLM / Agent Observability:

  • Stripe: “Scaling Incident Response” (search their blog)
  • Anthropic: Case studies on AI in operations (when available)
  • Your company’s incident postmortems (learn from real incidents)

The hands-on session for this module: M11 Workshop: Post-Deployment — Monitoring AI-Generated Systems

After completing this module and the workshop, you will have:

  • A mental model for observability in AI-augmented systems
  • Familiarity with OpenTelemetry instrumentation for LLM applications
  • An incident response workflow using AI-assisted investigation
  • An observability audit checklist for your own systems
  • Understanding of cost tracking as a reliability signal