M11: Post-Deployment — Monitoring AI-Generated Systems
Tier 3 — Operations and Scale | Duration: 90 min (pre-work + workshop)
Module Overview
Section titled “Module Overview”Shipping code is only half the battle. The real complexity emerges at 3 AM when your system is failing and your on-call engineer has 10 minutes to diagnose a production incident. This module teaches you how AI fundamentally changes incident response and system observability.
In traditional incident workflows, an engineer sees an alert, opens dashboards, digs through logs, correlates events, forms hypotheses about root causes, and deploys fixes—all manually. It’s slow and error-prone, especially on unfamiliar systems. AI changes this. With Claude Code and MCP tools, you can automate the investigation itself: ask an agent to “find what changed in the last hour,” “correlate this error with recent deployments,” or “trace this request through all services.” The agent pulls from monitoring systems, version control, logs, and traces—simultaneously—and narrows the search space to actionable causes in minutes instead of hours.
This module also covers the dual-agent model: local agents for interactive debugging (when you need human judgment) and cloud agents for automated monitoring and alerting (when you need scalability and speed). You’ll learn how to preserve organizational knowledge as experienced engineers leave, and how to structure observability so that both humans and AI can reason about what went wrong.
Takeaway: Incident investigation prompt templates, observability audit checklist, agent failure pattern taxonomy, escalation frameworks, and judgment about when AI accelerates vs. replacing human diagnosis.
Prerequisites
Section titled “Prerequisites”- Completed Tier 1 and Tier 2
- Familiar with Claude Code, prompts, and MCP integrations
- Basic understanding of monitoring, logging, or observability tools (or willingness to learn)
- Access to a production system with logs and metrics (or willingness to use a simulated scenario)
Pre-Work: Theory and Readings (15-20 min)
Section titled “Pre-Work: Theory and Readings (15-20 min)”SRE Fundamentals
Section titled “SRE Fundamentals”Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to operational problems. Core tenets:
- Reliability is a feature, not a bug. It competes with velocity; trade-offs are intentional.
- Observability > Monitoring. You can’t just track “CPU is high.” You need to ask arbitrary questions: “Why did this request take 5 seconds?” Observability is the ability to answer questions you didn’t anticipate.
- Blameless postmortems. Incidents are learning opportunities, not staff problems.
- Automation saves humans. Toil (manual, repetitive work) is the enemy. Automate it.
Observability Basics: Metrics, Logs, Traces
Section titled “Observability Basics: Metrics, Logs, Traces”Metrics: Numbers over time. CPU usage, request latency, error rate. Questions: “Is the system healthy right now?”
Logs: Textual records of events. “User login failed,” “database connection timeout.” Questions: “What happened during that time window?”
Traces: End-to-end request flows across services. “Request entered API → hit database → returned to client.” Questions: “Why was that specific request slow?”
Together, they form the Three Pillars of Observability. No single pillar is enough.
Implementing Observability: OpenTelemetry
Section titled “Implementing Observability: OpenTelemetry”Understanding the three pillars conceptually is necessary but not sufficient. You also need a way to collect and propagate telemetry data across services in a consistent, vendor-neutral way. That standard is OpenTelemetry (OTel).
What OpenTelemetry is: An open-source, vendor-neutral framework for instrumenting, generating, collecting, and exporting telemetry data (traces, metrics, logs). It is now the industry standard for distributed tracing and is supported by every major monitoring platform (Datadog, Honeycomb, Grafana, AWS, GCP, etc.).
Why it matters for AI systems: Without standardized trace IDs propagated in HTTP headers across service boundaries, you cannot reconstruct the full path of a request. An agent asked to “find what caused this latency spike” cannot reason across service boundaries if traces don’t connect.
Key concepts:
- Trace context propagation: A unique trace ID is injected into request headers (
traceparent). Every service that handles the request reads and forwards that ID, creating a connected chain from ingress to database and back. - Spans: Each unit of work within a trace (a single service call, a database query, an LLM inference). Spans carry start time, duration, status, and custom attributes.
- OpenInference: An extension of OpenTelemetry with LLM-specific semantics — token counts, model names, cost per call, prompt/response payloads. Enables cost tracking and quality monitoring alongside latency.
Practical example: Instead of asking an agent to “show me logs,” a properly instrumented system lets you query: “Get all LLM calls with latency > 5 seconds in the last hour” or “Show me every request where cost exceeded $0.50 and the response was marked incorrect.” That level of precision requires structured traces with consistent attributes — which OTel provides.
Instrumentation checklist:
- Use an OTel SDK (available for Python, Node.js, Java, Go, .NET, and others)
- Propagate
traceparentheaders across all service-to-service calls - Add OpenInference spans around LLM calls to capture token usage and cost
- Export to your monitoring platform of choice (vendor-neutral by design)
How AI Changes Incident Response
Section titled “How AI Changes Incident Response”The timelines below are illustrative example scenarios, not benchmarks. Traditional incident resolution can range from 15 minutes to multiple days depending on incident complexity, system familiarity, and observability maturity. The AI-augmented scenario assumes good observability infrastructure, appropriate tool access, and a trained agent — conditions that take time to establish.
Example: Traditional workflow (well-instrumented system, familiar engineer):
Alert fires → Engineer wakes up→ Opens dashboard (5 min)→ Scans logs for errors (10 min)→ Checks recent deployments (5 min)→ Forms hypothesis (10 min)→ Tests hypothesis with more logs (15 min)→ Deploys fix or escalates (20 min)Total: ~65 minutesExample: AI-augmented workflow (same system, good observability, agent has tool access):
Alert fires → Engineer wakes up→ Runs AI investigation: "What changed in the last hour?" (agent queries metrics, logs, Git history, deployment logs in parallel)→ Claude returns: "Deployment X modified retry logic; this request pattern wasn't tested"→ Engineer validates hypothesis (5 min)→ Deploys fix (5 min)Total: ~10 minutesThe AI doesn’t replace the engineer’s judgment. It compresses the investigation phase and surfaces the most likely causes. The engineer still decides what to do.
The 10-minute scenario is an optimistic illustration of what becomes possible with mature observability. If your logs are unstructured, traces are missing, or the agent lacks tool access to the relevant systems, the AI-augmented workflow won’t deliver that compression. Observability readiness is a prerequisite, not an afterthought.
Multi-Agent Systems for Production Monitoring
Section titled “Multi-Agent Systems for Production Monitoring”Two patterns:
-
Local Agent (Interactive Debugging): You run
claude codeon your laptop, ask questions about your system, and iterate. Useful for “I don’t know what to check next” scenarios. Full context, human judgment, but not automated. -
Cloud Agent (Automated Monitoring): A background process runs continuously, monitors your systems, and escalates anomalies or incidents. Useful for “detect this pattern and alert me” scenarios. Less context, but doesn’t require human presence.
Both can coexist: automated agents catch issues; local agents investigate when humans are needed.
Note: Cloud agents for continuous monitoring are an emerging architectural pattern with growing adoption (2024-2026), not yet a universal production standard. Start with local agent-assisted investigation and graduate to automated cloud agents as your observability infrastructure matures.
Knowledge Preservation and Scaling
Section titled “Knowledge Preservation and Scaling”When your best on-call engineer leaves, what walks out the door? All the unwritten knowledge: “When Service A acts up, always check Service B’s cache first.” “That log message is confusing, but it’s actually fine.” “The retry logic has a bug in edge case X, so watch for Y.”
AI provides a solution: encode this knowledge as agents and monitoring rules. A playbook for incident investigation becomes a set of MCP tools and prompts. A decade of debugging wisdom becomes part of the baseline for every on-call engineer, new or veteran.
Alert Fatigue and Noise Filtering
Section titled “Alert Fatigue and Noise Filtering”Every alert you ignore teaches you to ignore alerts. Alert fatigue kills incident response. The traditional solution is manual tuning of thresholds. The AI solution: use agents to filter noise. “Alert me if error rate exceeds 5% and latency is above 2s and this error is new (not seen in the last week).” Multi-condition logic, adaptive thresholds, learned baselines.
Common Agent Failure Patterns
Section titled “Common Agent Failure Patterns”When AI agents are part of your production system, they introduce failure modes that traditional monitoring doesn’t cover. These aren’t hypothetical — they’re documented patterns from production deployments.
Four failure archetypes:
-
Premature action without grounding — The agent acts before it has sufficient context. It reads one log file, concludes root cause, and applies a fix without checking related services. Prevention: require agents to gather evidence from multiple sources before proposing remediation actions.
-
Over-helpfulness / scope creep — The agent “helpfully” modifies adjacent systems while fixing the stated problem. Prevention: define explicit action boundaries per agent role; use read-only agents for investigation and separate agents with limited write permissions for remediation.
-
Context pollution — Irrelevant history or misleading intermediate results accumulate in the agent’s context window, degrading reasoning quality over a long incident. Prevention: segment tasks into fresh agent invocations; summarize and compress context between phases.
-
Fragile execution under load — The agent works correctly during testing but fails unpredictably under production load (concurrent requests, partial tool failures, slow API responses). Prevention: test agents under realistic load conditions; build in retry logic and graceful degradation.
Specific failure modes to monitor:
| Failure Mode | Signal | Response |
|---|---|---|
| Tool invocation errors | Agent calls a tool that returns an error; agent ignores or misinterprets it | Log tool call success/failure rate; alert when error rate > 5% |
| Multi-step reasoning drift | Agent’s reasoning quality degrades across a long chain of steps | Log quality scores per step; alert on > 20% degradation |
| Latent inconsistency | Same input produces different outputs across repeated calls | Run k-trial consistency checks; alert when consistency drops below 90% |
| Context window saturation | Agent has exceeded its effective context capacity and starts losing earlier information | Monitor context length; truncate or summarize when approaching limits |
| Version drift | Model upgrade changes output format, breaking downstream parsing | Run regression test suites after model updates before routing production traffic |
| Cost-driven collapse | Agent enters an infinite or runaway tool-calling loop | Alert on cost-per-request exceeding 3x baseline (see next section) |
Example walkthrough: A model upgrade introduces a subtle change in JSON serialization format. Your agent’s downstream parser breaks silently — it doesn’t throw an error, it just returns empty results. The system appears to function; no alerts fire. A regression test suite run against the upgraded model would have caught this before it reached production.
Key principle: Monitor the agent’s behavior, not just the system it’s monitoring. Tool invocation error rates, consistency scores, and reasoning step quality are first-class metrics.
Cost Tracking as a Reliability Signal
Section titled “Cost Tracking as a Reliability Signal”For AI systems, cost is not just a budget concern — it’s an early warning indicator of system failures.
Why cost anomalies precede functional failures:
A well-functioning agent has a predictable cost-per-interaction baseline. When that baseline spikes, it almost always means something is wrong before any functional metric degrades:
- 10x cost spike → Agent is likely in an infinite tool-calling loop, calling the same tool repeatedly without making progress
- Sustained cost increase → Model may be generating excessively long outputs (prompt injection, runaway reasoning chains)
- Cost spike correlated with accuracy drop → Context pollution or hallucination amplifying token usage
- Cost spike after deployment → New code introduced a logic error causing redundant LLM calls
What to track:
- Cost per request (absolute and relative to baseline)
- Cost per model, per tool, per interaction type
- Total daily/hourly cost with anomaly detection
- Token usage breakdown: input tokens vs. output tokens (output token spikes are often the first signal)
Alert configuration:
Set alerts at multiple thresholds:
- Warning: cost-per-request > 2x baseline for last 100 requests
- Critical: cost-per-request > 5x baseline, or total hourly spend > 3x daily average rate
Example: An agent’s daily operating cost is $12. After a code deployment, it climbs to $180 within two hours. Investigation reveals a new code path causes the agent to retry a failing tool call in a tight loop, generating thousands of redundant LLM calls. The cost alert caught this before any user-facing error appeared.
Add cost tracking to your observability dashboards alongside latency and error rate from day one.
Escalation and Safety in Automated Response
Section titled “Escalation and Safety in Automated Response”Automated agents require explicit escalation protocols. Without them, automation creates new failure modes: agents applying irreversible changes without human review, acting on insufficient evidence, or creating accountability gaps.
Action categorization:
Before deploying any automated agent, categorize the actions it can take:
| Category | Examples | Policy |
|---|---|---|
| Informational | Query logs, read metrics, generate summaries | Auto-execute; no approval needed |
| Reversible | Scale up a service, restart a pod, adjust a config flag | Auto-execute with monitoring window and rollback readiness |
| Irreversible or high-risk | Delete data, modify user records, make financial transactions | Require explicit human approval before execution |
Escalation triggers:
An automated agent should escalate to a human when:
- The proposed action is irreversible or affects user data
- Confidence in root cause is low (insufficient evidence)
- The incident is novel — no matching historical pattern
- Cost or blast radius of the remediation is above a defined threshold
- Two consecutive automated actions have failed to resolve the issue
Audit trails:
Every agent action should be logged with: the triggering event, the reasoning chain, the action taken, the outcome, and the agent version. This log is essential for postmortems and for detecting systematic agent errors over time.
Example escalation flow: Agent detects database connection pool exhaustion. It identifies the likely fix (scale the connection pool). This is a reversible action with moderate risk. The agent pages the on-call engineer with its analysis and proposed action, rather than executing immediately. The engineer reviews in two minutes and approves. The agent applies the change and monitors for ten minutes with automatic rollback armed if error rates don’t improve.
Meta-Observability: Monitoring the Agent Itself
Section titled “Meta-Observability: Monitoring the Agent Itself”Standard system observability tells you when your application is failing. Meta-observability tells you when your AI agent is failing — which is a different and equally important signal.
LLM observability metrics:
- Token usage: Input and output tokens per call. Unexpected output token spikes often precede quality issues.
- Cost per call: Track cost attribution by model, agent role, and interaction type.
- Latency: P50 and P99 latency for LLM calls. P99 degradation often indicates context window pressure.
- Tool call success rate: Fraction of tool calls that complete without error. This is often the first indicator of integration failures.
- Consistency score: Across repeated identical inputs, how stable are outputs? Measured as k-trial pass rate.
Tooling options:
- LangSmith / Langfuse: Purpose-built LLM observability platforms. Capture traces of agent reasoning chains, tool calls, prompt/response pairs.
- Helicone: Proxy-based LLM observability; minimal code changes required.
- Datadog / Honeycomb: General observability platforms with LLM monitoring integrations; useful if you want agent metrics in the same dashboard as system metrics.
What to watch for:
| Signal | Likely Cause |
|---|---|
| Accuracy drop after model upgrade | Version drift; run regression suite |
| Latency increase (P99) | Context window growing; context length monitoring needed |
| Cost spike without traffic increase | Runaway tool calls or infinite loops |
| Consistency drop below 90% | Model update or prompt regression |
| Tool call error rate > 5% | Integration failure or API change in external tool |
Connecting layers: Your observability stack should cover three layers simultaneously — infrastructure (CPU, memory, network), application (request latency, error rate, throughput), and agent (LLM cost, consistency, tool success rate). Gaps at any layer blind you to a whole class of failures.
Recommended Readings
Section titled “Recommended Readings”-
“Introduction to SRE” — Google SRE Book, Chapter 1. (Free online: https://sre.google/books/)
- ~15 min. Defines SRE, sets context for why this matters.
-
“Observability Basics” — Observe ability (Chapter 3 of “Observability Engineering” by Majors, Fong-Jones, Miranda). (Free summary online or book purchase)
- ~15 min. The three pillars, why they matter, how to use them.
-
“Kubernetes Troubleshooting with AI” — Kubernetes documentation on debugging + practical examples.
- ~20 min. Concrete examples of using logs and traces to diagnose issues. Read even if you don’t use Kubernetes; the patterns transfer.
-
“Benefits of Agentic AI in On-call Engineering” — Internal research or blog posts (e.g., from Anthropic, Stripe, or your own company).
- ~15 min. Case studies of how organizations are using AI to speed incident response.
-
“Your New Autonomous Teammate” — Internal or community documentation on using Claude Code in production contexts.
- ~10 min. Practical advice on trust, safety, and integration.
Key Concepts
Section titled “Key Concepts”| Concept | Definition |
|---|---|
| SRE | Site Reliability Engineering: applying software engineering to operational problems. Focus: reliability, automation, reducing toil. |
| Observability | Ability to understand system behavior by observing outputs (metrics, logs, traces). More powerful than traditional monitoring. |
| Metrics | Time-series data: CPU, memory, request latency, error rate. Answer: “How is the system performing right now?” |
| Logs | Structured or unstructured text records of events. Answer: “What happened during this time window?” |
| Traces | End-to-end request flows across services. Answer: “Why was this specific request slow or failing?” |
| Incident | An unplanned interruption or reduction in quality of service. |
| Root Cause | The underlying reason an incident occurred. Often multiple layers deep (not just “the code has a bug”). |
| Blameless Postmortem | A learning meeting after an incident focused on systems, not people. What failed? Why? How do we prevent it? |
| Alert Fatigue | Condition where too many alerts cause responders to ignore them. Kills incident response. |
| Toil | Manual, repetitive, operational work. SRE goal: eliminate it through automation. |
| OpenTelemetry (OTel) | Vendor-neutral, open-source standard for instrumenting and collecting telemetry (traces, metrics, logs). Industry default for distributed tracing. |
| Trace Context Propagation | The mechanism by which a trace ID is forwarded across service boundaries via HTTP headers, enabling end-to-end request reconstruction. |
| OpenInference | An OpenTelemetry extension with LLM-specific semantics: token counts, model names, cost per call, prompt/response content. |
| Meta-Observability | Observability applied to the AI agent itself: LLM cost, token usage, tool call success rate, consistency scores. Distinct from observing the system the agent monitors. |
| Observability Maturity | A staged measure of how well a system is instrumented for answering arbitrary diagnostic questions. Prerequisite for effective AI-assisted incident response. |
Observability Maturity: Established vs. Emerging Practices
Section titled “Observability Maturity: Established vs. Emerging Practices”Not all practices in this module carry the same level of industry adoption. Using established practices without the AI-augmented extensions is perfectly valid; piloting emerging practices on non-critical systems before broader rollout is prudent.
| Practice | Maturity | Guidance |
|---|---|---|
| Three Pillars of Observability (metrics, logs, traces) | Established — Universal industry standard | Implement fully; no caveats |
| Blameless postmortems | Established — Proven across Google, Netflix, Stripe, and others | Adopt unconditionally |
| SRE automation (eliminate toil) | Established — Core SRE principle since Google SRE (2016) | Adopt unconditionally |
| Observability > monitoring distinction | Established — Well-documented; foundational to modern SRE | Adopt unconditionally |
| OpenTelemetry for instrumentation | Established — CNCF standard, universal platform support | Adopt as default instrumentation approach |
| AI-assisted incident investigation (local agent) | Emerging — Research-validated (2024-2025), growing adoption | Pilot on non-critical systems first; build observability maturity first |
| Cloud agents for automated monitoring | Emerging — Architectural pattern with growing production adoption | Start with alerts + human triage; automate incrementally as confidence grows |
| Cost tracking as reliability signal | Emerging — Operationally documented in AI deployments (2025-2026) | Implement from day one for AI systems; limited precedent in traditional SRE |
| Agent failure pattern detection | Emerging — Taxonomy documented (2025-2026); tooling maturing | Implement monitoring metrics; adopt regression testing practices |
| Automated incident diagnosis (multi-agent) | Pilot — Demonstrated feasibility (AidAI, IRCopilot); not yet standard | Research and plan; do not deploy without robust human oversight gates |
Practical guidance: Start with the established practices. They are prerequisites. Layer in emerging AI-augmented patterns once your observability foundation (structured logs, distributed traces, OTel instrumentation) is solid. Attempting AI-assisted incident response on a poorly instrumented system wastes engineering effort and produces unreliable results.
Observability Audit Checklist
Section titled “Observability Audit Checklist”Before deploying AI-assisted incident response, assess your observability readiness. These are prerequisites, not nice-to-haves.
Foundation (All Systems)
Section titled “Foundation (All Systems)”- Structured logs: Logs are emitted as JSON with consistent field names (
timestamp,level,service,trace_id,message) - Consistent service naming: Every service uses the same identifier across logs, metrics, and traces
- Error context captured: Errors include stack traces, request IDs, and relevant state — not just error messages
- Metrics with clear labels: All key metrics (latency, error rate, throughput) include service, endpoint, and environment labels
- Alerting on symptoms, not causes: Alerts fire on user-visible symptoms (error rate, latency SLO breach) rather than internal signals (CPU, memory) alone
Distributed Systems
Section titled “Distributed Systems”- Trace IDs propagated:
traceparentheaders are forwarded across all service-to-service calls - Spans instrumented: Key operations (DB queries, external calls, LLM inference) emit spans with duration and status
- Service dependency map exists: You can enumerate which services call which, and what the failure blast radius is
- Log aggregation in place: Logs from all services (including containers and pods) flow to a central queryable store
AI-Agent-Specific
Section titled “AI-Agent-Specific”- LLM calls instrumented: Every LLM call captures model name, token counts (input + output), cost, latency, and outcome
- Tool call outcomes logged: Agent tool invocations log the tool name, input parameters, success/failure, and execution time
- Cost attribution in place: You can query cost by model, agent role, and time window
- Consistency baseline established: You have a k-trial pass rate benchmark for critical agent behaviors
- Agent version tracked: Agent model version and prompt version are logged alongside actions for postmortem analysis
Gaps and Remediation
Section titled “Gaps and Remediation”If you have gaps in the foundation tier, address those before investing in AI-augmented incident response. Common remediation steps:
- Add a structured logging library (e.g.,
structlogin Python,winstonin Node.js) - Instrument with OTel SDK and export to your monitoring platform
- Add
traceparentheader propagation to your HTTP clients and servers - Wrap LLM client calls in OpenInference spans
References
Section titled “References”Books and Guides
Section titled “Books and Guides”- Google SRE Book: Chapter 1 (Introduction)
- Observability Engineering (Book) by Majors, Fong-Jones, Miranda
- Kubernetes Documentation: Troubleshooting
Tools and Platforms (Sample)
Section titled “Tools and Platforms (Sample)”System Observability:
- Sentry (error tracking): https://sentry.io
- Datadog (monitoring and observability): https://www.datadoghq.com
- Prometheus + Grafana (open-source metrics): https://prometheus.io
- ELK Stack (logs): https://www.elastic.co
- Jaeger (distributed tracing): https://www.jaegertracing.io
- Honeycomb (query-driven observability): https://www.honeycomb.io
OpenTelemetry:
- OpenTelemetry (vendor-neutral instrumentation standard): https://opentelemetry.io
- OpenInference (LLM-specific OTel semantics): https://github.com/Arize-ai/openinference
LLM / Agent Observability:
- LangSmith (LLM tracing and evaluation): https://smith.langchain.com
- Langfuse (open-source LLM observability): https://langfuse.com
- Helicone (proxy-based LLM observability): https://www.helicone.ai
Articles and Blogs
Section titled “Articles and Blogs”- Stripe: “Scaling Incident Response” (search their blog)
- Anthropic: Case studies on AI in operations (when available)
- Your company’s incident postmortems (learn from real incidents)
Workshop
Section titled “Workshop”The hands-on session for this module: M11 Workshop: Post-Deployment — Monitoring AI-Generated Systems
Takeaway
Section titled “Takeaway”After completing this module and the workshop, you will have:
- A mental model for observability in AI-augmented systems
- Familiarity with OpenTelemetry instrumentation for LLM applications
- An incident response workflow using AI-assisted investigation
- An observability audit checklist for your own systems
- Understanding of cost tracking as a reliability signal