M11: Post-Deployment — Monitoring AI-Generated Systems

Tier 3 — Operations and Scale | Duration: 90 min (pre-work + workshop)

Module Overview

Shipping code is only half the battle. The real complexity emerges at 3 AM when your system is failing and your on-call engineer has 10 minutes to diagnose a production incident. This module teaches you how AI fundamentally changes incident response and system observability.

In traditional incident workflows, an engineer sees an alert, opens dashboards, digs through logs, correlates events, forms hypotheses about root causes, and deploys fixes—all manually. It’s slow and error-prone, especially on unfamiliar systems. AI changes this. With Claude Code and MCP tools, you can automate the investigation itself: ask an agent to “find what changed in the last hour,” “correlate this error with recent deployments,” or “trace this request through all services.” The agent pulls from monitoring systems, version control, logs, and traces—simultaneously—and narrows the search space to actionable causes in minutes instead of hours.

This module also covers the dual-agent model: local agents for interactive debugging (when you need human judgment) and cloud agents for automated monitoring and alerting (when you need scalability and speed). You’ll learn how to preserve organizational knowledge as experienced engineers leave, and how to structure observability so that both humans and AI can reason about what went wrong.

Takeaway: Incident investigation prompt templates, observability audit checklist, agent failure pattern taxonomy, escalation frameworks, and judgment about when AI accelerates vs. replacing human diagnosis.

Prerequisites

Completed Tier 1 and Tier 2
Familiar with Claude Code, prompts, and MCP integrations
Basic understanding of monitoring, logging, or observability tools (or willingness to learn)
Access to a production system with logs and metrics (or willingness to use a simulated scenario)

Pre-Work: Theory and Readings (15-20 min)

SRE Fundamentals

Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to operational problems. Core tenets:

Reliability is a feature, not a bug. It competes with velocity; trade-offs are intentional.
Observability > Monitoring. You can’t just track “CPU is high.” You need to ask arbitrary questions: “Why did this request take 5 seconds?” Observability is the ability to answer questions you didn’t anticipate.
Blameless postmortems. Incidents are learning opportunities, not staff problems.
Automation saves humans. Toil (manual, repetitive work) is the enemy. Automate it.

Observability Basics: Metrics, Logs, Traces

Metrics: Numbers over time. CPU usage, request latency, error rate. Questions: “Is the system healthy right now?”

Logs: Textual records of events. “User login failed,” “database connection timeout.” Questions: “What happened during that time window?”

Traces: End-to-end request flows across services. “Request entered API → hit database → returned to client.” Questions: “Why was that specific request slow?”

Together, they form the Three Pillars of Observability. No single pillar is enough.

Implementing Observability: OpenTelemetry

Understanding the three pillars conceptually is necessary but not sufficient. You also need a way to collect and propagate telemetry data across services in a consistent, vendor-neutral way. That standard is OpenTelemetry (OTel).

What OpenTelemetry is: An open-source, vendor-neutral framework for instrumenting, generating, collecting, and exporting telemetry data (traces, metrics, logs). It is now the industry standard for distributed tracing and is supported by every major monitoring platform (Datadog, Honeycomb, Grafana, AWS, GCP, etc.).

Why it matters for AI systems: Without standardized trace IDs propagated in HTTP headers across service boundaries, you cannot reconstruct the full path of a request. An agent asked to “find what caused this latency spike” cannot reason across service boundaries if traces don’t connect.

Key concepts:

Trace context propagation: A unique trace ID is injected into request headers (traceparent). Every service that handles the request reads and forwards that ID, creating a connected chain from ingress to database and back.
Spans: Each unit of work within a trace (a single service call, a database query, an LLM inference). Spans carry start time, duration, status, and custom attributes.
OpenInference: An extension of OpenTelemetry with LLM-specific semantics — token counts, model names, cost per call, prompt/response payloads. Enables cost tracking and quality monitoring alongside latency.

Practical example: Instead of asking an agent to “show me logs,” a properly instrumented system lets you query: “Get all LLM calls with latency > 5 seconds in the last hour” or “Show me every request where cost exceeded $0.50 and the response was marked incorrect.” That level of precision requires structured traces with consistent attributes — which OTel provides.

Instrumentation checklist:

Use an OTel SDK (available for Python, Node.js, Java, Go, .NET, and others)
Propagate traceparent headers across all service-to-service calls
Add OpenInference spans around LLM calls to capture token usage and cost
Export to your monitoring platform of choice (vendor-neutral by design)

How AI Changes Incident Response

The timelines below are illustrative example scenarios, not benchmarks. Traditional incident resolution can range from 15 minutes to multiple days depending on incident complexity, system familiarity, and observability maturity. The AI-augmented scenario assumes good observability infrastructure, appropriate tool access, and a trained agent — conditions that take time to establish.

Example: Traditional workflow (well-instrumented system, familiar engineer):

Alert fires → Engineer wakes up
→ Opens dashboard (5 min)
→ Scans logs for errors (10 min)
→ Checks recent deployments (5 min)
→ Forms hypothesis (10 min)
→ Tests hypothesis with more logs (15 min)
→ Deploys fix or escalates (20 min)
Total: ~65 minutes

Example: AI-augmented workflow (same system, good observability, agent has tool access):

Alert fires → Engineer wakes up
→ Runs AI investigation: "What changed in the last hour?"
  (agent queries metrics, logs, Git history, deployment logs in parallel)
→ Claude returns: "Deployment X modified retry logic; this request pattern wasn't tested"
→ Engineer validates hypothesis (5 min)
→ Deploys fix (5 min)
Total: ~10 minutes

The AI doesn’t replace the engineer’s judgment. It compresses the investigation phase and surfaces the most likely causes. The engineer still decides what to do.

The 10-minute scenario is an optimistic illustration of what becomes possible with mature observability. If your logs are unstructured, traces are missing, or the agent lacks tool access to the relevant systems, the AI-augmented workflow won’t deliver that compression. Observability readiness is a prerequisite, not an afterthought.

Multi-Agent Systems for Production Monitoring

Two patterns:

Local Agent (Interactive Debugging): You run claude code on your laptop, ask questions about your system, and iterate. Useful for “I don’t know what to check next” scenarios. Full context, human judgment, but not automated.
Cloud Agent (Automated Monitoring): A background process runs continuously, monitors your systems, and escalates anomalies or incidents. Useful for “detect this pattern and alert me” scenarios. Less context, but doesn’t require human presence.

Both can coexist: automated agents catch issues; local agents investigate when humans are needed.

Note: Cloud agents for continuous monitoring are an emerging architectural pattern with growing adoption (2024-2026), not yet a universal production standard. Start with local agent-assisted investigation and graduate to automated cloud agents as your observability infrastructure matures.

Knowledge Preservation and Scaling

When your best on-call engineer leaves, what walks out the door? All the unwritten knowledge: “When Service A acts up, always check Service B’s cache first.” “That log message is confusing, but it’s actually fine.” “The retry logic has a bug in edge case X, so watch for Y.”

AI provides a solution: encode this knowledge as agents and monitoring rules. A playbook for incident investigation becomes a set of MCP tools and prompts. A decade of debugging wisdom becomes part of the baseline for every on-call engineer, new or veteran.

Alert Fatigue and Noise Filtering

Every alert you ignore teaches you to ignore alerts. Alert fatigue kills incident response. The traditional solution is manual tuning of thresholds. The AI solution: use agents to filter noise. “Alert me if error rate exceeds 5% and latency is above 2s and this error is new (not seen in the last week).” Multi-condition logic, adaptive thresholds, learned baselines.

Common Agent Failure Patterns

When AI agents are part of your production system, they introduce failure modes that traditional monitoring doesn’t cover. These aren’t hypothetical — they’re documented patterns from production deployments.

Four failure archetypes:

Premature action without grounding — The agent acts before it has sufficient context. It reads one log file, concludes root cause, and applies a fix without checking related services. Prevention: require agents to gather evidence from multiple sources before proposing remediation actions.
Over-helpfulness / scope creep — The agent “helpfully” modifies adjacent systems while fixing the stated problem. Prevention: define explicit action boundaries per agent role; use read-only agents for investigation and separate agents with limited write permissions for remediation.
Context pollution — Irrelevant history or misleading intermediate results accumulate in the agent’s context window, degrading reasoning quality over a long incident. Prevention: segment tasks into fresh agent invocations; summarize and compress context between phases.
Fragile execution under load — The agent works correctly during testing but fails unpredictably under production load (concurrent requests, partial tool failures, slow API responses). Prevention: test agents under realistic load conditions; build in retry logic and graceful degradation.

Specific failure modes to monitor:

Failure Mode	Signal	Response
Tool invocation errors	Agent calls a tool that returns an error; agent ignores or misinterprets it	Log tool call success/failure rate; alert when error rate > 5%
Multi-step reasoning drift	Agent’s reasoning quality degrades across a long chain of steps	Log quality scores per step; alert on > 20% degradation
Latent inconsistency	Same input produces different outputs across repeated calls	Run k-trial consistency checks; alert when consistency drops below 90%
Context window saturation	Agent has exceeded its effective context capacity and starts losing earlier information	Monitor context length; truncate or summarize when approaching limits
Version drift	Model upgrade changes output format, breaking downstream parsing	Run regression test suites after model updates before routing production traffic
Cost-driven collapse	Agent enters an infinite or runaway tool-calling loop	Alert on cost-per-request exceeding 3x baseline (see next section)

Example walkthrough: A model upgrade introduces a subtle change in JSON serialization format. Your agent’s downstream parser breaks silently — it doesn’t throw an error, it just returns empty results. The system appears to function; no alerts fire. A regression test suite run against the upgraded model would have caught this before it reached production.

Key principle: Monitor the agent’s behavior, not just the system it’s monitoring. Tool invocation error rates, consistency scores, and reasoning step quality are first-class metrics.

Cost Tracking as a Reliability Signal

For AI systems, cost is not just a budget concern — it’s an early warning indicator of system failures.

Why cost anomalies precede functional failures:

A well-functioning agent has a predictable cost-per-interaction baseline. When that baseline spikes, it almost always means something is wrong before any functional metric degrades:

10x cost spike → Agent is likely in an infinite tool-calling loop, calling the same tool repeatedly without making progress
Sustained cost increase → Model may be generating excessively long outputs (prompt injection, runaway reasoning chains)
Cost spike correlated with accuracy drop → Context pollution or hallucination amplifying token usage
Cost spike after deployment → New code introduced a logic error causing redundant LLM calls

What to track:

Cost per request (absolute and relative to baseline)
Cost per model, per tool, per interaction type
Total daily/hourly cost with anomaly detection
Token usage breakdown: input tokens vs. output tokens (output token spikes are often the first signal)

Alert configuration:

Set alerts at multiple thresholds:

Warning: cost-per-request > 2x baseline for last 100 requests
Critical: cost-per-request > 5x baseline, or total hourly spend > 3x daily average rate

Example: An agent’s daily operating cost is $12. After a code deployment, it climbs to $180 within two hours. Investigation reveals a new code path causes the agent to retry a failing tool call in a tight loop, generating thousands of redundant LLM calls. The cost alert caught this before any user-facing error appeared.

Add cost tracking to your observability dashboards alongside latency and error rate from day one.

Escalation and Safety in Automated Response

Automated agents require explicit escalation protocols. Without them, automation creates new failure modes: agents applying irreversible changes without human review, acting on insufficient evidence, or creating accountability gaps.

Action categorization:

Before deploying any automated agent, categorize the actions it can take:

Category	Examples	Policy
Informational	Query logs, read metrics, generate summaries	Auto-execute; no approval needed
Reversible	Scale up a service, restart a pod, adjust a config flag	Auto-execute with monitoring window and rollback readiness
Irreversible or high-risk	Delete data, modify user records, make financial transactions	Require explicit human approval before execution

Escalation triggers:

An automated agent should escalate to a human when:

The proposed action is irreversible or affects user data
Confidence in root cause is low (insufficient evidence)
The incident is novel — no matching historical pattern
Cost or blast radius of the remediation is above a defined threshold
Two consecutive automated actions have failed to resolve the issue

Audit trails:

Every agent action should be logged with: the triggering event, the reasoning chain, the action taken, the outcome, and the agent version. This log is essential for postmortems and for detecting systematic agent errors over time.

Example escalation flow: Agent detects database connection pool exhaustion. It identifies the likely fix (scale the connection pool). This is a reversible action with moderate risk. The agent pages the on-call engineer with its analysis and proposed action, rather than executing immediately. The engineer reviews in two minutes and approves. The agent applies the change and monitors for ten minutes with automatic rollback armed if error rates don’t improve.

Meta-Observability: Monitoring the Agent Itself

Standard system observability tells you when your application is failing. Meta-observability tells you when your AI agent is failing — which is a different and equally important signal.

LLM observability metrics:

Token usage: Input and output tokens per call. Unexpected output token spikes often precede quality issues.
Cost per call: Track cost attribution by model, agent role, and interaction type.
Latency: P50 and P99 latency for LLM calls. P99 degradation often indicates context window pressure.
Tool call success rate: Fraction of tool calls that complete without error. This is often the first indicator of integration failures.
Consistency score: Across repeated identical inputs, how stable are outputs? Measured as k-trial pass rate.

Tooling options:

LangSmith / Langfuse: Purpose-built LLM observability platforms. Capture traces of agent reasoning chains, tool calls, prompt/response pairs.
Helicone: Proxy-based LLM observability; minimal code changes required.
Datadog / Honeycomb: General observability platforms with LLM monitoring integrations; useful if you want agent metrics in the same dashboard as system metrics.

What to watch for:

Signal	Likely Cause
Accuracy drop after model upgrade	Version drift; run regression suite
Latency increase (P99)	Context window growing; context length monitoring needed
Cost spike without traffic increase	Runaway tool calls or infinite loops
Consistency drop below 90%	Model update or prompt regression
Tool call error rate > 5%	Integration failure or API change in external tool

Connecting layers: Your observability stack should cover three layers simultaneously — infrastructure (CPU, memory, network), application (request latency, error rate, throughput), and agent (LLM cost, consistency, tool success rate). Gaps at any layer blind you to a whole class of failures.

Key Concepts

Concept	Definition
SRE	Site Reliability Engineering: applying software engineering to operational problems. Focus: reliability, automation, reducing toil.
Observability	Ability to understand system behavior by observing outputs (metrics, logs, traces). More powerful than traditional monitoring.
Metrics	Time-series data: CPU, memory, request latency, error rate. Answer: “How is the system performing right now?”
Logs	Structured or unstructured text records of events. Answer: “What happened during this time window?”
Traces	End-to-end request flows across services. Answer: “Why was this specific request slow or failing?”
Incident	An unplanned interruption or reduction in quality of service.
Root Cause	The underlying reason an incident occurred. Often multiple layers deep (not just “the code has a bug”).
Blameless Postmortem	A learning meeting after an incident focused on systems, not people. What failed? Why? How do we prevent it?
Alert Fatigue	Condition where too many alerts cause responders to ignore them. Kills incident response.
Toil	Manual, repetitive, operational work. SRE goal: eliminate it through automation.
OpenTelemetry (OTel)	Vendor-neutral, open-source standard for instrumenting and collecting telemetry (traces, metrics, logs). Industry default for distributed tracing.
Trace Context Propagation	The mechanism by which a trace ID is forwarded across service boundaries via HTTP headers, enabling end-to-end request reconstruction.
OpenInference	An OpenTelemetry extension with LLM-specific semantics: token counts, model names, cost per call, prompt/response content.
Meta-Observability	Observability applied to the AI agent itself: LLM cost, token usage, tool call success rate, consistency scores. Distinct from observing the system the agent monitors.
Observability Maturity	A staged measure of how well a system is instrumented for answering arbitrary diagnostic questions. Prerequisite for effective AI-assisted incident response.

Observability Maturity: Established vs. Emerging Practices

Not all practices in this module carry the same level of industry adoption. Using established practices without the AI-augmented extensions is perfectly valid; piloting emerging practices on non-critical systems before broader rollout is prudent.

Practice	Maturity	Guidance
Three Pillars of Observability (metrics, logs, traces)	Established — Universal industry standard	Implement fully; no caveats
Blameless postmortems	Established — Proven across Google, Netflix, Stripe, and others	Adopt unconditionally
SRE automation (eliminate toil)	Established — Core SRE principle since Google SRE (2016)	Adopt unconditionally
Observability > monitoring distinction	Established — Well-documented; foundational to modern SRE	Adopt unconditionally
OpenTelemetry for instrumentation	Established — CNCF standard, universal platform support	Adopt as default instrumentation approach
AI-assisted incident investigation (local agent)	Emerging — Research-validated (2024-2025), growing adoption	Pilot on non-critical systems first; build observability maturity first
Cloud agents for automated monitoring	Emerging — Architectural pattern with growing production adoption	Start with alerts + human triage; automate incrementally as confidence grows
Cost tracking as reliability signal	Emerging — Operationally documented in AI deployments (2025-2026)	Implement from day one for AI systems; limited precedent in traditional SRE
Agent failure pattern detection	Emerging — Taxonomy documented (2025-2026); tooling maturing	Implement monitoring metrics; adopt regression testing practices
Automated incident diagnosis (multi-agent)	Pilot — Demonstrated feasibility (AidAI, IRCopilot); not yet standard	Research and plan; do not deploy without robust human oversight gates

Practical guidance: Start with the established practices. They are prerequisites. Layer in emerging AI-augmented patterns once your observability foundation (structured logs, distributed traces, OTel instrumentation) is solid. Attempting AI-assisted incident response on a poorly instrumented system wastes engineering effort and produces unreliable results.

Observability Audit Checklist

Before deploying AI-assisted incident response, assess your observability readiness. These are prerequisites, not nice-to-haves.

Foundation (All Systems)

Structured logs: Logs are emitted as JSON with consistent field names (timestamp, level, service, trace_id, message)
Consistent service naming: Every service uses the same identifier across logs, metrics, and traces
Error context captured: Errors include stack traces, request IDs, and relevant state — not just error messages
Metrics with clear labels: All key metrics (latency, error rate, throughput) include service, endpoint, and environment labels
Alerting on symptoms, not causes: Alerts fire on user-visible symptoms (error rate, latency SLO breach) rather than internal signals (CPU, memory) alone

Distributed Systems

Trace IDs propagated: traceparent headers are forwarded across all service-to-service calls
Spans instrumented: Key operations (DB queries, external calls, LLM inference) emit spans with duration and status
Service dependency map exists: You can enumerate which services call which, and what the failure blast radius is
Log aggregation in place: Logs from all services (including containers and pods) flow to a central queryable store

AI-Agent-Specific

LLM calls instrumented: Every LLM call captures model name, token counts (input + output), cost, latency, and outcome
Tool call outcomes logged: Agent tool invocations log the tool name, input parameters, success/failure, and execution time
Cost attribution in place: You can query cost by model, agent role, and time window
Consistency baseline established: You have a k-trial pass rate benchmark for critical agent behaviors
Agent version tracked: Agent model version and prompt version are logged alongside actions for postmortem analysis

Gaps and Remediation

If you have gaps in the foundation tier, address those before investing in AI-augmented incident response. Common remediation steps:

Add a structured logging library (e.g., structlog in Python, winston in Node.js)
Instrument with OTel SDK and export to your monitoring platform
Add traceparent header propagation to your HTTP clients and servers
Wrap LLM client calls in OpenInference spans

References

Books and Guides

Google SRE Book: Chapter 1 (Introduction)
Observability Engineering (Book) by Majors, Fong-Jones, Miranda
Kubernetes Documentation: Troubleshooting

Tools and Platforms (Sample)

System Observability:

Sentry (error tracking): https://sentry.io
Datadog (monitoring and observability): https://www.datadoghq.com
Prometheus + Grafana (open-source metrics): https://prometheus.io
ELK Stack (logs): https://www.elastic.co
Jaeger (distributed tracing): https://www.jaegertracing.io
Honeycomb (query-driven observability): https://www.honeycomb.io

OpenTelemetry:

OpenTelemetry (vendor-neutral instrumentation standard): https://opentelemetry.io
OpenInference (LLM-specific OTel semantics): https://github.com/Arize-ai/openinference

LLM / Agent Observability:

LangSmith (LLM tracing and evaluation): https://smith.langchain.com
Langfuse (open-source LLM observability): https://langfuse.com
Helicone (proxy-based LLM observability): https://www.helicone.ai

Articles and Blogs

Stripe: “Scaling Incident Response” (search their blog)
Anthropic: Case studies on AI in operations (when available)
Your company’s incident postmortems (learn from real incidents)

Workshop

The hands-on session for this module: M11 Workshop: Post-Deployment — Monitoring AI-Generated Systems

Takeaway

After completing this module and the workshop, you will have:

A mental model for observability in AI-augmented systems
Familiarity with OpenTelemetry instrumentation for LLM applications
An incident response workflow using AI-assisted investigation
An observability audit checklist for your own systems
Understanding of cost tracking as a reliability signal