We are witnessing a massive category error in software engineering.
We are trying to solve live production outages with tools designed to write code.
It is a fundamental mismatch of substrates.

Look at the image of the man shaving with a chainsaw. The chainsaw is an engineering marvel. It is incredibly powerful, high-velocity, and perfect for clearing forests. But applying it to a delicate five o'clock shadow is a disaster. It is too heavy. It lacks precision. One micro-slip, and you lose far more than just your stubble.
When we ask developer-focused AI tools like Cursor or Claude Code to troubleshoot a live production incident, we are doing the exact same thing. We are taking a high-velocity creation tool and dropping it into a fragile, real-time control system.
To build truly resilient systems, we must understand why the architecture of an operations harness looks entirely different from a coding harness.
Why does the developer substrate fail in operations?
To understand why developer tools fail in production, we have to look at the physics of the two environments. They operate on entirely different planes of state, time, and memory.
A developer edits an inert world:
- The files stay put. A codebase does not change its topology while you are looking at it.
- Git remembers everything. Every line of code, every commit, and every author is systematically recorded.
- The environment is static. Code is dead text until it is compiled and executed.
A coding harness is built for this static world. It relies on the local codebase being the absolute source of truth. It works because the playground holds still.
An operator acts on a living system:
- The topology is constantly moving. Microservices scale, nodes evict, and network paths shift in real time.
- The system has amnesia. When an incident ends, the transient runtime context disappears.
- The real context is unwritten. The hotfix applied to an ingress controller at 2:00 AM is rarely committed back to Git in the heat of the moment.
In production, static code is only ten percent of the story. The other ninety percent is system behavior under live, unpredictable load. A tool that only reads files is fundamentally blind to the physical reality of your runtime.
Why do ask-on-demand AI tools play catch-up during an incident?
Developer AI tools are episodic. They work on demand: you prompt, they respond. This pull-based workflow is fantastic when you are designing a database schema or writing a new endpoint. It is a catastrophic workflow when a Redis cluster is cascading.
During an incident, a tool that only works when you ask is always starting from behind.
### Episodic AI Triage (The Catch-up Loop)
[Incident Fires] ➔ [SRE copies logs] ➔ [Paste into Chat] ➔ [AI asks for metrics] ➔ [SRE copies Datadog] ➔ [MTTR ticks upward]An on-demand tool starts with zero context. It does not know your VPC peering architecture. It does not understand upstream dependencies. It cannot see how a localized memory spike in a caching layer in one region is causing database connection pool exhaustion in another.
Every time you prompt an episodic assistant during a high-severity incident, you pay a steep tax. You copy logs. You paste Prometheus metrics. You manually sketch your architecture in text.
By the time the AI has enough context to be useful, your Mean Time to Understand (MTTU) has already stretched into hours.
The Illusion of Integration: Why MCPs and CLI access do not save developer tools
The obvious counter-argument is integration. Why not just bridge the gap?
With the Model Context Protocol (MCP), terminal extensions, and CLI execution plugins, you can now give Claude Code or Cursor direct access to your infrastructure. You can hook your IDE up to Datadog, AWS APIs, and kubectl.
Now, when an incident fires, the agent can "go and figure it out."
But giving a developer tool raw access to your production telemetry does not solve the structural mismatch. It actually exposes three deeper architectural vulnerabilities:
- The High-Latency Discovery Tax. Even with MCP access to Datadog, an on-demand agent is still reactive. It only wakes up when you prompt it. When the alert fires, it must start querying APIs from scratch. It has to trace the system topology under fire. It fetches historical metrics, compares them to current graphs, and reconstructs the causal tree in real time. By the time it builds a working model of the failure, your MTTR has already slipped. Integration does not change the fact that on-demand systems are always running behind.
- The Amnesia of Raw Telemetry. An MCP server lets an LLM fetch metrics, but it does not give the tool memory. If a similar Kafka consumer group lag caused an outage three months ago, and your team resolved it with a custom JVM flag update, your IDE agent does not know that. It sees the Datadog API payload as a brand new puzzle. It begins diagnosing the issue from first principles all over again. Without a compounding memory engine to correlate past incident resolutions, your automated triage is perpetually stuck on Day 1.
- The Unbounded Blast Radius. This is the most dangerous risk. A developer tool is optimized for action and completion. It wants to write code, execute commands, and solve the prompt you gave it. If you give a code-completion engine raw terminal or CLI execution rights during a high-severity incident, you are operating without guardrails. The agent might try to resolve a memory leak by executing a container restart, unaware that the target pod is on a node experiencing severe network partition. The action triggers a cascading failover. Because developer tools lack built-in policy-guided guardrails, they cannot calculate the operational blast radius before they execute a change.
What are the foundations of an operations harness?
If we want to automate reliability, we must stop trying to adapt creation tools for operational environments. We need a dedicated operations harness.
At RubixKube, we call the foundation of this harness Site Reliability Intelligence (SRI). It is built on three core architectural principles that are completely absent from coding assistants.
1. Continuous, Always-On Observation
An operations harness does not wait for a prompt. It runs an active, low-latency observer that continuously maps your topology. It learns what your microservices look like on Day 1, maps their dependencies by Week 1, and begins mapping causal relationships by Month 1. When an anomaly strikes, the harness already has the entire map in memory.
2. Compounding Memory Engine
Most infrastructure tools see your stack for the first time, every time you open them. A true reliability harness must remember.
Our Memory Engine compiles history across multiple dimensions:
- Resolution Notes. Short, unstructured notes written by SREs after solving past incidents.
- Graph Deltas. How your infrastructure changed over time.
- Implicit Rejections. Which automated recommendations your team turned down, and why.
This compounding memory ensures that if a database lock pattern occurs a second time, the system does not need to deduce the problem from scratch. It leverages the previous causal chain as a strong hypothesis instantly, cutting MTTU down to under three minutes.
3. The Closed OPEL Loop
Instead of simple prompt-and-response, an operations harness runs on a continuous OPEL Loop (Observe, Plan, Execute, Learn):
[ OBSERVE ] ─── Telemetry, Topology, and Logs
│
▼
[ PLAN ] ─── Correlate Signals & Root Cause Analysis (RCA)
│
▼
[ EXECUTE ] ─── Policy-Guided Action with Guardrails
│
▼
[ LEARN ] ─── Compounding Memory EngineThis execution layer operates behind strict, multi-layered guardrails. It runs model-generated commands inside hermetic, sandboxed execution environments, checks inputs against deterministic RBAC policies, and calculates the blast radius before proposing a change.
Stop shaving with a chainsaw
If you are trying to manage your production infrastructure with developer assistants, you are fighting the physics of the substrate. You are trading systemic safety for a conversational interface.
It is time to separate creation from control. Keep your coding tools in your editor to build great products. Give your production environment the dedicated, memory-retaining, and continuous reliability brain it deserves.
Want to see what an always-on operations harness looks like in your stack? Explore how RubixKube's Memory Engine and Safety Guardrails protect your runtime.
Frequently Asked Questions (FAQ)
What is the difference between a code-completion agent and an AI SRE?
A code-completion agent (like Cursor or Claude Code) is designed to generate and modify static files inside a local workspace. It operates on demand (episodic) and assumes a static context. An AI SRE is an always-on operational system that continuously ingests real-time telemetry, analyzes runtime infrastructure topology, performs automated root cause analysis, and executes remediation tasks within policy-guided guardrails.
How does the Model Context Protocol (MCP) affect incident response?
While the Model Context Protocol (MCP) allows LLMs to query external developer APIs and infrastructure logs, it remains a reactive, transactional integration. Because it only triggers when prompted, it suffers from a high-latency discovery tax during outages. It also lacks long-term context memory, treating recurring incidents as entirely new problems.
Why is cognitive coordination debt dangerous during an incident?
Cognitive coordination debt occurs when human operators are forced to act as manual routers copying logs from Datadog, architecture diagrams from wikis, and event traces from Kubernetes to feed them to an AI chat terminal. This manual overhead distracts engineers from critical decision-making, increases Mean Time to Understand (MTTU), and prolongs system downtime.




