Episode 3: The Spirit of Curiosity | A Cloud Pioneer on the Coming Decade of AI Operations

In the third episode of The Root Cause, Priyank Upadhyay sits down with cloud pioneer Randy Bias, co-author of "Pets vs Cattle," to unpack why the industry pours everything into AI-assisted coding while AI-assisted operations sits wide open; and what a decade of autonomous, agent-run infrastructure actually looks like.

Listen on:

In this episode of The Root Cause, Priyank Upadhyay sits down with Randy Bias to talk about the part of the AI revolution almost everyone is ignoring. Randy is VP of Technology and Strategy at Mirantis. He co-authored the "Pets vs Cattle" model that shaped how a generation thinks about infrastructure, served as an inaugural board director of the OpenStack Foundation, and founded Cloud Scaling, which was acquired by EMC. He was building on the internet before there was a commercial internet, and using AWS EC2 in private beta when it had a single instance size and nobody knew what it was. In other words, when Randy says a shift is coming, it is worth listening.

The whole industry is busy teaching AI to write code. Almost nobody is teaching it to run the systems that code lives on. That gap, Randy argues, is the single largest open opportunity in infrastructure today, and most of the field is walking right past it.

Key Takeaways

  • Why AI-assisted operations is dramatically underinvested compared to AI-assisted coding, even though operators carry far more toil.
  • Why founders with no operational experience will struggle to build for operators, no matter how many customer interviews they run.
  • The difference between a sandbox and a playground, and why operations is fundamentally a team sport.
  • Why judgment and accountability are the one part of operations that cannot be handed to an agent.
  • Why, in a world where AI levels the playing field, domain expertise is the only real moat left.

The Great Imbalance: We Automate Code, Not Operations

Randy describes himself as a recovered AI skeptic. He doubted the hype for years, then started seriously using the tools in 2024 and changed his mind fast. What convinced him was not a demo. It was the pace. He went from treating AI as a blip to placing it in the same category as the internet and the cloud: a world-altering event that touches the entire IT stack from top to bottom.

But his enthusiasm comes with a sharp frustration. Every harness, every framework, every clever pattern the industry has built sits on top of software development. Coding agents have matured at an astonishing rate. Meanwhile, the people who keep production alive have been left almost entirely out of the conversation.

This matters because of where the pain actually lives. Developers have toil, yes. But operations teams routinely spend 80 to 90 percent of their time paying down technical debt or fighting fires, which leaves almost nothing for actually improving the systems they run. The opportunity to take that load off their backs is enormous, and it is sitting untouched. The reluctance is partly cultural ("we can't let agents touch production") and partly a simple lack of investment. As Randy puts it, the momentum just is not there yet, and he is hoping the industry builds it soon.

The Cynic's Test: You Cannot Serve Operators Without the Scars

There is a wave of new companies chasing AI for operations, and Randy is glad people see the opportunity. But the cynic in him notices something: a lot of these companies are founded by people with zero operational experience.

His warning is blunt. You can run all the customer interviews you want, but the real job of product discovery is reading between the lines, sussing out what customers are not telling you. With operators and operations executives, you simply cannot do that unless you have lived their reality. Without the scars, you end up building for the symptoms operators describe instead of the underlying causes they cannot articulate.

This is not gatekeeping. It is a practical observation about a domain where the hardest problems are the unspoken ones. And it is exactly why the pressure is about to spike: as developers ramp up coding velocity with AI, the size and frequency of changes hitting production climb with it. The volume of vulnerabilities is accelerating too. Teams that try to absorb that firehose with old, heavyweight processes like ITIL and ITSM will break under the load. Randy's view is that adoption is not a question of if but when, because the environment itself will force operators to change.

The Missing Piece: Where Is the Operations Harness?

This is the heart of the conversation. When you use a modern coding agent, you are not just using a model. You are using an entire harness around it: you can set a goal, write a specification, generate end-to-end tests, run adversarial reviews with other models, bring in an architecture reviewer, and let the whole thing run for hours until it converges on something worth looking at. That scaffolding is what turned raw models into something genuinely powerful.

Now ask the obvious question. Where is the equivalent for operations?

Where is the agent whose job is change management, that pops up when you touch production and asks whether you got sign-off, whether you talked to the right people, whether you followed the process? Where is the agent that maintains a full audit trail so a mistake can be traced and turned into a blameless post-mortem? Where is the architecture manager that keeps a living picture of how the system actually works, the inventory manager that tracks every asset and bill of materials, the SOC engineer that watches the stream of CVEs and tells you what is coming for you?

None of this exists in any harness today. It lives in tribal knowledge, in scattered documents, in the heads of the people who happen to remember. Randy has read the system prompts of the leading coding tools line by line, and they are almost entirely about writing software. The operations harness has simply not been built yet. This is the same worldview that drives Site Reliability Intelligence: operations exists today because systems cannot understand themselves, and the work ahead is to change that.

Playground, Not Sandbox: Why Operations Is a Team Sport

The instinctive way to make AI safe in operations is to put it in a straitjacket, a sandbox so constrained the agent cannot do anything you did not pre-approve. Randy pushes back on this hard. If you constrain an agent so tightly that it cannot reason, search for a novel solution, or take a creative action, then it is no better than the deterministic automation we already had. You have spent a fortune on intelligence and then forbidden it from being intelligent.

His reframe is a playground instead of a sandbox. A space, possibly shared with other agents, where they can write code, make more tool calls, and freestyle, but with full auditability and traceability, and with oversight agents standing aside as the adults in the room. Confine agents by role so each one only touches what it should, but do not lobotomize them.

This works because of a truth that separates operations from software development: operations is a team sport. A single developer can ship something meaningful alone. But you cannot run production alone. You are not awake 24/7, and when an incident hits you need multiple sets of eyes attacking it from different angles. That means a future built not on one assistant, but on teams of agents that share situational awareness, talk to each other, and work a problem together, with humans coordinating the whole thing.

And the path there runs through trust, which is earned, not granted. The safest place to begin is read-only: let agents observe, correlate, and recommend before they ever take an action. That is precisely the progression we believe in, and it is how trust in autonomy actually gets built.

Judgment: The One Job You Cannot Hand to an Agent

So how far does autonomy go? Randy thinks it goes much further than people expect over the long haul, with agents eventually running large parts of our systems using the same resiliency patterns and checks and balances that humans designed. But there is a hard limit, and it is not technical.

It is judgment.

Picture a SEV-1. You are down, you are losing millions of dollars an hour, and you have to decide: roll forward with an unproven fix, or try to reproduce it somewhere safe first. An agent cannot make that call well. It does not have the business context, it cannot weigh the real risk and reward, and most importantly, it cannot take responsibility for the decision. Responsibility and accountability are human, and they belong to a team that answers for the business. You can pull a lot of toil out of operations. You cannot pull out judgment, which means you cannot pull out the humans.

This is also why context matters so much. The most useful agent is one that understands not just the system but how the system maps to the business. Randy's example is sharp: on a platform like eBay, not all API endpoints are equal. An endpoint that lets a small shop list its inventory is an inconvenience if it fails. An endpoint tied directly to revenue is a different category of emergency. Encoding that distinction, turning tribal knowledge into a living, shared memory between humans and agents, is what lets good decisions get made quickly.

The Exoskeleton Problem: Domain Expertise Is the Only Real Moat

The most quietly important idea in the episode is about people, not machines.

Randy offers a deliberately rough analogy. Imagine everyone could suddenly put on an exoskeleton that made them strong and fast enough to dunk like an NBA player. If they never learned the fundamentals of basketball, they would still be terrible players, because the exoskeleton lifts everyone equally. AI is that exoskeleton. It makes you feel superpowered, but it makes the person with real domain expertise just as superpowered, and they were already ahead.

So the only way to actually pull ahead is to go get the domain expertise. And here is the twist: the smartest way to use AI is to use it to acquire that expertise faster. You can have a model walk you through how a language, a storage system, or a compiler really works, deepening your understanding so you can then direct the tools with authority. What you cannot do is abdicate your thinking to the AI. If you let it do your reasoning for you, you only ever get back what it already has, which is a regurgitation of existing human knowledge. To push past the edge of what is known, the expertise has to live in you.

This is where the episode gets its name. Randy worries less about the veterans, who carry judgment and wisdom that AI lacks, than about a younger generation tempted to hand their thinking over before they have built any foundation. His advice is the same one he has followed his whole career: stay curious, get into the guts of how things work, and break things on purpose so you can see what you are actually building. The graybeards with deep expertise will become formidable the moment they pick up the tools. Everyone else has to earn it the hard way, one fundamental at a time.

The next decade of operations will not be won by whoever adopts AI first. It will be won by whoever pairs it with the deepest understanding of the systems underneath. So here is the question worth sitting with: are you using these tools to do your thinking, or to sharpen it?

---

The Root Cause is an original series by RubixKube — Site Reliability Intelligence: see more, plan better, act safely, and learn with every incident.

See how it works.

Book a 30-minute demo. No slides, just your stack.

Download Whitepaper