# llms-full.txt - RubixKube (plain-text corpus for LLMs and answer engines) # Single-fetch companion to https://rubixkube.ai/llms.txt (shorter index, citation hints, and the same FAQs). # This file adds long-form narrative per major route so crawlers do not need to render React to capture core positioning. site: https://rubixkube.ai [core_pages_plain_text] [Homepage] URL: https://rubixkube.ai/ RubixKube is Site Reliability Intelligence (SRI): software that detects infrastructure anomalies, runs evidence-linked root cause analysis, and drives governed resolution before customers feel impact. The product is built around an AI agent mesh that observes, plans, acts within guardrails, and learns so operational memory compounds across incidents instead of resetting after every outage. The positioning is intentionally beyond classic observability. Dashboards and alerts remain necessary; RubixKube reasons across signals and context, maintains graph-backed incident memory, and closes the loop with diagnosis and human-approved or policy-governed action. Pillars called out on the site include memory (what the system remembers), safety (guardrails and approvals), and explainability (auditable reasoning tied to evidence). Primary audiences are platform engineering and SRE teams on Kubernetes-heavy, cloud-native stacks who need lower MTTR, less alert noise, and reliability that improves over time rather than depending on repeated manual firefighting. --- [Platform] URL: https://rubixkube.ai/platform The platform page describes how RubixKube operationalizes SRI in production. A lightweight observer agent runs in customer Kubernetes clusters and pulls telemetry on demand from standard pipelines (for example OpenTelemetry, Jaeger, and Tempo style sources). Events and context feed services that correlate signals, build causal understanding, and coordinate specialized agents across the incident lifecycle. Differentiation on this route is architectural: conversational operations, graph-backed knowledge, integrations with collaboration and ticketing tools, and governed autonomy so recommendations and actions map to organizational policy. Technical implementation in the wider product stack includes event-driven services, graph storage for relationships and RCA context, and multi-tenant isolation at API and data layers. --- [Solutions] URL: https://rubixkube.ai/solutions Solutions content frames outcomes for engineering and business stakeholders: faster incident resolution, reduced noise from poorly correlated alerts, safer change and operations, and reliability posture suited to teams shipping AI-augmented software. Use cases align with how enterprises actually fail (cascading dependencies, unclear blast radius, repeated incidents with no institutional memory). RubixKube is positioned as complementary to existing observability vendors. Customers keep their metrics, logs, and traces where they already live; RubixKube adds the reliability layer that investigates, explains, and recommends next steps with evidence, rather than replacing the observability plane. --- [Pricing] URL: https://rubixkube.ai/pricing Pricing is organized into tiers labeled Individual, Business, and Enterprise on the marketing site, with dimensions such as environments, monitored applications, and investigations per month. Boost Packs add investigation capacity. Exact numbers and limits live on the pricing page and change with go-to-market updates; assistants should cite https://rubixkube.ai/pricing for current figures. The visible FAQ on /pricing explains product concepts (investigation, environment, monitored application), limit behavior, and the stance that RubixKube analyzes and recommends while the customer team controls what executes in production. FAQPage JSON-LD on that route matches those visible questions and answers. --- [About] URL: https://rubixkube.ai/about The about narrative ties the company to operator experience: infrastructure complexity that outpaces dashboards, on-call load, and the gap between raw telemetry and decisions that protect revenue and trust. RubixKube is presented as the reliability layer that makes systems more self-healing and legible to both engineers and leadership. Founding context and mission statements on /about should be cited for company motivation; technical depth belongs on /platform and supplementary machine-readable files. --- [Resources] URL: https://rubixkube.ai/resources Resources aggregate documentation links, blog content, guides, tutorials, hands-on demos, and learning-oriented material. It is the correct citation target for "where are the docs" style questions alongside any linked external documentation host the site references. Blog posts cover topics such as observability limits, autonomous operations, and product philosophy; slugs and freshness are listed in sitemap.xml. --- [Contact] URL: https://rubixkube.ai/contact Contact and demo requests route through the on-site contact experience. For sales, partnerships, or press, this path is the supported entry point. Product and security questions that need contractual detail should be directed to the team rather than inferred from marketing copy alone. --- [glossary_definitions] [MTTU (Mean Time to Understand): Definition, Benchmarks, and How to Measure It] URL: https://rubixkube.ai/glossary/mean-time-to-understand Every reliability metric you track today was coined because someone decided to make a vague problem measurable. MTTD came from "we find out too late." MTTR came from "we recover too slowly." Neither captures the phase that determines both. In 2026, with distributed systems, AI-generated workloads, and silent degradations that never trip a threshold, the bottleneck is not detection. It is not recovery. It is understanding. MTTU is how you measure that. The Definition MTTU is the time it takes for an on-call team to move from signals to a usable explanation: MTTU = time from first credible signal to “we understand the nature of the incident well enough to choose the right action with confidence.” That is different from detecting, acknowledging, or fully repairing. In practice, MTTU covers the “messy middle” of incident response: evidence gathering, correlation across services, hypothesis formation, validation, and agreeing on what matters. Where MTTU sits in the MTTx framework A concrete incident-phase timeline you can standardise The cleanest way to make MTTU usable is to treat an incident as a timeline with explicit phase transitions. The gap the industry already feels Even when companies don’t call it “MTTU”, the industry repeatedly describes the same bottleneck using phrases like: “time to root cause understanding” “investigation” “troubleshooting” “interpret dozens of data sources” “ad hoc queries… hours of experimentation” The need is not new. The metric is. How the industry describes it today Datadog Incident Response page “Run autonomous investigations… to surface root cause… in minutes.” “Autonomous investigations” and “root cause… in minutes” is explicitly about compressing the understanding phase. New Relic Logs Intelligence press release “Accelerating time to root cause understanding.” This is essentially “time-to-understanding” by name. The same release also links scale/AI logs to investigation load. Dynatrace Root-cause analysis page “You don’t have to manually interpret dozens of data sources to know the root cause.” The constraint is interpretation/correlation (human understanding), not repair mechanics. Google SRE Training PDF “Mitigation buys you time to investigate and gather data…” Google explicitly separates mitigation from investigation (understanding), implying a discrete phase worth optimising. incident.io on MTTR breakdown “12 minutes assembling… 20 minutes troubleshooting…” In their breakdown, a large share of MTTR is context + troubleshooting (i.e., MTTU-like work). Cloudflare postmortem (Feb 6, 2025) “08:25 Internal incident declared… 08:42 Root cause identified…” A first-party timeline separates “we are in an incident” from “we identified root cause” — that gap is directly measurable MTTU. Google Cloud status incident narrative (Jun 12, 2025) “Within 10 minutes, the root cause was identified…” Public reliability communications explicitly track time-to-root-cause as a speed milestone. Academic survey on RCA in microservices (2024) “Complex dependencies… pose significant challenges in identifying the underlying causes…” The research framing is clear: modern architectures make “finding causes” (understanding) hard enough to be a dedicated research domain. Why it matters now MTTU matters now because modern incidents are less “a server died” and more “the system is behaving strangely”. Three forces drive this. Microservices amplify ambiguity Microservices introduce dense dependency graphs and fault propagation: a small issue in one service can manifest as latency, retries, and partial failure elsewhere. An RCA survey focused on (micro)services calls out “complex dependencies and propagative faults” as a core challenge for identifying underlying causes. This shifts the bottleneck from “repair” to “reasoning”. When symptoms are distributed, humans spend time correlating telemetry, reconstructing timelines, and ruling out false leads—i.e., MTTU work. AI-era systems change the shape and volume of evidence The investigation surface area is exploding. In a 2025 announcement, New Relic explicitly ties modern distributed systems and “AI tools” to log volume and complexity, and even mentions “verbose model inference logs that burden AI workloads.” Whether you are operating AI workloads or simply operating with AI tooling, the operational reality is the same: more signals, more events, more correlated context required to understand what is real. Silent degradation is more common than clean failure The modern failure mode is not “down”. It is “degraded”, “partially broken”, or “user journey failing while servers look up”. Google’s Cloud CRE production maturity assessment explicitly warns that server “up” metrics can fail to reflect user experience: “if the server is ‘up’ but users still can’t use the product, that metric doesn’t give insight.” The measurable cost: investigation time dominates more often than teams admit You asked for realistic ranges. The honest answer is: MTTU varies wildly by incident type, but multiple credible sources show the understanding phase is often a major slice of the whole. A practical “median P1” breakdown (illustrative): incident.io says many teams see median P1 MTTR of 45–60 minutes, with a breakdown including 12 minutes “assembling the team and gathering context” and 20 minutes “troubleshooting the actual issue.” If you treat those two pieces as “time-to-understanding work”, that is ~32 minutes of cognition/correlation inside a 45–60 minute incident—often over half of the elapsed time (simple arithmetic on their stated breakdown). High-performing public postmortems still expose a distinct understanding gap: in Cloudflare’s Feb 6, 2025 incident timeline, an internal incident is declared at 08:25 UTC and root cause is identified at 08:42 UTC—about 17 minutes between “we’re responding” and “we understand cause” (derived from their timestamps). Best-case: root cause in minutes, but recovery still takes longer: Google Cloud described one incident where root cause was identified “within 10 minutes”, but mitigation rollout completed within ~40 minutes. That narrative again separates understanding from execution. Long-tail reality: understanding can take hours: in a Radix network outage report, the root cause was identified “about 4 hours after the outage began.” That is not a tooling failure—this is what complex systems do to humans under pressure. MTTD and MTTA can be excellent and you can still lose, because the organisation cannot convert signals into understanding fast enough to pick the right action. This is why a metric that isolates that middle phase is defensible: it turns a hand-wavy “debugging is hard” into a number you can trend, segment, and improve. How to operationalise it RubixKube’s public positioning already sets direction: make investigations “minutes, not hours” and treat “mean time to understand” as a headline outcome. The next step is to make MTTU measurement-grade: consistent definitions, consistent instrumentation, and experiments that show causality. Below are three measurable ways to operationalise MTTU in infrastructure incident management. Instrument “understanding” as an explicit incident event Mechanism: Add a first-class event in the incident workflow: Understanding Achieved. Start timestamp: first credible signal (alert fired, SLO burn, user journey failure, or incident declared). End timestamp: the moment the system (or incident commander) marks a validated root cause hypothesis (not just a guess). Practical implementation patterns: In Slack/Teams: a command like /understood that writes to the incident timeline. In the RubixKube UI: a one-click “Root cause validated” action that links evidence (telemetry queries, diffs, deploys, topology path). In API terms: an event type INCIDENT_UNDERSTOOD with required fields cause, confidence, evidence_links. This aligns with how first-party postmortems already separate declaration and root-cause timestamps (e.g., Cloudflare). Sample KPIs (put on one dashboard): P50 and P95 MTTU by severity (P1, P2, P3). MTTU share of MTTR = MTTU / MTTR (forces focus on the bottleneck). Correct-first-hypothesis rate (percentage where first “understood” tag matches postmortem root cause). Build an “MTTU dashboard” that ties understanding to causes, not just time Mechanism: Segment MTTU by what actually drives it. The Google CRE maturity assessment explicitly highlights that ad hoc exploration tools can be “cumbersome”, requiring “training or hours of experimentation.” That is not just an annoyance; it is MTTU inflation. A useful MTTU dashboard therefore needs dimensions like: Trigger type: deploy, config change, dependency outage, capacity regression, data corruption (Google’s pipeline guidance explicitly calls out configuration bugs and dependency issues as causes worth investigating). Evidence type used: logs-only vs metrics+traces+events (proxy for “correlation difficulty”). Services touched: number of services in blast radius (proxy for distributed ambiguity). Human load: number of responders / handoffs / escalations. Sample KPIs (operational + business): MTTU by “change adjacency”: incidents with a deploy/config in last N minutes vs not. MTTU by blast radius: 1 service vs 5+ services. Eng-hours spent before understanding: responders × MTTU (rough but powerful). “User-journey SLO saved” minutes: tie understanding speed to user impact framing. Run controlled experiments that prove RubixKube reduces MTTU, not just MTTR Mechanism: Treat RubixKube as an intervention in the incident lifecycle and measure differences rigorously. A clean experimental design (quarterly or monthly): A/B by service: some teams/services run with RubixKube recommendations enabled, others use baseline tooling. A/B by incident type: deploy-related incidents vs non-deploy incidents. Before/after with matched incidents: compare similar incident categories over time. What you measure: Primary: MTTU p50/p95 (and distribution shape). Secondary: MTTR, MTTM, responder count, number of “false lead” hypotheses, time spent in ad hoc querying. RubixKube is built around MTTU as a first-class metric. Every investigation the platform runs is timestamped from first signal to validated root cause, and that number is surfaced as a headline outcome, not buried in logs. --- [supplementary_faqs_same_as_llms_txt] Q: What is RubixKube? A: RubixKube is a Site Reliability Intelligence (SRI) platform that uses a coordinated mesh of AI agents to detect infrastructure anomalies, perform evidence-linked root cause analysis, and drive governed resolution across Kubernetes and cloud-native environments. It sits above your existing observability stack—not replacing Datadog or Grafana, but reasoning over the signals they collect. Q: What is Site Reliability Intelligence? A: Site Reliability Intelligence (SRI) is the category RubixKube defines: software that goes beyond dashboards and alerts to autonomously detect anomalies, correlate signals across infrastructure context, run root cause analysis tied to concrete evidence, and close the loop with diagnosis and action—all with operational memory, safety guardrails, and explainability built in. Q: How does RubixKube work? A: RubixKube deploys a lightweight observer agent into your Kubernetes clusters that pulls telemetry on demand from sources like Jaeger, Tempo, and OpenTelemetry. That feeds a coordinated agent mesh: specialized agents (detection, triage, expert SRE analysis, spectrum correlation) that stage the incident lifecycle in auditable steps, share context through graph-backed operational memory, hand off to the next agent or a human approval gate per policy, and build causal graphs while learning from outcomes. Incident context persists in the graph database so reliability compounds instead of resetting after every incident. Q: What does "evidence-linked root cause analysis" mean? A: Every root cause conclusion RubixKube produces is tied to concrete signals: specific log lines, metric anomalies, dependency graph paths, and recent changes. Engineers can inspect and audit the full reasoning trail rather than trusting a generic summary, which makes RCA findings actionable and verifiable. Q: How is RubixKube different from Datadog, New Relic, or Grafana? A: Datadog, New Relic, and Grafana excel at collecting and visualizing metrics, logs, and traces. RubixKube operates on the reliability layer above them: it ingests their signals, correlates context across sources, reasons about incidents using causal graphs, and drives diagnosis and resolution. It complements your observability stack rather than replacing it. Q: How is RubixKube different from PagerDuty or Opsgenie? A: PagerDuty and Opsgenie route alerts and manage on-call schedules. RubixKube starts where alerting ends: it investigates why an alert fired, builds an evidence trail, identifies root cause, and recommends or executes a resolution path. It reduces the manual investigation burden that begins after a page lands. Q: How is RubixKube different from generic AIOps platforms? A: Most AIOps tools apply statistical correlation or a one-off LLM prompt over raw telemetry. RubixKube is productized SRI: it maintains persistent operational memory across incidents, runs structured multi-agent workflows, stores context in a graph database for compounding learning, and enforces guardrails on every action—not a chatbot over your logs. Q: Why not just build an internal AI SRE tool with an LLM API? A: You can, but you will need to solve multi-tenant isolation, operational memory that persists across incidents, structured agent orchestration, guardrailed action execution, causal graph construction, and integration with your full observability and collaboration stack. RubixKube ships all of that as a managed platform so your SRE team focuses on reliability, not building and maintaining AI infrastructure. Q: Who is RubixKube for? A: Platform engineering and SRE teams running cloud-native or Kubernetes-heavy systems who want to reduce MTTR, cut alert noise, and build reliability that compounds over time instead of depending on heroic manual firefighting during every incident. Q: Does RubixKube work outside of Kubernetes? A: The core observer agent is Kubernetes-native, but RubixKube integrates with cloud APIs, general-purpose telemetry (OpenTelemetry, Jaeger, Tempo), and collaboration tools. If your infrastructure emits standard observability signals, RubixKube can reason over them. Q: Does RubixKube make changes to production without human approval? A: By default, RubixKube watches, analyzes, and recommends—your team decides what executes. Where you choose to enable autonomous actions, every workflow passes through configurable guardrails and approval gates so changes match your organization's policies. Q: How does RubixKube isolate tenant data? A: The platform is multi-tenant by design. Tenant identity is derived and validated from JWT claims at the API boundary, and those claims thread through every layer—APIs, data stores, event streams, and agent execution contexts—so one customer's operational data is never accessible to another. Q: Is RubixKube secure for regulated environments? A: Tenant-scoped data isolation, JWT-validated API boundaries, guardrailed agent actions with human approval gates, and full audit trails on every investigation step are built into the architecture. For specific compliance requirements, contact the RubixKube team. Q: What does RubixKube integrate with? A: RubixKube connects to chat (Slack), ticketing (Jira, Linear), documentation (Confluence), source control (GitHub), cloud provider APIs, and Kubernetes clusters. Incidents surface where teams already work, and resolution actions can flow back into existing workflows. Q: How do I deploy RubixKube? A: RubixKube is a managed SaaS platform. A lightweight Go-based observer agent deploys into your Kubernetes clusters and communicates with the RubixKube control plane. Setup connects your telemetry sources, collaboration tools, and defines your guardrail policies. Q: What is RubixKube built with? A: The platform runs Python FastAPI microservices with Google ADK for agent orchestration, a Go-based Kubernetes observer agent, Neo4j for causal graphs, MongoDB for incident history, and NATS JetStream for event streaming. The architecture is event-driven and multi-tenant. Q: Does RubixKube use a Kubernetes operator or sidecar? A: RubixKube uses a lightweight, pull-based observer agent deployed into your cluster—not a sidecar on every pod. The observer queries telemetry sources on demand rather than intercepting traffic, keeping the footprint minimal and non-intrusive. Q: How much does RubixKube cost? A: RubixKube offers three tiers—Individual, Business, and Enterprise—based on the number of environments, monitored applications, and investigations per month, with optional Boost Packs for additional capacity. See https://rubixkube.ai/pricing for current pricing, plan limits, and the full pricing FAQ. Q: Is there a free tier or trial for RubixKube? A: Check https://rubixkube.ai/pricing for the latest plan options including any free or trial offerings. Q: Where can I learn more about RubixKube? A: The official site is https://rubixkube.ai. For pricing and plan details, visit https://rubixkube.ai/pricing. For a short machine-readable index, see https://rubixkube.ai/llms.txt. For long-form page narratives plus this Q&A in one file, see https://rubixkube.ai/llms-full.txt. [contact] connect@rubixkube.ai