Driving Blind with GenAI
Observability has always been a confusing topic in the tech world, because most of the time it is lazily confused with monitoring
Often, it isn’t given a first thought, a second thought, or even an afterthought by fast-moving product teams. When we are keeping it lean, under the pressure to meet our growth targets, observability will absolutely take a back seat. And when it does, your customers will see failures before you do.
How do we justify it to the board? It is not a feature; it isn’t maintenance; it isn’t tech debt; it isn’t testing.
I can’t tell you how many times I’ve heard from CTOs: we have unstructured logs, we have a metrics dashboard, what else do we need?
Let’s review what observability means.
Observability lets us explore the internal state of our systems just from outputs with high cardinality metadata we have defined.
Charity Majors, cofounder of Honeycomb.io and a passionate voice in the world of observability sums it up succinctly, we have “known unknowns” and “unknown unknowns”.
Known unknowns are failures we can predict. They are covered by monitoring and alerts.
You have a PostgreSQL db so you know your pool might cap out, so you set an alert for 80% utilization You know your AWS SQS queue might back up so you create Cloudwatch alarms for the oldest message, average time in the queue, and number of messages. You expect to hit 3rd party API rate limits so you have a dashboard to monitor this
Unknown unknowns can’t be predicted. They love to gleefully sneak past unit tests, smoke tests, and integration tests in staging. This is a tale as old as time.
I have a fond memory of a particular poison pill scenario where we accepted an input that triggered catastrophic cascading failures. We accepted a payload for a customer integration. Unfortunately for us, the application code deployed out to prod was missing a WHERE clause to that particular table, which affected every single customer we had.
We had automated tests in staging, but unfortunately our tests didn’t tell us about the other 10 million rows that were updated in our table. Tests rarely catch side effects like that.
I was violently woken up by a pager one night.. Our metrics dashboard on fire: 200+ db sessions on our db, degrading response times, and spiking 5xxs.
Good thing we had distributed traces, aka telemetry maps that persist context across network boundaries. I clicked on a few slow db spans (individual events within a trace) during the spike, which let me investigate raw queries. I saw an update hitting millions of rows. From there, I inspected the trace context to find the upstream parent HTTP request. This revealed the integration webhook path.
▼ [HTTP POST /webhooks/foobar] ── (Parent Span: User & customer ID here)
├── [middleware.auth] ── (Auth check passed)
├── [handler.processPayload] ── (Accepted the customer integration payload)
│ └── [db.update] ── (UPDATE customers SET … — 10M rows affected)
│ └── Attribute: db.system = "postgresql"
│ └── Attribute: db.statement = "UPDATE customers SET status = 'active' …"
│ └── Attribute: db.rows_affected = "10538427"
└── [response] ── (200 OK returned — no error thrown)
Because we were proactive and purposefully populated high cardinality metadata, I could see exactly which user triggered it.
Just kidding, that’s what I dreamed happened.
In reality, we didn’t have instrumentation yet because we couldn’t justify spending any developer velocity to implement OTEL across our platform. Instead, I had to wake up the CTO and 2 other engineers at 2am while we all painstakingly examined every commit in the last week before we spotted the glaring SQL syntax error.
Metrics can tell us everything is going wrong, but they can’t tell you why. We need wide structured data with lots and lots and lots and lots of context to explore our blind spots while the world is on fire.
So if instrumentation matters that much for traditional deterministic software, what happens when we talk about supporting GenAI?
I’ve been to a lot of talks on GenAI. We all have. Even though it feels like we are advancing at the speed of light, we are making the same exact mistakes. Honestly every talk I’ve been to has been deeply stuck in monitoring.
I got it. I know we are at the frontier and establishing the best practices is somewhat new. But right now it feels like we are still just accounting for the known unknowns.
We monitor when we are hitting rate limits for our LLM providers We monitor total token usage and billing spikes We monitor latency of our automated agents We monitor hallucination scores We know how to ban harmful/irrelevant words and topics with regex
But as someone has to actually be on call for infrastructure, including now non-deterministic workloads like agentic AI, I need much more than just a dashboard. I need to uncover the ‘why’, when shit hits the fan.
Fortunately the OpenTelemetry contributors have already laid down some groundwork with GenAI semantic conventions. We now have standardized namespaces like gen_ai.agent* so we can start hydrating our context in GenAI.
Like typical software development, we need to establish a baseline. Here is an example of some of the useful namespaces for LLM calls:
| Attribute | Description |
|---|---|
| gen_ai.system | Provider (openai, etc.) |
| gen_ai.request.model | The model we asked for |
| gen_ai.response.model | The model that actually responded |
| gen_ai.usage.prompt_tokens | Input token count |
| gen_ai.usage.completion_tokens | Output token count |
| feature_flag.key / feature_flag.variant | Feature flags (use them with GenAI) |
| gen_ai.prompt | Raw text sent to the model |
| gen_ai.completion | Raw text returned by the model |
| gen_ai.call.tools | Tools available to the model |
| rpc.method | RPC method invoked |
That’s a good start. You can hydrate even more if you want to explore the semantic conventions doc I linked. But what are the unknown unknowns?
Well, they can’t really be predicted, but let’s explore 3 specific scenarios that may lead us to some patterns we might encounter in production.
Pattern 1: Agent Recursive Loop
What happens if our agent gets stuck in a recursive tool calling loop? We noticed our memory usage is spiking and our bill skyrocketing. How do we figure out what is causing this?
If we set distributed traces for these agents, you can imagine how many nested spans a looping trace would have. We wouldn’t even need to look at logs. In many telemetry platforms we can just search the number of spans for a trace via a query, so we can look for traces that have hundreds of spans (ie WHERE trace.span_count > 100).
With the trace waterfall visualizer, in this scenario we will see a massive cascade of spans after isolating the problem trace. By clicking on any of the spans with errors we can drill down into the metadata to discover what is going on. If we populate our spans with enough business context, we can pinpoint the tool or prompt state that caused it and go from there.
Pattern 2: Model Regression
What if we get reports from users in our logistics app that our agent is completely butchering our multi-stop delivery routing?
I couldn’t help but use a logistics example due to my history in logistics. I know how difficult it is to calculate multi-stop routing. It isn’t hard to imagine how a nondeterministic model might miscalculate these kinds of stops given the massive amount of possibilities.
This is an interesting scenario. A new model version might be significantly faster for users, and our traditional latency units would look healthy in our metrics dashboards. But faster doesn’t mean better when it comes to raw reasoning.
In order to catch this decline, we would need to enrich our spans with highly relevant business context, for example:
| Attribute | Description |
|---|---|
| logistics.routing.type | Delivery type (multi_stop_delivery, last_mile) |
| logistics.routing.stop_count | Number of stops requested — crucial for understanding at what complexity it starts to unravel |
| logistics.routing.distance_miles | Total requested mileage |
| logistics.carrier.id / logistics.carrier.name | Assigned carrier, in case some are offering suboptimal routes |
| logistics.pickup.zone / logistics.delivery.zone | Regional identifiers to track anomalies or problem regions |
| app.user_feedback.score | User thumbs up/down or rating |
This business context paired with some of the baseline context, ie gen_ai.request.model and llm.prompt.template_version, would help us isolate traces purely based on negative user feedback. If we find that most complaints map back to a specific model handling a specific number of stop counts, then we can drill straight into the gen_ai.content.output to see where the reasoning failed.
Pattern 3: RAG Poison Pill
What if our RAG app for enterprise document searching stops responding with valid English because an employee uploaded a bad PDF with corrupt characters?
We absolutely need visibility in a situation like this where exceptions aren’t even being thrown. And since these kinds of systems go across network boundaries, we need distributed traces.
If we populate business logic we can instantly filter queries down to a specific customer (WHERE app.customer_id == "john_doe_user")
When you open the parent trace for that user, you can follow the full path of the broken chat:
▼ [HTTP POST /api/chat] ── (Parent Span: Your user_id here)
├── [rag.embedding.generation] ── (Everything looks normal)
▼ [db.vector.search] ── (What is this?)
│ └── Attribute: custom.source_file = "doc_manual_v4_corrupted.pdf"
│ └── Attribute: db.vector.retrieved_text = "ÿØÿàJFIF"
└── [gen_ai.chat] ── (The LLM call that took garbage and spat out gibberish)
When an employee uploaded the doc_manual_v4_corrupted.pdf, an ingestion pipeline choked and sent the slop anyways. Because we were thoughtful enough to include the source file as vector metadata, we could easily trace the retrieved text back to the source file.
Moving from Monitoring to Observability
Observability isn’t some luxury we can add later once our systems have scaled. At a certain point, it becomes harder to circle back and come up with business context. It is always disruptive to engineering velocity and will be an uphill battle getting teams to cooperate. I know, I’ve been there.
GenAI introduces some of the most volatile, expensive, and often chaotic resources into our infrastructure. Dashboards with metrics green across the board aren’t going to help us know if our system is functioning as intended. When GenAI systems malfunction, they can be silent but catastrophic.
We must enrich our unified structured logs and traces with domain-specific business context. We don’t want our users to become our QA team, much less the victims of bad AI.
Despite how scary that sounds, there is some good news. Implementing OTEL in an AI stack isn’t an insane task. It is actually easier than it has ever been before with zero-code auto-instrumentations.
While there can be initial friction setting up telemetry, ongoing developer velocity almost drops to 0. It becomes part of the SDLC rather than an afterthought.
I’ve helped many teams transition from monitoring to unified observability throughout my career. It is a critical step from graduating from an ad-hoc, brittle setups to a well oiled machine that support a large scale customer base.
If we don’t invest in our ability to explore our unknown unknowns, we are just driving blind.