All About AI Observability (The Context Window #03)

In this episode of The Context Window — Grafana’s livestream series on AI in observability — I’m joined by Tiffany Jernigan and the engineers behind the product, Alexander Sniffin and Jack Gordley, for a deep dive into AI Observability in Grafana Cloud. We unpack what it actually is (a new way to instrument your AI apps and see canonical data about them, sitting alongside your existing telemetry), what an evaluation is in this context (and the difference between online evals on live traffic and offline evals on dataset conversations), and we walk through real demos: setting up AI Observability and evaluators, the analytics view, instrumenting a local coding agent, and Jack’s system prompt analysis tool. We also talk about the things nobody else is talking about — like LLM-as-judge as a method, evaluators that catch the most bugs, and the cheating-LLM phenomenon.

Timestamps

  • 00:00:00 — Introductions
  • 00:02:00 — The last month in AI news
  • 00:11:09 — What is AI Observability in Grafana Cloud?
  • 00:19:20 — What is an evaluation?
  • 00:21:05 — The origin of AI Observability
  • 00:25:09 — Demo: Setting up AI Observability and evaluators
  • 00:32:04 — What is LLM as judge?
  • 00:38:43 — Demo: AI Observability Analytics
  • 00:40:52 — AI O11y is based on OpenTelemetry
  • 00:42:17 — Demo: Instrumenting a local coding agent
  • 00:47:18 — Potential future agentic use cases
  • 00:52:00 — Evaluators that catch the most bugs
  • 01:02:23 — Demo: System prompt analysis
  • 01:05:11 — Guess the prompt

Resources

News from the episode

AI Observability

Mentioned

See Also