Building Observable AI: Why Your AI System Is a Black Box (And How to Fix It)
You've deployed your AI system to production. It's been running for two weeks. On Tuesday morning, a stakeholder emails: "The AI gave the wrong answer on this customer query. What happened?"
If you can't answer that question — if you can't trace exactly what inputs the model saw, what tools it called, what intermediate reasoning it produced, and what caused the output — your AI system is a black box. And black boxes in production are liabilities.
Observability is the property of a system that allows you to understand its internal state from its external outputs. For traditional software, this means logs, metrics, and traces. For AI systems, we need all of those plus something more: decision traces.
The Three Layers of AI Observability
Layer 1: Infrastructure Metrics
The basics. Is the system up? Is it slow? Is it expensive?
from prometheus_client import Counter, Histogram, Gauge
inference_requests = Counter(
'ai_inference_requests_total',
'Total inference requests',
['model', 'endpoint', 'status']
)
inference_latency = Histogram(
'ai_inference_latency_seconds',
'Inference latency distribution',
['model'],
buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
token_usage = Counter(
'ai_token_usage_total',
'Token consumption',
['model', 'type'] # type: prompt / completion
)
These give you alerting capacity. P95 latency over 3 seconds? Page someone. Token costs up 40% this week? Investigate.
But infrastructure metrics don't tell you why an output was wrong. That requires the next layers.
Layer 2: Semantic Logging
Standard application logs record events: "Request received", "Database query executed", "Response sent." Semantic logging for AI adds structured context about the meaning of what happened.
import structlog
log = structlog.get_logger()
async def run_agent(task: str, user_id: str) -> AgentResult:
trace_id = generate_trace_id()
log.info(
"agent.started",
trace_id=trace_id,
user_id=user_id,
task_summary=task[:200],
model=config.model,
tools_available=[t.name for t in tools],
)
result = await agent.execute(task)
log.info(
"agent.completed",
trace_id=trace_id,
steps_taken=result.steps,
tools_called=result.tool_calls,
output_length=len(result.output),
total_tokens=result.token_usage,
duration_ms=result.duration_ms,
success=result.success,
)
return result
Semantic logs are queryable. You can ask: "Show me all agent runs that took more than 10 steps and failed." Or: "Which tool is called most often in failed runs?"
Layer 3: Decision Traces
This is where AI observability diverges from traditional observability. Decision traces capture the reasoning chain of an AI system — every thought, every tool call, every intermediate result, in order.
We represent a decision trace as a tree:
{
"trace_id": "tr_abc123",
"task": "Find the top 3 competitors to Acme Corp in the CRM software space",
"root": {
"node_id": "n_001",
"agent": "planner",
"action": "decompose_task",
"output": ["search_competitors", "rank_by_relevance", "format_output"],
"children": [
{
"node_id": "n_002",
"agent": "researcher",
"action": "web_search",
"input": "CRM software competitors Acme Corp 2025",
"output": ["Salesforce", "HubSpot", "Pipedrive", "Zoho"],
"latency_ms": 847,
"children": []
},
{
"node_id": "n_003",
"agent": "ranker",
"action": "rank_by_market_share",
"input": ["Salesforce", "HubSpot", "Pipedrive", "Zoho"],
"output": ["Salesforce", "HubSpot", "Pipedrive"],
"reasoning": "Ranked by ARR and market presence; Zoho excluded due to different market segment",
"latency_ms": 234,
"children": []
}
]
}
}
With a decision trace, the Tuesday morning debugging question becomes answerable in seconds. You can see exactly what inputs the researcher agent got, what it returned, and why the ranker excluded Zoho.
Evaluation as Observability
Beyond debugging individual failures, you need to know if your system is drifting — gradually getting worse over time without any single obvious failure.
We address this with continuous evaluation: a background process that randomly samples production outputs and evaluates them against a set of rubrics.
async def continuous_eval_worker():
"""Runs in background, samples and evaluates production outputs."""
while True:
sample = await production_log.sample(n=50)
for entry in sample:
score = await evaluator.evaluate(
input=entry.input,
output=entry.output,
rubrics=["accuracy", "hallucination", "format_compliance"]
)
metrics.record("eval_score", score.overall, tags={
"agent": entry.agent_id,
"date": entry.date,
})
if score.overall < QUALITY_THRESHOLD:
await alert_team(entry, score)
await asyncio.sleep(300) # Run every 5 minutes
This gives you a rolling quality score for your AI system — a signal that surfaces gradual degradation long before customers notice.
The Observability Stack We Recommend
For most production AI systems, we use:
| Concern | Tool |
|---|---|
| Metrics | Prometheus + Grafana |
| Logs | structured JSON → Loki or Datadog |
| Traces | OpenTelemetry → Jaeger or Tempo |
| Decision traces | Custom store (Postgres + S3) |
| Evals | Custom evaluator + Grafana dashboard |
The decision trace store deserves special attention. We store trace trees in Postgres (for structured querying) and the full trace JSON in S3 (for cost efficiency). A typical query looks like:
SELECT
trace_id,
task_summary,
total_steps,
total_latency_ms,
success,
root_cause_node
FROM decision_traces
WHERE
success = false
AND date > NOW() - INTERVAL '7 days'
AND agent_id = 'researcher'
ORDER BY total_latency_ms DESC
LIMIT 20;
Start Before You Launch
The biggest mistake teams make is building observability after launch, when something has already gone wrong. By then, you have no baseline and no historical data to compare against.
Build observability into your AI system from day one. It will feel like overhead until the day you desperately need it — and then it will feel like the most important investment you made.
Sam Okoye is Head of Engineering at PrismGraph Technologies. 15 years building real-time systems, passionate about reliability engineering and elegant APIs.