I was building a multi-agent system where a router LLM dispatched incoming requests to one of four specialized agents: a code agent, a data analysis agent, a summarization agent, and a general Q&A agent. The routing step seemed like a solved problem — feed the query, get a category, move on.
That assumption broke quickly once I added telemetry.
The Counterintuitive Finding
What I expected: low-confidence routing decisions would be the expensive ones. The router would hedge, pick the wrong agent, the agent would fail or produce a bad result, and downstream error handling would kick in.
What actually happened was the opposite. Low-confidence decisions almost always triggered a fallback path I had wired in — when confidence dropped below 0.65, the system rerouted through the general Q&A agent as a safe default. Recoverable, if slower.
The real damage came from high-confidence misroutes. When the router scored a decision 0.92 or above but still picked the wrong agent, two things went wrong simultaneously. First, the specialized agent received a request it was not equipped for and produced a confident but incorrect output. Second, the error recovery logic never ran, because the system trusted the router’s confidence score.
I had wired retries and escalation only to low-confidence branches. The high-confidence path had no safety net.
What I Changed
I stopped treating confidence as a pass/fail gate and started tracking the confidence-accuracy joint distribution — how often high-confidence decisions were actually correct versus how often they were wrong in a costly way.
def log_route(qid, agent, score, label):
correct = agent == label
bucket = "high" if score >= 0.8 else "low"
metrics.increment(
"router.decision",
tags={
"correct": str(correct),
"confidence_bucket": bucket,
}
)After a week of collecting this, the high-confidence error rate was 12%. Not catastrophic by itself, but those 12% of cases had no fallback, so they were propagating silently into final outputs.
The fix was straightforward: add a validation step after every high-confidence dispatch. The specialized agent returns a brief self-check result alongside its main output. If the self-check fails, the system escalates regardless of what the router said.
The lesson I took from this: calibration matters more than raw accuracy in routing layers. A well-calibrated router that knows when it is uncertain is safer than a high-accuracy router that does not.