Scott Wueschinski
← All AI and Agentic POV

Your agent doesn't need more tools. It needs fewer.

Tool sprawl is the silent killer of agent reliability. Cutting tools, not adding them, is what makes agents production-grade.

AI & Agentic POV Agent Design

· 4 min read · Source: Anthropic Engineering: Writing effective tools for agents ↗

There is a reflex in every engineering org I have worked inside. When an agent fails, the team adds a tool. The agent could not pull the invoice, so they ship a get_invoice tool. It could not reconcile the line items, so they ship reconcile_invoice. Six weeks later the agent has forty tools and a reliability problem nobody can name. The instinct that adds tools is the instinct that kills the agent.

I work as a Forward Deployed Engineer. I ship agents into production, not onto a conference stage. And the single most reliable way to make a flaky agent worse is to give it more tools. The single most reliable way to make it production-grade is to take tools away.

The tax you cannot see on the demo

Every tool an agent carries is loaded into context on every turn. Not just the name. The full description, the parameter schema, the type definitions, the enum values. That payload arrives before the user has typed a word. In Anthropic’s own engineering guidance from September 2025, the team that builds the model says it directly: “More tools don’t always lead to better outcomes.” They flag the “common error” of wrapping existing functionality without considering whether the agent can actually use it well.

Here is what that looks like in production. At a dozen tools an agent is sharp. At forty it starts borrowing arguments from the wrong tool, calling the right tool with the wrong parameters, and occasionally inventing a tool that does not exist because three of yours sound alike. The decision space widened, the model’s attention thinned, and the error rate climbed. This is not a graceful slope. It is a cliff, and your demo will sit comfortably on the safe side of it while production sits past the edge.

The cost compounds in two directions. Token cost rises linearly with every schema you load, which means you pay more per call to be less accurate. And the failure mode is silent. A bloated agent does not throw an exception. It quietly completes the wrong task, and you find out from a customer.

Consolidation is an engineering decision, not a cleanup chore

The fix the Anthropic team prescribes is the one production teams arrive at the hard way: consolidate. Their example is precise. Instead of shipping list_users, list_events, and create_event as three separate tools, build one schedule_event tool that does the whole workflow. The agent stops orchestrating low-level primitives and starts calling intent. You collapse three schemas into one, you remove two opportunities to misroute, and you move the orchestration logic out of the probabilistic layer and into deterministic code where it belongs.

Namespacing does the rest. Group related tools under common prefixes so the model can reason about boundaries instead of guessing across a flat list of forty look-alikes. This is not cosmetic. It reduces context load and it offloads computation from the part of the system least equipped to handle it: the model’s attention budget.

The deeper move is to stop carrying tools at all when you do not need them. Lazy-load capability. Let the agent search for the tool it needs at the moment it needs it, rather than holding the entire catalog in context permanently. Better still, expose your tools as a code API the agent can write programs against, so the heavy surface lives on disk and only the relevant slice enters the window. The agent is good at writing code. It is bad at managing its own context. Design to that asymmetry.

The CODN of doing the lazy thing

Here is the trap. Adding a tool is cheap, fast, and feels like progress. Cutting one requires you to understand the workflow well enough to merge it, which is real engineering work. So teams default to addition, and the Cost of Doing Nothing accumulates invisibly. Nobody schedules a sprint to delete tools. The agent just gets slowly less reliable, the token bill creeps up, and trust erodes one wrong answer at a time until someone declares the whole agent “not ready for production” and nobody can point to the day it broke.

It broke the day you stopped curating. Reliability in agentic systems is not an emergent property of capability. It is a property of restraint. The teams shipping agents that actually hold up in production are not the ones with the longest tool list. They are the ones who treat every tool as a liability that must earn its place in the context window.

Audit your agent this week. Count the tools. Find the three that fire on every run and the thirty that fire on none. Merge what you can, namespace what survives, and lazy-load the rest. Then do it again next month. The agent that wins is not the one that can do everything. It is the one that carries only what the task in front of it demands, and writes code for the rest.