The Freedom to Explore: Why Open Agents Outperform Rigid Workflows
Discover why flexible AI agents that can ask questions and explore freely consistently outperform rigid workflows. Learn how this approach mirrors real-world problem-solving, backed by research showing significant performance improvements in coding, business, and enterprise applications.

I see it every day: engineers battling with prompts, trying to feed their AI coding agents the perfect information upfront. They optimize, they refine, they craft these elaborate prompts hoping the agent will magically understand everything and complete the task effortlessly. Most of the time, this fails spectacularly, leading to frustration. Why? Because it's nearly impossible to optimize one perfect prompt for each unique task.
My approach with coding agents is fundamentally different. Instead of trying to do all the work for them, I give my agents the relevant information they need to explore and discover. I encourage them to ask questions. Remember, for now, agents are helpful assistants – if you don't tell them to ask questions, they default to being yes-men, eager to please. But when you explicitly encourage questions, their helpfulness manifests as curiosity about what they don't understand. We do the "prompt engineering" together, enriching the context collaboratively.
Think about it this way: imagine bringing a junior engineer to your office at the start of the week and saying, "I'm going to talk for an hour and give you all the details upfront. We can't speak during the week, and you need to deliver everything by Friday."
Compare that to bringing them to the office, giving them a general overview of the task, and encouraging them to explore the project. Based on their discoveries, you continue the discussion, building on insights from each phase until you both understand the project and task thoroughly.
Recent research from Princeton confirms this intuition: rigid pipeline approaches consistently underperform flexible agent architectures on complex, real-world tasks. In fact, Anthropic's multi-agent research system achieved a 90.2% performance improvement over single-agent approaches by embracing this exploratory paradigm.
This insight struck me while working with Claude Code. Instead of crafting long, detailed prompts, I shifted to small questions and symbiotic collaboration. Tasks like "let's get to know this service, understand how it works, and figure out how to implement this fix" or "go over this repo and find all database references and configurations" became my new approach.
(For a deeper technical dive into how Claude Code achieves this, see my analysis of Claude Code vs Cline architectures.)
Most of the time, my mental model of the underlying infrastructure wasn't quite accurate. The agent sees everything – every forgotten corner of code, every patch we've applied. The frustration comes when we assume the infrastructure works one way while the LLM gets confused by the actual code written.
The numbers support this approach. Phil Schmid's research on context engineering shows that "the field is shifting from prompt engineering to context engineering – providing the right information and tools at the right time, rather than attempting to encode all possibilities in advance."
After reverse-engineering both Claude Code and Cline, the stark differentiation between their approaches and those of Cursor and Copilot (which lean on structured methods like semantic retrieval and indexing) became clear.
According to discussions in the developer community, Cline's filesystem traversal approach means it "reads files in logical order, following imports and dependencies." This mirrors how developers actually navigate codebases, rather than relying on vector similarity.
What makes Claude Code feel smoother? Efficient tools and thoughtful prompt engineering. The way Anthropic designed Claude Code's code editing prompts made all the difference. Cline (an excellent tool optimizing for many models) has struggled for months with their editing tools – sometimes missing the actual edit location, other times rewriting entire files. Claude Code? Near-zero failures.
"Getting context takes time," you might say. Of course it does. Cline can take forever to understand a repository's context, which is why they introduced memory banks. Claude Code? It uses concurrent agents that crawl entire repositories in seconds, understanding the code's actual state. Memory banks' biggest flaw is the inevitable drift between cached memory and actual code – the same problem plaguing Cursor's indexing.
(I go deep into the technical architecture differences in my post on reverse-engineering these tools, including how parallel execution changes everything.)
Taking these insights to my AI engineering work yielded phenomenal results. Instead of viewing every use case as a pipeline needing precise configuration and fine-tuning, I started thinking in terms of tools and system prompts – giving agents the breadth to make decisions and gather context.
In my work, I need to gather information for legal cases from communications (SMS, calls, emails), platform metadata, documents, and countless other data points. Trying to predict what's needed for each insight is nearly impossible. I leaned into general agents with well-executed tools, and the results were fascinating. Getting insights without hard-coding pipelines felt mesmerizing – the results had both charm and efficiency.
Berkeley's research on compound AI systems validates this approach: "60% of LLM applications now use RAG and 30% use multi-step chains," demonstrating a clear move away from single-model approaches toward flexible, compound systems.
Another killer feature of "free" agents? They're easy to start with. No need for complex frameworks (LangChain Graph, Agno, etc.) – just a simple loop, tool calls, and system prompts. The devil, as always, is in the details.
But here's where it gets interesting: optimizing a tool or system prompt has compound effects. Every insight gets a sudden performance boost. It truly feels like training a worker for the job you're supposed to do.
Research shows that tool-testing agents that rewrote tool descriptions resulted in 40% faster completion times for future agents. Three factors – token usage (80%), tool calls, and model choice – explained 95% of performance variance in comprehensive evaluations.
The pattern extends far beyond coding. In healthcare, rigid clinical decision support systems are giving way to adaptive AI agents that learn from patient outcomes, with one academic medical center reporting 23% improved diagnostic accuracy.
Customer service provides particularly clear evidence. Unity saved $1.3 million in service costs through collaborative agentic workflows, while 90% of businesses surveyed by Zendesk now use AI agents for routing. The key differentiator? Flexible agents that maintain conversation context and learn from interactions consistently outperform rigid workflow automation.
Even Uber reports that generative AI and agentic AI boost engineers' productivity and cut delays – not through rigid pipelines, but through adaptive systems that understand context.
The concept I'm most fascinated by is the evolution of agents around users. We're used to static software – it acts the same every time (plus changing data and updated dashboards). But imagine software where everyone gets the same seed, but agents evolve around each user, understanding their needs, actions, and priorities, changing their tools, system prompts, and actions accordingly.
Imagine software with both a database and memory. Software that sleeps and changes to grow with your needs. This is the future I'm dreaming about.
This evolution parallels what I explored in my article about LLMs as the new assembly code of AI – just as computing evolved from punch cards to personal computers, we're witnessing AI's evolution from rigid tools to adaptive systems.
McKinsey predicts that 33% of enterprise software will include agentic AI by 2028, with 15% of daily work decisions made autonomously. Microsoft's Copilot already learns user preferences and work patterns, while healthcare agents continuously improve diagnostic accuracy based on patient outcomes.
Despite the clear advantages of flexible architectures, production reality is nuanced. Industry analysis shows that 95% of companies use generative AI with 79% implementing AI agents, but only 1% consider their implementations "mature."
The most successful pattern? Hybrid approaches that combine workflow reliability with agent adaptability. Use workflows for the 80% of predictable tasks, reserve agents for the 20% requiring dynamic reasoning. As one industry expert noted, "In the real world, value comes from what works. Not what wows."
The evidence is overwhelming: open, tool-based agents outperform rigid, predetermined workflows – but with critical nuances. Success lies not in choosing one approach over the other, but in understanding when each delivers maximum value.
The shift from front-loading all context to collaborative exploration represents a fundamental change in how we think about AI systems. It's not about crafting the perfect prompt; it's about creating the conditions for discovery. It's not about predicting every need; it's about building systems that can adapt when needs change.
As we move forward, remember: the best agent architecture is the one that matches the problem complexity. In a world of increasing complexity and changing requirements, that increasingly means architectures that can adapt, learn, and evolve – just like the problems they're designed to solve.
The future belongs to agents that ask questions, explore freely, and grow with their users. The question isn't whether to embrace this approach – it's how quickly we can make the transition.