General AgentsFeatured1,505 views5 likes

HDD: Human-Driven Development - How I Build AI Agents That Actually Work

The methodology that transformed my agent development from guesswork to systematic success

Sahar Carmel

• Director AI enablement

January 25, 2026 • 13 min read

HDD: Human-Driven Development - How I Build AI Agents That Actually Work

The methodology that transformed my agent development from guesswork to systematic success

A few months ago, I was building an analytics feature that needed to generate charts from natural language requests. Simple enough, right? "Show me revenue by month" → chart. I had Claude Code running, connected to BigQuery, and was feeling pretty confident.

I typed my first request, and Claude ran off to do its thing. A minute later, it came back with a query. The SQL looked reasonable. The results populated. I was about to call it a win.

Then I showed it to the analyst on the team.

"Where did it pull that from?" she asked, squinting at the query.

I looked closer. Claude had queried a table I didn't even know existed. Some legacy database that looked like it had the right column names.

"That table is garbage," she said flatly. "Nobody's touched it in months. All the real data lives in the dbt semantic layer."

This was the moment everything clicked for me.

The Problem With How We Build Agents

I've spent the last year building production AI agents at Flare and consulting with companies on AI-first development. And I keep seeing the same pattern: engineers approach agent building like they approach software architecture. They sit down, design the system upfront, define the tools, write the prompts, and deploy.

It rarely works.

The issue is that we're designing for how we think agents should work, not how they actually need to work in our specific environment. We're guessing at what tools they need, what context they require, what constraints they'll encounter.

That analyst didn't tell me the dbt layer was the source of truth because I didn't know to ask. Claude didn't know to use it because I didn't know to tell it. The gap between "reasonable-looking output" and "actually correct output" was invisible to both of us.

This is when I started developing a different approach. I call it HDD: Human-Driven Development.

What HDD Actually Is

The core idea is simple: before you build an agent to do something, watch how a human expert does it. Then, ask Claude Code to replicate that process. When it fails—and it will fail—diagnose why it failed. Each failure tells you something specific that's missing from your agent architecture.

Code

Observe human expert doing the task
         ↓
Ask Claude Code to do the same thing
         ↓
It fails
         ↓
Diagnose WHY:
  - Missing tool?
  - Context bloating?
  - Need a skill?
  - Prompt unclear?
         ↓
Fix that specific thing
         ↓
Try again
         ↓
Repeat until consistent success
         ↓
THEN migrate to production (Claude Agent SDK)

The key insight: you don't design the agent architecture upfront. You discover it through failures. Claude Code CLI is your debugging environment. The failures tell you exactly what's missing.

The Analytics Dashboard: A Case Study in HDD

Let me walk you through how this played out with the analytics dashboard project.

Phase 1: The First Failure

After the analyst told me the data was wrong, I sat with her for an hour. I watched how she actually worked. And what I saw surprised me.

She didn't just query the database. She opened the dbt repository first. She traced through the semantic layer, understanding how raw data transforms into meaningful business metrics. She looked at the ETL pipelines. Only then did she write a query—and when she did, it referenced specific dbt models, not raw tables.

Her mental model wasn't "find a table with the right columns." It was "understand how the organization processes data, then query the processed data."

Claude didn't have any of this context. It saw column names and made assumptions. Classic hallucination, but with data instead of facts.

The diagnosis: Missing domain knowledge about data architecture.

The fix: I added the entire dbt repository to Claude Code's context.

The result was immediate. Same question, completely different query. This time from the right place, with the right transformations, referencing the actual business logic that the organization uses.

Phase 2: The Chart Generation Problem

With the data problem solved, I moved on to the next challenge: generating actual charts from the queries.

I gave Claude freedom to create chart specifications. "Return JSON with the chart configuration." Simple enough.

The JSON it returned was syntactically perfect. Valid JSON, no errors. But when I rendered the charts, they were garbage. Fields that didn't exist in the data. Chart types that made no sense for the data structure. Encodings that were logically inconsistent.

This is a subtle failure mode that I've now seen across dozens of agent implementations: syntactically correct output that's semantically wrong. The LLM can generate valid JSON all day long. But "valid JSON" doesn't mean "valid chart specification for your specific system."

The diagnosis: Output space too open. Claude had no way to know what was actually possible in my charting system.

The fix: I built something I call SafeSpec—a constrained schema with exactly 10 chart types and 5 encoding fields, validated by Zod at runtime.

TypeScript

interface SafeSpec {
  type: "line" | "bar" | "horizontalBar" | "stackedBar" | 
        "pie" | "donut" | "area" | "scatter" | "metric" | "table";
  
  encoding: {
    x?: FieldEncoding;      // X-axis
    y?: FieldEncoding;      // Y-axis
    color?: ColorEncoding;  // Multi-series grouping
    value?: FieldEncoding;  // Single value (metric tiles)
    theta?: FieldEncoding;  // Pie/donut angle
  };
  
  data: {
    query: string;
    metric?: MetricReference;
  };
}

Now Claude can reason freely about what chart type fits the data, but every output is guaranteed to be valid in my system. It's like giving a child freedom to draw whatever they want—but only with the colors that exist in the box.

This is the pattern that I keep returning to: constraints that hook the LLM to your reality. It's structured output on steroids. Not just "return valid JSON," but "return JSON that can only contain values that make sense in this specific context."

Phase 3: The Subagent Discovery

As the system grew more complex, I hit another wall. The context window was filling up with all sorts of noise—dbt configurations, previous query results, chart rendering details, user conversation history. Claude was starting to lose focus on the actual task.

I wrote about this problem in my sub-agents article: every piece of information in the context window gets attention, and LLMs can't selectively ignore irrelevant information. When 30% of your context is Docker configs from a previous debugging session, the model starts thinking the main problem is infrastructure, not chart generation.

The diagnosis: Context pollution from unrelated concerns.

The fix: I decomposed the system into specialized subagents, each with its own context window:

Metric Finder: Searches the dbt semantic layer for relevant metrics
Query Builder: Constructs SQL queries against the semantic layer
Chart Selector: Determines the appropriate visualization type
Spec Validator: Validates output against SafeSpec schema

Each subagent does one thing well and returns only the essential information to the parent agent. The context stays clean. The focus stays sharp.

Phase 4: Validation

When did I know the system was ready for production?

When I took the generated queries to stakeholders—people who live in this data every day—and they said the queries were correct. Not "close enough." Not "with a few tweaks." Correct.

That's the bar. Human experts validating agent output. Not unit tests. Not benchmarks. Real domain experts confirming the work.

The dbt Semantic Layer: Why It Matters

Let me zoom in on one specific piece of this story, because it illustrates a pattern I see everywhere: the importance of organizational knowledge artifacts.

The dbt semantic layer is essentially a codified version of how an organization thinks about its data. It defines what "revenue" means. It specifies how customer segments are calculated. It documents the transformations that turn raw events into meaningful business metrics.

When I gave Claude access to this layer, something fundamental changed. It stopped making up metrics and started using the ones the organization had already defined. It understood the business logic, not just the table schemas.

According to dbt Labs, the semantic layer exists to create a "single, governed source of truth for business logic, metrics, and data contracts." This is exactly what an LLM needs: not raw data, but the organizational interpretation of that data.

This pattern extends far beyond analytics:

Legal teams have document templates and clause libraries—the institutional knowledge of how contracts should be structured
Engineering teams have architecture decision records and coding conventions—the accumulated wisdom of how to build in this specific codebase
Sales teams have playbooks and qualification frameworks—the distilled expertise of what works

Every organization has these knowledge artifacts. They're usually scattered across wikis, repos, and people's heads. When you're building agents with HDD, identifying and surfacing these artifacts is often the key breakthrough.

The Claude Code → Production Migration

Here's the workflow that emerged from months of iteration:

Development Phase (Claude Code CLI)

Do the task manually, or watch an expert do it
Ask Claude Code to replicate it
When it fails, diagnose why
Add the missing piece (tool, skill, context, constraint)
Repeat until consistent success

Production Phase (Claude Agent SDK)

Once it works reliably in Claude Code, migrate to SDK
Each subagent becomes a programmatic agent
Each tool becomes an SDK tool
Each constraint (like SafeSpec) becomes runtime validation

Why Claude Code CLI for development? Because it's designed for exploration and debugging. You can iterate quickly, see exactly what the model is thinking, and diagnose failures in real-time. The Agent SDK is built for production—containerization, session management, programmatic control—but it's not where you want to be doing discovery work.

As Anthropic's engineering team notes, agents work in a feedback loop: "gather context → take action → verify work → repeat." HDD is about making that feedback loop tight during development, before you ever deploy to production.

Why "Human-Driven" and Not "Human-Centered"?

The name is intentional. "Human-centered" suggests designing for humans. "Human-driven" means using human expertise to drive the development process.

In HDD, the human expert isn't just the end user. They're the blueprint. Their workflow, their knowledge, their judgment—these are what you're encoding into the agent system. You're not starting from abstract capabilities and hoping they'll be useful. You're starting from concrete human expertise and building the agent architecture to replicate it.

This is different from prompt engineering. You're not trying to write the perfect prompt that magically makes the agent work. You're systematically discovering, through failure, exactly what capabilities and constraints the agent needs.

It's also different from evaluation-driven development, where you build first and test against benchmarks later. In HDD, the "test" is continuous: does this work the way the human expert would do it?

Practical Principles for HDD

After applying this methodology across multiple projects and companies, some patterns have crystallized:

1. Start with observation, not implementation

Before writing any code, spend time with the human experts who do this task today. Watch them work. Ask questions. The goal isn't to interview them about what they do—it's to observe what they actually do, which is often different from what they'd describe.

2. Failures are data

When Claude fails, resist the urge to immediately fix it with a better prompt. Instead, categorize the failure:

Missing tool: Claude couldn't take an action it needed
Missing context: Claude didn't have information it needed
Missing constraint: Claude had too much freedom and hallucinated
Unclear task: The prompt was ambiguous

Each category has a different fix. Treating them all as "prompt problems" is how you end up with 3,000-word system prompts that still don't work.

3. Constraints are features

The SafeSpec example illustrates a counterintuitive truth: giving the LLM less freedom often produces better results. When the output space is constrained to only valid options, the model can focus on choosing the best option rather than inventing impossible ones.

This connects to what I wrote about in The Freedom to Explore—agents need freedom to explore and reason, but that freedom should be bounded by reality hooks that keep their outputs grounded.

4. Domain artifacts are gold

Every organization has accumulated knowledge in documents, configs, repos, and wikis. These artifacts encode institutional wisdom that took years to develop. Surfacing them for your agent is often more valuable than sophisticated prompting.

The dbt semantic layer was the breakthrough for my analytics agent. For other domains, it might be a coding style guide, a legal clause library, a sales playbook, or an architecture decision record.

5. Subagents preserve focus

When your context starts filling with irrelevant information, it's time to decompose. Each subagent should have a single, clear responsibility. They communicate through well-defined interfaces. The parent agent orchestrates without getting bogged down in implementation details.

This mirrors how effective human teams work: specialists with clear roles, communicating through defined interfaces.

6. Validation is human

The ultimate test isn't whether the agent produces syntactically correct output. It's whether domain experts say the output is actually correct. Build this validation into your development process from the start.

The Bigger Picture: Compilers for a New Era

I've written before about how Claude Code is more like a compiler than an assistant. HDD extends this metaphor.

In the early days of computing, programmers wrote assembly code by hand. Every instruction had to be exactly right. There was no margin for error. Then compilers emerged—systems that could transform high-level intent into correct low-level implementation.

We're at a similar inflection point with AI agents. The "assembly code" phase is building agents by hand-crafting prompts and hoping they work. HDD represents a more systematic approach: understanding what the agent needs through empirical observation, building constraints that guarantee valid outputs, and validating against human expertise.

The agents we're building today are the compilers of tomorrow. The question is whether we build them through guesswork or through systematic methodology.

Getting Started with HDD

If you want to try HDD on your next agent project, here's a minimal starting point:

Pick a task that a human expert currently does manually
Observe them doing it at least three times
Document the tools they use, the information they reference, the decisions they make
Ask Claude Code to do the same task
When it fails, identify which category of failure occurred
Fix that specific thing, then try again
Repeat until the output matches expert quality
Then migrate to your production environment

The process is slower than just deploying an agent and hoping it works. But the agents you build this way actually work—consistently, correctly, in ways that domain experts validate.

And that's the difference that matters.

Building agents with HDD or have questions about the methodology? Join Squid Club, our community of AI-first developers navigating this transition together.

Continue Reading

Back to Blog

HDD: Human-Driven Development - How I Build AI Agents That Actually Work

Sahar Carmel

HDD: Human-Driven Development - How I Build AI Agents That Actually Work

The Problem With How We Build Agents

What HDD Actually Is

The Analytics Dashboard: A Case Study in HDD

Phase 1: The First Failure

Phase 2: The Chart Generation Problem

Phase 3: The Subagent Discovery

Phase 4: Validation

The dbt Semantic Layer: Why It Matters

The Claude Code → Production Migration

Why "Human-Driven" and Not "Human-Centered"?

Practical Principles for HDD

1. Start with observation, not implementation

2. Failures are data

3. Constraints are features

4. Domain artifacts are gold

5. Subagents preserve focus

6. Validation is human

The Bigger Picture: Compilers for a New Era

Getting Started with HDD

Further Reading

Tags

Continue Reading