Why Pure AI Agents Fail in B2B (and How To Build Deterministic Workflows)

It's tempting to believe a working product can be built by one LLM agent and by simply crafting the perfect prompt. Give the agent a goal, a few tools, and let them figure things out within a feedback loop. For exploratory or creative tasks, this can work remarkably well. For B2B systems, it usually doesn't.

I'm a product lead at Kombo, where I'm leading AI Apply. We are building an LLM-powered system that submits tens of thousands of job applications monthly into enterprise ATS systems. I ran into the above problem when I first started building browser agents, but it quickly became clear that, to submit thousands of applications on behalf of candidates without making mistakes, guardrails need to be in place.

On paper, it looks like a perfect agent problem: messy enterprise UIs and endless variations of application forms. And the output is theoretically simple: Understanding the underlying questions, parsing them, and letting the candidate provide answers before filling the questions in.

In practice, small failures became expensive and infeasible quickly. A missed checkbox during parsing meant potentially hundreds of failed applications. A hallucinated field meant poor candidate experience. The product was often almost right (I'd parse most fields correctly but miss one). But in B2B, where we allow our customers to send us thousands of candidates and each of them is expected to land in the HR system, "almost right" doesn't cut it.

The issue wasn't necessarily in the LLM's quality or intelligence, it was the architecture.

Defining Agent and Workflow

When I refer to "agents" and "workflows," I'm referring to the terms coined by Anthropic in their article "Building effective agents".

In short, an agent is a single feedback loop with an input and output, where the LLM makes all intermediate choices. A workflow, on the other hand, consists of multiple predefined steps routed by traditional code. The latter offers far more stability.

Our first prototype was a true agent with a bunch of tools: give it the application form HTML, tell it to extract all fields, and let it figure out which tools to use to provide the final output. It worked in principle (for simple forms, it never understood truly complex enterprise forms), however, the outputs were always slightly different.

Agents Are Good at Reasoning, Bad at Execution

The mistake I made early on was letting the agent both decide what to do and execute every step.

LLMs are genuinely good at the messy parts: interpreting unclear forms, understanding dynamic pages with inconsistent labels, deciding what action should come next. However, they're much worse at the boring parts: repeating the same behavior the same way, producing stable outputs across runs, following constraints that aren't explicitly enforced.

With our product, we had a huge advantage: If I know that I will be parsing application forms and applying to them, I know what conditions my system will encounter. While enterprise forms can vary a lot, there are overall guardrails that I can put in place, which makes the agent way less likely to produce an unexpected output, and thus the product becomes feasible to be sold.

Importantly, once an LLM is responsible for critical steps of a business (in our customers’ case: application form submissions), you need it to behave more like a deterministic function than an improvisational agent. For a pure agent architecture, this is incredibly hard to achieve. What you gain in flexibility, you lose in predictability.

Separating the Decision Making From Execution

While building the initial MVP, it became clear that leaving a complex operation, like parsing a long application form with dozens of questions, to a single orchestrator without guardrails led to inconsistent outcomes. I also pivoted away from the pure agent approach pretty quickly.

For some use cases, a true agent might work fine: if your input and output are well-defined, but the path between them allows for flexibility (buying an item for a user on Amazon, for example). But for my use case, every step along the way needs to be more or less correct. I'm collecting metadata at each stage, and every output feeds into the next.

What finally worked was separating responsibilities:

Use a predefined workflow to decide what should happen
Use an LLM to interpret the unknown form at each workflow step

The workflow now defines explicit steps: identify all field elements on the page, then for each field extract its label, then classify its type (text input, dropdown, checkbox, etc.), then determine if it's required. The LLM handles each step, but it's no longer deciding what to do next. The architecture is.

Instead of a long prompt telling the agent about all the tasks it needs to complete, it became:

The workflow defines a flow chart of moves to be taken
Each move is performed by an LLM, allowing for the system to work on top of any application form
Based on the LLM output, the flow chart defines where to move next

After making this change, I was able to create assertions on the output of 2-hour workflows and verify that the output was exactly the same across runs. That's how much more deterministic it got.

How I Made Execution Deterministic

Everything below is basically guardrails. Not because I don't trust LLMs to ever produce the right result, but rather because I needed to measure and bound the behavior.

Structured Output as a Contract

Every agent decision is returned as structured output (JSON with a predefined schema), and I use as many enums as possible instead of free-form strings.

This turns the LLM output into an interface you can rely on and allows downstream code to receive an expected input.

Measuring Reliability With Iterative Testing

Once the output became structured, I could start testing it like software.

I keep fixtures for problematic forms and job postings. When something breaks in production, it becomes a test case. I then run the same extraction 50 to 100 times and measure: Did it include the expected fields? Did labels and types match what I expect?

If 49 out of 50 runs pass, I know the success rate is 98%. It's not perfect, but it gives me a solid baseline. I can then use other strategies (mentioned below) to push that final 2%. Importantly, this also prevents regressions. Without fixtures, you end up making "vibe-based" changes instead of engineering.

We run these iteration tests in Playwright. When a fixture fails, we add it to the suite, tweak the prompt or schema, and re-run 50 to 100 times. If the pass rate improves, we ship it. If not, we keep iterating. It's the only way to make real progress instead of guessing, and we now have a test suite of hundreds of varying application forms to ensure our system understands them generalizably.

The Final 1%

If a step has a small but non-zero failure rate, sampling works surprisingly well.

For crucial decisions in the workflow, I run the same decision twice in parallel and only accept it if the outputs agree. Assuming a 2% failure rate (the above 49 out of 50 pass rate), taking two samples drops that to 0.04%.

Verification Loops for Self-Healing

In addition, the workflow isn't a single shot. It's a loop.

After each step, I verify that the browser changed in the way I expected. Did the input value actually change? Did the page navigate? Did the validation error disappear? These are standard Playwright assertions, nothing fancy.

If the assertion fails, I know exactly which workflow step went wrong. I can retry that specific step, try an alternative approach, or fail cleanly with a specific error. I'm not at the mercy of an agent that just says "something went wrong" and tries to solve the issue by itself.

Some of the weirdest edge cases I ran into were enterprise ATS systems where normal Playwright interactions just didn't work: the page was coded in some obscure legacy way, and clicking an input wouldn't focus it, or typing wouldn't register. The LLM would get stuck in a loop of "I tried everything, but the input just won't change." Having explicit verification at each step means I also catch this and fail cleanly instead of silently submitting a broken application.

The Takeaway

The workflow architecture keeps what LLMs are best at while removing them from the parts that need repeatability: failures are measurable, costs are predictable, regressions don't slip into production, and it enabled my use case of selling an infrastructure product which enterprises can trust.

In B2B, system reliability is non-negotiable, and the path forward isn't choosing between human logic and AI capability. It's architecting workflows that let each do what it does best.