Llana was supposed to be small. The plan was a code-generation tool — type a request, get a function, ship. The plan ran into reality in roughly the order such plans do.

The first version generated reasonable code, then crashed if any step failed. The user would type a goal, watch a few minutes of work happen, and then watch the whole thing dissolve because a test runner returned a non-zero exit code we had not handled. The polite response was to retry. The honest response was that the user had to start over from scratch, and they did not, because they did the rational thing and stopped using the tool.

The second version did not crash, but it did something almost as bad: it would silently lose the work it had done when its context window ran out. The user would come back to a partially-written PR with no memory of how it had been started, no plan to finish it, and no graceful way to ask Llana to please continue. Adoption did not move.

By the fourth or fifth restart we had stopped pretending we were building a tool and started building an agent. The distinction is small in the abstract and enormous in practice. A tool produces an artifact when invoked. An agent maintains a process across invocations, recovers from interruption, and is judged less on its peak output than on the rate at which its work is salvageable.

Three things that turned out to matter more than we thought

Cancellation and resumption are the core loop, not a polish item. The single biggest unlock in Llana's trajectory was a feature we had been treating as nice-to-have: when an agent run is cancelled or fails, it checkpoints the completed tasks, the pending tasks, and a snapshot of its own reasoning about why it did what it did. Next time the user re-runs, Llana rebuilds context from this snapshot and continues. Adoption tripled the week we shipped it. Users were not asking us to make the agent faster; they were asking us to make it possible to walk away from it for an hour and come back without losing the day. The general lesson, which we now apply elsewhere: a system's UX is not defined by its performance on the happy path. It is defined by how it behaves when interrupted.

Plans are skeletons, not decoration. Early Llana had a flat execution loop — try to do the goal, hope it works. The loop was shallow because we thought the model was smart enough to keep its own implicit plan in working memory. It was not, and even when it was, we could not inspect the plan, intervene on the plan, or replan only the failing part. We added explicit multi-phase planning with dependency graphs. Each phase has clear entry and exit conditions. When phase three fails, Llana can replan phase three without losing the work it did in phases one and two. The model is no smarter than it was. The agent is much more competent. The plan is the difference.

Multi-model routing beats single-model optimization. We spent weeks tuning Qwen3-32B for code generation before noticing that most of Llana's wall-clock time was being spent on tasks the 32B was overkill for — repository discovery, file enumeration, test triage. We added a phase router that hands cheap, structured tasks to a faster 8B, expensive reasoning to the 32B, and falls back to a closed-weight cloud model when the on-prem stack times out. Same Llana. Same outputs. Three times faster on average and noticeably more reliable in the tail.

What we would do differently

We would ship the operational loop first, and the model second. This is almost the opposite of what felt right at the time.

The instinct, when you have a strong model, is to wrap it in just enough scaffolding to expose its strength. The instinct is correct in narrow ways and wrong in important ones. What users feel, day to day, is whether the agent crashes; whether it loses work; whether it can pick up where it left off when they need to make coffee. None of those things are properties of the model. All of them are properties of the loop you wrap around it. We spent weeks on tuning curves we should have spent on supervisor architecture.

The model quality matters. It matters less, by a lot, than the question of whether the system is willing to admit it has failed in a way the user can recover from. A weaker model with a stronger loop will be the better product for almost everyone, almost all the time.

The shape of an agent we now believe in

Plans up front. Each plan step revocable. Each plan step able to fail without taking down the run. State checkpointed at every meaningful boundary. A clear distinction between the agent's understanding of what it is doing and the action it is currently taking, so the former can survive the latter going wrong. A multi-tier model stack with cheap fallbacks for cheap work and expensive ones for expensive work. A user-visible record of what was tried, what worked, what didn't, and what the agent now believes.

None of this is glamorous. None of it shows up in a benchmark. It is, as far as we can tell, what makes the difference between an agent people use and one they politely close.