The conventional way to choose a base model is to read the leaderboards, run the evals on your own tasks, and pick the strongest. We did that. It is a useful exercise. It is also not what determined the choice.

What determined the choice was a question we kept coming back to whenever the conversation about model strategy reached a hard part: whose property is the model after we have finished improving it? Every other question was downstream of this one.

The frontier closed-weight models — call them Claude, GPT, Kimi — are excellent. They are also rented. You do not fine-tune them on your data, you do not run them in your data center, and the version you bet on can be deprecated at the vendor's pace, not yours. There is real value in renting the best, especially early. There is also a ceiling on what you can build, because at some point you will want to do something the API does not let you do, and at that point the ceiling becomes the floor.

The strong open-weight models — DeepSeek, Mixtral, Llama, Qwen — let you cross that ceiling. They are not all equivalent. The license matters; the architecture matters; the family matters most.

What we considered

DeepSeek shipped impressive coder and reasoning models on permissive licenses. We ran them. They are good. Our hesitation was about pace and continuity: a strong DeepSeek-2 does not promise a stronger DeepSeek-3 in any particular direction, and our fine-tunes are tied to a tokenizer and an architecture that we would have to re-do every time the underlying family pivots.

Kimi and other commercial-API offerings are closed. We discussed using them as cloud fallbacks, and that role we still endorse. They could not be the foundation of work we wanted to own.

Mixtral and Llama were on the table for a long time. The deciding factor was multimodality — we needed vision and audio in the same family, with the same tokenizer, on the same release cadence. Llama 3 is excellent at text; we were not going to glue together three different vendors for the rest of the stack.

Qwen3 won, and the reasons read more like infrastructure than like benchmarks. Apache 2.0 across the family. Open weights for every variant we cared about — chat, reasoning, vision, audio, embedding, reranking. Tokenizer compatibility across all of them, which means our fine-tuning data does not need to be re-tokenized when we move between sizes. A single research org with clear release cadence. And, importantly, an architecture we believe will accept the next generation of optimizations — quantization, expert routing, long-context training — without forcing us to start over.

The downside, honestly

Qwen3 is not the strongest single-task model on every benchmark. Claude is better at long-form writing. GPT is better at certain kinds of code. There are tasks where, all else equal, we would rather have a frontier closed model than a Qwen3 fine-tune.

We accept this. The fine-tunes we are training on Qwen3-8B and Qwen3-32B are ours. They run on our hardware. They cost what they cost to run, not what someone decides to charge for them next quarter. They will exist when this round of API providers has rotated out their generation, and they will have absorbed whatever specialized training we did on them over years rather than months.

The decision is not "Qwen3 is the best model." It is "Qwen3 is the model we can compound improvements on."

The principle

There is a temptation, in a fast-moving field, to chase the best model released this week. You will be wrong about which one is best more often than your benchmarks suggest, and even when you are right, you will be replacing it inside a year. A strong base model is a starting point; what matters is the curve along which your version of it improves.

We bet on one family. We bet on owning the fine-tunes. The bet is not that Qwen3 will always be the strongest model; the bet is that our version of Qwen3, six versions in, will be the strongest model for the things we ship.

That is a slower bet. It is also a more durable one.