The hard part is in the trenches
Follow the research and you'd think the whole game is speed. For years the industry has organized itself around one question: who has the most capable model. It is a real race. The labs leapfrog each other every few weeks, each new model more capable than the last. Every few months it is a new modality, too: text, images, voice, video. It is genuinely fast, and it is exciting to watch.
But it quietly stopped being the question that decides our deployments. When a project succeeds or fails, the deciding factor is almost never the capability of the model underneath. It is whether the organization can become a different version of itself. And that work is the opposite of fast.
Out in the trenches, nothing generalizes. It is a particular ministry's process, a language, a regulation, an exception nobody wrote down, a way one team has always done one thing. These things don’t fold into one winning model. The devil is in the details, and the details are local, and there are a great many of them.
Reinventing an organization around AI takes longer than almost anyone expects going in, and it should. It asks for patience, for kindness toward the people whose roles are being rewritten, and for technical and non-technical people sitting in the same room long enough to build the new way together.
Keeping control & staying flexible
In January I framed sovereignty mostly as a national concern: where data lives, who controls the model, whose rules apply. That is still true and still matters. But it is just as much a company question.
A company’s context is specific to it and to no one else: its processes, its exceptions, the data that records how it actually works. This data is also often the most sensitive, and in many places it legally cannot leave the building. So the system has to come to the data, not the other way around: in the cloud, on-premise, or fully air-gapped. The same thing that makes this work hard makes it valuable.
The reinvention of a company with AI at its core is also something deeper than adopting a tool. It is the slow rewiring of how decisions get made, how knowledge moves, and how people and systems work together. Over time it becomes the core of how the company runs.
Which is why it should not depend on a single model provider. When a company builds its core on one model, that provider's decisions become its own: a price change, a retired version, an outage, a shift in terms, and the thing it reinvented itself around is suddenly governed by a choice it did not make. The companies that do well will not bet everything on one. They stay able to move between models and to change course when necessary without losing the core value they built around. Flexibility is not only protection. It is also an advantage. The right combination of models, each used where it is strongest, tends to outperform any single frontier model on its own. The best system is rarely one model. It is several, orchestrated.
Trust has to be measured
An AI system can be excellent on the day it is deployed and quietly worse three months later, not because anyone broke it, but because the world it was tuned for shifted underneath it. So choosing the right model and keeping the right model is not as obvious as it sounds. And the model is only one part of it. What surrounds it matters just as much: the prompts, the tools, the data, the whole harness the model runs inside. Any part of it can be the thing that breaks. It all comes down to one discipline: evaluation.
Evaluating an agent turns out to be its own hard discipline, one the field is only beginning to take as seriously as building the agent in the first place.
It is not a matter of scoring a single answer against a key. An agent takes a sequence of actions: it reads, decides, calls a tool, recovers when the tool fails, and decides again. It does not just answer; it acts and modifies the world around it. That is what makes it so much harder to evaluate: the mistakes are live, not hypothetical. Any step can be the one that quietly goes wrong, so the whole trajectory has to be evaluated, not just the final answer.
The hard part is deciding what "good" even means, because it is domain-specific and lives in the heads of many experts. Building the ground truth is itself an act of writing those experts' judgment down.
A legal client needed Arabic speech-to-text. The obvious choice was the model that topped the public leaderboard for Arabic. It is genuinely state-of-the-art. It was also not good enough, because the real task was long strings of digits spoken in dialect, and the public benchmark had never measured that. So the team built an evaluation for their actual use case: real examples from the work, read aloud across several dialects, scored on whether every digit came out right. The model that performed best was ten times smaller. Public benchmarks measure the average case. Evaluation measures the actual job.
And because the system can drift after it ships, the checking never really ends. A system that looks right and a system that is right are different systems, and only rigorous evaluation can tell them apart.
The journey from tools to teammates has a next chapter, and it is not really about scale in the way I first meant it. It is the patient, local, human work of helping an organization become itself again, with AI at its core. The labs will keep sharpening the instrument. The other hard thing, the one far fewer people are working on, is teaching an institution full of people how to play it. That is the part worth doing and celebrating.




