When discussing AI with others, I can be sceptical about claims people or AI companies make. There’s a lot of hype out there, and I tend to be much more nuanced than the influencers online. Sometimes, I get the question: “Yeah, the AI doesn’t support that now, but it will when we add more data, right?” It took me a while to come up with a satisfying answer, and in summary, it’s this: The big innovations in LLMs don’t (just) come from adding data.
Yes, the state-of-the-art models today are often much larger than before. Or were trained on much more data. Sure: GPT 4 was much bigger than GPT 3.5, which in turn was bigger than its predecessors. However, there are also reports that the effect of increasing the size of the models and/or the size of the models are tapering off. In the mean time, some ideas around how to train models had a very big effect on the versatility and usefulness of LLMs.
There’s a pattern to some big innovations in the LLM space, that could not be attained by “just adding data”. The following steps are necessary to unlock new capabilities in models:
These are not just scaling up existing approaches and hoping that more data will unlock more capabilities. Let me show you three innovations that follow this pattern.
LLMs became mainstream all of a sudden, when ChatGPT came out. Everyone was talking about it. The big invention here was a chat-interface on top of LLMs, that you could talk to. However, before ChatGPT, LLMs were basic completion models. This means that when you were prompting a model, you were thinking about how it could complete a sentence:
"The square root of 9 is "
would be completed with (hopefully, since LLMs
are not great at mathematics): "3"
. Or "The quick brown fox jumps over "
would be completed with "the lazy dog."
.
The big invention, that allowed ChatGPT to jump into our collective consciousness, was fine-tuning (i.e., retraining on a specialized data set) a model such that it would return as a chat-message. In order to make that possible, they had to come up with a way to encode the start and end of messages and train the models to follow that style of generation. While this did require extra data that looked like chat-messages, the actual jump came from realising that this would unlock new possibilities. (Fun fact: they did this because they thought it would generate more data they could use for training their models!)
This innovation meant easier interaction between humans and LLMs, but the next innovation gave LLMs the capability to interact with tools.
Chatting with LLMs is now pretty common. Even my parents are hooked. The next iteration that created a leap in capabilities is function calling. This means that the model can signal that it wants the client to execute a function. For example, a calculator (again, LLMs are bad at mathematics), or looking something up on Wikipedia.
Before models got native support for this, you could use something like the ReAct pattern, which I learned about in Simon Willison’s ‘TIL’ post on this topic. This pattern is a prompting technique that alternates between reasoning and acting, it required telling the model about which functions it could use, and parsing a structured response to detect when it wanted to call one. A bit clunky, because the models would sometimes use different syntax or fail to actually call a function.
Nowadays, most model providers have a way of specifying the functions in the API call, and have trained their models to be good at function calling. This capability unlocked what people now call “Agentic Systems“, a hot topic in AI influencer space, and very powerful. Some of the hot topics that were made possible by function calling are MCP (Model Context Protocol, a standard for connecting AI models to data sources) and Coding agents. Tools like Claude Code, Cursor and Windsurf are heavily using function calling to help developers write code.
Part of function calling was reasoning about what tools to use. Roughly around the same time this innovation was being rolled out, researchers started investigating this reasoning part for more complex tasks.
Another hype that has some merit, if you ignore the claims of super-intelligence, is reasoning (or thinking1). When you use a reasoning model (like OpenAI’s o1/o3/o4 models, Anthropic’s Claude 3.7 or higher, or Deep Seek R1), the model will first output tokens that expand on your query. This expansion will often lead to better answers for more complex tasks that require reasoning.
The idea of reasoning with LLMs was (to my knowledge) first explored in Chain-of-thought prompting. The most widely adopted form was asking models to “think step by step”. The authors showed some examples where this style of prompting improved the output of models for tasks that required common sense, arithmetic, and symbolic reasoning. This style of prompting soon found its way into a lot of “prompt engineering” guides.
Of course, seeing the success people had with this, model providers started fine-tuning their LLMs to output these traces. A big part of this fine-tuning is creating and curating data (probably one of the main reasons OpenAI hides their thinking tokens).
One reason why I think this works, is because expanding on the query gives the model more “surface area” to generate next tokens with (LLMs are still next-token predictors!) However, a recent paper argues that these reasoning models completely break down when the tasks become a bit too complex. Gary Marcus on this topic says that neural networks (what LLMs are based on) do well within the boundaries of the data they are exposed to, but don’t generalize well beyond that.
As I mentioned in the introduction, I can be quite sceptical about some of the claims people make around AI. The hype around it brings simplistic, black-and-white thinking. The pattern expressed above can bring some nuance to the discourse.
For AI development, it suggests bigger isn’t automatically better. Smaller, more focused models can outperform the multi-million dollar models on specific tasks. In the last year, we’ve seen several smaller models come up that have some advantage, such as being able to run it on your own hardware, or costing much less to train. Apart from general-purpose models, companies can use libraries like SpaCy, with a combination of LLMs and humans-in-the-loop to build small, focussed models that have excellent accuracy. Similarly, tools like Model2Vec can distill large embedding models into very fast static embeddings. These distilled models are much smaller and faster than the giant, general-purpose models.
In discussions around AI expectations, instead of a yes/no discussion, we could steer the conversation towards asking what specific training approaches or data could unlock a particular capability. Compute or data volume are often not the (only) bottleneck.
One of the themes I take away from these three innovations in LLM technology, is that you do need more data, but you need the right kind of data. This is not data that you can just scrape from the web. Summarized, what you need for these innovations:
As you can see in that list, it’s not just data that you need, you also need new types of tokens to denote that you’re thinking or starting a new message. Some innovations may require even bigger architectural changes. And, you need to actually come up with the idea that this might be beneficial in certain use-cases. It’s not just something that these models pick up from feeding it more data scraped from the web.
I really like that the community at large is picking up new kinds of interactions, showing how useful they can be, and that model providers then fine-tune their flag-ship models to take advantage of this. This suggests that the next breakthrough will come from identifying what’s missing and how to systematically address that. It is not necessarily the company with the biggest compute-budget.
I don’t like this name, it gives these models too much credit… I like “reasoning” better, but most model providers call it thinking, or use thinking tokens.↩