What's next for LLMs?

16 May, 2025

This is no way a professional prediction, just summarizing some wild ideas popped into my head in a recent discussion with friends. Read this with a grain of salt.

Reasoning, agent, and multimodality are top 3 research topics for LLMs, as far as I can tell. And for me, there is always the topic of efficiency or scalability. And of course, beyond that, world models, and intelligent robotics.

Verifiable reinforcement learning boosted the reasoning performance as explored by DeepSeek and others, but it requires much more output tokens, to a factor of almost 10. That would be 10x inference compute, and to solve the same problem in the same time, you would need another 10x compute to achieve that. Such a 100x compute demand, is not desirable.

I was wrong in 2023, about the inefficient inference of LLMs. I didn't like the quadratic scaling of the attention mechanism, but it turns out that there are more ways to work around the bottleneck than seeking linear attention such as RWKV and other alternatives. Inference time scaling, and MoE are both game changers.

So I would expect myself to be wrong again about the cost of reasoning. And there are preliminary researches showing that reasoning is already inside and can be further trained into the LLMs.

That is, reasoning should not be happening at the output layer, but intermediate layers. Writing down the reasoning process, should merely serve as a memory of explored paths, to keep things on track, or materialize the box to think out of.

The waste is not just on the input side, there are significant waste of input tokens by prompts that are not part of the instruction nor the context, but guidance. These guidances should not just be cached, but "compiled", in a more efficient and controllable manner, while remaining the interpretability.

Prompt engineering and reasoning by output tokens are two sides of the same coin: we are reaching better results in an interpretable manner, but much more expensive than it should be, and it would become obvious once we find the right principle. Of course, history relies on such detours to get back on track.

Agent is another layer of waste, on top of individual inferences. Agents fall into fruitless planning and exploration, failing to snap out of the infinite loop, by acquiring the extra information or triggering the extra side effects. Their interactions with tools (e.g., MCP), agents (e.g., A2A), and humans (e.g., A2H) are all inefficient due to using natural language as the intermediate representation.

Imagine an alien species with a far more efficient language, like the one depicted in Arrival. Such a language could express complex concepts like tool invocation with parameters using just a single token - or perhaps a single multimodal token combining multiple meanings.

There should be special dense tokens, to represent tools or even ways to think (e.g., the MCP for "Sequential Thinking"). And why should a token just be an embedding of a high dimensional vector? Why isn't it a curve, a hypersurface, or a manifold?

Take some time to think about this, while showing the visualization of the Biology of a Large Language Model on the screen.

I partially believe that using modularity other than human perceptions, as the intermediate representation, is the key to the next breakthrough. They would also represent the world model better.

At the end of the day, it all comes down to human-level efficiency to reach human-level intelligence. And the efficiency will come from using the right principle, which will also result in better performance.

And all of reasoning, agent, multimodality, efficiency, world model, and robotics, are all covered by this one integrated solution.