The Evaluation Moat | world model #12

Forget "World Models" for a Moment. The Real Moat is How We Evaluate Them.

Sep 22, 2025

∙ Paid

Autonomous driving is experiencing a surge of hype around new AI paradigms – especially in China, where terms like Vision-Language-Action (VLA) models are touted as the next big thing. Major Chinese electric vehicle makers are boasting about AI “driver models” that can “see” the road, “think” in natural language, and “act” by controlling the car end-to-end . But amid this buzz, a non-consensus viewpoint is emerging: the real key to winning the self-driving race may not be the fanciest algorithm, but how we evaluate these systems. In other words, beyond flashy demos and buzzwords, having the right metrics and rigorous evaluation will be king for autonomous driving. This article dives into the current hype around VLA, the differing approaches (like Huawei’s skeptical focus on “World Models”), and why robust evaluation metrics and data quality are ultimately crucial to progress.

The VLA Buzz: Chinese Automakers Bet on Vision-Language-Action

In 2025, Chinese EV companies have aggressively embraced the idea of Vision-Language-Action (VLA) models for self-driving. VLA refers to a multimodal AI model that integrates vision, language understanding, and action output in one end-to-end system . The concept is inspired by human-like driving: the AI “sees” its environment through sensors, “interprets and reasons” about the scene (possibly in a language-like or logical form), and then directly outputs driving actions (steering, acceleration, etc.).

Li Auto was one of the first to formally announce a VLA-based autonomous driving architecture, dubbed MindVLA, in a NVIDIA GTC conference talk by their AD lead Jia Peng . In that talk, Li Auto revealed that MindVLA is envisioned as a “driver big model” working like a human driver – combining spatial/vision intelligence, language intelligence, and behavioral intelligence in one system . Unlike their previous approach that coupled a vision-based end-to-end model with a separate language model, the new VLA architecture merges everything into a single model for joint perception, reasoning, and decision-making .

Notably, MindVLA is designed with two reasoning modes – a bit akin to Kahneman’s System 1 and System 2 thinking. As Li Auto explains, the model can produce two types of outputs: a *“slow” output which is a step-by-step chain-of-thought (CoT) reasoning (literally a text-based thought process the AI generates word by word), and a “fast” output which skips the explicit reasoning chain and directly emits the action commands . In practice, this means the AI sometimes explains its logic internally and other times acts reflexively. For example, the MindVLA might internally narrate a complex scenario (“If the sign says the bus lane is open to cars after 7 PM, then I…”) before executing a maneuver, or it might instantly output the steering trajectory for straightforward situations without verbose reasoning. Li Auto indicates the slow CoT mode helps the system handle complex logic in traffic (and provides human-readable rationale), whereas the fast mode is used when immediate reaction is needed . To achieve real-time performance (they target >10 Hz planning), MindVLA uses a fixed, short CoT template for the slow mode and a parallel decoding scheme – basically generating the chain-of-thought with causal attention (one word at a time) while outputting the action trajectory with a bidirectional attention in one go . This clever engineering ensures that even the “thinking out loud” mode doesn’t slow the car down too much to be unsafe. Indeed, Li Auto reports that with optimizations like a smaller vocabulary and speculative decoding, their VLA inference can exceed 10 Hz on car-grade chips .

Advertisement: World Model – For readers interested in deep insights and data on the autonomous driving industry, consider subscribing to our World Model newsletter. We offer first-hand data on companies like Pony.ai and up-to-date China EV industry metrics, providing the hard numbers behind the news. Our paid subscribers gain access to detailed datasets (including product feature comparisons, deployment statistics, and even funding history data) in formats ready for analysis by your AI tools – giving you an informational edge for your investments or research. For more information or special data requests, feel free to reach out.

Keep reading with a 7-day free trial

Subscribe to World Model to keep reading this post and get 7 days of free access to the full post archives.