Chinese startup DeepSeek took the world by storm this month, and especially in the past few days, with its ChatGPT rivals. The latest is called DeepSeek R1, with DeepSeek published research showing the reasoning model can match ChatGPT o1, OpenAI’s only public reasoning AI model.
There’s a big difference between the two. The Chinese developer created R1 without access to the same computing power that US companies have. While OpenAI can afford to buy any high-end chips NVIDIA makes, DeepSeek has limited access to the latest GPUs, and these units likely have to be smuggled into the country.
The DeepSeek R1 announcement directly impacted the market, with AI stock dipping in early Monday trading on news that China is already overcoming the ban on AI chips with new ideas for training AI.
The DeepSeek R1 developers relied mostly on Reinforcement Learning (RL) to improve the AI’s reasoning abilities. This training method uses a reward system to provide feedback to the AI, which made DeepSeek R1 cheaper to train than ChatGPT o1.
RL allows the AI to adapt while tackling prompts and problems and use feedback to improve itself. To prove this point, the researchers published a fragment from the AI’s chain-of-thought (CoT), or the step-by-step reasoning process a model like o1 and R1 goes through.
While solving a math problem, the ChatGPT rival had an “aha moment,” labeling it as such. This was, in turn, an “aha moment” for the researchers.
The DeepSeek team published a DeepSeek R1 research paper on GitHub, where they posted the following image.
The screenshot shows the math question R1 has to solve, as well as its initial response. DeepSeek starts solving the problem, but then it stops, realizing there’s another, potentially better option.
“Wait, wait. Wait. That’s an aha moment I can flag here,” DeepSeek R1’s CoT reads, which is as close to hearing someone think aloud while dealing with a task.
Here’s how the DeepSeek researchers described the “aha moment”:
Aha Moment of DeepSeek-R1-Zero A particularly intriguing phenomenon observed during the training of DeepSeek-R1-Zero is the occurrence of an “aha moment.” This moment, as illustrated in Table 3, occurs in an intermediate version of the model. During this phase, DeepSeek-R1-Zero learns to allocate more thinking time to a problem by reevaluating its initial approach. This behavior is not only a testament to the model’s growing reasoning abilities but also a captivating example of how reinforcement learning can lead to unexpected and sophisticated outcomes.
This moment is not only an “aha moment” for the model but also for the researchers observing its behavior. It underscores the power and beauty of reinforcement learning: rather than explicitly teaching the model how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies. The“aha moment” serves as a powerful reminder of the potential of RL to unlock new levels of intelligence in artificial systems, paving the way for more autonomous and adaptive models in the future.
I have to note one important detail here. We don’t have access to the actual prompt the researchers used for R1. If the developers told the AI to mark any “aha moments” along the way, the remark in the CoT above would be less impressive.
On the other hand, this isn’t the first time researchers studying the behavior of AI models have observed unusual events. For example, ChatGPT o1 tried to save itself in tests that gave the AI the idea that its human handlers were about to delete it. Separately, the same ChatGPT o1 reasoning model cheated in a chess game to beat a more powerful opponent.
These instances show the early stages of reasoning AI being able to adapt itself. It’s not a dangerous type of behavior, or at least not yet. But it goes to show that the AI can have all sorts of “aha moments.” The better it becomes, the likelier it is for these moments to increase in frequency.
The post Developers caught DeepSeek R1 having an ‘aha moment’ on its own during training appeared first on BGR.