DSR1 is actually a reasoning model which also does chat. But the surprise that it outperformed o2 and others is because nobody paid much attention to this company and its publications. After the DSR1 announcement, I found these two papers: DeepSeekMath (April, 24) and DeepSeekCoder (June, 24). Had I read this last year, I would have been waiting for the actual DSR1. In fact, the real contribution was in V3 model from December, 24. DSR1 is easier to reproduce from DSV3. Getting to DSV3 is what is difficult and amazing that used older and crippled infra. (2.8M GPU hours) or roughly 2.8M*$3.5/hr = $9.8M. Compare that to several billions spent in pre-training current proprietary and open source models.
To get to DSR1, you start with DSV3 and apply GRPO (defined in DeepSeekMath paper). It is not SFT with HF, it is RL with verified rewards (VR) - essentially no humans involved. The key contribution of GRPO actually came out in April in DSMath paper that is linked above.
To reduce cost, they use FP8 and DualPipe algorithm which helps them reduce GPU memory consumption. Recall successive GPUs from both Nvidia and AMD are simply adding more HBMe memory to the GPU module. Also recall the H800 was crippled by more than halving its bandwidth to the VRAM (memory on GPU module). The training parameters they optimized on are compute to communication ratio and near zero all-to-all communication in a GPU cluster. They avoided tensor level parallelism and focused on traditional pipeline bubble removal. They developed middleware that optimized on inter-node communication that understood the underlying transport as IB or NVLink.
In summary, all the cost efficiencies are standard HPC techniques and their key innovations were published months before they achieved DSR1. The only difference is they believed in their approach and those who actually read the papers in April, 2024 were high on we need more GPUs to pre-train and ignored their innovations.
The constrained infra in China pushed them in this direction, but not we should thank that constraint because we now know that even models exhibit emergent behavior when trained in constrained environment. You want sweet grapes, don't water the vine!