Friday, January 31, 2025

DeepSeek is On-Prem and "" real value is between ""

 High Flyer is a hedge fund which was early adopter of GPU acceleration in finance. The expertise they built in that field helped them launch a subsidiary Deepseek which recently released R1 after two other LLMs. Everybody now knows what Deepseek is but not many know that it is actually not running in any cloud. It is all on-prem.

In China, you can't get H100, but you can get H800 which is BW limited and H20 which is crippled version but enough to train and infer 680B parameter model. Some suggest that Deepseek innovated due to constraints. It may be, but it could just as well be that they figured out first that one does not need to load the whole model into memory for inference. The latest figures are that DSR1 only load 30B parameters which you can load on a Laptop GPU with only 12GB of VRAM. 

Since the new on DSR1 hit the wires, the consumer grade GPUs have been flying off the shelf. It is super easy to run DSR1 using ollama (literally just type %ollama run deepseek-r1). Most of the value, I derive from DSR1 is the bit inside. For example, I asked DSR1 how to compress and load a LLM into VRAM. The answer was obvious as shown below:  

In summary, my approach would be:
1. Start by quantizing the model parameters to lower bit precision (e.g., 8-bit or 16-bit).
2. Apply pruning to remove unnecessary weights after quantization.
3. Use techniques like dynamic or static quantization for further compression during inference.
4. Implement checkpointing strategies to optimize memory usage during the inference process.
5. Possibly combine with other methods like knowledge distillation if there's excess capacity.

but the real insight was in between tags <think>. Some of the tips there are knowledge distillation (not the same as model distillation. Use of fixed point math (something we did like 30 years ago to draw pictures using postscript). I particularly liked the way it went from top-of-the-head response of quantization to fixed point to knowledge distillation and went on to (eventually reject) sparsity techniques. 


DRS1 = DSV3 + GRPO + VR

 DSR1 is actually a reasoning model which also does chat. But the surprise that it outperformed o2 and others is because nobody paid much at...