Understanding DeepSeek R1
Brigitte Cammack laboja lapu pirms 7 mēnešiem


DeepSeek-R1 is an open-source language model developed on DeepSeek-V3-Base that’s been making waves in the AI community. Not only does it match-or even surpass-OpenAI’s o1 model in many criteria, however it likewise comes with totally MIT-licensed weights. This marks it as the very first non-OpenAI/Google design to provide strong reasoning abilities in an open and available manner.

What makes DeepSeek-R1 especially amazing is its openness. Unlike the less-open approaches from some market leaders, DeepSeek has actually published a detailed training approach in their paper. The model is also extremely cost-effective, with input tokens costing just $0.14-0.55 per million (vs o1’s $15) and output tokens at $2.19 per million (vs o1’s $60).

Until ~ GPT-4, the common wisdom was that much better models needed more information and compute. While that’s still legitimate, designs like o1 and R1 demonstrate an alternative: inference-time scaling through reasoning.

The Essentials

The DeepSeek-R1 paper provided multiple designs, but main amongst them were R1 and R1-Zero. Following these are a series of distilled designs that, while fascinating, I won’t talk about here.

DeepSeek-R1 utilizes two major ideas:

1. A multi-stage pipeline where a little set of cold-start data kickstarts the design, followed by massive RL.

  1. Group Relative Policy Optimization (GRPO), a reinforcement knowing method that counts on comparing numerous model outputs per timely to avoid the requirement for a different critic.

    R1 and R1-Zero are both thinking designs. This essentially means they do Chain-of-Thought before . For the R1 series of models, this takes type as believing within a tag, before responding to with a last summary.

    R1-Zero vs R1

    R1-Zero applies Reinforcement Learning (RL) straight to DeepSeek-V3-Base with no supervised fine-tuning (SFT). RL is used to enhance the design’s policy to take full advantage of reward. R1-Zero attains exceptional precision however often produces complicated outputs, such as blending multiple languages in a single reaction. R1 repairs that by including restricted monitored fine-tuning and multiple RL passes, which enhances both correctness and readability.

    It is intriguing how some languages may reveal certain concepts better, which leads the model to choose the most meaningful language for the job.

    Training Pipeline

    The training pipeline that DeepSeek published in the R1 paper is exceptionally interesting. It showcases how they developed such strong thinking models, and what you can get out of each phase. This includes the problems that the resulting models from each phase have, and how they fixed it in the next stage.

    It’s fascinating that their training pipeline differs from the normal:

    The typical training technique: Pretraining on big dataset (train to forecast next word) to get the base model → monitored fine-tuning → preference tuning through RLHF R1-Zero: Pretrained → RL R1: Pretrained → Multistage training pipeline with multiple SFT and RL stages

    Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a few thousand Chain-of-Thought (CoT) samples to guarantee the RL process has a decent starting point. This offers an excellent design to begin RL. First RL Stage: Apply GRPO with rule-based benefits to improve thinking correctness and format (such as requiring chain-of-thought into believing tags). When they were near merging in the RL process, wifidb.science they relocated to the next step. The outcome of this step is a strong thinking design but with weak general abilities, e.g., bad format and language mixing. Rejection Sampling + basic data: Create brand-new SFT information through rejection sampling on the RL checkpoint (from step 2), combined with monitored information from the DeepSeek-V3-Base design. They gathered around 600k premium reasoning samples. Second Fine-Tuning: Fine-tune DeepSeek-V3-Base again on 800k overall samples (600k reasoning + 200k general tasks) for more comprehensive abilities. This step led to a strong thinking model with basic abilities. Second RL Stage: Add more reward signals (helpfulness, harmlessness) to refine the final design, in addition to the reasoning benefits. The outcome is DeepSeek-R1. They also did design distillation for [mariskamast.net](http://mariskamast.net:/smf/index.php?action=profile