Understanding DeepSeek R1
Alica Scarborough урећивао ову страницу пре 6 месеци


DeepSeek-R1 is an open-source language model developed on DeepSeek-V3-Base that’s been making waves in the AI community. Not just does it match-or even surpass-OpenAI’s o1 design in numerous standards, however it likewise includes totally MIT-licensed weights. This marks it as the very first non-OpenAI/Google design to deliver strong reasoning abilities in an open and available manner.

What makes DeepSeek-R1 especially interesting is its transparency. Unlike the less-open methods from some industry leaders, DeepSeek has actually published a detailed training methodology in their paper. The model is also extremely affordable, with input tokens costing just $0.14-0.55 per million (vs o1’s $15) and output tokens at $2.19 per million (vs o1’s $60).

Until ~ GPT-4, the typical wisdom was that better designs needed more data and compute. While that’s still legitimate, designs like o1 and R1 demonstrate an alternative: inference-time scaling through reasoning.

The Essentials

The DeepSeek-R1 paper presented several designs, but main among them were R1 and R1-Zero. Following these are a series of distilled designs that, while fascinating, I won’t go over here.

DeepSeek-R1 utilizes two major ideas:

1. A multi-stage pipeline where a small set of cold-start information kickstarts the model, followed by large-scale RL.

  1. Group Relative Policy Optimization (GRPO), a support knowing approach that counts on comparing numerous model outputs per prompt to avoid the need for a separate critic.

    R1 and R1-Zero are both reasoning designs. This essentially suggests they do Chain-of-Thought before addressing. For the R1 series of models, this takes form as thinking within a tag, before responding to with a last summary.

    R1-Zero vs R1

    R1-Zero applies Reinforcement Learning (RL) straight to DeepSeek-V3-Base with no monitored fine-tuning (SFT). RL is used to enhance the model’s policy to make the most of benefit. R1-Zero attains exceptional accuracy however often produces confusing outputs, such as blending several languages in a single reaction. R1 repairs that by including limited monitored fine-tuning and numerous RL passes, which improves both accuracy and readability.

    It is fascinating how some languages might express certain concepts better, which leads the design to choose the most meaningful language for the job.

    Training Pipeline

    The training pipeline that DeepSeek released in the R1 paper is exceptionally fascinating. It showcases how they created such strong reasoning models, higgledy-piggledy.xyz and what you can anticipate from each stage. This consists of the issues that the resulting models from each phase have, and how they resolved it in the next phase.

    It’s intriguing that their training pipeline differs from the normal:

    The usual training strategy: [rocksoff.org](https://rocksoff.org/foroes/index.php?action=profile