Overview of DeepSeek-R1 Open Reproduction

Published On: January 27, 2025

The DeepSeek-R1 project by Hugging Face aims to create a fully open-source reproduction of the R1 pipeline. This initiative allows developers and researchers to replicate, build upon, and customize the pipeline for tasks involving model training, evaluation, and synthetic data generation. The project uses a modular design, making it accessible and adaptable for a variety of use cases.

Key components of this reproduction include:

Training Scripts: These scripts handle supervised fine-tuning (SFT), gradient-propagated reinforcement optimization (GRPO), and other training strategies.
Evaluation Tools: Tools for benchmarking and assessing model performance on datasets, using frameworks like lighteval.
Synthetic Data Generation: Techniques for generating high-quality synthetic datasets to enrich model training.

The repository is under active development, welcoming contributions from the open-source community to enhance its functionality and scope.

Steps for Setting Up the DeepSeek-R1 Pipeline

1. Environment Setup

To get started, create a Python virtual environment using Conda:

bash

conda create -n openr1 python=3.11 && conda activate openr1

Next, install the required dependencies, including PyTorch (v2.5.1) and vLLM:

bash

CopyEdit

pip install vllm==0.6.6.post1

pip install -e “.[dev]”

Ensure you log into your Hugging Face and Weights & Biases accounts for seamless integration:

bash

huggingface-cli login

wandb login

Additionally, install Git Large File Storage (Git LFS) to manage and push large models or datasets:

bash

CopyEdit

sudo apt-get install git-lfs

git-lfs –version

2. Training Models

The pipeline supports multiple training approaches, including DeepSpeed ZeRO and Distributed Data Parallel (DDP). Configure the training method by adjusting the YAML configuration files in the configs directory.

For example, to run Supervised Fine-Tuning (SFT) on a distilled dataset like Bespoke-Stratos-17k, use the following command:

accelerate launch –config_file=configs/zero3.yaml src/open_r1/sft.py \

–model_name_or_path Qwen/Qwen2.5-Math-1.5B-Instruct \

–dataset_name HuggingFaceH4/Bespoke-Stratos-17k \

–learning_rate 2e-5 \

–num_train_epochs 1 \

–max_seq_length 4096 \

–output_dir data/Qwen2.5-1.5B-Open-R1-Distill

For larger models or advanced configurations, you can modify parameters like batch size, gradient accumulation steps, or precision settings (e.g., bf16).

3. Evaluating Models

Use the lighteval framework to evaluate trained models against tasks or benchmarks. A typical evaluation command might look like this:

lighteval vllm $MODEL_ARGS “custom|$TASK|0|0” \

–custom-tasks src/open_r1/evaluate.py \

–system-prompt=”Please reason step by step, and put your final answer within \boxed{}.” \

–output-dir $OUTPUT_DIR

The evaluation can be scaled to utilize multiple GPUs for improved throughput by enabling Data Parallelism or Tensor Parallelism.

4. Generating Synthetic Data

Synthetic data generation is a cornerstone of the R1 pipeline. Using a distilled R1 model, you can produce datasets tailored to specific use cases. Here’s an example setup using the distilabel library:

python

from distilabel.models import vLLM

from distilabel.pipeline import Pipeline

from distilabel.steps.tasks import TextGeneration

pipeline = Pipeline(

name=”distill-qwen-7b-r1″,

description=”Generate synthetic data from a distilled R1 model”

)

llm = vLLM(

model=”deepseek-ai/DeepSeek-R1-Distill-Qwen-7B”,

tokenizer=”deepseek-ai/DeepSeek-R1-Distill-Qwen-7B”,

generation_kwargs={“temperature”: 0.6, “max_new_tokens”: 8192}

)

pipeline.run(dataset=dataset)

Generated datasets can then be pushed to the Hugging Face Hub for collaboration or reuse.

Key Takeaways

Flexibility: The DeepSeek-R1 pipeline supports a range of tasks, from fine-tuning models to generating synthetic datasets.
Community-Driven: The open-source nature of the project encourages contributions to expand its capabilities.
Scalability: With support for distributed training and evaluation, the pipeline is adaptable to various hardware setups.

FAQs

Q1: What hardware is recommended for training large models?
A: The repository is optimized for setups with 8×H100 GPUs, but configurations can be adjusted for smaller systems.

Q2: Can I use custom datasets for training?
A: Yes, custom datasets can be integrated by specifying their paths in the configuration files.

Q3: Where can I find pre-trained models or datasets?
A: Check the Hugging Face Hub for available models and datasets compatible with the DeepSeek-R1 pipeline.

Let me know if you’d like further refinements or additions!