The DeepSeek-R1 project by Hugging Face aims to create a fully open-source reproduction of the R1 pipeline. This initiative allows developers and researchers to replicate, build upon, and customize the pipeline for tasks involving model training, evaluation, and synthetic data generation. The project uses a modular design, making it accessible and adaptable for a variety of use cases.
Key components of this reproduction include:
- Training Scripts: These scripts handle supervised fine-tuning (SFT), gradient-propagated reinforcement optimization (GRPO), and other training strategies.
- Evaluation Tools: Tools for benchmarking and assessing model performance on datasets, using frameworks like lighteval.
- Synthetic Data Generation: Techniques for generating high-quality synthetic datasets to enrich model training.
The repository is under active development, welcoming contributions from the open-source community to enhance its functionality and scope.
Steps for Setting Up the DeepSeek-R1 Pipeline
1. Environment Setup
To get started, create a Python virtual environment using Conda:
bash
conda create -n openr1 python=3.11 && conda activate openr1
Next, install the required dependencies, including PyTorch (v2.5.1) and vLLM:
bash
CopyEdit
pip install vllm==0.6.6.post1
pip install -e “.[dev]”
Ensure you log into your Hugging Face and Weights & Biases accounts for seamless integration:
bash
huggingface-cli login
wandb login
Additionally, install Git Large File Storage (Git LFS) to manage and push large models or datasets:
bash
CopyEdit
sudo apt-get install git-lfs
git-lfs –version
2. Training Models
The pipeline supports multiple training approaches, including DeepSpeed ZeRO and Distributed Data Parallel (DDP). Configure the training method by adjusting the YAML configuration files in the configs directory.
For example, to run Supervised Fine-Tuning (SFT) on a distilled dataset like Bespoke-Stratos-17k, use the following command:
accelerate launch –config_file=configs/zero3.yaml src/open_r1/sft.py \
–model_name_or_path Qwen/Qwen2.5-Math-1.5B-Instruct \
–dataset_name HuggingFaceH4/Bespoke-Stratos-17k \
–learning_rate 2e-5 \
–num_train_epochs 1 \
–max_seq_length 4096 \
–output_dir data/Qwen2.5-1.5B-Open-R1-Distill
For larger models or advanced configurations, you can modify parameters like batch size, gradient accumulation steps, or precision settings (e.g., bf16).
3. Evaluating Models
Use the lighteval framework to evaluate trained models against tasks or benchmarks. A typical evaluation command might look like this:
lighteval vllm $MODEL_ARGS “custom|$TASK|0|0” \
–custom-tasks src/open_r1/evaluate.py \
–system-prompt=”Please reason step by step, and put your final answer within \boxed{}.” \
–output-dir $OUTPUT_DIR
The evaluation can be scaled to utilize multiple GPUs for improved throughput by enabling Data Parallelism or Tensor Parallelism.
4. Generating Synthetic Data
Synthetic data generation is a cornerstone of the R1 pipeline. Using a distilled R1 model, you can produce datasets tailored to specific use cases. Here’s an example setup using the distilabel library:
python
from distilabel.models import vLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration
pipeline = Pipeline(
name=”distill-qwen-7b-r1″,
description=”Generate synthetic data from a distilled R1 model”
)
llm = vLLM(
model=”deepseek-ai/DeepSeek-R1-Distill-Qwen-7B”,
tokenizer=”deepseek-ai/DeepSeek-R1-Distill-Qwen-7B”,
generation_kwargs={“temperature”: 0.6, “max_new_tokens”: 8192}
)
pipeline.run(dataset=dataset)
Generated datasets can then be pushed to the Hugging Face Hub for collaboration or reuse.
Key Takeaways
- Flexibility: The DeepSeek-R1 pipeline supports a range of tasks, from fine-tuning models to generating synthetic datasets.
- Community-Driven: The open-source nature of the project encourages contributions to expand its capabilities.
- Scalability: With support for distributed training and evaluation, the pipeline is adaptable to various hardware setups.
FAQs
Q1: What hardware is recommended for training large models?
A: The repository is optimized for setups with 8×H100 GPUs, but configurations can be adjusted for smaller systems.
Q2: Can I use custom datasets for training?
A: Yes, custom datasets can be integrated by specifying their paths in the configuration files.
Q3: Where can I find pre-trained models or datasets?
A: Check the Hugging Face Hub for available models and datasets compatible with the DeepSeek-R1 pipeline.
Let me know if you’d like further refinements or additions!