Dream-Coder/base at main · DreamLM/Dream-Coder · GitHub
Skip to content

Latest commit

 

History

History

Folders and files

README.md

Dream-Coder-Base

Data

The adaptation data used to train Dream-Coder-Base mostly comprises OpenCoder, Stack-Edu, Dolmino, and DCLM-Baseline. We provide the data filtering script on the DCLM data in data_filter.py. We mix the general text, code and math with weights 2:7:1. The detailed data weighting strategy is as follows:

dclm_filtered: 0.17
wikibook: 0.02
finemath: 0.05
openmathinstruct: 0.025
tinygsm: 0.025
tulu: 0.005
natural_reasoning: 0.005
open_coder_anneal: 0.15
stack_v2_smol: 0.4
stack_edu_py: 0.15

Evaluation

The evaluation code is based on lm-eval, Qwen-Coder Eval and Dream.

To evaluate on bigcodebench, follow the environment installation guideline in Qwen-Coder Eval and run:

bash eval_bcb.sh

To evaluate on other tasks, run:

cd lm_eval
pip install -e ".[math]"

and

bash eval_code_base.sh