The adaptation data used to train Dream-Coder-Base mostly comprises OpenCoder, Stack-Edu, Dolmino, and DCLM-Baseline. We provide the data filtering script on the DCLM data in data_filter.py. We mix the general text, code and math with weights 2:7:1. The detailed data weighting strategy is as follows:
dclm_filtered: 0.17
wikibook: 0.02
finemath: 0.05
openmathinstruct: 0.025
tinygsm: 0.025
tulu: 0.005
natural_reasoning: 0.005
open_coder_anneal: 0.15
stack_v2_smol: 0.4
stack_edu_py: 0.15
The evaluation code is based on lm-eval, Qwen-Coder Eval and Dream.
To evaluate on bigcodebench, follow the environment installation guideline in Qwen-Coder Eval and run:
bash eval_bcb.sh
To evaluate on other tasks, run:
cd lm_eval
pip install -e ".[math]"
and
bash eval_code_base.sh
