Sunbelt Computer Software

Dream-Coder-Base

Data

The adaptation data used to train Dream-Coder-Base mostly comprises OpenCoder, Stack-Edu, Dolmino, and DCLM-Baseline. We provide the data filtering script on the DCLM data in data_filter.py. We mix the general text, code and math with weights 2:7:1. The detailed data weighting strategy is as follows:

dclm_filtered: 0.17
wikibook: 0.02
finemath: 0.05
openmathinstruct: 0.025
tinygsm: 0.025
tulu: 0.005
natural_reasoning: 0.005
open_coder_anneal: 0.15
stack_v2_smol: 0.4
stack_edu_py: 0.15

Evaluation

The evaluation code is based on lm-eval, Qwen-Coder Eval and Dream.

To evaluate on bigcodebench, follow the environment installation guideline in Qwen-Coder Eval and run:

bash eval_bcb.sh

To evaluate on other tasks, run:

cd lm_eval
pip install -e ".[math]"

and

bash eval_code_base.sh

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

README.md

Dream-Coder-Base

Data

Evaluation

Name		Name	Last commit message	Last commit date
parent directory ..
bigcodebench		bigcodebench
lm_eval		lm_eval
README.md		README.md
data_filter.py		data_filter.py
eval_bcb.sh		eval_bcb.sh
eval_code_base.sh		eval_code_base.sh

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

FilesExpand file tree

base

Directory actions

More options

Directory actions

More options

Latest commit

History

base

Folders and files

parent directory

README.md

Dream-Coder-Base

Data

Evaluation