StarCoderDataè¯æåºæ¦è¿° | AIçæåç¿»è¯
é®é¢ï¼ ä»ç»ä¸ä¸ bigcode/starcoderdata ââ 宿¯ä»ä¹ãå
å«ä»ä¹å
容ã以åå®çç»æã
çæ¡ï¼
宿¯ä»ä¹
starcoderdata æ¯ç¨äºè®ç» StarCoder å StarCoderBase ç精确è®ç»è¯æåºï¼å
å« 86 ç§ç¼ç¨è¯è¨ç 783 GB 代ç ï¼å¤å 54 GB ç GitHub Issuesã13 GB ç Jupyter Notebookï¼ä»¥èæ¬åææ¬-代ç 对两ç§å½¢å¼ï¼ä»¥å 32 GB ç GitHub commitsï¼æ»è®¡çº¦ 2500 亿 tokenãStarCoderBase å¨å¤§çº¦ 1 ä¸äº¿ token ä¸è¿è¡äºè®ç»ï¼å³å¨è¯¥æ°æ®ä¸è®ç»äºçº¦ 4 个 epochï¼ï¼è StarCoder Python åä½åå¨ Python åéä¸é¢å¤è®ç»äº 2 个 epochã
è¿ä½¿å¾å®æä¸ºçè§£âCodex 飿 ¼âé¢è®ç»è¯æåºå®é æ ·åçæä½³å¼æ¾äº§ç©ââ宿¯å°éå®éªå®¤ç§ä¸æåå·¥ä½çå®å ¨å ¬å¼çæ¬ã
çæå®çæµæ°´çº¿ï¼çæ£æè¶£çé¨åï¼
å®ä» The Stack v1ï¼éè¿è®¸å¯è¯æ£æµè·å¾ç宽æ¾è®¸å¯ GitHub 代ç ï¼å¼å§ï¼å¹¶æé¡ºåºåºç¨ä»¥ä¸æ¥éª¤ï¼
- è¯è¨éæ© ââ æ ¹æ®æ°æ®éåæµè¡åº¦éæ© 86 ç§è¯è¨ï¼å¤å é ç½®/æ è®°æ ¼å¼ï¼JSONãYAMLãMarkdownï¼ã
- è´¨éè¿æ»¤å¨ ââ æ¯ç§è¯è¨ç¹æçå¯åå¼è§åï¼è¡é¿åº¦éå¶ã忝æ°åæ¯ä¾ã忝 token æ¯çãèªå¨çææä»¶æ£æµï¼ä»¥åå¯¹æ°æ®å¯éåæä»¶ï¼é¿ JSON/YAML ä¼è¢«æ¿è¿å°æªæï¼çè¿æ»¤ãé对æ¯ç§è¯è¨è¿è¡äººå·¥æ£æ¥ä»¥è°æ´éå¼ã
- è¿ä¼¼å»é ââ MinHash + LSHï¼Jaccard ç³»æ°çº¦ 0.85ï¼5-gram shinglesï¼ãå»éæ¯å½±åæå¤§çå䏿¥éª¤ï¼ä»ä»¬çæ¶èå®éªæ¾ç¤ºï¼è¿ä¼¼å»éæ¯ä» 精确å»éææè¦å¥½å¾å¤ã
- PII ç¼è¾ ââ ä»ä»¬è®ç»äºä¸ä¸ª NER 模åï¼StarPIIï¼ï¼åºäºæ æ³¨æ°æ®éæ£æµå¹¶æ©çå§åãçµåé®ä»¶ãå¯é¥ãå¯ç ãIP å°åï¼ç¶åæ¿æ¢ä¸º
<NAME>ã<API_KEY>ã<IP_ADDRESS>ç tokenï¼å®é ä¸å°±æ¯ä½ ååæ·»å å°èªå®ä¹æä»¤ä¸çç¼è¾çç¥ï¼å¨è¯æåºè§æ¨¡ä¸åºç¨ï¼ã - 廿±¡æ ââ ç§»é¤ä¸ HumanEvalãMBPPãAPPSãGSM8K æµè¯éå¹é çæä»¶ã
- éåºæºå¶ ââ éè¿âAm I in The Stackâå·¥å ·è¯·æ±ç§»é¤çå¼åè ä»åºè¢«å é¤ã
ç£çä¸çç»æ
ä»åºæ ¹ç®å½ä¸æ¯ç§è¯è¨ä¸ä¸ªç®å½ï¼python/ãcpp/ãrust/ãjava/â¦â¦å
± 86 个ç®å½ï¼ï¼æ¯ä¸ªç®å½å
å« Parquet ç¢çã代ç è¡å
å« contentï¼æºä»£ç ææ¬ï¼ãid å max_stars_countãåä¸ªç¹æ®åéå
·æä¸åçæ¨¡å¼ï¼å æ¤ä¸æ¬¡æ§å è½½æ´ä¸ªæ°æ®éä¼å¤±è´¥ââéè¦åå«å è½½å®ä»¬ï¼jupyter-scripts-dedup-filteredãjupyter-structured-clean-dedupãgithub-issues-filtered-structuredãgit-commits-cleanedã
ä¸ä¸ªå ³é®çæ ¼å¼ç»èï¼å 容å°å æ°æ®åç»æåµå ¥ä¸ºç¹æ® tokenï¼StarCoder ç tokenizer å°å ¶è§ä¸ºåå tokenï¼
<reponame>owner/repo<filename>src/foo.py<gh_stars>42
... å®é
代ç ...
Commits çèµ·æ¥å <commit_before>...<commit_msg>...<commit_after>ï¼è Issues ä½¿ç¨ <issue_start>ã<issue_comment> çãå¦æä½ å¨è¿ä¸ªæ°æ®ä¸è®ç»èªå·±ç tokenizerï¼éå³å®æ¯å¦ä¿çè¿äºæ è®°ââå®ä»¬æ¯ StarCoder å¦ä¹ ä»åº/æä»¶æ¡ä»¶çæçæ¹å¼ï¼å¹¶ä½¿å¾å¯ä»¥éè¿æç¤º <gh_stars>1000 æ¥ååæ´é«è´¨éè¡¥å
¨çæå·§æä¸ºå¯è½ãFIMï¼<fim_prefix>/<fim_middle>/<fim_suffix>ï¼æ¯å¨è®ç»æ¶åºç¨çï¼è䏿¯åµå
¥å¨æ°æ®éä¸ã
å¦ä½ä½¿ç¨
宿¯åéå¶çââä½ å¿ é¡»ç»å½å¹¶æ¥å The Stack çä½¿ç¨æ¡æ¬¾ï¼ä¸æ¸¸ä½¿ç¨å¿ é¡»å°éåå§ä»£ç 许å¯è¯ï¼ä½ åæä¼ æç§»é¤è¯·æ±æ´æ°ï¼ãç¶åï¼
from datasets import load_dataset
# æµå¼å è½½ä¸ç§è¯è¨ï¼æ°¸è¿ä¸è¦ç²ç®ä¸è½½å
¨é¨ 783GB
ds = load_dataset("bigcode/starcoderdata", data_dir="python",
split="train", streaming=True)
for i, row in enumerate(ds):
print(row["content"][:300], "\n---")
if i == 2: break
æ ¹æ®ä½ çé
ç½®è¿è¡è§æ¨¡ä¼°ç®ï¼ ä»
python/ ç®å½å°±çº¦ 60GBï¼çº¦ 200 亿 tokenï¼ââ对äºä¸æ¬¡ Chinchilla æä¼ç 760M åæ°æ¨¡åè®ç»ï¼çº¦ 150 亿 tokenï¼æ¥è¯´ç»°ç»°æä½ï¼å®å¯ä»¥æ¾å¨ä½ å·¥ä½ç«ç 916GB ç£çä¸ï¼ä½ä¸å¤ªéåæ¾å¨ Air ä¸ãå¯¹äº MI300Xï¼ææ¸
æ°çæ¹å¼æ¯æµå¼å è½½ + 峿¶ tokenization å°å
åæ å°ç .bin æä»¶ï¼nanoGPT 飿 ¼ç prepare.pyï¼ï¼ä½¿ç¨ tiktoken 对 200 亿 token è¿è¡ tokenization æ¯ä¸æ¬¡æ§ç CPU ä»»å¡ï¼å¤§çº¦éè¦ 1-2 å°æ¶ãTinyLlama-1.1B 使ç¨äºè¿ä¸ªæ°æ®éä½ä¸ºå
¶ä»£ç æ··åé¨åï¼å æ¤å®çä»åºæ¯ä½ å¨å½åè§æ¨¡ä¸å®ç°æåæµæ°´çº¿çè¯å¥½åèã
注æï¼è¿æ¯ v1 çæ¬ãå¦æä½ æ³è¦æ´æ°æ´å¤§ççæ¬ï¼StarCoder2 对åºçæ°æ®éæ¯ bigcode/the-stack-v2-train-smol-ids / -full-idsï¼çº¦ 9000 亿 tokenï¼ä½å
容å¿
é¡»ä» Software Heritage S3 è·åï¼å¦åæè¿°ï¼ã对äºé¦æ¬¡è®ç»è¿è¡ï¼starcoderdata çå
èææ¬ä½¿å
¶æ´å æç¨ã
åèèµæï¼
- starcoderdata æ°æ®éå¡ç
- StarCoder: May the source be with you!ï¼è®ºæï¼
- The Stack: 3 TB of permissively licensed source codeï¼è®ºæï¼
