Hubåå¨å·®å¼è¯¦è§£ | AIçæåç¿»è¯
å¾å¥½çé®é¢ã让æä»åå¨å±å¼å§è§£éã
æ ¸å¿åºå«ï¼Git vs Git LFS vs Xet
GitHub å°ææå 容åå¨å¨æ å Git 对象ï¼packfileï¼ä¸ãGit ä¸ä¸ºæºä»£ç 设计ââå®è½å¾å¥½å°åç¼©ææ¬å·®å¼ï¼ä½å¤ç大åäºè¿å¶æä»¶æ¶è¡¨ç°ä¸ä½³ãæ¯ä¸ªçæ¬çäºè¿å¶æä»¶åºæ¬ä¸é½ä¼è¢«å®æ´åå¨ãä¸ä¸ª 7GB çæ¨¡åæä»¶æ 10 ä¸ªçæ¬ï¼æå¡å¨ä¸å°±ä¼å ç¨çº¦ 70GBã
Hugging Face Hub 表é¢ä¸æ¯ Gitï¼ä½å¯¹äºå¤§æä»¶ä½¿ç¨äºä¸¤ç§ä¸åçå端ï¼
1. Git LFSï¼ä¼ ç»æ¹æ¡ï¼ä»ä¸ºé»è®¤ï¼
å½ä½ è¿è¡ hf upload lzwjava/zz . æ¶ï¼å¯¹äºå¤§æä»¶å®é
åççè¿ç¨å¦ä¸ï¼
ä½ çæºå¨ HF Hub
âââââââââââ ââââââ
git commit (å°æä»¶) âââ> æ å Git ä»åº
(README.md, config.json ç)
LFS æéæä»¶ âââ> LFS batch API
(å¨ .gitattributes ä¸ å°å®é
blob åå¨å¨
追踪ç大æä»¶) 对象åå¨ä¸ (S3/GCS)
Git ä»åºæ¬èº«åªå 嫿éæä»¶ââåè¿æ ·çå°ææ¬åæ ¹ï¼
version https://git-lfs.github.com/spec/v1
oid sha256:abc123...
size 7000000000
å®é
çæ¨¡åæéåå¨å¨å¯¹è±¡åå¨ï¼S3/GCSï¼ä¸ãå½ä½ æ§è¡ git clone æ hf_hub_download æ¶ï¼LFS ç smudge è¿æ»¤å¨ä¼ä»å¯¹è±¡åå¨ä¸è·åçå®å
容ï¼è䏿¯ä» Git ä¸è·åã
è¿å°±æ¯ä¸ºä»ä¹ HF å¯ä»¥æç®¡å å« 100GB+ 模åæä»¶çä»åºââGit 仿ªçå°å®é çåèã
2. Xetï¼æ°æ¹æ¡ï¼æ£å¨æ¨å¹¿ä¸ï¼
HF ä¸ç´å¨æå»ºä¸ä¸ªå为 Xet çæ°å端ï¼åºäºå 容å®ä¹çåååï¼ç±»ä¼¼äº rsync/resticï¼ãå®ä¸åå°æ´ä¸ªæä»¶åå¨å¨å¯¹è±¡åå¨ä¸ï¼
ä¼ ç» LFSï¼ å°æ´ä¸ª 7GB æä»¶ä½ä¸ºä¸ä¸ª blob åå¨
Xet (CDC)ï¼ å°æä»¶åå为约 64KB çå¯å大å°åï¼
è·¨æä»¶/çæ¬å»éè¿äºå
è¿æå³çï¼å¦æä½ å¾®è°ä¸ä¸ªæ¨¡åä¸åªæ 5% çæéåçæ¹åï¼Xet åªä¼ä¸ä¼ /åå¨ä¿®æ¹è¿çåã对äºè¿ä»£å¼çæºå¨å¦ä¹ 工使µæ¥è¯´ï¼è¿æ¯ä¸ä¸ªå·¨å¤§çä¼å¿ãä½ æå°ç hf_api.py å¨ _upload_large_folder.py åæ´ä¸ªä¸è½½è·¯å¾ä¸é½å¼ç¨äºå¯¹ Xet çæ¯æã
æ¶æå¯¹æ¯
GitHub HuggingFace Hub
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
åè®® Git (smart HTTP) Git (smart HTTP)
+ LFS batch API
+ Xet CAS API
大æä»¶åå¨ Git LFSï¼å¯éï¼ Git LFSï¼é»è®¤ï¼
æä»åºéæ©å å
¥ï¼ + Xetï¼å
容寻å
åååå¨ï¼
æå¤§æä»¶å¤§å° 建议 ~2GB æ¯æ 50GB+
ï¼ç¡¬éå¶ ~100GBï¼
å»é æ ï¼ä»
Git deltaï¼ Xetï¼è·¨æä»¶åå»é
å
é/ä¸è½½ git cloneï¼æå hf_hub_downloadï¼
ææå
å®¹ï¼ éæ©æ§æä»¶è·åï¼
åºäºç¬¦å·é¾æ¥çç¼åï¼
å¹¶è¡ä¸è½½
ä»åºç±»å åªæârepoâ models, datasets, spaces
ï¼æ¯ç§æä¸åçå
æ°æ®æ¨¡å¼ï¼
å
æ°æ® åªæ README 模åå¡çï¼YAML åç½®
ä¿¡æ¯ï¼ï¼æ°æ®éæ ç¾ï¼
åºåç§°ï¼ææ ï¼
许å¯è¯ç
æå模åæé N/Aï¼ä¸ä¸ºæ¤è®¾è®¡ï¼ ä¸çç¹æ§ï¼
snapshot_download()ï¼
from_pretrained() é½ä»
Hub æå
ç½é¡µæç®¡ GitHub Pages Spacesï¼Gradio/Streamlit
åºç¨ï¼æ¯æ GPU æç®¡ï¼
为ä»ä¹ GitHub åä¸å°ï¼ä¹ä¸è¯å¾åï¼
GitHub çä»·å¼å¨äºä»£ç åä½ââPRãè®®é¢ãActions CI/CDã代ç 审æ¥ãGit çå¯¹è±¡æ¨¡åæ¯ä¸ºææ¬å·®å¼æ¯è¾ä¼åçã对 7B åæ°ç .safetensors æä»¶æ§è¡ git diff æ¯«æ æä¹ã
HF Hub çä»·å¼å¨äºæ¨¡åååãæ´ä¸ªåå¨å±å´ç»âå°éå°é
ç½®æä»¶ + å ä¸ªå¤ GB ä¸ä¸å¸¸åæ´çäºè¿å¶ blobâ设计ãhf_hub_download ä¸çéæ©æ§ä¸è½½ï¼åªè·å model-00002-of-00004.safetensorsï¼è䏿¯æ´ä¸ªä»åºï¼å¨çº¯ Git 䏿¯ä¸å¯è½å®ç°çã
ä½ çä¸ä¼ å®é åäºä»ä¹
Start hashing 36 files. # 对æ¯ä¸ªæä»¶åå¸ä»¥æ£æµåå
Finished hashing 36 files. # ä»
ä¸ä¼ æ°ç/æååçæä»¶
â Uploaded # éè¿ HF API æ¨éæäº¤
å叿¥éª¤æ¯ HF å®ç°å¢éä¸ä¼ çæ¹å¼ââ宿£æ¥åªäºæä»¶å·²åå¨äº Hub ä¸ï¼éè¿ SHA256ï¼ï¼å¹¶è·³è¿å®ä»¬ãè¿æ¯ git push æ´æºè½ï¼åè
ä¼éæ°åéæªååç LFS 对象ã
å¯¹äº zzï¼ä½ ç CLI å·¥å
·ï¼ï¼ä»åºå¯è½å
¨æ¯å°æä»¶ï¼æä»¥æ¬è´¨ä¸å°±æ¯ä¸æ¬¡æ®éç git pushãä½å¦æä½ ä¸ä¼ äºä¸ä¸ª 7GB ç .safetensors 模åï¼å®ä¼éæå°èµ° LFS/Xet è·¯å¾ã
