huggingface-hub的几个bug的解决方案记录
~Update: 本文提的 Issue 在 huggingface-hub 的main分支已解决,暂未发版,大家可以手动安装main分支版本。~
huggingface-hub
目前对于 HF_ENDPOINT
的支持存在几个 bug,已经给官方提了 issue,在修复前,先记录下临时解决方案。
这些bug为什么值得写一篇记录?
已经给官方提了 issue 和部分解决方案。由于我开设了一个 hf-mirror.com 对 HF_ENDPOINT
依赖很强,最近经常有用户来反馈这几个问题,因此在官方修复前,需给出一些临时解决方案。
此外,令人担忧的是,官方可能对于这几个 bug 的修复没有特别大的动力(个人不负责任的推测)。根据官方这个 PR,由于关闭了 Private Hub 实验,后续不再主动宣传 HF_ENDPOINT
变量了,基于这种策略,虽然官方声称该变量目前仍然保持有效,但对其的支持可能会逐渐 broken,例如我目前在最新版的 transformers
、datasets
中遇到的,我估计就和这个策略有关(毕竟文档都不主动介绍了,新来的程序员不了解的话,就容易忽略这个变量)。
问题1:无法上传文件
Github Issue Link
问题复现方法:如下,我们用 huggingface-cli
的 upload
子命令进行项目的上传,将当前目录内的所有文件上传到远程仓库的项目根目录:
$ export HF_ENDPOINT=https://hf-mirror.com
$ huggingface-cli upload --token hf_*** username/reponame . .
Consider using hf_transfer
for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
Traceback (most recent call last):
File "/home/padeoe/.local/bin/huggingface-cli", line 8, in
sys.exit(main())
File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/commands/huggingface_cli.py", line 49, in main
service.run()
File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/commands/upload.py", line 190, in run
print(self._upload())
File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/commands/upload.py", line 251, in _upload
repo_id = self.api.create_repo(
File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3370, in create_repo
return RepoUrl(d["url"], endpoint=self.endpoint)
File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 437, in __init__
repo_type, namespace, repo_name = repo_type_and_id_from_hf_id(self, hub_url=self.endpoint)
File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 227, in repo_type_and_id_from_hf_id
raise ValueError(f"Unable to retrieve user and repo ID from the passed HF ID: {hf_id}")
ValueError: Unable to retrieve user and repo ID from the passed HF ID: https://huggingface.co/username/reponame
可以看到存在报错,报错显示连接的仍然是 https://huggingface.co
而非用 HF_ENDPOINT
设置的 https://hf-mirror.com
。
值得注意的是,如果设置的 HF_ENDPOINT="https://huggingface.co"
,则不会存在以上问题。
问题1的临时解决办法:
使用Python代码HfApi.upload还是可已上传的:
import os
os.environ['HF_ENDPOINT']='https://hf-mirror.com'
import huggingface_hub
huggingface_hub.login("hf_***")
from huggingface_hub import HfApi
api=HfApi()
api.upload_folder(
repo_id="username/reponame",
folder_path="/path/to/your/repo",
path_in_repo="."
)
问题2:无法下载特定数据集
Github Issue Link
该 Issue 需要同时满足以下三个条件才会触发:
datasets>2.15.0
或huggingface-hub>0.19.4
,这很常见,两者最新版本就可以触发- 数据集的 ID 不包含机构名, 例如
bookcorpus
,gsm8k
,wikipedia
, 而非A/B
形式的数据集 ID. - 设置了
HF_ENDPOINT
且其值的格式不是(hub-ci.)?huggingface.co
该格式。
复现步骤:
1.安装特定版本(当前最新版本)的库
pip install datasets==2.18.0
pip install huggingface_hub==0.21.4
2.执行以下python代码
import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
from datasets import load_dataset
bookcorpus = load_dataset('bookcorpus', split='train')
输出如下:
Traceback (most recent call last):
File "", line 1, in
File "/home/padeoe/.local/lib/python3.10/site-packages/datasets/load.py", line 2556, in load_dataset
builder_instance = load_dataset_builder(
File "/home/padeoe/.local/lib/python3.10/site-packages/datasets/load.py", line 2228, in load_dataset_builder
dataset_module = dataset_module_factory(
File "/home/padeoe/.local/lib/python3.10/site-packages/datasets/load.py", line 1879, in dataset_module_factory
raise e1 from None
File "/home/padeoe/.local/lib/python3.10/site-packages/datasets/load.py", line 1830, in dataset_module_factory
with fs.open(f"datasets/{path}/{filename}", "r", encoding="utf-8") as f:
File "/home/padeoe/.local/lib/python3.10/site-packages/fsspec/spec.py", line 1295, in open
self.open(
File "/home/padeoe/.local/lib/python3.10/site-packages/fsspec/spec.py", line 1307, in open
f = self._open(
File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 228, in _open
return HfFileSystemFile(self, path, mode=mode, revision=revision, block_size=block_size, **kwargs)
File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 615, in __init__
self.resolved_path = fs.resolve_path(path, revision=revision)
File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 180, in resolve_path
repo_and_revision_exist, err = self._repo_and_revision_exist(repo_type, repo_id, revision)
File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 117, in _repo_and_revision_exist
self._api.repo_info(repo_id, revision=revision, repo_type=repo_type)
File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2413, in repo_info
return method(
File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2286, in dataset_info
hf_raise_for_status(r)
File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 362, in hf_raise_for_status
raise HfHubHTTPError(str(e), response=response) from e
huggingface_hub.utils._errors.HfHubHTTPError: 401 Client Error: Unauthorized for url: https://hf-mirror.com/api/datasets/bookcorpus/bookcorpus.py (Request ID: Root=1-65ee8659-5ab10eec5960c63e71f2bb58;b00bdbea-fd6e-4a74-8fe0-bc4682ae090e)
问题2的临时解决方法
降级版本,至少是下述及以下的版本号:
pip install datasets==2.15.0
或
pip install huggingface-hub==0.19.4
3Q
果然可以!只要安装pip install datasets==2.15.0版本。在百度的平台测试也可以。