huggingface-hub的几个bug的解决方案记录

~Update: 本文提的 Issue 在 huggingface-hub 的main分支已解决,暂未发版,大家可以手动安装main分支版本。~

huggingface-hub 目前对于 HF_ENDPOINT 的支持存在几个 bug,已经给官方提了 issue,在修复前,先记录下临时解决方案。

这些bug为什么值得写一篇记录?

已经给官方提了 issue 和部分解决方案。由于我开设了一个 hf-mirror.comHF_ENDPOINT 依赖很强,最近经常有用户来反馈这几个问题,因此在官方修复前,需给出一些临时解决方案。

此外,令人担忧的是,官方可能对于这几个 bug 的修复没有特别大的动力(个人不负责任的推测)。根据官方这个 PR,由于关闭了 Private Hub 实验,后续不再主动宣传 HF_ENDPOINT 变量了,基于这种策略,虽然官方声称该变量目前仍然保持有效,但对其的支持可能会逐渐 broken,例如我目前在最新版的 transformersdatasets 中遇到的,我估计就和这个策略有关(毕竟文档都不主动介绍了,新来的程序员不了解的话,就容易忽略这个变量)。

问题1:无法上传文件

Github Issue Link
问题复现方法:如下,我们用 huggingface-cliupload 子命令进行项目的上传,将当前目录内的所有文件上传到远程仓库的项目根目录:

$ export HF_ENDPOINT=https://hf-mirror.com
$ huggingface-cli upload --token hf_*** username/reponame . .
Consider using hf_transfer for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
Traceback (most recent call last):
  File "/home/padeoe/.local/bin/huggingface-cli", line 8, in 
    sys.exit(main())
  File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/commands/huggingface_cli.py", line 49, in main
    service.run()
  File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/commands/upload.py", line 190, in run
    print(self._upload())
  File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/commands/upload.py", line 251, in _upload
    repo_id = self.api.create_repo(
  File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3370, in create_repo
    return RepoUrl(d["url"], endpoint=self.endpoint)
  File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 437, in __init__
    repo_type, namespace, repo_name = repo_type_and_id_from_hf_id(self, hub_url=self.endpoint)
  File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 227, in repo_type_and_id_from_hf_id
    raise ValueError(f"Unable to retrieve user and repo ID from the passed HF ID: {hf_id}")
ValueError: Unable to retrieve user and repo ID from the passed HF ID: https://huggingface.co/username/reponame

可以看到存在报错,报错显示连接的仍然是 https://huggingface.co 而非用 HF_ENDPOINT 设置的 https://hf-mirror.com

值得注意的是,如果设置的 HF_ENDPOINT="https://huggingface.co",则不会存在以上问题。

问题1的临时解决办法:

使用Python代码HfApi.upload还是可已上传的:

import os
os.environ['HF_ENDPOINT']='https://hf-mirror.com'

import huggingface_hub
huggingface_hub.login("hf_***")

from huggingface_hub import HfApi
api=HfApi()
api.upload_folder(
    repo_id="username/reponame",
    folder_path="/path/to/your/repo",
    path_in_repo="."
)

问题2:无法下载特定数据集

Github Issue Link
该 Issue 需要同时满足以下三个条件才会触发:

  • datasets>2.15.0huggingface-hub>0.19.4,这很常见,两者最新版本就可以触发
  • 数据集的 ID 不包含机构名, 例如 bookcorpus, gsm8k, wikipedia, 而非 A/B 形式的数据集 ID.
  • 设置了 HF_ENDPOINT 且其值的格式不是 (hub-ci.)?huggingface.co 该格式。

复现步骤:

1.安装特定版本(当前最新版本)的库

pip install datasets==2.18.0
pip install huggingface_hub==0.21.4

2.执行以下python代码

import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
from datasets import load_dataset
bookcorpus = load_dataset('bookcorpus', split='train')

输出如下:

Traceback (most recent call last):
  File "", line 1, in 
  File "/home/padeoe/.local/lib/python3.10/site-packages/datasets/load.py", line 2556, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/padeoe/.local/lib/python3.10/site-packages/datasets/load.py", line 2228, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/home/padeoe/.local/lib/python3.10/site-packages/datasets/load.py", line 1879, in dataset_module_factory
    raise e1 from None
  File "/home/padeoe/.local/lib/python3.10/site-packages/datasets/load.py", line 1830, in dataset_module_factory
    with fs.open(f"datasets/{path}/{filename}", "r", encoding="utf-8") as f:
  File "/home/padeoe/.local/lib/python3.10/site-packages/fsspec/spec.py", line 1295, in open
    self.open(
  File "/home/padeoe/.local/lib/python3.10/site-packages/fsspec/spec.py", line 1307, in open
    f = self._open(
  File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 228, in _open
    return HfFileSystemFile(self, path, mode=mode, revision=revision, block_size=block_size, **kwargs)
  File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 615, in __init__
    self.resolved_path = fs.resolve_path(path, revision=revision)
  File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 180, in resolve_path
    repo_and_revision_exist, err = self._repo_and_revision_exist(repo_type, repo_id, revision)
  File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 117, in _repo_and_revision_exist
    self._api.repo_info(repo_id, revision=revision, repo_type=repo_type)
  File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2413, in repo_info
    return method(
  File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2286, in dataset_info
    hf_raise_for_status(r)
  File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 362, in hf_raise_for_status
    raise HfHubHTTPError(str(e), response=response) from e
huggingface_hub.utils._errors.HfHubHTTPError: 401 Client Error: Unauthorized for url: https://hf-mirror.com/api/datasets/bookcorpus/bookcorpus.py (Request ID: Root=1-65ee8659-5ab10eec5960c63e71f2bb58;b00bdbea-fd6e-4a74-8fe0-bc4682ae090e)

问题2的临时解决方法

降级版本,至少是下述及以下的版本号:

pip install datasets==2.15.0

pip install huggingface-hub==0.19.4

共有 2 条评论

  1. 果然可以!只要安装pip install datasets==2.15.0版本。在百度的平台测试也可以。

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注