Skip to content

Encountered an error when loading data in WebDataset format using load_datasets during multi-matchines training. #8201

@aihao2000

Description

@aihao2000

Describe the bug

Single-machine training works fine, but multi-machine training throws up all sorts of weird bugs. Help me!

Steps to reproduce the bug

train_dataset = load_dataset(
"webdataset", data_files=args.train_dataset, split="train", streaming=True,cache_dir='/dev/shm/.cache'
)

args.train_dataset is tar list

Expected behavior

[rank9]: train_dataset = load_dataset(

[rank9]: File "/mnt/aihao/miniconda3/envs/pixeldit/lib/python3.10/site-packages/datasets/load.py", line 1705, in load_dataset

[rank9]: return builder_instance.as_streaming_dataset(split=split)

[rank9]: File "/mnt/aihao/miniconda3/envs/pixeldit/lib/python3.10/site-packages/datasets/builder.py", line 1110, in as_streaming_dataset

[rank9]: splits_generators = {sg.name: sg for sg in self._split_generators(dl_manager)}

[rank9]: File "/mnt/aihao/miniconda3/envs/pixeldit/lib/python3.10/site-packages/datasets/packaged_modules/webdataset/webdataset.py", line 79, in _split_generators

[rank9]: pipeline = self._get_pipeline_from_tar(tar_paths[0], tar_iterators[0])

[rank9]: IndexError: list index out of range

Environment info

Copy-and-paste the text below in your GitHub issue.

  • datasets version: 4.8.4
  • Platform: Linux-5.14.0-284.25.1.el9_2.x86_64-x86_64-with-glibc2.35
  • Python version: 3.10.20
  • huggingface_hub version: 0.36.2
  • PyArrow version: 23.0.1
  • Pandas version: 2.3.3
  • fsspec version: 2026.2.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions