Describe the bug
Single-machine training works fine, but multi-machine training throws up all sorts of weird bugs. Help me!
Steps to reproduce the bug
train_dataset = load_dataset(
"webdataset", data_files=args.train_dataset, split="train", streaming=True,cache_dir='/dev/shm/.cache'
)
args.train_dataset is tar list
Expected behavior
[rank9]: train_dataset = load_dataset(
[rank9]: File "/mnt/aihao/miniconda3/envs/pixeldit/lib/python3.10/site-packages/datasets/load.py", line 1705, in load_dataset
[rank9]: return builder_instance.as_streaming_dataset(split=split)
[rank9]: File "/mnt/aihao/miniconda3/envs/pixeldit/lib/python3.10/site-packages/datasets/builder.py", line 1110, in as_streaming_dataset
[rank9]: splits_generators = {sg.name: sg for sg in self._split_generators(dl_manager)}
[rank9]: File "/mnt/aihao/miniconda3/envs/pixeldit/lib/python3.10/site-packages/datasets/packaged_modules/webdataset/webdataset.py", line 79, in _split_generators
[rank9]: pipeline = self._get_pipeline_from_tar(tar_paths[0], tar_iterators[0])
[rank9]: IndexError: list index out of range
Environment info
Copy-and-paste the text below in your GitHub issue.
datasets version: 4.8.4
- Platform: Linux-5.14.0-284.25.1.el9_2.x86_64-x86_64-with-glibc2.35
- Python version: 3.10.20
huggingface_hub version: 0.36.2
- PyArrow version: 23.0.1
- Pandas version: 2.3.3
fsspec version: 2026.2.0
Describe the bug
Single-machine training works fine, but multi-machine training throws up all sorts of weird bugs. Help me!
Steps to reproduce the bug
train_dataset = load_dataset(
"webdataset", data_files=args.train_dataset, split="train", streaming=True,cache_dir='/dev/shm/.cache'
)
args.train_dataset is tar list
Expected behavior
[rank9]: train_dataset = load_dataset(
[rank9]: File "/mnt/aihao/miniconda3/envs/pixeldit/lib/python3.10/site-packages/datasets/load.py", line 1705, in load_dataset
[rank9]: return builder_instance.as_streaming_dataset(split=split)
[rank9]: File "/mnt/aihao/miniconda3/envs/pixeldit/lib/python3.10/site-packages/datasets/builder.py", line 1110, in as_streaming_dataset
[rank9]: splits_generators = {sg.name: sg for sg in self._split_generators(dl_manager)}
[rank9]: File "/mnt/aihao/miniconda3/envs/pixeldit/lib/python3.10/site-packages/datasets/packaged_modules/webdataset/webdataset.py", line 79, in _split_generators
[rank9]: pipeline = self._get_pipeline_from_tar(tar_paths[0], tar_iterators[0])
[rank9]: IndexError: list index out of range
Environment info
Copy-and-paste the text below in your GitHub issue.
datasetsversion: 4.8.4huggingface_hubversion: 0.36.2fsspecversion: 2026.2.0