Skip to content

Feature Request / Discussion: Skip origin_metadata fetch when streaming=True? #8197

@yuxin00j

Description

@yuxin00j

Hi team!

I've been looking into the initialization process of the DatasetBuilder and noticed that the ETag/origin metadata caching path is executed unconditionally.

While fetching the origin_metadata and calculating the ETag is necessary for safely managing a local disk cache when using as_dataset(), it seems this might be redundant for streaming. When a user requests streaming=True, the pipeline drops into as_streaming_dataset(). Because streaming does not use a local cache, it appears the streaming pipeline ignores the generated _cache_dir string and never calls _get_dataset_fingerprint(). As a result, the origin metadata is fetched and hashed without being used.

I wanted to start a discussion to understand if there is another purpose for the etag/origin_metadata when streaming. If not, would it be possible/desirable to skip this fetch when streaming=True to improve initialization latency and save unnecessary network calls?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions