Hi team!
I've been looking into the initialization process of the DatasetBuilder and noticed that the ETag/origin metadata caching path is executed unconditionally.
While fetching the origin_metadata and calculating the ETag is necessary for safely managing a local disk cache when using as_dataset(), it seems this might be redundant for streaming. When a user requests streaming=True, the pipeline drops into as_streaming_dataset(). Because streaming does not use a local cache, it appears the streaming pipeline ignores the generated _cache_dir string and never calls _get_dataset_fingerprint(). As a result, the origin metadata is fetched and hashed without being used.
I wanted to start a discussion to understand if there is another purpose for the etag/origin_metadata when streaming. If not, would it be possible/desirable to skip this fetch when streaming=True to improve initialization latency and save unnecessary network calls?
Thanks!
Hi team!
I've been looking into the initialization process of the
DatasetBuilderand noticed that the ETag/origin metadata caching path is executed unconditionally.While fetching the
origin_metadataand calculating the ETag is necessary for safely managing a local disk cache when usingas_dataset(), it seems this might be redundant for streaming. When a user requestsstreaming=True, the pipeline drops intoas_streaming_dataset(). Because streaming does not use a local cache, it appears the streaming pipeline ignores the generated_cache_dirstring and never calls_get_dataset_fingerprint(). As a result, the origin metadata is fetched and hashed without being used.I wanted to start a discussion to understand if there is another purpose for the etag/origin_metadata when streaming. If not, would it be possible/desirable to skip this fetch when
streaming=Trueto improve initialization latency and save unnecessary network calls?Thanks!