View source on GitHub
|
Reads CSV files into a dataset.
tf.compat.v1.data.experimental.make_csv_dataset(
file_pattern,
batch_size,
column_names=None,
column_defaults=None,
label_name=None,
select_columns=None,
field_delim=',',
use_quote_delim=True,
na_value='',
header=True,
num_epochs=None,
shuffle=True,
shuffle_buffer_size=10000,
shuffle_seed=None,
prefetch_buffer_size=None,
num_parallel_reads=None,
sloppy=False,
num_rows_for_inference=100,
compression_type=None,
ignore_errors=False,
encoding='utf-8'
)
Reads CSV files into a dataset, where each element of the dataset is a
(features, labels) tuple that corresponds to a batch of CSV rows. The features
dictionary maps feature column names to Tensors containing the corresponding
feature data, and labels is a Tensor containing the batch's label data.
By default, the first rows of the CSV files are expected to be headers listing
the column names. If the first rows are not headers, set header=False and
provide the column names with the column_names argument.
By default, the dataset is repeated indefinitely, reshuffling the order each
time. This behavior can be modified by setting the num_epochs and shuffle
arguments.
For example, suppose you have a CSV file containing
| Feature_A | Feature_B |
|---|---|
| 1 | "a" |
| 2 | "b" |
| 3 | "c" |
| 4 | "d" |
# No label column specified
dataset = tf.data.experimental.make_csv_dataset(filename, batch_size=2)
iterator = dataset.as_numpy_iterator()
print(dict(next(iterator)))
# prints a dictionary of batched features:
# OrderedDict([('Feature_A', array([1, 4], dtype=int32)),
# ('Feature_B', array([b'a', b'd'], dtype=object))])
# Set Feature_B as label column
dataset = tf.data.experimental.make_csv_dataset(
filename, batch_size=2, label_name="Feature_B")
iterator = dataset.as_numpy_iterator()
print(next(iterator))
# prints (features, labels) tuple:
# (OrderedDict([('Feature_A', array([1, 2], dtype=int32))]),
# array([b'a', b'b'], dtype=object))
See the
Load CSV data guide for
more examples of using make_csv_dataset to read CSV data.
Args |
|---|
file_pattern
tf.io.gfile.glob for pattern rules.
batch_size
column_names
column_defaults
Tensor with one of the aforementioned types. The tensor can either be
a scalar default value (if the column is optional), or an empty tensor (if
the column is required). If a dtype is provided instead of a tensor, the
column is also treated as required. If this list is not provided, tries
to infer types based on reading the first num_rows_for_inference rows of
files specified, and assumes all columns are optional, defaulting to 0
for numeric values and "" for string values. If both this and
select_columns are specified, these must have the same lengths, and
column_defaults is assumed to be sorted in order of increasing column
index.
label_name
Tensor from
the features dictionary.
select_columns
column_names or inferred from the file header lines. When this argument
is specified, only a subset of CSV columns will be parsed and returned,
corresponding to the columns specified. Using this results in faster
parsing and lower memory usage. If both this and column_defaults are
specified, these must have the same lengths, and column_defaults is
assumed to be sorted in order of increasing column index.
field_delim
string. Defaults to ",". Char delimiter to
separate fields in a record.
use_quote_delim
True. If false, treats
double quotation marks as regular characters inside of the string fields.
na_value
header
num_epochs
shuffle
shuffle_buffer_size
shuffle_seed
prefetch_buffer_size
num_parallel_reads
1.
sloppy
True, reading performance will be improved at
the cost of non-deterministic ordering. If False, the order of elements
produced is deterministic prior to shuffling (elements are still
randomized if shuffle=True. Note that if the seed is set, then order
of elements after shuffling is deterministic). Defaults to False.
num_rows_for_inference
compression_type
tf.string scalar evaluating to one of
"" (no compression), "ZLIB", or "GZIP". Defaults to no compression.
ignore_errors
True, ignores errors with CSV file parsing,
such as malformed data or empty lines, and moves on to the next valid
CSV record. Otherwise, the dataset raises an error and stops processing
when encountering any invalid records. Defaults to False.
encoding
UTF-8.
Returns | |
|---|---|
A dataset, where each element is a (features, labels) tuple that corresponds
to a batch of batch_size CSV rows. The features dictionary maps feature
column names to Tensors containing the corresponding column data, and
labels is a Tensor containing the column data for the label column
specified by label_name.
|
Raises |
|---|
ValueError
View source on GitHub