View source on GitHub
|
Decodes each string into a sequence of code points with start offsets.
tf.strings.unicode_decode_with_offsets(
input,
input_encoding,
errors='replace',
replacement_char=65533,
replace_control_characters=False,
name=None
)
Used in the notebooks
| Used in the guide |
|---|
This op is similar to tf.strings.decode(...), but it also returns the
start offset for each character in its respective string. This information
can be used to align the characters with the original byte sequence.
Returns a tuple (codepoints, start_offsets) where:
codepoints[i1...iN, j]is the Unicode codepoint for thejth character ininput[i1...iN], when decoded usinginput_encoding.start_offsets[i1...iN, j]is the start byte offset for thejth character ininput[i1...iN], when decoded usinginput_encoding.
Args |
|---|
input
N dimensional potentially ragged string tensor with shape
[D1...DN]. N must be statically known.
input_encoding
errors
'strict': Raise an exception for any illegal substrings.'replace': Replace illegal substrings withreplacement_char.'ignore': Skip illegal substrings.replacement_charThe replacement codepoint to be used in place of invalid substrings in inputwhenerrors='replace'; and in place of C0 control characters ininputwhenreplace_control_characters=True.replace_control_charactersWhether to replace the C0 control characters (U+0000 - U+001F)with thereplacement_char.nameA name for the operation (optional).
Returns | |
|---|---|
A tuple of N+1 dimensional tensors (codepoints, start_offsets). |
codepointsis anint32tensor with shape[D1...DN, (num_chars)].offsetsis anint64tensor with shape[D1...DN, (num_chars)].
The returned tensors are tf.Tensors if input is a scalar, or
tf.RaggedTensors otherwise.
Example:
input = [s.encode('utf8') for s in (u'G\xf6\xf6dnight', u'\U0001f60a')]result = tf.strings.unicode_decode_with_offsets(input, 'UTF-8')result[0].to_list() # codepoints[[71, 246, 246, 100, 110, 105, 103, 104, 116], [128522]]result[1].to_list() # offsets[[0, 1, 3, 5, 6, 7, 8, 9, 10], [0]]
View source on GitHub