-
Notifications
You must be signed in to change notification settings - Fork 234
Description
Problem:
The current/packing unpacking situation is confusing and complex when it comes to dealing with the different binary and string types that will be packed to, or unpacked from, the MessagePack "str format family" and "bin format family" data types. It is difficult to determine the correct combination for a satisfactory type mapping in all situations.
In addition, the current msgpack-python (now msgpack) implementations do not have a solution (in either direction) for dealing with data containers that contain both string and binary data types.
For example (on the unpacking/deserialization side), the following byte sequence defines a MessagePack array that contains two elements: a unicode snowman character in utf-8, and an arbitrary byte sequence of [0x00, 0x01, 0x02]:
data = b'\x92\xa3\xe2\x98\x83\xc4\x03\x00\x01\x02'
What possible combination of msgpack.unpackb kwargs can properly unpack this to a two element list containing a python string and a suitable binary type (like bytearray in Python 2, and bytes in Python 3)? Conversely, how would you generate such a MessagePack structure from python (aside from direct generation as above)?
Proposal:
Rather than having a collection of effectivey global switches that can be sent to packb/unpackb (for example: raw_as_bytes and use_bin_type), it would be better if there were a method for defining an explicit typemap that would be used at a per-element level, and which defined the type mappings to use for both packb (from python to the MessagePack protocol) and unpackb (from MessagePack to python).
For example, it would be great if msgpack.unpackb(data_bytes, typemap='ideal') would get the "ideal" behaviour I outline in the tables further below. When using the typemap switch, packing/unpacking could then work in a per-element way, rather than having issues with mixed-type sequences like the global switches currently do. Possible typemap values could be similar to in the columns defined further below: ('ideal', '0.4', '0.5', 'default'), or somesuch, where default would be the default value, and currently equate to typemap='0.5'. It would also be illegal (raising ValueError) to specify kwargs like raw_as_bytes together with a typemap specification). I think this proposal will resolve any potential compatibility issues.
This typemap kwarg behaviour should be bidirectional. Specifically, there should also be a similar possibility for msgpack.packb(data, typemap=ideal`).
In addition (with the exception of python 2 str and bytearray ambiguity) it should always be the case that unpackb(packb(v)) == v. This is currently not true with available msgpack python versions.
Explicit Type Mapping Tables
The type mapping situation for msgpack bin and str (current, and "ideal") are covered in the tables below, covering the current situation for different msgpack versions, as well as my proposed ideal mapping.
Note that, in the tables below:
- PackType ==
mp_strrefers to the "str format family", with leading (101XXXXX,0xd9,0xda,0xdb) - PackType ==
mp_binrefers to the "bin format family", with lead bits in (0xc4,0xc5,0xc6) - The "PackType (ideal)" column indicates what I personally think the accurate pack/unpack targets should be for each type.
packb/unpackbresults for versions "< 0.5.x" and ">= 0.5.x" are what you get with default values for all global kwarg switches likeraw_as_bytes- I have intentionally not referenced the types where the mapping is (in my opinion) extremely clear, for example:
- msgpack
nil format⇔None - msgpack
bool format⇔bool - msgpack
int format family⇔int - msgpack
float format family⇔float - msgpack
array format family⇔list- aside: I'd prefer an immutable tuple here, although map/dict targets have to be mutable so consistency is an issue
- msgpack
map format family⇔dict
- msgpack
Python 2 packing/serialization behaviour
| Python2 type | PackType (ideal) | PackType (<0.5.x) | PackType (>=0.5.x) | Comment |
|---|---|---|---|---|
str |
mp_bin |
mp_str |
mp_str |
str is really bytes in python 2 |
unicode |
mp_str |
mp_str |
mp_str |
Should always encode to utf8 |
bytearray |
mp_bin |
ERROR | mp_str |
If this doesn't go to mp_bin, what does? See notes below on the ERROR case (which was good) |
* In the case of the ERROR for msgpack < 0.5.x, this was actually extremely useful, since it resulted an the default callback being invoked where you could specifically manage bytearray (since msgpack-python itself did not). Now that 0.5.x silently encodes bytearray to mp_str, this is no longer possible. This is actually the problem that triggered me to raise this issue (after an update to 0.5.x broke a build).
Python 2 unpacking/deserialization behaviour
| Msgpack type | Python2 type (ideal) | Python2 type (<0.5.x) | Python2 type (>=0.5.x) | Comment |
|---|---|---|---|---|
mp_str |
unicode |
str |
str |
Should always decode assuming utf8 |
mp_bin |
str |
str |
str |
bytearray would actually be a more literal/ideal unpack target, but is unfamiliar to most (and is mutable) |
Python 3 packing/serialization behaviour
| Python3 type | PackType (ideal) | PackType (<0.5.x) | PackType (>=0.5.x) | Comment |
|---|---|---|---|---|
str |
mp_str |
mp_str |
mp_str |
Always encode with utf8 |
bytes |
mp_bin |
mp_str |
mp_str |
Python 3 unpacking/deserialization behaviour
| Python3 type | PackType (ideal) | PackType (<0.5.x) | PackType (>=0.5.x) | Comment |
|---|---|---|---|---|
mp_str |
str |
bytes |
bytes |
The default conversion to bytes is particularly confusing |
mp_bin |
bytes |
bytes |
bytes |
References
There are some other issues related to this that are worth referencing here:
- Backward incompatible API change toward 1.0 #191 -- regarding backwards compatibility issues moving towards 1.0
- unpack should decode string types by default #99 -- an old issue about properly decoding string types
- can't serialize bytearray #224 -- an issue specifically about serializing
bytearray - msgpack #121 -- an old/length issue (now closed) about differentiating between raw binary data and strings
I decide to make a new issue, since this is a general proposal about A) explicitly clarifying type mapping in both directions between msgpack and python, and B) it explicitly covers both bin and str msgpack formats (in msgpack terminology) as well as both python str/bytes cases.