Skip to content

Unclear type mapping with both packb/unpackb, especially with mixed type sequences #281

@rwarren

Description

@rwarren

Problem:

The current/packing unpacking situation is confusing and complex when it comes to dealing with the different binary and string types that will be packed to, or unpacked from, the MessagePack "str format family" and "bin format family" data types. It is difficult to determine the correct combination for a satisfactory type mapping in all situations.

In addition, the current msgpack-python (now msgpack) implementations do not have a solution (in either direction) for dealing with data containers that contain both string and binary data types.

For example (on the unpacking/deserialization side), the following byte sequence defines a MessagePack array that contains two elements: a unicode snowman character in utf-8, and an arbitrary byte sequence of [0x00, 0x01, 0x02]:

data = b'\x92\xa3\xe2\x98\x83\xc4\x03\x00\x01\x02'

What possible combination of msgpack.unpackb kwargs can properly unpack this to a two element list containing a python string and a suitable binary type (like bytearray in Python 2, and bytes in Python 3)? Conversely, how would you generate such a MessagePack structure from python (aside from direct generation as above)?

Proposal:

Rather than having a collection of effectivey global switches that can be sent to packb/unpackb (for example: raw_as_bytes and use_bin_type), it would be better if there were a method for defining an explicit typemap that would be used at a per-element level, and which defined the type mappings to use for both packb (from python to the MessagePack protocol) and unpackb (from MessagePack to python).

For example, it would be great if msgpack.unpackb(data_bytes, typemap='ideal') would get the "ideal" behaviour I outline in the tables further below. When using the typemap switch, packing/unpacking could then work in a per-element way, rather than having issues with mixed-type sequences like the global switches currently do. Possible typemap values could be similar to in the columns defined further below: ('ideal', '0.4', '0.5', 'default'), or somesuch, where default would be the default value, and currently equate to typemap='0.5'. It would also be illegal (raising ValueError) to specify kwargs like raw_as_bytes together with a typemap specification). I think this proposal will resolve any potential compatibility issues.

This typemap kwarg behaviour should be bidirectional. Specifically, there should also be a similar possibility for msgpack.packb(data, typemap=ideal`).

In addition (with the exception of python 2 str and bytearray ambiguity) it should always be the case that unpackb(packb(v)) == v. This is currently not true with available msgpack python versions.

Explicit Type Mapping Tables

The type mapping situation for msgpack bin and str (current, and "ideal") are covered in the tables below, covering the current situation for different msgpack versions, as well as my proposed ideal mapping.

Note that, in the tables below:

  • PackType == mp_str refers to the "str format family", with leading (101XXXXX, 0xd9, 0xda, 0xdb)
  • PackType == mp_bin refers to the "bin format family", with lead bits in (0xc4, 0xc5, 0xc6)
  • The "PackType (ideal)" column indicates what I personally think the accurate pack/unpack targets should be for each type.
  • packb/unpackb results for versions "< 0.5.x" and ">= 0.5.x" are what you get with default values for all global kwarg switches like raw_as_bytes
  • I have intentionally not referenced the types where the mapping is (in my opinion) extremely clear, for example:
    • msgpack nil formatNone
    • msgpack bool formatbool
    • msgpack int format familyint
    • msgpack float format familyfloat
    • msgpack array format familylist
      • aside: I'd prefer an immutable tuple here, although map/dict targets have to be mutable so consistency is an issue
    • msgpack map format familydict

Python 2 packing/serialization behaviour

Python2 type PackType (ideal) PackType (<0.5.x) PackType (>=0.5.x) Comment
str mp_bin mp_str mp_str str is really bytes in python 2
unicode mp_str mp_str mp_str Should always encode to utf8
bytearray mp_bin ERROR mp_str If this doesn't go to mp_bin, what does? See notes below on the ERROR case (which was good)

* In the case of the ERROR for msgpack < 0.5.x, this was actually extremely useful, since it resulted an the default callback being invoked where you could specifically manage bytearray (since msgpack-python itself did not). Now that 0.5.x silently encodes bytearray to mp_str, this is no longer possible. This is actually the problem that triggered me to raise this issue (after an update to 0.5.x broke a build).

Python 2 unpacking/deserialization behaviour

Msgpack type Python2 type (ideal) Python2 type (<0.5.x) Python2 type (>=0.5.x) Comment
mp_str unicode str str Should always decode assuming utf8
mp_bin str str str bytearray would actually be a more literal/ideal unpack target, but is unfamiliar to most (and is mutable)

Python 3 packing/serialization behaviour

Python3 type PackType (ideal) PackType (<0.5.x) PackType (>=0.5.x) Comment
str mp_str mp_str mp_str Always encode with utf8
bytes mp_bin mp_str mp_str

Python 3 unpacking/deserialization behaviour

Python3 type PackType (ideal) PackType (<0.5.x) PackType (>=0.5.x) Comment
mp_str str bytes bytes The default conversion to bytes is particularly confusing
mp_bin bytes bytes bytes

References

There are some other issues related to this that are worth referencing here:

I decide to make a new issue, since this is a general proposal about A) explicitly clarifying type mapping in both directions between msgpack and python, and B) it explicitly covers both bin and str msgpack formats (in msgpack terminology) as well as both python str/bytes cases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions