CBOR is MessagePack. At least cbor-ruby started with the MessagePack sources. There's nothing at all complicated about MessagePack, there are a few types, they all have small prefixes, end of format. The story is that Carsten took MessagePack, wrote a standard and added some things he wanted, and called it something else. I personally think this is pretty bad faith, but I admit I haven't been involved in the community or discussions, so it's possible there are some things I don't know.
Disclaimer: I wrote and maintain a MessagePack library.
> (CBOR introduced) a neat [type|length] encoding instead of just [type], which
> applies to all types - integers, floats, strings, maps, etc.
Not really. For example there's no 11-bit integer type. In fact, CBOR is more-or-less identical to MP in this regard: there are a set number of (we'll use CBOR parlance here) tags, each of which indicates the length of succeeding binary data.
If you're just saying "length follows type", then this is in fact exactly how MP works. If you have 30 bytes of raw data, you use the bin 8 tag (0xc4), write 0x1e for the length, and then memcpy your data into your buffer.
---
I'd like to discuss CBOR's design, which I think is very poorly thought out.
---
CBOR is broken because of indefinite-length data types. A CBOR implementation has to accept an infinite stream of data. Besides being impossible, this is unacceptable for pretty much any application. It's also poorly specified; Section 3.1 reads:
In a streaming application, a data stream may be composed of a sequence of
CBOR data items concatenated back-to-back. In such an environment, the
decoder immediately begins decoding a new data item if data is found after
the end of a previous data item.
"Immediately" is troubling here, in fact, the section goes on:
Note that some applications and protocols will not want to use
indefinite-length encoding. Using indefinite-length encoding allows an
encoder to not need to marshal all the data for counting, but it requires a
decoder to allocate increasing amounts of memory while waiting for the end
of the item. This might be fine for some applications but not others.
I really don't know what to do here. The data format requires your application to allocate indefinitely, and the spec even says "this might be bad for you". Come on.
---
CBOR's tags are pretty much MP v5's Extension types, but the spec makes them more confusing. Here's an excerpt:
Decoders do not need to understand tags, and thus tags may be of little
value in applications where the implementation decoding that stream know
the semantic meaning of each item in the data flow. Their primary purpose
in this specification is to define common data types such as dates.
So far, so good. CBOR then goes on to define more than a dozen tags:
- Standard date/time string
- Epoch-based date/time
- Positive bignum
- Negative bignum
- Decimal fraction
- Bigfloat
- Expected conversion to base64url encoding
- Expected conversion to base64 encoding
- Expected conversion to base16 encoding
- Encoded CBOR data item
- URI
- base64url
- base64
- Regular expression
- MIME message
- Self-describe CBOR
Besides being over-engineered ("Expected conversion to X"? "MIME message"?!), this puts pressure on implementations to support these extension types. The reason MP doesn't do this is that, for example, BigNum support isn't free and you don't want to tie your binary format to it. CBOR's spec says implementations are free to not convert these (at least, that's how I interpret "Decoders do not need to understand tags"), but it creates a schism: most dynamic languages include BigNum support (Ruby, Python, etc.), but at least C and C++ don't. This creates pressure on C/C++ implementations to pull in a lot of extra baggage, something the MP authors specifically and thoughtfully avoided.
Even worse, tags are undefined behavior in CBOR. Section 3.5 reads:
A decoder that comes across a tag (Section 2.4) that it does not recognize,
such as a tag that was added to the IANA registry after the decoder was
deployed or a tag that the decoder chose not to implement, might issue a
warning, might stop processing altogether, might handle the error and
present the unknown tag value together with the contained data item to the
application (as is expected of generic decoders), might ignore the tag and
simply present the contained data item only to the application, or take
some other type of action.
"some other type of action" is deeply worrying. Even more deeply worrying is that this applies to anything at all in CBOR:
A decoder that comes across a simple value (Section 2.3) that it does not
recognize, such as a value that was added to the IANA registry after the
decoder was deployed or a value that the decoder chose not to implement,
might issue a warning, might stop processing altogether, might handle the
error by making the unknown value available to the application as such (as
is expected of generic decoders), or take some other type of action.
I choose not to implement `int`. I decide instead to fill up your home folder. I'm a compliant CBOR implementation.
---
Canonical CBOR is a bad idea, as outlined by ludocode elsewhere. Generally speaking, canonicalization is application-specific and attempts to codify it end up being nonsensical, overwrought, or incomplete. And hey look, CBOR's is nonsensical AND incomplete: it punts on floats and tags.
---
There is no technical justification for CBOR. I know some places require standards, and that could've been solved by working with Sadayuki. I'm legitimately at a loss.
---
I know you're not into the "drama" part, but I think it's beneficial to get into it a little. Feel free to ignore; it's why I put it down here.
I'm not gonna characterize things. I'll just list links. I feel like that's the most fair way to do things.
It's also worth pointing out that CBOR's RFC is incorrect and unfair in its description of MP. Heres's Section E.2.:
MessagePack has been essentially stable since it was first published around
2011; it has not yet had a transition. The evolution of MessagePack is
impeded by an imperative to maintain complete backwards compatibility with
existing stored data, while only few bytecodes are still available for
extension. Repeated requests over the years from the MessagePack user
community to separate out binary and text strings in the encoding recently
have led to an extension proposal that would leave MessagePack's "raw" data
ambiguous between its usages for binary and text data. The extension
mechanism for MessagePack remains unclear.
MP v5 specifies separate types for string data and raw binary data. MP v5 is backwards-compatible with v4; in fact the v5 standard says implementations should provide a v4 compatibility mode. Nothing about the separate types for string and raw data is ambiguous; they're as separate as any other variable-length type. MP v5 includes a clear and simple extension mechanism:
MessagePack allows applications to define application-specific types using
the Extension type. Extension type consists of an integer and byte array
where the integer represents a kind of type and the byte array represents
data. Applications can assign 0 - 127 to store application-specific type
information. MessagePack reserves -1 - -128 for future extension to add
predefined types which will be described in separated documents.
Disclaimer: I wrote and maintain a MessagePack library.