What are the key differences between Apache Thrift, Google Protocol Buffers, MessagePack, ASN.1 and Apache Avro?

Protocol BuffersThriftasn.1Avro

Protocol Buffers Problem Overview


All of these provide binary serialization, RPC frameworks and IDL. I'm interested in key differences between them and characteristics (performance, ease of use, programming languages support).

If you know any other similar technologies, please mention it in an answer.

Protocol Buffers Solutions


Solution 1 - Protocol Buffers

ASN.1 is an ISO/ISE standard. It has a very readable source language and a variety of back-ends, both binary and human-readable. Being an international standard (and an old one at that!) the source language is a bit kitchen-sinkish (in about the same way that the Atlantic Ocean is a bit wet) but it is extremely well-specified and has decent amount of support. (You can probably find an ASN.1 library for any language you name if you dig hard enough, and if not there are good C language libraries available that you can use in FFIs.) It is, being a standardized language, obsessively documented and has a few good tutorials available as well.

Thrift is not a standard. It is originally from Facebook and was later open-sourced and is currently a top level Apache project. It is not well-documented -- especially tutorial levels -- and to my (admittedly brief) glance doesn't appear to add anything that other, previous efforts don't already do (and in some cases better). To be fair to it, it has a rather impressive number of languages it supports out of the box including a few of the higher-profile non-mainstream ones. The IDL is also vaguely C-like.

Protocol Buffers is not a standard. It is a Google product that is being released to the wider community. It is a bit limited in terms of languages supported out of the box (it only supports C++, Python and Java) but it does have a lot of third-party support for other languages (of highly variable quality). Google does pretty much all of their work using Protocol Buffers, so it is a battle-tested, battle-hardened protocol (albeit not as battle-hardened as ASN.1 is. It has much better documentation than does Thrift, but, being a Google product, it is highly likely to be unstable (in the sense of ever-changing, not in the sense of unreliable). The IDL is also C-like.

All of the above systems use a schema defined in some kind of IDL to generate code for a target language that is then used in encoding and decoding. Avro does not. Avro's typing is dynamic and its schema data is used at runtime directly both to encode and decode (which has some obvious costs in processing, but also some obvious benefits vis a vis dynamic languages and a lack of a need for tagging types, etc.). Its schema uses JSON which makes supporting Avro in a new language a bit easier to manage if there's already a JSON library. Again, as with most wheel-reinventing protocol description systems, Avro is also not standardized.

Personally, despite my love/hate relationship with it, I'd probably use ASN.1 for most RPC and message transmission purposes, although it doesn't really have an RPC stack (you'd have to make one, but IOCs make that simple enough).

Solution 2 - Protocol Buffers

We just did an internal study on serializers, here are some results (for my future reference too!)

Thrift = serialization + RPC stack

The biggest difference is that Thrift is not just a serialization protocol, it's a full blown RPC stack that's like a modern day SOAP stack. So after the serialization, the objects could (but not mandated) be sent between machines over TCP/IP. In SOAP, you started with a WSDL document that fully describes the available services (remote methods) and the expected arguments/objects. Those objects were sent via XML. In Thrift, the .thrift file fully describes the available methods, expected parameter objects and the objects are serialized via one of the available serializers (with Compact Protocol, an efficient binary protocol, being most popular in production).

ASN.1 = Grand daddy

ASN.1 was designed by telecom folks in the 80s and is awkward to use due to limited library support as compared to recent serializers which emerged from CompSci folks. There are two variants, DER (binary) encoding and PEM (ascii) encoding. Both are fast, but DER is faster and more size efficient of the two. In fact ASN.1 DER can easily keep up (and sometimes beat) serializers that were designed 30 years after itself, a testament to it's well engineered design. It's very compact, smaller than Protocol Buffers and Thrift, only beaten by Avro. The issue is having great libraries to support and right now Bouncy Castle seems to be the best one for C#/Java. ASN.1 is king in security and crypto systems and isn't going to go away, so don't be worried about 'future proofing'. Just get a good library...

MessagePack = middle of the pack

It's not bad but it's neither the fastest, nor the smallest nor the best supported. No production reason to choose it.

Common

Beyond that, they are fairly similar. Most are variants of the basic TLV: Type-Length-Value principle.

Protocol Buffers (Google originated), Avro (Apache based, used in Hadoop), Thrift (Facebook originated, now Apache project) and ASN.1 (Telecom originated) all involve some level of code generation where you first express your data in a serializer-specific format, then the serializer "compiler" will generate source code for your language via the code-gen phase. Your app source then uses these code-gen classes for IO. Note that certain implementations (eg: Microsoft's Avro library or Marc Gavel's ProtoBuf.NET) let you directly decorate your app level POCO/POJO objects and then the library directly uses those decorated classes instead of any code-gen's classes. We've seen this offer a boost performance since it eliminates a object copy stage (from application level POCO/POJO fields to code-gen fields).

Some results and a live project to play with

This project (https://github.com/sidshetye/SerializersCompare) compares important serializers in the C# world. The Java folks already have something similar.

1000 iterations per serializer, average times listed
Sorting result by size
Name                Bytes  Time (ms)
------------------------------------
Avro (cheating)       133     0.0142
Avro                  133     0.0568
Avro MSFT             141     0.0051
Thrift (cheating)     148     0.0069
Thrift                148     0.1470
ProtoBuf              155     0.0077
MessagePack           230     0.0296
ServiceStackJSV       258     0.0159
Json.NET BSON         286     0.0381
ServiceStackJson      290     0.0164
Json.NET              290     0.0333
XmlSerializer         571     0.1025
Binary Formatter      748     0.0344

Options: (T)est, (R)esults, s(O)rt order, (S)erializer output, (D)eserializer output (in JSON form), (E)xit

Serialized via ASN.1 DER encoding to 148 bytes in 0.0674ms (hacked experiment!)

Solution 3 - Protocol Buffers

Adding to the performance perspective, Uber recently evaluated several of these libraries on their engineering blog:

https://eng.uber.com/trip-data-squeeze/

The winner for them? MessagePack + zlib for compression

> Our goal was to find the combination of encoding protocol and > compression algorithm with the most compact result at the highest > speed. We tested encoding protocol and compression algorithm > combinations on 2,219 pseudorandom anonymized trips from Uber New York > City (put in a text file as JSON).

The lesson here is that your requirements drive which library is right for you. For Uber they couldn't use an IDL based protocol due to the schemaless nature of message passing they have. This eliminated a bunch of options. Also for them it's not only raw encoding/decoding time that comes into play, but the size of data at rest.

Size Results

Size Results

Speed Results

enter image description here

Solution 4 - Protocol Buffers

The one big thing about ASN.1 is, that ist is designed for specification not implementation. Therefore it is very good at hiding/ignoring implementation detail in any "real" programing language.

Its the job of the ASN.1-Compiler to apply Encoding Rules to the asn1-file and generate from both of them executable code. The Encoding Rules might be given in EnCoding Notation (ECN) or might be one of the standardized ones such as BER/DER, PER, XER/EXER. That is ASN.1 is the Types and Structures, the Encoding Rules define the on the wire encoding, and last but not least the Compiler transfers it to your programming language.

The free Compilers support C,C++,C#,Java, and Erlang to my knowledge. The (much to expensive and patent/licenses ridden) commercial compilers are very versatile, usually absolutely up-to-date and support sometimes even more languages, but see their sites (OSS Nokalva, Marben etc.).

It is surprisingly easy to specify an interface between parties of totally different programming cultures (eg. "embedded" people and "server farmers") using this techniques: an asn.1-file, the Encoding rule e.g. BER and an e.g. UML Interaction Diagram. No Worries how it is implemented, let everyone use "their thing"! For me it has worked very well. Btw.: At OSS Nokalva's site you may find at least two free-to-download books about ASN.1 (one by Larmouth the other by Dubuisson).

IMHO most of the other products try only to be yet-another-RPC-stub-generators, pumping a lot of air into the serialization issue. Well, if one needs that, one might be fine. But to me, they look like reinventions of Sun-RPC (from the late 80th), but, hey, that worked fine, too.

Solution 5 - Protocol Buffers

Microsoft's Bond (https://github.com/Microsoft/bond) is very impressive with performance, functionalities and documentation. However it does not support many target platforms as of now ( 13th feb 2015 ). I can only assume it is because it is very new. currently it supports python, c# and c++ . It's being used by MS everywhere. I tried it, to me as a c# developer using bond is better than using protobuf, however I have used thrift as well, the only problem I faced was with the documentation, I had to try many things to understand how things are done.

Few resources on Bond are as follows ( https://news.ycombinator.com/item?id=8866694 , https://news.ycombinator.com/item?id=8866848 , https://microsoft.github.io/bond/why_bond.html )

Solution 6 - Protocol Buffers

For performance, one data point is jvm-serializers benchmark -- it's quite specific, small messages, but might help if you are on Java platform. I think performance in general will often not be the most important difference. Also: NEVER take authors' words as gospel; many advertised claims are bogus (msgpack site for example has some dubious claims; it may be fast, but information is very sketchy, use case not very realistic).

One big difference is whether a schema must be used (PB, Thrift at least; Avro it may be optional; ASN.1 I think also; MsgPack, not necessarily).

Also: in my opinion it is good to be able to use layered, modular design; that is, RPC layer should not dictate data format, serialization. Unfortunately most candidates do tightly bundle these.

Finally, when choosing data format, nowadays performance does not preclude use of textual formats. There are blazing fast JSON parsers (and pretty fast streaming xml parsers); and when considering interoperability from scripting languages and ease of use, binary formats and protocols may not be the best choice.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionandreypoppView Question on Stackoverflow
Solution 1 - Protocol BuffersJUST MY correct OPINIONView Answer on Stackoverflow
Solution 2 - Protocol BuffersDeepSpace101View Answer on Stackoverflow
Solution 3 - Protocol BuffersAvnerView Answer on Stackoverflow
Solution 4 - Protocol BuffersnjimkoView Answer on Stackoverflow
Solution 5 - Protocol BuffersSrivathsa Harish VenkataramanaView Answer on Stackoverflow
Solution 6 - Protocol BuffersStaxManView Answer on Stackoverflow