Google has recently open sourced protocol buffers, one of the building blocks used for building large-scale systems. This is a good time to recap the controversies over data formats. In this debate, opinion has been sacred and facts largely open to interpretation.
Most recent iteration from May: XML: the angle bracket tax takes issue with the inefficiency and developer-unfriendliness of XML as data format. First point is hard to dispute: all of those tags and attributes decorating the data add weight. (On the other hand, all of that redundancy means XML documents compress very well but that adds another step.) Compared to the YAML equivalent on the page, the XML variant looks manifestly verbose. In all fairness the example picked here (SOAP) is probably one of the worst case scenarios for bloat, short of even more tchotckied-up format such as WS-Trust/WS-Security/WS-anything. The jury is out on the second point. XML is not necesarily intended for direct presentation to users– that is why there is XSL as stylesheet language to add presentation. For example Internet Explorer’s default tree view for XML files with collapsible nodes is actually generated by applying a default built-in stylesheet. Manually tweaking XML files is never intended for end users. As for developers and tinkerers stuck with that problem, there is help in the form of XML editors that allow for directly working with the high-level structure instead of low-level syntax. But even with plain-text editing, consider that XHTML is a formulation of HTML as well-formed XML and not particularly more challenging than vanilla HTML/4.0 to work with.
Skepticism against XML runs deep, including entire websites devoted to cataloging its deficiencies. But strong opinions run both ways. Inflammatory arguments bashing other formats as unnecessary reinventions of XML are also abundant; here is one comparing JSON vs XML.
In to this mess Google throws a curve-ball with protocol buffers which define a binary serialization format. The overview includes the expected comparison against XML as a data point:
Protocol buffers have many advantages over XML for serializing structured data. Protocol buffers:
- are simpler
- are 3 to 10 times smaller
- are 20 to 100 times faster
- are less ambiguous
- generate data access classes that are easier to use programmatically
If efficiency were the primary constraint, a binary format is the way to go. The challenge then is making that format extensible. This is the catch: if a data format were known to be immutable forever, everything would be hand-coded to use up the fewest number of bits. Reality is that systems evolve, protocols change. Version #2 of the application introduces extensions and yet must remain backwards compatible in a way that it can still communicate with version #1. (Sometimes even in the opposite direction) Building this level of flexibility without rewriting code for each version is the challenge.
Strangely there is already a widely used binary serialization format: ASN1. Virtually any large scale web site is likely using it because the X509 certificates required for the SSL/TLS security protocol are encoded this way. It is more complex, slightly more powerful than protocol buffers but boasts the same level of extensibility since new optional fields can be added in such a way that the structure remains a superset of its previous version. But greatest challenge for ASN1 is ease of use. There are ASN1 compilers but the interface is nowhere as polished as the cross-language programming model for protocol buffers. Parsing ASN1 is so complex that integer overflows in the parser used on Windows featured in a remote-code execution vulnerability back in 2004.
One of the more interesting properties of ASN1 is that the abstract structure of data and its encoding are clearly separated. The same logical structure can be expressed in different ways on the wire and while the most common variant “basic encoding rules” or BER is binary, a different version called XER allows outputting directly to XML. Small step for reconciliation in the format wars.