
Polyglots don’t lead to interoperability
tl;dr: The reason I filed w3ctag/design-principles#239
is that polyglots don’t lead to interoperability. Please don’t specify them.
What is a polyglot?
In plain English, a polyglot is someone who speaks more than one language.
In the context of web standards, a polyglot is a document which can be parsed as multiple different formats, ideally where the result is (in some sense) equivalent. Speccing a polyglot of format A and format B means requiring authors to write content that can be processed as both format A and format B, and allowing processors to interpret these documents as either format A or format B.
Bridges between ecosystems
Why people spec polyglots
The name of the game in standards is finding consensus among disparate communities (e.g. the HTML & XML communities or the Web platform & Semantic Web communities). In this kind of environment, speccing a polyglot can be really tempting. A group that can’t agree on an approach ends up settling on a polyglot as a ¿Por Qué No Los Dos? solution which satisfices the most intransigent participants from each ecosystem.
It’s common to downplay the added complexity of a polyglot by claiming that the complexity doesn’t impose any cost onto consuming software that isn’t doing [the optional] processing, because you can simply ignore the [optional] bits if you don’t need them but let the people who do want them benefit from them.
(Henri Sivonen of Mozilla, emphasis mine, in No Namespaces in JSON, Please)
Processing polyglots
Unfortunately, I’m pretty sure at this point that it’s always a mistake to specify polyglots, and that the costs of doing so are actually quite high.
Authors tend to test their document with only one kind of processor, so they inadvertently introduce errors which would only be caught by the other kind of processor. In the case of Polyglot Markup, this happened when authors introduced XML errors into their document but only tested with an HTML parser. Consumers using an XML parser would, instead of seeing the document, see an entirely unhelpful YSOD. In other words, if the polyglot format contains fields that are only used by one kind of processor, such fields are likely to experience bit rot problems when authors only routinely test their documents with the other kind of processor.
Henri again:
This simply isn’t how format compatibility works at scale. You can’t just make a processing layer optional and have everything be happy for everyone.
One of three things happens:
- The optional processing layer introduces enough syntactic sugar that some producers start relying on the quasi-optional layer (JSON-LD in the JSON case) being there and consumers that didn’t want to buy into the quasi-optional layer are forced to implement the layer that was sold as optional.
- Producers don’t test with software that uses the quasi-optional layer, so what they output is broken for the purposes of the quasi-optional layer The people who wanted the quasi-optional layer to be there can’t get the benefits anyway.
- A messy mixed state of the two above options without clearly converging on either.
And no, the tools won’t save you.
Security risks
Distinguished Engineer at Mozilla Martin Thomson once said a polyglot, by definition, projects a different face toward different processors.
If a security tool checks only one side of the polyglot, it’ll give a pass to content that exploits processors of the other side.
And we see this all the time in practice. Tricking an anti-virus scanner, for instance, into thinking a document is benign and then getting the user to open that document with a different processor which sees a malicious payload instead is an extremely common form of security issue. Here are just a few examples I gathered in a few minutes of googling:
- OMG WTF PDF
- DICOM Images Have Been Hacked! Now What?
- Ubiquitous Bug Allows HIPAA-Protected Malware to Hide Behind Medical Images
- The GIFAR Image Vulnerability
- Polyglot Files: a Hacker’s best friend
- Malvertising fiends using polyglot exploits in real-world attacks
The decision to spec a polyglot format these days is not a neutral one from a security perspective.
JSON and JSON-LD polyglots
It’s common these days to specify data interchange formats as polyglots of JSON and JSON-LD. This specific kind of polyglot has particular costs that must be borne by users, authors, and implementors.
Two data models means twice the work!
HTML has two serializations, HTML and XHTML, with one data model: the DOM. Programmers can work with the DOM and don’t have to worry all that much about the bits on the wire.
JSON/JSON-LD polyglots, on the other hand, have one syntax but two different data models. Libraries and utilities which operate on one of the data models can’t be easily reused by programmers working with the other data model, which can lead to duplication of work.
When there are multiple data models at play, spec authors writing algorithms need to define a variant of each algorithm per data model, to avoid accidentally mixing up concepts from the different data models, which can leave implementors and authors scratching their heads.
Brittleness and prefixes
The JSON-LD-specific parts of JSON/JSON-LD polyglots are brittle and lead to a particularly acute case of the metacrap problem. The @context
section required by JSON-LD is the mechanism by which shortened, local names are mapped into RDF’s global, URI-based naming scheme.1 It’s really easy for authors who test their content with a plain JSON parser to introduce errors which will cause JSON-LD-aware tools to choke.
Processing a document as JSON-LD is costlier than processing the equivalent JSON, so in performance-sensitive applications it’s reasonable to expect this brittleness effect to be amplified, because performance-sensitive processors are more likely to treat the format as plain JSON.
Desirable features which depend on JSON-LD processing
After the initial version of a polyglot is defined, it’s easy to add features which require the polyglot to be processed in one way and not the other, effectively (or maybe even explicitly) reneging on the original promise that the polyglot can be processed by either kind of tool without degrading the user experience.
For instance, if you want your JSON/JSON-LD polyglot to be able to carry some kind of cryptographic proof of its own data integrity, all pretense of being able to process the polyglot as plain JSON goes out the window. Calculating such a proof requires canonicalization of the data, and canonicalizing JSON-LD requires processing it as a format bearing RDF triples.
Conclusion
Do not specify polyglot formats. They do not lead to interoperability (and the whole point of standards is interoperability). The benefits to users of speccing polyglots are marginal at best, while the ongoing costs to authors, implementors, and users are high. And what may have started life as a compromise between communities can turn out to not be a compromise at all once features are added that rely on the "optional to process" bits from one side of the polyglot.
What should you do instead?
When the proposed polyglot consists of a simpler “base” format with optional additions from another, more complex data model, consider specifying only the base format. It may be worth separately defining a canonical mapping from the base format to the complex data model, along with parser and serializer algorithms. In this kind of scenario the cost of the complexity is borne by the community which is asking for it, and if it turns out YAGNI, the complexity isn’t forever baked into the format.
For instance, if you’re considering speccing a JSON/JSON-LD polyglot, instead define a JSON format and a mapping between it and the RDF data model. That way, people with Linked Data toolchains have a well-defined way to import and export your format, and programmers in other ecosystems don’t have to deal with the added complexity of a polyglot. ∎
Acknowledgments
Many thanks to Anne van Kesteren, Eryn Wells, Henri Sivonen, Jeffrey Yasskin, Martin Thomson, and Tantek Çelik for their insightful feedback on earlier versions of this post, and to Kristina Yasuda for a fun hallway track conversation at TPAC a few weeks ago that prompted me to finally write this down.
Notes
- This mechanism, it turns out, is really weird. Of course, bound prefixes are an anti-pattern in language design, but this is a problem with JSON-LD in and of itself, and not specifically a problem of JSON/JSON-LD polyglots. ↩