Are text-based encodings even relevant any more?

YAML, JSON and Config Files

Some years ago, one of my very first blog posts was a piece on how terribly XML fails its essential design goals. I mentioned that YAML or JSON were much better compromise solutions to the essential problem of expressing data in a form that is easily readable by both people and machines.

But the other day, I came across a couple of excellent articles by Martin Tornoij that pointed out some of the shortcomings of JSON and YAML, especially for configuration files. Both pieces are well-written, and worth a read. They cite a number of issues, including JSON’s lack of comments and lack of programmability, and YAML’s subtle formatting complexities and security issues.

The main focus of the articles is configuration files, and it occurred to me at the time that configuration files are probably only one use case out of many for text-based data formats. I started working on a wonderfully erudite (and exhaustingly long-winded) post on all the different use cases for YAML, JSON and XML, and then it hit me: there are are no other proper use cases.

Let me clearly step out on a limb here:

I believe that configuration files are the only case for which text-based data formats such as JSON, YAML or XML are an optimal solution.

Text-based Encodings in the Wild

Of course I acknowledge that configuration files are not the only setting in which we see text-based data. Web pages are the most obvious example, and there are a myriad of others. The use of text-based encodings in these contexts is well-established, so I don’t mean to say that it doesn’t work – I just mean that it’s not the best solution. Here’s why…

The whole point of text-based data encodings is to have a single format that is ‘easy’ for both humans and machines to read and write. That’s their essential purpose, and that’s why I think XML is a massive failure, because it’s not either of those things. But the real question lies in where this requirement is a priority. In what settings is it important that data be encoded in a way that is easy for both machines and humans and read and write?

Consider the file formats for popular office suites. Both the OpenDocument and Office Open XML formats are based on XML files that are then zipped. But why? Even setting aside my general antipathy for XML, I can’t believe anyone in their right mind unzips these files and opens the XML in a text editor. Why not settle on a binary format that is compact and easy to process, and let the office software render it properly? Humans don’t (and shouldn’t have to) care about the ‘raw’ data of a spreadsheet file.

Of course, HTML and its derivatives, variants and hangers-on are the elephants in the room. The waters are a bit murkier here, but I still don’t think they comprise a counter-example to the basic point. First, HTML (like XML and SGML) is terrible for both machines and humans to read and write. It’s verbose, inconsistent, and visually noisy for humans, and complicated and expensive for computers.

Secondly, I’d argue that HTML is closer to a programming language than data encoding. These days, we don’t use web pages to express information so much as human / machine interactions, which is a reasonable (if generic) definition for any application program. The use of languages like JavaScript, and platforms like Rails and Django, which generate HTML dynamically only strengthens this argument.

Now, programming languages themselves certainly are examples of encodings intended for use by both humans and machines, but they’re not intended for users – they’re used by developers. Of course developers are users (especially of development software) but I’m drawing the distinction here because developers have a different skill-set to non-developers, and are far more likely to be comfortable with using a text editor and navigating complex information structures. So even if HTML were a good text-based encoding, it’s not intended for non-technical consumption anyway.

The rest of us simply use a browser to access HTML, and the problems that even the best web browsers have in rendering HTML consistently and efficiently are very much a testament to the shortcomings of HTML from the machine side. HTML is problematic for machines to process as well as humans.

The underlying message here is that, if the primary means by which ‘real world’ humans access some data is not in its ‘raw’ form, there’s no point in spending the extra effort of making that raw form available to humans at all.

When Worlds Collide

Earlier in this piece, I made the point that the purpose of text-based information formats is that they’re intended to be easily accessible to both humans and machines. I’d go so far as to say that such formats represent an intersection between the user experience and the core application logic.

In general, good design principles would mandate that we separate those things very distinctly. Information accessible to humans is a separate concern to manipulating the information, so we generally build separate components to deal with these different functions. That’s what design architectures like Model-View-Controller are all about.

Configuration files lie in that (to my mind) small intersection of requirements that make them a true exception or edge case to the more general principle of separating information processing and presentation. I’m struggling to find anything else that falls into that category.

Is this a startling or radical conclusion? It’s hard to say, but I was surprised by it. It does seem to me that text-based encodings have far fewer legitimate usages than we actually see in the real world. I’m happy to be corrected, if anyone has some counter-examples I haven’t considered.