The Web platform: what it is

T.V. Raman said the Web is more than a Web Browser, and that Web technology means more than just HTML,¹ and I couldn’t agree more. Yet he goes on to characterize the HTML5 effort like so:

The HTML5 community would define themselves as encompassing all Web technologies, i.e., if it’s not HTML5 and implemented in a browser, it’s not the Web.

I wanted to write a bit about this disconnect.

What is the Web platform anyway? Honestly, it’s a weird thing: part designed, part congealed; part documented, part reverse-engineered. It consists of the technology broadly used to process the public content of the Web. Many tools that live outside of browsers are built on top of this platform, but the important part of calling a technology a piece of the Web platform is that it works with the public content of the Web.

The Web platform includes many technologies we’re at this point all familiar with: URLs, HTTP, REST, HTML, CSS, JavaScript, DOM, etc… It includes lots of pieces users never see, like website APIs you can call into with XHR (or with your favorite programming language’s HTTP library), and some data formats like XML and JSON, which aren’t user-facing.

Henri Sivonen’s second pass at a diagram of the web stack.

Tools intended to operate on public Web content usually need to handle that content the way that browsers do, because browsers are the tools that authors target their content for. (This is the Support Existing Content design principle.) If a browser handles unquoted attribute values, say, it behooves you to do the same, since your code will break on lots of web content if you do otherwise. Here’s another example: a web crawler like Google’s doesn’t run in a browser, but needs to process public Web content in as browser-compatible a way as possible, so that Google’s search results most accurately reflect what you and I will see when we click through to the pages returned.

How do you make sure that XML technologies can co-exist on the Web alongside HTML without necessarily having HTML’s sloppiness leaking into all Web languages?

When he asks this, Raman seems to believe that the messiness of the Web is somehow contained in and limited to HTML and/or the major browsers, and that the rest of the platform doesn’t have to know about or handle the sorts of messiness that browsers have to. That somehow the messiness is imposed on the rest of the Web stack by the browsers.

But this is just wrong. The messiness is in the content on the Web, it’s simply one of the core facts of the Web. The browsers are just on the front lines of dealing with it. Any tool purporting to be useful on the Web, and any specification purporting to describe the reality of the Web, must recognize this.