Marking up code listings

The straightforward way to mark up source code in semantic HTML is by employing a compound of <pre> and <code> which ends up looking like this:

<pre><code>
  …source code goes here…
</code></pre>

Aside: It should be noted that the existence of this HTML compound is why proposals for a code microformat such as Anders Conbere’s hCode get shot down — the answer to is there a compound of XHTML elements that would work? (one of the questions to ask before proceeding in the microformats process) is an emphatic “yes.”

Now, this compound is pretty minimal — we know that the text inside is source code, but we don’t know anything interesting about it. Let’s see about enhancing it.

For my first baby-step past the basics, I indicate the code’s language by putting a class onto the <code> element:

<pre><code class="python">
  …python source code goes here…
</code></pre>

There are several JavaScript-based automatic syntax highlighers — such as Dan Webb’s CodeHighlighter — which operate directly on such language-labeled <pre><code> blocks. A really simple addition to the basic HTML compound can get you quite far. But say you want to handle syntax highlighting yourself — what should you do?

Syntax highlighting

Take this simple snippet of JavaScript:

var foo = 4;

We can introduce basic syntax highlighting of variables by using the <var> element:

<pre><code class="javascript">
  var <var>foo</var> = 4;
</code></pre>

This is about as far as we can get without introducing our own semantics via custom classes.

So how should we choose what to highlight? What should we name the classes we create? As Jon Williams noted, our technique should take into account the wide variety of languages we might want to mark up — programming languages are but one such variety.

This is where Emacs comes in.

Emacs already knows how to syntax highlight pretty much every language I’d like to post snippets of, and its mechanism for doing so — font lock — maps disparate language features onto reasonably semantically-named font lock faces: builtin, comment, constant, function, keyword, string, type, and variable, to name the key ones. These are pretty good names to crib, and we can observe what names Emacs attaches to different parts of code too.

To illustrate, here’s an example pulled from my ~/.cshrc:

set complete=enhance

set ssh_hosts = `grep '^Host[ ][^*]' ~/.ssh/config | cut -c 6-`
complete ssh 'p/1/$ssh_hosts/'

# Aliases for pulling up a screen on each host
foreach host ($ssh_hosts)
    alias $host "ssh -t $host screen -DR"
end

Here’s how you might mark that up:

<pre><code class="csh"><span class="builtin">set</span> <var>complete</var>=enhance

<span class="builtin">set</span> <var>ssh_hosts</var> = <span class="string">`grep '^Host[ ][^*]' ~/.ssh/config | cut -c 6-`</span>
complete ssh <span class="string">'p/1/$ssh_hosts/'</span>

<span class="comment"># Aliases for pulling up a screen on each host</span>
<span class="keyword">foreach</span> host ($<var>ssh_hosts</var>)
    <span class="builtin">alias</span> $<var>host</var> <span class="string">"ssh -t $host screen -DR"</span>
<span class="keyword">end</span></code></pre>

Not only is Emacs a decent source of guidance on how to mark code up, it can also do most of the markup-writing heavy lifting for us. There are several tools of varying quality for automatically converting font locked Emacs buffers into equivalent HTML [1, 2, and 3], but I rolled something myself in about 50 lines of Emacs Lisp that Works For Me. I just select a region in some buffer and hit a keystroke: a marked-up version of the region gets dropped right into the clipboard for easy pasting.

The colors I use are from my Emacs color theme, color-theme-hober2.el.

CSS rules derived from color-theme-hober2.el.

pre code .keyword      { color: #4682b4; }
pre code .type         { color: #3cb371; }
pre code .function     { color: #5f9ea0; }
pre code var           { color: #ff6a6a; }
pre code .string       { color: #fffacd; }
pre code .comment      { color: #9932cc; }
pre code .preprocessor { color: #f0e68c; }
pre code .constant     { color: #db7093; }
pre code .builtin      { color: #f4a460; }

I’m hoping to write up a companion post over at Emacsen.org in which I detail the actual mechanics of the code, but I’m sufficiently busy these days that I doubt I’ll get to it in the near future.