Theresa O’Connor

Tag soup parsing

Bill de hÓra on tag soup parsing (emphasis mine):

The lesson I (re)learned was that using BeautifulSoup, and in the past Universal Feed Parser and Tidy, makes it clear there’s some economic value to be had in giving up on well-formedness in a judicious fashion. …

Engineers have a concept called tolerence. A tolerance specifies the variance in dimensions under which which a part or component can be built and still be acceptable for production use. …

Every major commercial project I have worked on, every one, has had the issue of “data tolerances” being off, where two or more systems did not line up properly. The result invariably is to fix one end, both ends, or insert a compensating layer — what mechanics call a ‘shim’ and what programmers call “middleware”. …

We don’t have the tools or metrics just yet for defining data tolerances as as acceptable practice, but it might happen if enough of these kinds of parse anything libraries come online, that we can come to put a dollar cost on what it is involved in insisting on having perfect markup flying about end to end versus judiciously giving up on syntactic precision. …

If we assume or allow that most data on the web is syntactic junk and will always be syntactic junk,… then there is a good argument that says we’ll need a layer of convertors whose purpose is to parse content no matter what. …

In the semantic web case, I think tag soup parsers are a fundamental layer to that architecture — syntactic convertors that work just like analog-to-digital converters.