In my spare time, I’ve indulged a curiosity about some resources in the ecosystem. In particular, I’ve implemented a simple Wikibase by taking advantage of David Green’s Mylyn wikitext package and the community edition of the MarkLogic XML Server. Some notes follow from that experiment.
The Wikibase stores both the wikitext page and the converted XHTML. That way, the Wikibase can provide the original source for editing (because some people feel strongly about whitespace) but still perform epic backend operations (searching, linking, extraction, etc) on the XHTML pages. The relative simplicity of wikitext is a virtue here, as simple HTML approximates Plain Old Semantic HTML.
Ingrate that I am, I threw the Wikipedia articles from the markup languages category at Mylyn wikitext. The HTML article in particular (not surprisingly) has a lot of quoted HTML. Righteous hiccups from the MarkLogic XML parser. I futzed around with my simplistic Wikipedia extraction process (which might be the culprit) and tinkered with some regular expressions before and after the conversion (running into the usual impedance mismatch between regular expressions and tag nesting).
In the end, John Cowan’s TagSoup and about 5 lines of consumer code placated the XML parser. TagSoup is a more recent (and seemingly more robust) alternative to JTidy for XML domestication of wild HTML. (Also worthy of a passing mention in this space: for field work that inspects HTML in its unruly lair, I’ve done a few things with the Jericho HTML parser.)
The result has blemishes, but it’s well formed, and that’s all XQuery needed. In implementing the basic services, the only thing I miss from DB2 PureXML is the ability to transform a document in memory with XQuery update operations. (Maybe some well-known alternative exists, and I simply haven’t turned it up yet.)
Anyway, MarkLogic runs the XQueries that implement paged search and navigation over the XHTML articles. A JSP with a textarea provides the basic editing. I can peruse and add outlandish asides to the Wikibase articles to my heart’s content.
Of course, this simple Wikibase lacks a ton of mandatory function — versioning, authentication, content locales, a decent authoring UI, you name it — and I may not take the experiment much father. Still, good times — the ecosystem more than proved its mettle.