Getting started with XProc

You hear at times that tools should just work — that your first impulse should always be right, and the overall experience should be like easing into a bath. That might make sense for familiar productivity tools, but should we never learn a new way of thinking about problems?

After deciding to write a more serious XSL tags preprocessor, I needed a way to run tests. The obvious solution would have been Ant, but it gave me a chance to go swimming in XProc (the pipeline language for processing XML documents).

It’s interesting to watch yourself learn a new tool — making mistakes that (in retrospect) are just silly because you’re trying to figure out the conceptual model, syntax, et al all at the same time. Eventually, after enough blunders, things start working. (Or, as Blake put it, “If the fool would persist in his folly he would become wise.”)

In hopes it might help someone else get up to speed, here’s a list of practical realizations (all no doubt spelled out somewhere in the specification):

  • Make the top element either
    • p:pipline for a pipeline that takes external input and output arguments
    • p:declare-step for a pipeline that identifies its own inputs and outputs
  • When a p:declare-step pipeline uses XSLT, specify a parameters port to avoid an XProc error, as in:
    <p:input port="parameters" kind="parameter" primary="true"/>
  • Use attributes for static values and a nested p:with-option element for dynamic values, as in:
    <p:store omit-xml-declaration="false">
     <p:with-option name="href" select="concat($builddir, $namein)"/>
    </p:store>
  • Apply a sequence of operations to an XML document with a list of steps that identify neither input nor output.
  • Reuse the output from a step as input for multiple steps (for instance, to generate different formats from a single XML document) by referring to the step that produces the output. Don’t bother changing the default “result” and “source” names for the output and input ports of the producer and consumer. Instead:
    1. Set the @name attribute on the step that produces the XML.
    2. Supply p:input and p:pipe subelements in the steps that consume the XML, as in:
    <p:wrap-sequence name="tests" wrapper="test-files"/>
    
    <p:xslt name="expected">
      <p:input port="source">
        <p:pipe step="tests" port="result"/>
      </p:input>

    In short, XProc has a pull instead of push polarity.

  • Save modified XML in an intermediate file (similar to a tee in a UNIX / Linux pipe) by using the same technique with the p:store element as one consumer and the next real pipeline step as a separate consumer, as in:
    <p:xslt name="expected">
      ...
    </p:xslt>
    
    <p:store>
      <p:with-option name="href" select="'xstag2xsl.xspec'"/>
    </p:store>
    
    <xpc:xspec>
      <p:input port="source">
        <p:pipe step="expected" port="result"/>
      </p:input>
    </xpc:xspec>
  • Process the files in a directory with:
    1. A p:directory-list step with a regular expression to match the files. The file list gets wrapped up as a single XML document (using a built-in XProc vocabulary with c:directory and c:file elements) and passed along the pipeline.
    2. A p:for-each step with a p:iteration-source subelement to iterate over the child elements identifying each file.
    3. A p:load step to get the file contents, as in:
    <p:directory-list include-filter=".*\.xspec$">
      <p:with-option name="path" select="$testdir"/>
    </p:directory-list>
    <p:for-each>
      <p:iteration-source select="/c:directory/c:file"/>
    
      <p:variable name="namein" select="/c:file/@name"/>
    
      <p:load>
        <p:with-option name="href" select="concat($testdir, $namein)"/>
      </p:load>
    
      ...
    </p:for-each>
  • Reuse the file list after you’re done loading and processing the files by selecting a filename step for the output of p:for-each:
    <p:for-each>
      <p:output port="result">
        <p:pipe step="testname" port="result"/>
      </p:output>
    
      <p:iteration-source select="/c:directory/c:file"/>
    
      <p:variable name="namein" select="/c:file/@name"/>
    
      <p:identity name="testname"/>
    
      ...
    
      <p:store>
        <p:with-option name="href" select="concat($resultdir, $namein)"/>
      </p:store>
    </p:for-each>
    <p:wrap-sequence name="tests" wrapper="test-files"/>

    That is, the p:for-each step can specify which subordinate step generates its output.

  • Supply simple transforms inline within the pipeline:
    <p:xslt>
      <p:input port="stylesheet">
        <p:inline>
    <xsl:stylesheet version="2.0">
      ...
    </xsl:stylesheet>
        </p:inline>
      </p:input>
    </p:xslt>

Initial reactions and speculations: I do miss some of Ant’s features — in particular, managing dependencies and skipping steps by checking timestamps on intermediate files. (Also, I haven’t checked whether xmltask improves the Ant story.) On the other hand, applying operations to XML in memory instead of applying tools to XML files is most beneficial, and inline XSLT is the bee’s knees.

Inline XSLT is so convenient that it cries out for co-mingling XProc and XSLT. Instead of using the fine-grained XML manipulation steps, it might be more straightforward for adopters to apply an XSLT template as a single pipeline step (where the template takes its current node from the step input and converts its result to the step output).

More importantly, would complex pipelines be more maintainable as a kind of state machine? After all, a pipeline step can be considered a state transition. If so, SCXML offers one model for state choreography. But so does XSLT — the combination of the current node and the mode is a state that transitions through apply operations.

In other words, the entire pipeline could be re-conceived as a single complex transform of a collection of XML sources to a collection of outputs. An XProc pipeline could be re-conceived as applying to a particular kind of XML input under specific circumstances — in short, a special kind of template with an XProc implementation that:

  • Executes when the template pattern matches the current XML node in the specified mode.
  • Takes primary input from the matched node.
  • Returns its primary output as the template result.
  • Can apply other templates as a pipeline step (as sketched above but including other XProc as well as XSLT templates).

Easy to sketch, of course; a bear to solve all the state and streaming challenges. Still, wouldn’t it be something to use XSLT as a pipeline engine?

Advertisements
Posted in XProc, XSLT | 1 Comment

Tag libraries and callbacks for XSLT / XQuery

At the Balisage conference earlier this month, Vyacheslav Zholudev gave quite an interesting talk on virtual documents. That spurred a related thought: could XSLT / XQuery support an equivalent to JSP tag libraries?

Background: If you aren’t familiar with tag libraries, the goal is a clean division of responsibilities:

Page designer
Implements the layout and styling by embedding tags in HTML pages and applying CSS to the data produced by the tags.
Tag developer
Implements the data access and structuring by defining tags that query data stores, sort query results, and so on.

A typical page using tags has fragments like:

<h1>Attendees in 2010</h1>
<table class="attendees">
  <tr><th>Last Name</th>
    <th>First Name</th></tr>
<db:person-list select-year="2010" sort-by="last-name">
  <tr><td class="lname">{ $last-name  },</td>
    <td class="fname">{ $first-name }</td></tr>
</db:person-list>
</table>

The result after processing the db:person-list tag is something like:

<h1>Attendees in 2010</h1>
<table class="attendees">
  <tr><th>Last Name</th>
    <th>First Name</th></tr>
  <tr><td class="lname">Crawford,</td>
    <td class="fname">Randy</td></tr>
  <tr><td class="lname">Franklin,</td>
    <td class="fname">Aretha</td></tr>
  <tr><td class="lname">Green,</td>
    <td class="fname">Albert</td></tr>
</table>

XSL Pages: XSLT Simplified Stylesheet Modules already provides most of constructs of the JSTL core library. XSLT value templates offer a parallel to the Java EL expression language. The additions necessary for a complete story:

  • Associating a tag name with a function or template implementation in an XSLT / XQuery module.
  • Importing the tag library module by allowing xsl:import inside the root element (that is, html) of a simplified stylesheet page.
  • Supporting XPath expressions in content by allowing value templates outside attributes in simplified stylesheet pages.
  • Processing tags in simplified stylesheet pages as calls to the associated function or template, passing attributes as parameters.
  • Procesing the tag content during execution of the function or template as a callback (for instance, executing the tr fragment for each result produced by the db:person-list tag.)

Callbacks pose the only major addition, but they would have general benefit for XSLT / XQuery. For instance, when converting between XML vocabularies, the optimal decision often requires knowledge of the current source context and result context. Callbacks could offer a natural solution to that problem.

To evaluate the idea, I wrote transforms for XSLT pages and tag libraries conforming to the sketch above (using Dimitre Novatchev’s anonymous function technique from FXSL for the callbacks). Running the transforms on a test case, I can produce an HTML page using XSLT tag libraries.  Here’s the experimental package.

Conclusions: The experiment teased out some refinements (such as being able to provide parameters as subelements) and provides a proof of concept, I think, of the benefits of tag pages and tag libraries for XSLT / XQuery. However, I may not do much more with these conversion transforms:

  • I don’t see a way for the callbacks to read the variables that were in scope at the tag call (along the lines of Java anonymous local classes). Maybe Novatchev is right and XSLT / XQuery needs an anonymous functions constructor (after all, JavaScript has one) or maybe a more restrictive callback feature would do the trick.
  • Tag libraries must be implementable in XQuery as well as XSLT. For XQuery, tag libraries could provide a set of building blocks for rapid application development against an XML store.  For XSLT, a possible scenario would be composing transforms by mocking up the output document.

Anyway, many thanks to Vyacheslav for stimulating the thought.

Footnote:
Two existing initiatives have some relevance:

  • Cocoon eXtensible Server Pages (XSP) support pipelines that format data from multiple sources using XSLT transforms.
  • Resin XML Template Pages (XTP) applies XSLT stylesheets to what appear to static placeholder tags within XHTML files. To get the full power of tag libraries, the tags should instead be executed (or interpretted).
Posted in XQuery, XSLT | Leave a comment

The word according to Balisage

As I started looking forward to getting my assumptions torqued at the Balisage markup conference, it occurred to me to see what Jonathan Feinberg’s Wordle would make of the program descriptions:

Tag cloud for the program of the Balisage markup conference

Much the same as the buzz in the line for coffee between sessions — good of Jonathan to sum up.

Posted in XML | Leave a comment

The explicit and the implicit (RDF vs XML)

And in this corner …

RDF makes data explicit:

  • The RDF graph supports manifest relationships between any two objects.

    By contrast, the XML tree supports hierarchical containment relationships. Other relationships require something in addition to XML proper — either XLink (which has seen limited adoption) or, more typically, an attribute with referencing semantics in the vocabulary.

  • RDF dynamic typing annotates an object with multiple types as the properties of the object qualify it for those types.

    By contrast, an XML element tags an object with a single name. To capture other names again requires something in addition to XML proper — typically, a class or role attribute with type list semantics in the vocabulary.

  • RDF inference adds explicit data to the graph incrementally based on the implications of the current graph.

    XML itself has no method for identifying and operating on implications of the data. Applications for specific vocabularies sometimes add data to documents or, because additions might be invalid for the schema, rewrite documents completely. As often, vocabularies mandate application rules that don’t add data to the document.

Explicit data makes for easy processing, integration, and, more generally, agile data (as Lee Feigenbaum notes).

And yet …

An explicit document is more difficult to create, maintain, and understand. We’re adept at recognition. Anything that we would get without being told gets in the way: annoying at best; at worst, hiding the real news. Explicit data can also be awkward, as Benjamin Nowack notes with respect to tunneling structural and subject metadata through HTML with RDFa. In short, for authored documents, an XML markup with audience-sensitive implication is much more practical.

That leaves the challenge of getting from the XML representation suitable for people to the RDF representation that’s optimal for processors — a problem known in the RDF community as “lift.” The traditional answer has been GRDDL, in which an XML document refers to external XSLT transforms that extract the data from the document. XSLT, however, has a mismatch for the lift problem in that the output is RDF/XML — an XML document that happens to be a serialization of RDF — rather than RDF directly.

XSPARQL, a proposal for integration of the XQuery query language for XML and the SPARQL query language for RDF, has a lot of promise for the lift problem. (Check out the use cases.) Although I didn’t find an example, it would seem plausible to embed XSPARQL within a script element, along the lines of:

<script type="application/xsparql">
declare namespace html="http://www.w3.org/1999/xhtml/";
declare namespace dc="http://purl.org/dc/elements/1.1/";
let $page := fn:document-uri(/)
let $title := /html:html/html:head/html:title/text()
construct {
    $page dc:title $title .
}
</script>

A small subset of XSPARQL would be enough: a sequence of XPaths to extract values from the XML and a construct clause to build the triples. The XSPARQL could be referenced externally to share among XHTML documents or embedded to package with the XHTML documents.

Such an approach would be more flexible and convenient than the current alternatives (RDFa and GRDDL), leveraging both XML and RDF to maximum advantage.

Posted in Semantic Web, XML | Leave a comment

A relation typing idiom in RDF

(Capturing a few conclusions about RDF idioms from some recent experiments….)

In object orientation and especially UML, relationships are typed at both ends.  So, in RDF,  I used to identify with diligence the domain and range for most properties, as in:

ex:contains a owl:ObjectProperty ;
  rdfs:domain ex:Container ;
  rdfs:range ex:Contained .

Because an RDF processor can infer the class of the domain and range objects, I had to be quite precise.  That tended to have rippling effects on the design. In particular, a herd of subproperties expressed the same semantic relation but took different domain and range classes.  Of course, the domain and range classes of the subproperties had to inherit from the domain and range classes of the base property. The result was a dissatisfying explosion of names for syntactic rather than semantic reasons.

Semantic Web for the Working Ontologist (helpfully recommended by Bob DuCharme) devotes a sidebar (pages 101 to 102) to cautioning against this common tendency among object-oriented programmers.  In essence, the recommendation is to use domain or range only when inference of the class will be useful.

Well, hush my puppies.

Still, it would seem like a step backward if the RDF model carried less information than a UML model.  Fortunately (and rather obviously in retrospect (as usual)),  the constraints can be expressed directly:

ex:Container a owl:Class ;
  rdfs:subClassOf [ owl:intersectionOf (
    [ a owl:Restriction ;
      owl:onProperty ex:contains ;
      owl:allValuesFrom ex:Contained ]
    ... cardinality or type restrictions on other properties ...
  )  ] .

ex:contains a owl:ObjectProperty .

In this idiom, each class maintains the type and cardinality constraints on its properties. The same property can be constrained differently by different classes.   The constraints on properties don’t license uninteresting inferences. The idiom is a bit verbose, but that’s a small price to pay for the benefits.

The other idiom that I’ve adopted recently is to prefer union to subclass assertions for abstract classes.  After all, an abstract class is exactly equivalent to its list of subclasses. For instance, the alternative to:

ex:DocumentType a owl:Class ;
  rdfs:subClassOf ex:Container .

ex:MarkupType a owl:Class ;
  rdfs:subClassOf ex:Container .

would be:

ex:Container a owl:Class ;
  owl:unionOf (ex:DocumentType ex:MarkupType) .

The idiom expresses the intent of the designer more precisely.  The only concern would be if the union effectively closed the set of subclasses. That would be true if a new union assertion with a superset of members creates a self-contradictory model — something I haven’t investigated yet.

Anyway, other idioms?  Or, better idioms for typing relations?

Posted in Semantic Web | Leave a comment

Implementing a simple wiki on MarkLogic Server

In my spare time, I’ve indulged a curiosity about some resources in the ecosystem. In particular, I’ve implemented a simple Wikibase by taking advantage of David Green’s Mylyn wikitext package and the community edition of the MarkLogic XML Server. Some notes follow from that experiment.

The Wikibase stores both the wikitext page and the converted XHTML. That way, the Wikibase can provide the original source for editing (because some people feel strongly about whitespace) but still perform epic backend operations (searching, linking, extraction, etc) on the XHTML pages. The relative simplicity of wikitext is a virtue here, as simple HTML approximates Plain Old Semantic HTML.

Ingrate that I am, I threw the Wikipedia articles from the markup languages category at Mylyn wikitext. The HTML article in particular (not surprisingly) has a lot of quoted HTML. Righteous hiccups from the MarkLogic XML parser. I futzed around with my simplistic Wikipedia extraction process (which might be the culprit) and tinkered with some regular expressions before and after the conversion (running into the usual impedance mismatch between regular expressions and tag nesting).

In the end, John Cowan’s TagSoup and about 5 lines of consumer code placated the XML parser. TagSoup is a more recent (and seemingly more robust) alternative to JTidy for XML domestication of wild HTML. (Also worthy of a passing mention in this space: for field work that inspects HTML in its unruly lair, I’ve done a few things with the Jericho HTML parser.)

The result has blemishes, but it’s well formed, and that’s all XQuery needed. In implementing the basic services, the only thing I miss from DB2 PureXML is the ability to transform a document in memory with XQuery update operations. (Maybe some well-known alternative exists, and I simply haven’t turned it up yet.)

Anyway, MarkLogic runs the XQueries that implement paged search and navigation over the XHTML articles. A JSP with a textarea provides the basic editing. I can peruse and add outlandish asides to the Wikibase articles to my heart’s content.

Of course, this simple Wikibase lacks a ton of mandatory function — versioning, authentication, content locales, a decent authoring UI, you name it — and I may not take the experiment much father. Still, good times — the ecosystem more than proved its mettle.

Posted in Search, XML | Leave a comment

Open Graph, or the website extended

The Open Graph protocol recently rolled out by FaceBook introduces a set of metadata properties for pages. When I mentioned Alex Iskold’s assessment of its implications, John Warren quickly pointed out some significant limitations of Open Graph as a metadata format:

  • The page has a single subject.
  • The properties of that subject are a flat list without deep structure or relations.
  • The subject identifier is not controlled.

Recognizing those limitations (thanks) makes it easier to see what’s going on. Open Graph does not take on the challenge of classifying resources for discovery or of publishing datasets. Instead, Open Graph provides a vehicle for delegating maintenance of resource descriptions to resource experts. For FaceBook, the experts include Pandora for music, IMDb for movies, Yelp for reviews, and so on. FaceBook just has to catch the expert definition of a resource when a user clicks on the Like widget.

Both the technical and social engineering here are clever.  Because maintenance is distributed, the approach can scale. That’s reminiscent of the Semantic Web Environment Directory (SWED) and of Google, for that matter (because maintenance of the link data that’s crucial for the Google index is distributed to the authors of web pages). Also, besides making the initial implementation a lot easier, the narrow scope makes for better graph nodes. Allowing metadata about only a single subject encourages the creation of pages with focus — one of the tenets of Information Architecture.

From a Semantic Web perspective, a couple of points stand out.  First, as mentioned previously, Open Graph doesn’t address the typical Semantic Web scenario of publishing public datasets but, instead, augments a private graph with isolated public resource definitions. Second, instead of assigning a unique identifier to a real-world object, Open Graph appears to identify a real-world object indirectly through its association with a web page. That’s pragmatic for web publishers but challenging for processing. Consider the movie GoodBye, Solo, which has web pages on IMDb and Rotten Tomatoes. I’d expect each site to treat its page as the canonical definition of the movie instead of accepting the other site’s claim to primacy. To provide a unifying social object (as explained by Dare Obasanjo) for qualifying the relationships among all of its fans, the movie needs a single object in the graph. Even though the object type is controlled and will be movie for both IMDb and Rotten Tomatoes, the title can and, in this case, does vary:  “Goodbye Solo (2008)” on IMDB and “Goodbye Solo  (2009)” on Rotten Tomatoes. Perhaps FaceBook establishes equivalence through a combination of textual matching and common fans of both web pages or, perhaps, connecting a subset of the real fans via one web page is good enough.

While Open Graph is an open standard, the Like widget that harvests the data and the private graph that integrates the harvested data are the real keys to taking advantage of the Open Graph pages on IMDb and company. Other solutions can still crawl the Open Graph pages, but such solutions will have to come up with their own integration logic and will have to tolerate days or weeks of lag in updates.

Regardless, Open Graph shows a disruptive innovation:  a unified web experience through dynamic integration of pages hosted by multiple organizations.  No small feat.

Posted in Semantic Web | Leave a comment