Schematron: a handy XML tool that’s not just for villains!

Schematron by Mary SeligyAt JATSCon (the Journal Article Tag Suite Conference), one of the speakers, whose talk referenced Schematron, had shared with the audience that he’d asked his young son what he thought the word “Schematron” might mean. As though pointing out the obvious, the son replied that Schematron was “a helmet that helps you think up schemes, mostly worn by villains”1. Though it’s tempting to allow this catchier (if nefarious-sounding) definition, Schematron is the name of a useful and practical XML technology widely employed by ordinary, not-particularly-villainous folk in scholarly publishing to do things like enforce XML quality assurance/control or impose business rules on their XML.

Schematron2 is a rules-based validation language for making assertions about the presence or absence of certain patterns in XML documents. A “Schematron” refers to a collection of one or more rules which contain tests. Schematrons are written in a form of XML, which makes them relatively easy for people ~even non-programmers~ to inspect, understand, and write.

Essentially, a Schematron performs two actions in sequence:

  1. Finds context nodes of interest in the document. A ‘context node’ can be an element of a particular type, or a specific element in a certain place in the document, an attribute, or an attribute value. For example, suppose you want to check to see whether a <label> element exists within each of your displayed equations. In this case, the context  node would be the element <disp-formula>
  2. For each of those nodes of interest, checks whether a specific statement is true or false. For example, you might have a rule written to answer the question “are any of my displayed equations missing a label?”

Why Schematron?

At this point, you may be asking yourself how the DTD fits into this. After all, doesn’t the JATS DTD provide rules for what can and cannot be in journal article XML? It does, but only to a point.

DTDs, W3C Schema and other so-called ‘grammar-based constraint languages’ can check for big things like whether a given element is allowed in another element or in a particular sequence, or whether attributes are allowed on a given element. But what a DTD cannot do is check for rules that a particular organisation might want to apply to its XML. Things like whether an optional element or attribute exists and attributes contain one of the organisation’s approved values, whether particular content is present in a certain context, or what the relationship is among several XML files. Rick Jelliffe, who has invented Schematron, described it as “a feather duster to reach the parts other schema languages cannot reach”.

What Schematron can do

A Schematron can

  • Validate or report on document structure, such as the presence or absence of elements (“Does my <abstract> contain a <disp-formula>?”). It can also look for the location of elements (“In which section is the table in my article?”)
  • Validate or report on document content; i.e., there must be some content, or there must be some particular content and the content must follow some rule (“There must be a displayed equation in <abstract> and the label for this equation must always occur just after the math in the <disp-formula> element, not before it; tell me when this rule has been disobeyed!”)
  • Validate or report on the presence or absence of attributes, or on the content of attributes (“The attribute ‘specific-use’ must occur on <aff> within <contrib> and this attribute’s value must be ‘internal-use’ “)
  • Check co-occurrence constraints: if X is true, then Y should be true; or, A, B, C, and W must all be present (somewhere)

Schematron building blocks

The basic Schematron building blocks are as follows.

Tests are the basic components of Schematron, and are simply statements that perform an action. There are two kinds of Schematron test statements:

  • assert statements fire when the statement evaluates to FALSE (“this must always be true, so only tell me when it’s not”)
  • report statements fire when the statement evaluates to TRUE (“this is interesting, tell me when this happens”)

When an assert fails or when a report succeeds, it is possible (and very useful) to emit a text message. The person who creates the Schematron (AKA the “Schematron designer”) actually writes his or her own error messages into the tests, making them meaningful and understandable to anyone in the organisation who needs to see the messages (see Example 1, below).

Rules are groups of tests, where each rule specifies the context for the tests in the rule. Patterns are groups of rules, and phases can activate different patterns at different times.

Example 1: A basic Schematron test

In JATS, the attribute ‘article-type’ (@article-type) on <article> is an optional attribute with no default. But suppose you want to write a Schematron rule to

  1. Ensure (assert) that in your journal articles @article-type is always present and
  2. Report when @article-type is “retraction” and provide a message to inform the Director of Publications

Your rule might look something like this:

<rule context="article">
    <assert test="exists(@article-type)">@article-type must always be present!</assert>
    <report test="@article-type='retraction'">Please inform Director of Publications about the retraction</report>
</rule>

Here, the rule (<rule>) sets the context  node (element <article>) and contains two tests: an assert, which fires when it evaluates to FALSE, i.e., when <article> does not have @article-type; and a report, which fires when it evaluates to TRUE, i.e., when the value of @article-type is “retraction”. The Schematron designer writes the messages herself, in such a way that they would make sense to those who run the Schematron.

Example 2: A test for conditional co-occurrence constrains 

This example demonstrates how to use Schematron to validate conditional co-occurrence constraints. For example, suppose you decide that Acknowledgements, if present in any of your journals except for two of them (with journal IDs ‘ja’ and ‘jb’), must contain exactly one paragraph. For the two exception journals  (‘ja’ and ‘jb’), suppose the Acknowledgements must contain exactly two paragraphs. You would then write the following two rules, each containing one test:

 1. <rule context="ack[//journal-id=('ja','jb')]">
 2.      <assert test="count(p) eq 2">'<name/>' in '<value-of select="//journal-id"/>' must contain exactly two paragraphs</assert>
 3.   </rule>
 4. <rule context="ack">
 5.      <assert test="count(p) eq 1">'<name/>' in '<value-of select="//journal-id"/>' must contain exactly one paragraph</assert>
 6. </rule>

The first step is to deal with the exception journals (‘ja’ and ‘jb’). In the first rule (line 1) the context  is the acknowledgements element (<ack>) that occurs in a journal whose code is either ‘ja’ or ‘jb’. The test checks whether the count of paragraph elements (<p>) is equal to 2 and if this test evaluates to FALSE, gives the error message saying that the <ack> in the journal in question (‘ja’ or ‘jb’) must contain two paragraphs. In line 4,  the context of the rule is simply <ack>, and the assert test in line 5 checks to see whether only one <p> exists.

Incidentally, this illustrates an important principle of Schematron design: if two rules within the same pattern have the same context then the second rule fires only when the first one does not. Here, if the journal code is ‘ja’ or ‘jb’ then the first rule applies; the second rule will catch all other cases, e.g., articles from journals other than ‘ja’ and ‘jb’.

Other Schematron goodies

One advantage of using Schematron rules is that you can run different collections of rules to perform QA/QC at various stages of the production process. For instance, suppose you do not yet know the final metadata (issue number, posting date, page numbers, etc). You could run an initial Schematron to check certain things in an article at that early stage of production but leave out any rules related to metadata. Then later, once the article is ready for production, you could run another Schematron to ensure that everything, including  the metadata, is the way it should be.

To do this, you would create a modular library with two top-level Schematrons: Initial and Final. In the library, the rules that check the final metadata are collected in a module that the Initial Schematron does not invoke. In contrast, the Final Schematron calls all the modules the Initial does but also invokes the module that performs the final metadata check.

Example 3: A test to ensure that, at the final validation stage, the <page-count> element is present

<rule context="article-meta/counts">
<assert test="exists(page-count)"> metadata has no <page-count> element </assert>
</rule>

It’s worth noting that if you use JATS for journal articles and BITS for books or conference papers, you can develop a modular Schematron that reuses certain modules to validate items of different genres (articles or books etc).

How to get started using Schematron

Strictly speaking, XPath is all you need to know to start writing Schematron. (XPath is a query language for selecting nodes from an XML document.) In practice, however, you also need to be familiar with XSLT 2.0 functions, such as ‘matches’, ‘current’, ‘document’, or ‘contains’. And finally, a basic facility with regular expressions (regex) is very helpful. Regular expressions is a query language for matching patterns in text.

Example 4: A test (employing regular expressions) to verify that the volume number in the metadata is an Arabic numeral no longer than four digits, with no leading zero

<rule context="article-meta/volume"> 
    <assert test="matches(.,'^[1-9][0-9]{0,3}$')"><name/> '<value-of select="."/>' must be an Arabic numeral four digits or less, with no leading zero</assert> 
</rule>

There are a number of resources that can help you get started learning Schematron and it may be helpful to take an introductory course (for example, there is one offered by Mulberry Technologies).

Environment

In a typical implementation, the Schematron XML is compiled into XSLT code for deployment anywhere that XSLT can be used. Schematron Validation Report Language (SVRL) is a simple report language defined as part of ISO Schematron3.

The most user-friendly environment for learning, developing, testing, and using Schematron is probably oXygen Editor, which includes full support for ISO Schematron. The latest version supports JATS, which means that additional tools are available when oXygen detects that you are looking at a JATS XML document.

If you are using a vendor for converting manuscripts from author-submitted formats to JATS XML, then you may ask the vendor to run your Schematron rules. In this case, which is fairly common, the vendor  typically compiles your Schematron into an XSLT application using SVRL and runs it in a batch mode, while your production team might run the Schematron in oXygen Editor in your organisation to do your own internal QA/QC.

If you use a vendor: trust, but verify

Many publishers rely on vendors to create their XML and expect the vendor to fully perform QA/QC on the XML. While vendors should do this, it is more important than ever for publishers to take ownership of their XML and perform checks themselves. Schematron is ideal for flagging errors that might be easy for the naked (human) eye to miss. If you are using a vendor for XML conversion, it’s a good idea to do the following:

  • have the vendor run your Schematron;
  • set up a QC system that would not allow the vendor to deliver XML files that contain Schematron errors (some web platforms will work with publishers on this);
  • run the delivered XML files through the Schematron on your end, just to make sure your XML is the XML you want

Schematron allows you to define three different severity levels for the messages emitted as a result of a test evaluation, and these are Error, Warning, and Info (and for each of these, you can write your own custom message). In setting up your QA system, you might, for example, choose to allow your vendor to deliver XML files with warnings but not with errors, thereby saving your staff valuable time.

Development, testing, and maintenance: essential tips

Although writing simple Schematron is not too difficult, building a complex and efficient one is no easy task. Step one is to carefully gather and thoroughly document user requirements; this is essential. Schematron validation must fit well into existing workflows or else be modified to suit the workflow, if necessary.  Next, write your Schematron rules in modules (groups of related rules) where it makes sense to do so. This helps to ensure that various rules do not conflict with one another.  Finally, invest the time to optimise and test Schematron performance.

Documenting requirements

It is impossible to overemphasise the importance of documenting the requirements in plain English as part of Schematron development and maintenance. Too often the Schematron code becomes unintelligible when the number of tests exceeds several dozens (and it is not unusual for a publisher’s Schematron to contain hundreds of tests). Ideally, the code should be self-documented, with requirements tagged, so that an XSLT could be used to create HTML documents with readable requirements.

Example 5: Documenting a Schematron rule with plain English

<d:req>Volume number in the metadata must be an Arabic numeral no longer than four digits, with no leading zero.</d:req>

Regression testing

When Schematron reaches a certain level of complexity, every change you introduce may have unforeseen consequences. To make sure the previously developed code works, you have to build a set of so-called Go and NoGo tests: while Go tests should pass, the NoGo tests should fail and emit the expected messages.  Developing such tests is time-consuming but necessary to ensure markup integrity and data quality.

Notes and sources

1Mike Eden, Cambridge University Press. Presentation: An Implementation of BITS: The Cambridge University Press Experience. JATSCon 2016.
2There are a number of Schematron implementations. In this article we refer to ISO Schematron http://standards.iso.org/ittf/PubliclyAvailableStandards/c055982_ISO_IEC_19757-3_2016.zip.
3 Schematron: A Language for Making Assertions About Patterns Found in XML documents. http://www.schematron.com/

Text by Alexander (‘Sasha’) Schwarzman (@SashaSchwarzman) with Mary Seligy (@MarySeligy). Sasha Schwarzman has twenty years of experience in markup technologies, beginning with SGML, and later with XML, XPath, Schematron, and native XML databases. As Content Technology Architect at OSA–The Optical Society, he is involved in quality control and semantic enrichment of published and converted content, as well as effective management of XML-centric workflows. Prior to joining OSA, Sasha worked as Information Systems Analyst at the American Geophysical Union (AGU) and served as a co-chair of the NISO/NFAIS Working Group on Supplemental Materials to Journal Article. He holds an equivalent of Master of Science degree in Mechanical Engineering from the Saint Petersburg State Polytechnic University, Russia, and a Master of Library Science degree from the University of Maryland, College Park, USA.

Image credit: Mary Seligy (@maryseligy)

Schematron: a handy XML tool that’s not just for villains!