Why XML Metadata Matters More than Ever (and how you can optimise yours for reuse)

metadataWhy is journal article metadata so darned sexy? To be clear, by ‘metadata’ we mean not only the actual metadata itself (the collective information about an article), but also the representation of that metadata in the JATS XML standard. As if this weren’t exciting enough, the main reason why metadata is increasingly a hot topic of conversation in scholarly publishing is because in real and practical ways it is an article’s passport to key destinations in the scholarly publishing universe. The quality of XML metadata can mean the difference between an article travelling wherever it needs to and something closer to a ‘stay-cation’ at home. In this article we’ll talk about why XML metadata matters more than ever today, and what you can do to optimise yours for dissemination, discovery and reuse.

Consider the primary objective of a journal article, which is to have its contents disseminated as widely as possible so that it can be used by as many people in as many ways as possible, ultimately serving the larger goal of advancing science and humanities research.

To accomplish this objective, especially in light of the titanic volume of journal articles churned out daily, an article has to travel many places: in and out of peer review systems, to the publisher’s website, to aggregators, digital catalogs, search engine return pages, databases, and discovery platforms, to partner publishers or societies, repositories, archives, and so on. And just as passports for humans provide authoritative assertions about our identities to the gatekeepers who permit our passage, article metadata provides the gatekeepers of key article destinations with assertions about the provenance, authorship, ownership, access, funding, and relevance of an article, all of which are useful (if not specifically required) for its admission to those destinations.

Humans are the end-users we typically have in mind when we publish articles. People need metadata to help them decide whether to invest the time to read (and cite) content, to properly store, organise and curate it, or to decide to fund more of it.

But in scholarly publishing today, no human reads an article before it is read by a machine. Whether it is a researcher looking at content on the journal’s website, a librarian cataloging an article for their users, or a funder reviewing research, the content in question is served to the human after a machine (or series of machines) has processed that content’s metadata. And there is no shortage of metadata to process: as people find ever more interesting and inventive ways to use content, the systems and tools required to do the work will require more and more metadata to understand and access the content they are tasked with processing. Furthermore, as the need to standardise content for improved interoperability increases, so does the need for persistent identifiers and other new forms of metadata that all must be captured within an article’s XML.

It is clear that if we want to achieve our objectives for journal content, we need to satisfy our machine readers with machine-friendly metadata. But what does ‘machine-friendly’ mean?

Unlike humans, who may not love but can tolerate a few quirks and a little ambiguity in metadata, machines are not flexible readers. Machine systems function effectively and efficiently with data that is accurate, consistent, detailed, and predictable. Sloppy, incomplete, or inconsistent metadata may seriously limit (if not prevent) an article’s potential for wide dissemination, discovery, exchange, and reuse by machine systems. If you are not sure where to begin to make your metadata machine-worthy, fret not! Below are the three main rules to follow for creating quality machine-friendly metadata.

1. Be consistent

An important rule for all quality XML tagging is consistency, and this applies to both metadata and full-text documents. Although JATS is an XML standard that is (almost) ubiquitously used in scholarly publishing today, it is also a loose standard, which means that there may be many ways to tag the same item and still be valid. The important thing is to choose a consistent way to tag a given item within an article and across articles and volumes.

For example, if you have decided that your organization will follow the JATS tag library’s stated best practice for tagging authors and affiliations, then stick to that and do not allow your way to stray to any other model, however valid and reasonable it may be. If you have decided on a particular ID syntax for all your elements with IDs, then stick to that. If you change how you (or a vendor creating XML on your behalf) encode a particular item in XML then make it a conscious, thoughtful decision based on what is best for your organization and the content it provides.

2. Be explicit

To a machine, text on its own is just a meaningless blob. XML makes content accessible to machines, so the more explicit (detailed and specific) the XML, the more ways a machine system has to interact with it and the more possibilities there are for extracting and using components for various purposes. The key is to be explicit without sacrificing accuracy or consistency.

For example, an affiliation can be tagged this way:

<aff>Nunavut Arctic College, Repulse Bay, Nunavut X0A 0H0, Canada.</aff>

But depending on whether it is possible to accurately and consistently tag components of the affiliation above, you might have

<aff id=”aff1”>Nunavut Arctic College, <city>Repulse Bay</city>, <state>Nunavut</state> <postal-code>X0A 0H0</postal-code>, <country>Canada</country>.</aff>

Or even

<aff id=”aff1”><institution-wrap><institution content-type=”university”>Nunavut Arctic College</institution><institution-id id-type=”ringgold”>1931</institution-id></institution-wrap>, <city>Repulse Bay</city>, <state>Nunavut</state> <postal-code>X0A 0H0</postal-code>, <country>Canada</country>.</aff>

3. Follow best practices

This is similar to rule Number One (be consistent), but it means trying to be consistent not only within our own content, but with the content of others in scholarly publishing as well.
Whether we work for a publisher, vendor, compositor, library, or archive, we all benefit from following common XML tagging practices. For example, a publisher may own and control its own website technology, but ultimately its articles are processed by many (if not all) of the same systems that make up the current scholarly publishing infrastructure, such as CrossRef, PubMed, ORCID, and Google Scholar, to name a few. If we all follow the same tagging practices, then the machine systems we collectively use can be programmed to process our content much more efficiently and effectively. Here are some ways to follow common tagging practices:

  • Look at the JATS tag library documentation. The JATS tag library often indicates common tagging practices, which is a good place to start if you want to tag in ways that are similar to others.
  • Follow JATS4R. ‘Common’ practice does not mean ‘best’ practice, and although the JATS tag library mentions one or two best practices, it is really not its purview. However, it is the mandate of JATS4R to develop and recommend best practices for tagging content that is machine friendly. Follow or become involved with JATS4R here: http://jats4r.org/how-to-participate/
    Talk to other publishers, to see how they tag things. Better yet, ask them to come along to the next JATS4R call.

Text and illustration by Mary Seligy (@maryseligy) of Canadian Science Publishing, on behalf of JATS4R.

Why XML Metadata Matters More than Ever (and how you can optimise yours for reuse)