Thursday, January 4, 2018

Publishing the Homer Multitext project archive

The Homer Multitext project (HMT) is changing its publication practice in 2018.  All of our work in progress remains available from publicly visible repositories hosted on github, but we are adopting a new format for integrating material from our working archive into publishable units.

Our goals have always been first to specify a model for all HMT data structures independent of any publication format, and then to select a format that fully captures the semantics of the model.  In choosing a format for publication, we prefer one that, while completely expressing the model, is as simple as possible.  It should be intellegible both to human readers and to software, and readily usable by the widest possible range of digital tools.

Beginning in 2014, we adopted the TTL serialization format of the   Resource Description Framework (RDF) to integrate textual editions, data about physical artifacts like manuscripts, and documentary images into a single publishable file.  RDF was designed to facilitate dynamic exchange and automated linking of resources on the world wide web, and is widely used for that purpose in the digital humanities community today.  As a format for disseminating stable releases of HMT content, it is not ideal, however. RDF can be quite verbose:  to represent a single citable node of text in one of our editions, for example, requirs more than a half dozen separate RDF statements.  It is often not immediately intellegible to human readers, and although the RDF model can be implemented in multiple formats (JSON and XML, in addittion to TTL), RDF data can only be practically used with software specifically aimed at RDF processing.

This month, we are releaseing our first published data sets in the CITE Exchange format (CEX).  To quote the CEX specification, CEX is "a plain-text, line-oriented data format for serializing citable content following the models of the CITE Architecture."  CEX makes it possible to represent any of the fundamental models of the HMT archive — texts, citable collections of objects, and the complex relations among these objects that our archival data sets encode — as simple tabular structures in labelled blocks of a plain text file you can inspect with any text editor.  All blocks in a CEX file are optional, so we can equally easily publish a single updated body of material — a new set of photographs of a manuscript, or a newly edited section of a text — or an entire compilation of our current archive in a single plain-text file.  Because each CEX block is a table represented as lines of delimited text, generic tools from spreadsheets, databases, or ancient command-line utilities like `sed` and `grep` can be directly applied to CEX data, in addition to specialized code libraries we have developed that understand the semantics of citation with URNs.  (See https://cite-architecture.github.io/ for more information about the cross-platform code libraries.)

As a result, over the coming weeks you will see a series of short announcements of releases as we test and release one portion of our archive at a time.

Happy New Year, with complex data in simple formats!