The Homer Multitext: design

Showing posts with label design. Show all posts

Monday, October 29, 2012

From graphs to applications

In 2012, the quantity of material collected by the HMT project has grown rapidly. To cope with this, we have been developing an automated system to identify the relations among all citable objects in the HMT data archive (texts, images, artifacts like manuscript pages, to name a few). In mathematical terms, these relations form a graph.

In the HMT graph, all nodes are identified by URN values (CTS URNs for texts, or CITE URNs for other kinds of objects). This simple, consistent reference format made it easy for us to develop a network service for working with the HMT graph: supply the service with a URN value, and the service finds all links to that URN.

This is an important development for the long-term development of our digital multitext, and will certainly be the subject of future blog posts. For today, I simply want to announce a test site with end-user applications built on the graph service.

Like our other services for retrieving HMT data, the graph service replies with a simple XML format; as in our other service implementations, we can include XSLT stylesheets to format these graph descriptions as web pages for human readers.
We have written three sets of stylesheets that turn the graph data into three quite distinct applications:

a facsimile browser, for reading diplomatic edition of texts alongside documentary images
a multitext reader, for reading multiple versions of a single text
a graph navigator, for exploring links in the HMT project graph

You'll find test versions of all three of these apps at our new HMT Apps test page: http://beta.hpcc.uh.edu/tomcat/hmtapps/

If you're curious about how the graph service works, try viewing the XML source of one of the application's web pages. If you just want to try out an app, feel free. Expect that the test versions on this site will evolve rapidly over the next several months. We'll post announcements on this blog when we install more static release versions elsewhere.

Thursday, August 9, 2012

πάθει, μάθος

When I ran the automated build to publish our collection of xml diplomatic editions as blogged here earlier this week, I had inadvertently turned off a final automated validation on all of the texts to be published, and some errors crept into the published package. I've rerun the build and republished the texts. If your system is automatically using the latest version of a Nexus artifact, you'll now get the correct versions. If you are manually setting the version, please update it to 2012.8.9, or to download manually, get the zip file from the 2012.8.9 directory here.

Apologies to those who were mystified by the errors in the initial release. We're still learning how to automate the management of our archives as effectively as possible, and will continue to blog here our successes and failures alike.

Monday, August 6, 2012

Release of archival text editions

A recent post noted the reorganization of our downloadable archive of image data. Today, we have published a first package of diplomatic editions of texts.

In contrast to our relatively static collection of large, binary data files of images, we expect to release new packages updating or expanding our archive of TEI-compliant editions whenever new material is ready. We are managing these releases with the Nexus repository management system, hosted at the University of Houston High Performance Computing Center (briefly noted here). Nexus organizes named "artifacts" (as it terms published objects) in groups, with further, specific identifiers distinguishing each published version. Most significantly for the on-going management of the HMT project's digital resources, Nexus supports the automated management of dependencies used in build systems like maven or gradle. This means that when we want to automate a task working with our editions, we can simply declare that our task depends on a particular version of the artifact, and the zipped file will be automatically retrieved over the internet and unpacked locally. (For the technically inclined, I recently blogged a note elsewhere about using automated dependency management in small disciplines like Classics.)

In addition to the TEI-compliant XML files, the package includes a CTS TextInventory file documenting the citation scheme of each edition, and how it maps onto the document's XML markup. README files summarize the contents and editorial status of each section of the archive. This release includes initial, unverified diplomatic editions of 32 Iliadic texts from papyrus sources, the Iliadic text and scholia from Iliad 1-6 in the Venetus A manuscript, and other texts from the first eleven folios of the Venetus A manuscript.

The package is named hmt-editions, and it belongs to the group org.homermultitext. For this package, a date string is used for the version identifier (formatted as YYYY.MM.DD). You can search for "hmt-editions" from the nexus server at beta.hpcc.uh.edu/nexus, or directly download the zip file from this URL.

If you want to add a dependency to this package in your maven or gradle build, use the following maven coordinates (in whatever syntax is appropriate to your build system):

groupId: org.homermultitext
artifactId: hmt-editions
version: 2012.8.7
type: zip

Friday, August 3, 2012

ICT · The Image Citation Tool

The Homer Multitext has developed an Image Citation Tool for use with digital images served by the CITE Image Service.

Humanist scholarship is the act of forging new connections between ideas, and placing those connections before the eyes of readers. Those readers must be free and empowered to judge the value of the new connections.

The heart of humanist scholarship, then in quotation. Robert Sokolowski calls quotation a ‘curious conjunction of begin able to name and to contain’;* V.A. Howard is more succinct: quotation is ‘replication-plus-reference’.** We would re-phrase this as “reproduction plus citation”.

Reproduction in a quotation allows us to talk about a particular artifact of human thought without the burden of reproducing its entire context. As a practical matter, it is easier to say “Μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος” than to reproduce the whole of the Iliad.

Citation in a quotation provides a link from the reproduced selection back to context of that selection. “Iliad 1.1” names our quotation, as Howard says, but also invites the reader to explore Iliad 1.2, all of Book 1 of the Iliad, or the whole poem.

This is easy stuff. We were all taught to do it early in our educations, and we take it for granted.
It is easy, that is, with texts. Images are a different matter. How do you “quote” from an image? Most scholars use images from time to time in their work; few of those uses meet the rigorous standards of “quotation” that we take for granted with texts. The general practice is to open a digital image in an image-editor, cut out the area of the image under discussion, and paste that image into a document or web-page. If the image is cited, the citation is often to a page in a book that has published a version of the image, a museum’s accession number identifying an original work of art, or a URL to a web-page on which a digital version of the image appears. The citation does not provide a path from the selection to its context.

Nor is this kind of “image quotation” actionable in the way that a textual quotation (reproduction+citation) is. Given “Iliad 1.1”, it is simple to answer the question “What comes next?”. It is simple to know that “Iliad 1.1–1.10” includes Iliad 1.5. Given a cut-copied-pasted snippet of a digital image, and perhaps a citation to a web-page, it is not possible to answer with any degree of precision “Where is this snippet in its larger context? What parts of the image are adjacent to this snippet?” Given two snippets, it would require considerable computation to determine whether Snippet A contains Snippet B.

Since the Homer Multitext is committed to image-based scholarly editing, and subsequent discussion and argument based closely on digital imagery of manuscripts and papyri, we have worked to developed a means of quoting images as rigorously, and usefully, as we can quote texts.

The CITE Image Service allows us to identify images with canonical citations in URN-notation. These URNs take the form: urn:cite:hmt:chsimg.VA094VN-0597. This points to a notional image, which might be delivered at any scale, or by services hosted on various machines with various addresses.
A CITE Image URN can take a suffix that identifies a rectangular region-of-interest: urn:cite:hmt:chsimg.VA094VN-0597:0.3833,0.2441,0.0783,0.0463. This URN+ROI can resolve to a “quotation”, that is the region-of-interest on its own, or to a view of the larger image with the region-of-interest highlighted. These image-ROIs canonically cited with URN notation are concise, precise, and machine-actionable mechanisms for image quotation in the best tradition of humanist scholarship.

To help ourselves, our collaborators, and anyone else interested in working with the openly licensed images in our Homer Multitext Image Collection, we have developed a web-based tool for defining regions-of-interest on our digital images and capturing canonical citations for them. This is the Image Citation Tool.

A lengthy introduction to the tool and its use are included in our Homer Multitext documentation pages. Downloadable source-code for the tool—a relatively simple web-application in HTML, CSS, and Javascript—is available from the HMT’s code repository.

References

* Sokolowski, Robert. “Quotation.” The Review of Metaphysics 37.4 (1984) : 699-723. Print. 24 May 2011.
** Howard, V.A., “On Musical Quotation”, Monist 58 (1974) 310.

Wednesday, July 25, 2012

Announcing version 1.0 of CHS Image Services

A recent post described our reorganization of the Homer Multitext project's archival image data.
We have been experimenting for some time with a preliminary internet service for working with canonically citable images.

Today, we are releasing version 1.0 of our implementation of the CHS Image Service, an extension to the CITE architecture's Collections. The CHS Image Service supports extended citation of images including regions of interest, and provides methods for gathering various kinds of information about a canonically citable image, including retrieving binary image data. We plan to follow up on this release shortly with a formal specification of version 1.0 of the CHS extension to CITE Collections.

In the mean time, if you are a developer interested in using canonically citable images, see this summary of CHS Image Services in our overview of the CITE architecture.

If you would like to run your own installation, see this guide to running a CHS Image Service.

If you are an end-user who currently uses HMT apps, you should see no changes at all (except perhaps that the web pages at our reference installation of CHS Image Services, amphoreus.hpcc.uh.edu/tomcat/chsimg/, may have a little nicer skin) — that's a feature of the design of chsimg 1.0. What you should expect to see over the next year or two is more rapid development of applications drawing on chsimg to incorporate canonically citable images in new ways to visualize and explore the Homer Multitext project's increasingly rich archive.

Tuesday, July 24, 2012

Verifying an inventory of scholia

One of the most important tasks the Homer Multitext project has been addressing is to compile a complete inventory of the scholia in the manuscripts we are studying. Remarkably, this has never been done, even for the much-discussed Venetus A manuscript. (The most thorough and accurate edition to date, Dindorf's admirable two-volume Scholia Graeca in Homeri Iliadem, excludes interior and exterior scholia from consideration. As to Erbse's Scholia Vetera in Iliadem, often misunderstood by classicists as systematic, or as an attempt to create a comprehensive inventory, a detailed comparison of our edition of Iliad 3 and 4 with Erbse's text showed that he publishes only about 80% of the scholiastic text in the Venetus A.)

Venetus A, folio 12 recto, with the first 25 lines of the Iliad; overlays show the location of scholia, color-coded for their class of placement on the folio.

Since the summer of 2010, we have used a system of machine-assisted visual proofing to help verify that we have indeed included all scholia on a given folio in our inventory. HMT editors create structured inventory notebooks that include a citation of visual evidence for each scholion they identify. These notebooks can be transformed into web pages with images illustrating each folio, with partially opaque overlays showing the position of each scholion. Editors can see at a glance whether any areas of a folio page have text outside an overlay.

This summer, as part of our effort to improve the automated management of our archival data, we have added an automated check that verifies the syntax of every image citation. Yesterday, I completed a review of inventories for Iliad 1-5 in the Venetus A. 3503 out of 3505 entries (99.9%) were syntactically valid: clearly, visual proofing is a pretty effective method of checking syntax as well as finding missed areas on a folio page. Next, we plan to package the verification tool in an expanded tool kit for HMT editors, so that they can validate the syntax of their citations before ever submitting a notebook for review. (That or course means: they will be required to validate that 100% of their citations are syntactically valid before submitting a notebook for review!)

Of course we cannot be sure that this method will find scholia when they are nearly invisible on our photographs. Participants in the 2011 summer seminar at the Center for Hellenic Studies made the alarming discovery that they had missed a number of exterior scholia visible in the 1901 facsimile edition by Comparetti. It is clear that in the century since Comparetti, the small scholia on the edges of the manuscript pages have suffered the most, and we have since established a practice of routinely checking each inventoried folio against Comparetti's facsimile. In some cases, our ultraviolet photography preserves legible text.

What does an inventory of 3500 scholia look like? I've posted one visualization: a pdf that offers a kind of "flip chart", of folios 12 recto through 80 recto of the Venetus A (that is, Iliad 1-5), with a very small version of our overlay image for each folio side. You can see that visualization, as well as some work in progress from the summer of 2012, here.

Friday, July 20, 2012

Digital images in the Homer Multitext project

Since 2007, when the Homer Multitext project photographed three manuscripts in Venice, digital images have been a fundamental component of the project. With all of our digital resources, our aims are first to make archival-quality sources fully available for downloading, then to publish them in internet services allowing retrieval by canonical citation, and finally to develop end-user applications drawing canonically cited material from our services.

In our reorganization of HMT material this summer, we have gathered archival versions of our images in a single location, linked from the new 'data' section of our web site or directly accessible from www.homermultitext.org/hmt-image-archive. The image archive currently includes approximately 10,000 digital images of five manuscripts. As I am posting this note, the entire archive is being mirrored at the College of the Holy Cross — a process that, even at internet speed, will take a few days to complete.

All images are available under the terms of a Creative Commons attribution license: further details are included in README files for each data set. We include md5 checksums for each image so that you can verify that downloads have not been corrupted.

Wednesday, July 18, 2012

Building the HMT web site

As part of our broader effort to automate the management of the expanding digital resources of the Homer Multitext project, we have recently defined four sections of the www.homermultitext.org web site:

archival data (hmt-data)
technical documentation (hmt-doc)
programs developed for the HMT (hmt-code)
end-user applications (hmt-apps)

The content is written in markdown in four repositories hosted on bitbucket.org. The markdown pages are converted to HTML5 and formatted for inclusion in the HMT web site using mdweb, a minimal system for managing structured documentation in markdown syntax which you can find among our first listings in the new hmt-code section of the site (here).

We hope that with this very lightweight documentation system we will be able to keep these sections of the web site up to date more easily. The new system has already simplified mirroring those sections of the HMT web site: you can currently find clones of all four sections on katoptron.holycross.edu.

To coordinate blog entries here with content on the homermultitext.org web site, we will include tags parallel to the four web site sections (data, doc, code, apps) on discussion here relevant to one of those sections.

Tuesday, July 17, 2012

University of Houston High Performance Computing Center

The High Performance Computing Center at the University of Houston has provided invaluable support to the Homer Multitext project for several years. If you have ever used one of our zoomable images, the stream of image data comes from amphoreus.hpcc.uh.edu, a machine dedicated to the Homer Multitext project's work. In January when I visited the University of Houston, part of the warm hospitality included a tour of the astonishing facilities there.

amphoreus.hpcc.uh.edu is part of this massive rack

Later this spring, HPCC extended their support by creating a second, virtual machine where we can test software in development before moving better tested, stable versions to amphoreus. One of our first experiences on the test machine has been to use Nexus, a repository management system, as a way to publish versioned packages of software, textual editions and other material created by the HMT project.

Left, Keith Crabb, director of HPCC; right, Alan Pfeiffer-Traum, system managing magician; center, I attempt to impersonate a professor in a suit.

This support is allowing us to reorganize both the way we make our archival material available to others, and the way we automatically use our archival material internally within the HMT project. We'll post more details here as new parts of our reorganized system are publicly released. Meanwhile, once again, thanks to Keith Crabb, the manager of the High Performance Computing Center at UH, and to Alan Pfeiffer-Traum, the system administrator who keeps everything running around the clock. Their support for the work of HMT editor Casey Dué, and for the HMT project as a whole, is dramatically changing the ways our material will reach the internet.

Welcome to the HMT

This blog discusses new developments and on-going research related to the Homer Multitext project (www.homermultitext.org). The HMT seeks to present the textual transmission of the Iliad and Odyssey in a historical framework. Such a framework is needed to account for the full reality of a complex medium of oral performance that underwent many changes over a long period of time. Using technology that takes advantage of the best available practices and open source standards that have been developed for digital publications, the HMT offers free access to a library of texts and images and tools to allow readers to discover and engage with the Homeric tradition.