Computing the Provenance of Art

By Rob Sanderson, Semantic Architect for the J Paul Getty Trust

In this connected age, if information is not on the web, then for the majority of people it does not exist. As data becomes more and more connected within and across organizations, it is also moving to the web as a way to make it more accessible, discoverable and most importantly usable.  Unless privacy issues dictate otherwise, if data is not on the web, then it does not exist for the purposes of scholarship. Data about the history of people, places, objects and their complex interrelationships avoids many of the privacy issues, and is a perfect domain for this work.

Computers are capable of very detailed analysis over huge amounts of information, often termed “big data”, but only when that data is represented clearly and consistently. The description of art is a collective human effort, and thus in order to gain the benefits of machines doing macro-level analysis, those descriptions need to be consistently created and connected across institutions and projects. The descriptions needed to model history in terms that machines can understand and compute on are more complex than the same concepts expressed in human language, as machines lack the context to understand the difference between a person and a painting, unless told.

Several organizations have begun to work together to come up with an understandable set of guidelines and data models for how to do this, focused on the model derived from the American Art Collaborative project. It adopts current best practices for usability and design of models defined in Linked Open Data (LOD), and is one of several emerging champions for the notion of Linked Open Usable Data (LOUD!). The model makes the simple things easy and the complex things possible, expressed in a way that is easy for software developers without PhDs in both art history and computer science to work with.

At the Getty Research Institute, is being used to recondition and republish the Provenance Index, a dataset of events in which the ownership of an object changed. The events are described based on primary source evidence from the archives of art dealers, from sales catalogs of auction houses, and other similar research. The objects and people are also described, and connected where possible to other datasets.  This work mirrors other ongoing Linked Open Data work around the Getty, such as the Museum collection (including the provenance) being expressed using the same model, and the description of conservation science and its literature in the Getty Conservation Institute. The Getty Vocabularies are already available as LOD, and work is ongoing to add the important Usability aspect.

Much of this information is being managed in a data platform called Arches. It was originally funded by the Getty Conservation Institute to look after information about immobile cultural heritage, such as temples or rock carvings, in war torn states, but that scope has since broadened to encompass many of the data requirements for cultural heritage in general. Arches provides human user interfaces on top of the data, both for creating and editing records and searching and displaying them to end users. It allows art historians to create and publish data in a consistent, connected and collaborative way — the three “C”s that drive Getty’s digital mission.