Notes from the Hierarchy Alignment session

Notes on hierarchy alignment

I wanted to share some of the work I’ve done at GTN-Québec on hierarchy alignment, its strengths and limits, and we shared ways to go beyond this.

First our use-case: We are harvesting bibliographic metadata from multiple sources, using different hierarchical thematic vocabularies (such as Dewey, library of congress, etc.) We want our users to find resources from all sources, using a classification term from any single vocabulary.

Most of our thematic hierarchies are small (<250 categories) and we have found it possible to align them by hand, using only the notion of related term (RT) and narrower term (NT) (from ISO 2788) across hierarchies. So it is possible to search for terms (and associated resources) along a path of RT/NT relations that may be either internal or cross-vocabulary.

Moreover, we can use a single (rich enough) vocabulary as a “pivot”, so we would establish correspondance between each vocabulary and the pivot, rather than between each pair of vocabularies. (We have chosen to use Dewey, which was the largest vocabulary we had to use, as pivot.)

There are interesting corner cases, but more often than not they can be resolved at a finer level. For example, one vocabulary lumps linguistics and literature for each “foreign” language under that language’s studies. This does not correspond to any of Dewey’s categories, but specific linguistic or literary subcategories can be related as narrower terms to the appropriate Dewey categories. In some more complex cases, some categories find themselves between two Dewey categories, or vice-versa.

In general, our choice has been to let some categories dangle if they can be resolved at the subcategory level, which means that resources at that abstract level cannot be found; thus we have favoured precision over recall. But sometimes, there are no sub-categories to use, and we are reaching the limits of this approach.

In the discussion, we gave many examples of “almost identical” categories, especially across languages; someone mentioned how the German “Gemüse” category (roughly our vegetables) excluded potatoes.  We raised the prospect of codifying exceptions (NOT narrower term), but we agreed using negation was probably more demanding than it was worth.

First, we agreed that for search, lack of precision was a comparatively minor flaw, and that we should distinguish matches by a degree of quality. Path traversals with a cost at each step are a known solution to this problem; but we need also to convey to the users that some results are more or less certainly relevant.

Chris McDowall mentioned how resources belonging to a category was contextual; and explained how supervaluationism distinguishes context-free “super-truths” from contextual truths. Concretely, this could be translated to element membership being qualified by the number of people asserting it. (This might be related to Yves Raymond’s session on user feedback for categories.)

One big point of agreement is that categories equivalence should be determined through Big Data in general, and not by hand. Michel Gagnon proposed using document classification as a way to infer categories equivalence; in a private conversation, I mentioned the issue of corpus in different languages, and he proposed using dbpedia terms as a pivot.

We also mentioned the possibility of inferring the taxonomy itself rather than only equivalence between existing taxonomies; could there be intermediate cases between the rigidity of fixed vocabularies and the anarchy of folksonomy?

We also mentioned tools for multifaceted classification, such as feature lattices or Componential analysis; and methodologies for ontology alignment (such as this one for another domain, using category theory.)

Notes from the Preserving linked data Session

Only four participants! Antoine Isaac (Europeana), Romain Wenz (BnF), Ryan Donahue (Met), Cate O’Neill (Find&Connect)

As it appears, there are more urgent issues to solve for LODLAM.
In fact issues are similar to the ones that were raised about WWW long ago. As WWW survives them, maybe LD can survive them too. It however seems tricky for ‘reference’ datasets. And what would happen when you re-use others’ data?

Some (only slightly curated) bullet points:

– Basic issue: allowing decentralized data access and use, preservation beyond basic requirement of persistent URIs. Data/links can change!

– Handling updates similar to what happens for historical place names in catalogues. (scope of “The netherlands” as of 1821, as opposed to later).

– Preserving context: keeping different levels of truth, different parts of the provenance (time and data producers)

– RDF triples make time and data provenance tricky to represent, unless we go for quadruple or versioned URIs (which have their disadvantages). BnF more-or-less tracks manually (on demand) the provenance.

– Serve representations (data) for which “versions” of a resource (URI)? Interest of an “historical GET”, comparable to Memento (www.mementoweb.org).
Basic solution: no versioned URIs for the resource. but keep track of different versions of the representations (RDF data, HTML page). data.bnf.fr uses Internet archive to archive its representations (just one canonical representation – RDF/XML – for each URI)
Creating Dataset of datasets to find their archive back?

– How to decide what what to preserve/give access to? Everything/every version? Linked data users probably want to get is “best” for the identifier. And it may change! E.g., deprecating some names in authorities from preferred to alternative.
BnF has some cases, where people ask to remove data (Birth dates, Attributions that are not good for the reputation). In such cases, it’s not really desirable to even keep track of historical data in the authoritative service.
Should we mint/re-use URIs or HTTP code for saying that data was removed?
Note: cf OAIS: preservation success is success *for humans*!

– Examples of linked data that was not preserved?
Probably some Talis datasets.

– Misc. remarks on persistent identifiers.
A trick to preserve identifiers is embed identifiers inside other identifiers. But needs some resolver service!
URI design: problem of meaning attached to the URI. We need to separate description function from identification one.

Introducing a project to publish the Getty Vocabularies as LOD

 I am currently leading a project to publish all four Getty Vocabularies as LOD. The four Getty Vocabularies are: the Art & Architecture Thesaurus (AAT)®, the Union List of Artist Names (ULAN)®, the Getty Thesaurus of Geographic Names TGN)®, and the Cultural Objects Name Authority (CONA)™.  We are on track to start with AAT in July of this year. We will then move on to TGN, ULAN and finally CONA. Here is a PDF version of the most current flier – vocab_lod_flier

I am also interested in advice from the LODLAM community on what it takes to build and maintain a successful community of consumers of LOD versions of AAT, TGN, ULAN and CONA. Some of the discussions that would be helpful to us are:

  • Best ways to host and encourage open communication threads regarding things like issues, comments about our ontologies, offers of help, examples of complex SPARQL queries to share, etc.
  • Creating a road-map for community-built, open-source tools for access, contributing, matching, etc.
  • Use cases from the community

 

Evaluating (and Enhancing) the Draft MODS RDF Ontology

The Library of Congress’ MODS/MADS Editorial Committee recently released a draft MODS RDF Ontology.  Because Columbia University Libraries / Information Services uses MODS as its primary schema for our digital collections (particularly those in Academic Commons, our institutional repository), we decided to actively experiment with this RDF ontology in the hopes that by improving it we will have an easy path forward to migrate our existing metadata into a triple store, such as 4store, enrich it, use it as an authority system, and make it available for consumption by others.

Our initial testing has shown some promise, particularly as the Editorial Committee have already worked to address some of our initial concerns (particularly how it favored literals over URIs); that said, it clearly has a ways to go. As such, I would like to propose a session for LODLAM where we could discuss how MODS RDF could be further improved to provide a robust and functional ontology for the LODLAM community.