Flexible/Inflexible Vocabularies, Thesauri, Subject Headings, and Lexicons

Fellow LOD-LAMer and Powerhouse Museum Web Manager and data monkey* Luke Dearnley made a stop at BPOC today to give a presentation to Balboa Park staff about the importance of open data and to encourage these institutions to consider making their datasets public. In the midst of his talk, he brought up how the Powerhouse and other institutions participating in the Museum Metadata Exchange could present their internal-facing vocabularies as food for people using open data (including the institutions themselves) to generate a list of terms that could be analyzed to increase the richness of the vocabulary for all participants in the exchange.

I think I can safely state that the majority of museums use internal lexicons, along with some form of standardized Vocabulary (which I’ll denote with a capital V), including Getty’s AAT, ULAN, and TGN (and the upcoming CONA); Library of Congress Subject Headings; and Museum Nomenclature. The Vocabularies, while comprehensive, are not complete and will never be complete. On the one hand, I have to groan a bit at the thought of sort of creating yet another standard (even if it’s a potentially more extensible and flexible model than any of the “commercial” Vocabularies). On the other, is the duality of having both flexible and inflexible Vocabularies within the same dataset a problem for linked data, or does this provide an opportunity to build upon existing structures (even if it’s not officially sanctioned by the organizing bodies of these Vocabularies)?

Thinking realistically, while Vocabularies have their place in the world, they’re nearly useless when trying to qualify tagging activities; activities that are just as important to the quality of the dataset as curatorially-determined categories, object names, and other descriptors. Also thinking realistically, in many cases, most data output from Collection Information Systems don’t identify which Vocabulary is in use in a given field, if any at all. This does hinder someone performing data analysis from being able to weight Vocabularies as “preferred terms” against other alternate terms (such as those that come from tags or internal lexicons).

Let’s assume then, if you add in social tagging alongside Vocabularies, you end up with a mass of descriptor terms. Some of these are kind of useless from the greater public’s point of view. Could one assume that a corpus of tags from a number of different locales and types of museum would provide enough data to evaluate against the standards? Could one then increase the value of the Vocabularies by linking the tags to them? I think so, but would the major players (Getty, LOC) be willing to go along with it? That, I think is more of a political question, but one that will need to be addressed in some point in the future, I believe.

Standardized Vocabularies are important. They serve as handy resources for museums to begin to organize and describe their objects in a consistent way, and thus they are incredibly useful for people using open data to begin to make sense of the holdings. I know Susan Chun has thought an awful lot about this through her work on the Steve project, and she and I have talked about it to a degree. But I would be interested in discussing with others where Vocabularies fit into the linked and open data realm, and if we can or should leverage them as a method for providing a point of entry to the body of related tag descriptors.

*sorry, Luke. I couldn’t quite remember what your alternate terms for “data monkey”was

Revisiting archival description

Apologies for the brevity of this blog post – I’m keeping this brief to make sure I get it posted before LOD-LAM.

So, archival description.

Archival records are hard to find. They’re often in large bodies of records, difficult to browse through and generally less cut-and-dry than publications which are intended for formal publication and/or public consumption. Archival finding aids are the researcher’s traditional first point of contact, providing background biographical information on the organization and/or personal creator(s), as well as a description of how the records are arranged and description of the various levels of organizational hierarchy. They’re useful!

But they’re also a bit old-fashioned, at least as typically implemented. The finding aid structure imposes a few issues for linked open data applications.

I see two[1] major problems with current archival description:

  • They’re hierarchical

Most countries’ archival description standards are based on a strict hierarchy from higher levels of description (fonds, etc.) to more precise levels of description (series, sub-series, file, item) with fairly rigidly prescribed relationships between items. The finding aid also assumes a “paper” whole-body approach, rather than a linking approach. This is kind of non-webby, and imposes a stricter order on documents than their creators may have had, in many cases.

(The Australians, of course, are a few steps ahead of the rest of us already.)


Perhaps even more though, a major problem is that:

  • They’re imprecise.

This is the real issue, or at least the most immediate issue. Archival descriptions are designed for human eyes in a paper world, and so they’re often encoded with a level of ambiguity that’s difficult for machines to extract. (LOCAH has been doing a great job of identifying points of concern and trying to route around them.)

Archival descriptions have some inherent ambiguity because interpretation of archival holdings is not always cut and dry, but that doesn’t mean that we have to be ambiguous in how we create those descriptions. We can be precise about the ways in which our collections are ambiguous.

I’d love to get a conversation going about revising descriptive standards to enhance precision in finding aids in order to enhance the ability to use them as computer-readable metadata. I can see a number of areas for improvement:

  • More strongly-typed data fields, rather than “fuzzy” fields that can hold a variety of types of subjectively-defined data
  • More focus on “globally-scoped” names rather than “locally scoped” (as pointed out by Pete@LOCAH here)
  • A stricter, clearer inheritance model rather than ISAD(G)’s rule of non-repetition (Thanks to Pete again)
  • Certainly more, which we can talk about at LOD-LAM!

The extent to which all this can be implemented will depend on the organization, of course – retrofitting older archival descriptions for all of this would be time-consuming, if practical at all. But I think there are a lot of benefits to be gained by changing practices going forward, and I see this as an enhancement to current descriptive standards/practices that can benefit more than just linked open data applications.


[1] Probably more than two, but for now I’ll focus on these.

Some questions

What are the objectives of linked open data for libraries, archives and museums? From the perspective of libraries, the objectives of providing metadata have traditionally been “finding, identifying, selecting and obtaining”; within other domains, I am sure that the same kinds of principles apply, however, there seems to be something lacking in these objectives seen from the context of the semantic web. What are the new objectives when creating semantic metadata? At NTNU, we think that contextualizing (what contexts surround the item, its facets), comparing, sharing are important, but there is surely more. (Hats off to Knut Hegna of the University of Oslo for the spark for these thoughts.)

What are the opportunities that arise from moving the technological and conceptual platform away from domain-specific to generic RDF; the move away from “understandable by an expert” to “useful and usable by everyone”? This is particularly relevant when we’ve got content (documents, images, experiences, information) (freely) available online, but also in the cases where we have metadata about other things that aren’t available online.
On a broader note: what does openness actually mean for an organization? It is something that potentially affects everything from staff attitudes to strategy…I don’t think that there is a roadmap for institutions taking the leap into openness.

Beyond OAI-PMH

C & NW RR, a general view of a classification yard at Proviso Yard, Chicago, Ill. (LOC)
OAI-PMH is a great way to ship large "collections" of records between repositories.

The Open Archives Initiative – Protocol for Metadata Harvesting (OAI-PMH) is the foundation on which the IMLS Digital Collections and Content project and the companion Opening History aggregation are built.* Although small increases in the use of OAI-PMH were seen over the course of the project, less than a quarter of IMLS National Leadership grant projects provide item-level metadata using OAI-PMH [1, 2]. In some cases, the projects in the the missing 75% are legitimate – they are not collection with readily available item-level metadata (e.g. narrative exhibits, interactives/games, etc.). But this still leaves many projects/collections out of a broader network of resources. OCLC/RLG found a higher percentage (48%) of member organizations using OAI-PMH, but it is unclear how much of their metadata was shared this way [3]. While recognizing that OAI-PMH has been successful at making millions of descriptions available, it’s worth pausing to wonder if 25-50% adoption is good enough.

In light of the rapid growth of LOD in the last few years, I’ve been wondering how a large-scale aggregation like IMLS DCC might fit into this environment. Here are a few questions to discuss at #LODLAM:

  • What are the lessons from OAI-PMH that will be important for LOD-LAM?
  • How is the lack of one, common protocol for sharing data a benefit and/or a danger?
  • Will Linked Open Data be “low barrier” for some, but untouchable for many?
  • Can/should we build LOD on top of existing OAI-PMH installations? (see [4, 5])
  • Should we abandon OAI in favor of more web-friendly approaches? (See @edsu Digital Public Library as a Generative Platform)
  • What are the lessons from the Museums and the Machine-Processable Web and Europeana for U.S. organizations?
  • One of the reasons that OAI-PMH succeeded was through support of funders – what should funding agencies tell projects about implementing LOD?
Nottingham School at the Interstates edge," in Teaching & Learning Cleveland
LOD offer the opportunity to move smaller units of information, quickly, to more access points.
  1. Palmer, C., Zavalina, O., Mustafoff, M. (2007) Trends in Metadata Practices: A Longitudinal Study of Collection Federation. pre-print available at: http://imlsdcc.grainger.illinois.edu/docs/JCDL07_final.pdf
  2. Jett, J.G. (2010). Supplementing OAI-PMH in the IMLS Digital Collections & Content Aggregation. Masters Thesis. Available at: http://goo.gl/SaWPE
  3. Ayers, L. , Camden, B. P. , German, L. , Johnson, P. , Miller, C. and Smith-Yoshimura, K. (2009) What we’ve learned from the RLG partners metadata creation workflows survey— Retrieved March 2, 2009 fromhttp://www.oclc.org/programs/publications/reports/2009-04.pdf
  4. Haslhofer, B., & Schandl, B. (2008). The OAI2LOD Server: Exposing OAI-PMH metadata as linked data. International Workshop on Linked Data on the Web (LDOW2008), co-located with WWW. Available at: http://events.linkeddata.org/ldow2008/papers/03-haslhofer-schandl-oai2lod-server.pdf
  5. Haslhofer, Bernhard and Schandl, Bernhard (2010) Interweaving OAI-PMH Data Sources with the Linked Data Cloud. Int. J. Metadata, Semantics and Ontologies, 1 (5). pp. 17-31. Available at: http://eprints.cs.univie.ac.at/73/1/ijmso2010_haslhofer_schandl.pdf

Images:

* Disclaimer: these opinions are my own and may not reflect official opinions of the project or my colleagues.

This post has been cross-posted on Inherent Vice.

Crowdsourcing LOD-LAM

The ‘linked’ part of LOD can eat up a lot of time and resources. Machine processing might be able to match up places and subjects to established LOD sources, but disambiguating people and events can be trickier. Crowdsourcing may be one way in which these more complex or subtle relationships can be defined. It’s a topic that is perhaps worth some time at LOD-LAM.

It seems to me that there are at least 4 ways in which crowdsourced data might enrich the LOD offerings of LAMs…

Specifically designed projects
There are lots of examples now of projects created by institutions to enlist the public in extracting structured data from unstructured sources, or adding to existing descriptive data relating to objects, photographs or documents. There are fewer that seek to use crowdsourcing to define relationships between items, or between items and other entities. Any examples?

Machine tags
You can allow users to add semantics to the tags they add to collection items, creating machine tags (or triple tags). This allows people to refer to standard vocabularies in their tags, or define relationships with entities outside of your collection database. Flickr, for example, supports machine tags (you can browse them all here).

Machine tags are of course meant to be read by machines, so they’re not all that human friendly. If you wanted to encourage their use you’d probably want to create tools that simplified their construction, and perhaps some feedback mechanism to demonstrate their significance. That’s basically what I was experimenting with in the Flickr Machine Tag Challenge. People can generate machine tags automatically using my Identity Browser (based on People Australia), add them to Flickr, and keep track of their work via the FMTC scoreboard. Similarly I created a simple tool for generating machine tags from the NLA’s newspapers database.

Distributed linking
One of the good things about the Linked Data is that it’s linked! There’s no reason why all the activity has to happen on institutional website. It may be that the best way of enmeshing you collection in the cloud is to provide clear persistent uris and to help people and projects that use your stuff to publish their own research as LOD. Make it a co-operative endeavour rather than a ‘come to our site and help us’ project. This is what I have in mind for Invisible Australians, but we haven’t got very far yet.

Meta linking (I need some better names for these categories)
I’m sure there’s already something like this, but I can’t think of any examples right now (it’s late!). In wondering about where to go with the FMTC, I started thinking about a meta-level biographical linker, which would allow people to define and publish relationships between resources about people on the web. Sort of a semantic bookmarker, rdfa generator, biographical register… Perhaps using tools like LORE or even Zotero.

The point being, of course, that the links or annotations can exist completely separately from the resources they’re describing.

Anyway, I know there are people coming to LOD-LAM with much more experience than me on the crowdsourcing front, so I’d be really interested in having a discussion along these sorts of lines.

Not just collections

It seems that a lot of the LOD-LAM discussion and activity so far has centered on collections — the stuff sitting in collection management systems and databases. This is of course a natural starting point, but it might be good to consider what else there is that would benefit from LOD treatment. In particular, I’m interested in thinking about what we can do with the large amounts of contextual and interpretative material that LAMs produce — online exhibitions, finding aids, fact sheets, publications, encyclopedias, indexes, photo galleries and more. These sorts of things are packed with unstructured data — people, places, events and of course links to the collection dbs. By extracting and exposing these structures and linking into local authority systems and the LOD cloud we could create rich structures for discovery and understanding.

And then there’s material in the cloud — Flickr and elsewhere — as well as the work of our users, captured in blog posts or Zotero libraries. How might we start to develop a set of tools and practices that encourage semantic links across the wider LAM ecosystem?

Topic: Services

I thought it might be a good idea to start to introduce some of the topics we want to cover at the meeting so that people could begin to wrap their heads around these and related ideas.

I am particularly interested in exploring the services that we can create based on LOD that will connect users to resources held or managed by LAM. There has been a lot of activity (well, at least some activity) around the creation of data sets and vocabularies, but so far little has been done or even speculated on the ways that users will benefit from this linked data.

I would like to spend some time exploring “wild and crazy ideas” about possible services based on LOD. To make things a bit more concrete, I always go back to Vannevar Bush’s MEMEX – with its function of organizing, storing, linking and collaborating around information resources. If MEMEX was a great idea in 1949, what’s the appropriate great idea for 2011? I encourage folks to come to the meeting with ideas in this area.