Flexible/Inflexible Vocabularies, Thesauri, Subject Headings, and Lexicons

Fellow LOD-LAMer and Powerhouse Museum Web Manager and data monkey* Luke Dearnley made a stop at BPOC today to give a presentation to Balboa Park staff about the importance of open data and to encourage these institutions to consider making their datasets public. In the midst of his talk, he brought up how the Powerhouse and other institutions participating in the Museum Metadata Exchange could present their internal-facing vocabularies as food for people using open data (including the institutions themselves) to generate a list of terms that could be analyzed to increase the richness of the vocabulary for all participants in the exchange.

I think I can safely state that the majority of museums use internal lexicons, along with some form of standardized Vocabulary (which I’ll denote with a capital V), including Getty’s AAT, ULAN, and TGN (and the upcoming CONA); Library of Congress Subject Headings; and Museum Nomenclature. The Vocabularies, while comprehensive, are not complete and will never be complete. On the one hand, I have to groan a bit at the thought of sort of creating yet another standard (even if it’s a potentially more extensible and flexible model than any of the “commercial” Vocabularies). On the other, is the duality of having both flexible and inflexible Vocabularies within the same dataset a problem for linked data, or does this provide an opportunity to build upon existing structures (even if it’s not officially sanctioned by the organizing bodies of these Vocabularies)?

Thinking realistically, while Vocabularies have their place in the world, they’re nearly useless when trying to qualify tagging activities; activities that are just as important to the quality of the dataset as curatorially-determined categories, object names, and other descriptors. Also thinking realistically, in many cases, most data output from Collection Information Systems don’t identify which Vocabulary is in use in a given field, if any at all. This does hinder someone performing data analysis from being able to weight Vocabularies as “preferred terms” against other alternate terms (such as those that come from tags or internal lexicons).

Let’s assume then, if you add in social tagging alongside Vocabularies, you end up with a mass of descriptor terms. Some of these are kind of useless from the greater public’s point of view. Could one assume that a corpus of tags from a number of different locales and types of museum would provide enough data to evaluate against the standards? Could one then increase the value of the Vocabularies by linking the tags to them? I think so, but would the major players (Getty, LOC) be willing to go along with it? That, I think is more of a political question, but one that will need to be addressed in some point in the future, I believe.

Standardized Vocabularies are important. They serve as handy resources for museums to begin to organize and describe their objects in a consistent way, and thus they are incredibly useful for people using open data to begin to make sense of the holdings. I know Susan Chun has thought an awful lot about this through her work on the Steve project, and she and I have talked about it to a degree. But I would be interested in discussing with others where Vocabularies fit into the linked and open data realm, and if we can or should leverage them as a method for providing a point of entry to the body of related tag descriptors.

*sorry, Luke. I couldn’t quite remember what your alternate terms for “data monkey”was