Curation of LOD

These are the session notes (sketchy I’m afraid) for the discussion on curation of linked open data on day 1 of the 2013 LODLAM summit in Montreal.  There are multiple ways to look at curation and that can be seen in the different slants brought into the mix – curation of the data that the agency or person has (and its state or fitness for reuse and supply) and the data that it is desirable to link to (why and what does that mean).  It is no surprise that questions of control and authority emerged and questions around reliance and co-contribution.  What is the perfect combination and how long will those combinations of data complement each other?

Moules, frites, bière
Moules, frites, bière
CC-BY Ingrid Mason

 

The wording in (brackets) is mine from recall.  Please feel free to comment and correct me if I’ve misinterpreted the notes.

  • Who to link to? (whose data to link the data you have to)
  • Why link to them? (is there a working relationship, how much prior collaboration, does this matter?)
  • How good they are? (what is the quality of the LOD you want to use and its relevance to your data?)
  • Who to trust may change over time?
  • Multiple suppliers of data (what to choose?)
  • Ecosystem (developing and changing)
  • Engagement of the curator in the ecosystem
  • Mediator, editor and value add through curation (to the use of LOD)
  • Mappings between different ontologies not just controlled vocabularies
  • Identity – automated linking (issues?)
  • Is VIAF a big enough grid? c/- IFLA hosted by OCLC
  • Wide reliance in (north American) libraries, e.g. OCLC example (Australia has the NLA People Australia service and there is ORCID too)
  • Linking is curation!
  • Is shared curation possible?
  • Institutional support – local, national and global linkages (follow culture, history, economics, language, trade routes and politics and there will be links?)
  • Whose requirements are being met?
  • Who pays for curation?
  • Who or what is a curator (of LOD)?
  • Curating what? (is it the data and the meaning or the interfaces too and the user experience of search and discovery too?)
  • Persistent URI exist as long as the web exists
  • Quid pro quo – get it (LOD) out quick to get it improved (co-contribution of correction or uptake for testing?)
  • !! Editorial decisions of the consuming organisation !! (of LOD) (this is curation?)
  • “publishing (LOD) with the authority of the institution” (surely this is curation?)
  • Some access is better than no access (is that always true?)
  • Data always links with a person (?) (multiple links to data sources provides diversity and useful redundancy?)
  • Open curation to the masses
  • Curation ups the quality but need good processes to help with cleaning or correction
  • Pressure on public institutions to participate in the commons
  • There is a social dimension between the curator, the community and the LOD ecosystem
  • Can use redundancy (see as an opportunity) to track errors, support consensus, and self-helping
  • Unattributed assertions (how to manage these, whether to integrate these, or not to allow them?)
  • Bidirectional (is this always the case, you link to me, I link to you?)
  • Embrace messiness and get over control issues (provide notices where the data hasn’t been checked or gone through curation process?)
  • (Use LOD) to provide supplementary information (see BBC Music)
  • Encode linking and curation as LOD, use W3C PROV-O ontology for provenance
  • Social quality – link Geodata – use: ID, City, Picture, Depiction
  • Example: OpenStreetMap
  • Buddy up with citizen curator (akin to citizen scientists)
  • BBC Wildlife trust of Wikipedia content, it filled in the gaps
  • See: Connecting the Smithsonian American Art Museum to Linked Data Cloud (US artists)
  • Flavours of LOD from well maintained and quality controlled provenance data to anonymous
  • Issues around how you present your LOD
  • Consumers’ may trust organisations may not always want to trace it (the LOD)
  • Attribution and usage (don’t conflate these two concepts for dealing with rights)
  • CC0 is “no rights reserved” effectively releasing the work into the public domain whereas CC-BY-NC is an acknowlegement of copyright and defines the nature of the use (as a licence) requiring attribution and non-commercial use
  • Note CC0 likely does not apply under Australian law and possibly also not New Zealand

Making the Case for LOD

These are the session notes (rough I’m afraid) for the discussion on making the case for linked open data on day 1 of the 2013 LODLAM summit in Montreal.  At some point I’d really like to summarise these ideas better or maybe get to a point where it is possible to tell success stories and cautionary tales so that those interested in making or reusing LOD can pick up and expand on the precious work done thus far.

Gold leaf floating caught on the wind
Gold leaf floating caught on the wind
CC-BY Ingrid Mason

The wording in (brackets) is mine from recall.  Please feel free to comment and correct me if I’ve misinterpreted the notes.

  • What are the pain points? (also who feels the pain)
  • Should the O in LOD be K for knowledge and have it rebadged? (perhaps LOD isn’t the terminology for everyone to understand what LOD can do)
  • Explain LOD so people understand it (keep is simple smarty-pants)
  • Different elevator pitches to stakeholders to get support (headlines for execs perhaps and technical speak for techs?)
  • Internal use case (who will invest and put their support behind you in a LOD project in your organisation)
  • Public use case (who are the public stakeholders and are their any general or specific needs that could be filled with LOD)
  • Listening (to stakeholders, to others experience, etc)
  • Benefits? (work out what these are and who will value what you do)
  • Responsibility? (who leads this work and/or needs to be involved to make it a success)
  • Demystifying LOD for stakeholders (non-tech speak and maybe outcomes in lay terms)
  • Keep LOD ‘under the hood’ (see slide 80, ALIAOnline Practical Linked (Open) Data for Libraries, Archives and Museums, to see how the web view and the underlying linked data are presented)
  • Who for? (make sure it is clear who the audience is for LOD project)
  • Why? (be clear about the goals for a LOD project)
  • What? (have a good think about what data to generate and integration and why)
  • Issues? Backlogs of wobbly data (this is very common and often underestimated, so perhaps including this in a LOD project outline ensures this doesn’t turn into a SNAFU)
  • Type of project – demo or BAU? (depends on how much traction with key supporters and how experimental a LOD project is)
  • Creative Commons (0), revenue risk (something to do with pressure around capacity to generate income if data isn’t CC0 (which is valid in US but not Australia or NZ btw)
  • Focus on your own data – less risk and less cost
  • Example, BBC Music – point out (use other LOD)
  • Users – what are their drivers?
  • Find ways to communicate to them (the users) e.g. via discovery
  • Scale – take care with this – ecosystem grows
  • Metrics e.g. AustLit.edu.au  (to justify investment and uptake)
  • What legal or funding requirements need to be surmounted to enable the data to be released as LOD?
  • Upfront deal with rights and costs (sic and offer value or benefits)
  • Attribution – how to deal with this or ask for it
  • Galaxy Zoo and gamification of the classification of galaxies
  • Work acknowledgement (perhaps rather than at triple level, which seems quite insane)
  • Figshare as an example (of the strength of openness in support of scholarly communication)
  • Scholarly practice and new practices of tagging (as part of a LOD project?)
  • Some ideas based on experience with e-artexte by artexte (small non-profit)
  • Problem: (how to get moving and get support)
  • Agree to be a guinea pig (this is a perfect idea)
  • Find advocates in the community
  • Publishing and visibility (catalogues online via website) (LOD apparent in search interface too?)
  • Work with a partner (Concordia), extension of library service (piggy back)
  • Solution: (what they did)
  • Open access repository (see news release)
  • Lots of outreach (getting buy-in and engagement by long term partners and supporters)
  • Next steps: (building on success)
  • Research projects (taking on new ideas)
  • Success stories (these are needed for LOD projects that hit the spot!)
  • Ways to work with technophobes “helps me do something I already do” (solve a problem with LOD?)
  • Works for open data (Wikimedia), can work with linked open data
  • Who to convince? (what do you need: money, permission, technical partners, registrar time?)
  • Who to trust? (what and who are you relying on and have you relied on them before?)
  • How to manage the question of authority? (publish your own LOD because you created it and monitor that which you integrate or ingest externally)
  • Deliver to core user stories (don’t go off into the wilds unles you’ve been funded to)
  • Prototype stage (is this Agile, i.e. make sure if you have key stakeholders they’re fully engaged)
  • Keep (iterating and checking?)
  • Talk about enhancement of services (competition?)
  • Kickbacks, and feedback loops (look at how to make the most from what you have?)
  • Need to be able to demonstrate (keep the focus and the make the scope small)
  • Social – embedding your knowledge (into the LOD?)
  • Embed LOD in the tools people are already using
  • Attach LOD and allow it to emerge by stealth (trickery)
  • We need to consolidate stories for each to use (write these up)
  • Use the design pattern library

Notes from Normalizing Licensing (and Data) Models

This was originally Normalizing Licensing and Data Models, but we decided that was too much to take on in one session. We had about 15 participants. I did my best to lead this session though was admittedly a bit exhausted! And now I’ve let too much time go by before getting my notes in here.

I started by describing some of the work we’re doing at Historypin to create metadata crowdsourcing and annotation tools for the public and in particular cultural heritage institutions. We talked briefly about our current efforts to consider the data models of Europeana and DPLA, as well as Open Annotation, and how we might incorporate some of this in as simple a way as possible, as we don’t want to differentiate between individuals and institutional contributors. I threw out this worksheet for comparing licensing across various platforms and would welcome anyone to add other examples to it (thanks Antoine Isaac for adding a bit to this already).

I think we agreed that we we’ve come a long way from where we were 2 years ago at the last summit, when the 4 star scheme of open licensing of metadata was launched. Jerry Persons talked about Stanford policy and also about the week long workshop they held in July of 2011 recommending CC0 for all bibliographic metadata.

We talked a bit about international issues of copyright and licensing, with Chris of Digital New Zealand weighing in with the very good point that CC0 is not an option in New Zealand, or at least not respected by New Zealand law. Romain from French National Library echoed this issue for France.

Romain also talked about the difference between what is copyrightable at all, and that courts in France have tested the difference between non-intellectual or creative content vs fact, which we agreed there is international precedent for, and I pointed out that we (at Historypin) are following the lead of the DPLA on this front.

From here we ventured a bit into creating and encouraging a culture of sharing in which institutions/individuals that share with open licensing could get some recognition, as well as some potential centralized site for tracking changes. We discussed the Cooper Hewitt release on to Github, though it was pointed out that Github was putting a 15mb limit on files. The OpenGLAM Data Hub could be a great shared source for us to list content. We talked about the importance/potential about combining forces across GLAMs internationally and agreeing that this would be a good place to share and as importantly, to show uses of and improvements to metadata.

We touched briefly on burnout on behalf of content providers that work very hard to release datasets and then not have anyone use them, or not know about reuses of the datasets, so encouraging this kind of community and circling back is critical.

I’m sure I missed a ton, please feel free to make additions/corrections/etc in the comments or in the notes doc directly.

Start to put the pieces together, and need the help of LODLAM 2013 community.

Bonjour LODLAM community!

It has been a great pleasure to meet you all at LODLAM 2013 this spring/summer!  Now, I am trying to recall this amazing event and trying to put the pieces together after two weeks.

The LODLAM 2013 Timetable -draft is what I have tried to rebuild for an overall picture. As you can see from the table, there are several names of facilitators missing. If you know the name of that session, and would like to help me to rebuild this landscape, please do not hesitate to drop me a note. Also, for the 7 colors of session blocks in the table indicating my own understanding of the issues, if you do not agree with my classification (especially for the view of the session facilitators), please let me know.

To witness this community doing something entirely different feels fantastic!

with great appreciation

andrea hunag from taipei taiwan

Presentation of MisMuseos project in LODLAM Challenge Final

Mismuseos.net: Art After Technology (putting cultural data to work)

MISMUSEOS.NET MONTREAL CANADÁ.

Website: www.mismuseos.net

PRELIMINARY

LAM: Libraries, Archives and Museums are the places of our collective memory.

LOD: They contain a hidden graph, made up of nodes (entities) and lines (relations) with enormous possibilities of discovering and knowledge.

SEMANTIC WEB: Now we can compute those concealed relationships, and expose the connexions inside our collective memory graph under some conditions.

DESCRIPTION OF THE PROBLEM

The data of the museums are distributed and not connected. There are more than 55,000 museums in 202 countries. We cannot do interesting exploitations using the capacity of the machines with the information contained in the current formats of knowledge representation

So, the first part of the problem consists of building a Museums Micro Cloud of Linked, Clean and Curated Data with an underlying Specialized and Unified Graph

Secondly, we wanted to connect cultural and educational worlds in a knowledge ecosystem.  In other words, we wanted to valorise cultural information of our cultural heritage for educational purposes.

Our project shows the way for overcoming the challenge of linking all the resources of all museums, by making real that possibility for a group of Spanish Greatest Museums.

The project is a free access online solution available in the web address http://mismuseos.net.

MisMuseos.net, gathers museum metadata from multiple Spanish Public Institutions. It is a semantic Museum of Museums.

It works according to the standards of the Semantic Web and the principles of the Linked Open Data Web.

We currently have a collection of seven Spanish Great Museums (a meta-museum), where users can browse over 17,000 pieces of art and 2,650 artists.

Mismuseos.net allows users to find and discover museums-related content, and also reach some related external information thanks to the correlation with other datasets.

MAIN GOAL

The main goal of Mismuseos.net is to present a case of exploitation of Linked Data for the G.L.A.M. community through innovative end-user applications, like facet-based searches and semantic context creation, which drastically improve user experience, built on GNOSS, a semantic and social software platform with a deep focus on the generation of social knowledge ecosystems and end-user applications in a Linked Data environment.

GOALS

In more detail, the project is guided by the following goals:

  • To put data to work: exploit public datasets and information on museums to generate benefits for users and improve the user’s experience.
  • To link datasets both to enrich content and generate accurate contexts of information
  • To clean up, curate, unify, extend and hybridize data into a knowledge domain.
  • To express all the data through a unique graph in the context of a specialized Micro Linked Data Cloud for culture and education.
  • To connect cultural and educational worlds in a knowledge ecosystem through the development of Hybrid or Extended Ontologies thought for that purpose.

DATASETS USED

  • Europeana dataset (CER.ES collection) and the online collections of public Spanish Museums.
  • DBPedia, used to supplement the information about the author and extract information on authors and museums location.
  • Geonames, in order to obtain the geolocation data of artists and museums, once we have obtained the names of the places from the primary source or from Dbpedia. This information will be consumed in the future to locate them in a map view.
  • Didactalia, an index of over 50,000 educational resources on gnoss.com, linked to provide users with related educational content.

TECHNOLOGY AND MAIN FEATURES:

The solution has been developed on gnoss.com

The featured applications in Mismuseos.net are:

  • Faceted Searches: The search engine enables aggregated searches by different facets and summarization of results for each successive search.
  • Contexts or related information: enriched content and navigation through graphs.  We have set several contexts depending on the object or entity that the user is viewing, which offer dynamically generated content:

1. Contexts for the entity ‘piece of art’

2. Contexts for the entity ‘artist’

  • Semantic Content Management System (SemCMS): The previous entities (pieces of art, artists and museums) are represented on the platform with their specific ontologies thanks to the Semantic Content Management System which allows uploading an OWL file describing the concepts and relations within a particular knowledge domain, and it generates a semantic form with all the classes and properties represented in the OWL file.

NEXT STEPS:

  • Expand the collections with more museums and galleries
  • Enable the multilingual navigation in Mismuseos.net and, as a consequence, extending the cultural domains of the solution.
  • Show the results of a query in very specialized ways (Now, we can in some cases search on a map, but in the future it should be arranged to do that on timelines and other ways of presentation). The connection between specialized fields of knowledge has not only an ontological solution, but also a solution based on the way the results are shown.

HOW OUR INNOVATIVE IDEAS WILL ADVANCE THE GLAM COMMUNITY

Direct advances/ benefits:

  • Extending the cultural graph with additional contexts through the connection to other cultural heritage resources of libraries and archives
  • Offering a semantic web publishing service for museums to serve customized semantic webpages with selected data coming from Mismuseos.net.
  • Providing the point of view from the educational world to reflect on the necessary research on ontological engineering in LAM.
  • Providing linked data or semantic contexts to third parties, putting your cultural information inside new platforms and spaces

Other potential advances:

  • Developing personalized cultural assistants depending on the user preferences
  • Offering varied levels of specialized information using different views and searching tools for every kind of user, from kids to researchers
  • Promoting the development of new applications for your exposed linked data, from games to digital books

CONCLUSIONS:

The main problems we have faced in this project have been:

  • Generating a scalable graph that can potentially integrate any Museum
  • Maintaining the datasets and data contained in them
  • Developing uses and presentations of those data appropriate to each environment or user group (Semantic Dynamic Publishing)

We think we have shown a possible way to travel and solve this set of problems

Thank you!

 

 

Notes from the Hierarchy Alignment session

Notes on hierarchy alignment

I wanted to share some of the work I’ve done at GTN-Québec on hierarchy alignment, its strengths and limits, and we shared ways to go beyond this.

First our use-case: We are harvesting bibliographic metadata from multiple sources, using different hierarchical thematic vocabularies (such as Dewey, library of congress, etc.) We want our users to find resources from all sources, using a classification term from any single vocabulary.

Most of our thematic hierarchies are small (<250 categories) and we have found it possible to align them by hand, using only the notion of related term (RT) and narrower term (NT) (from ISO 2788) across hierarchies. So it is possible to search for terms (and associated resources) along a path of RT/NT relations that may be either internal or cross-vocabulary.

Moreover, we can use a single (rich enough) vocabulary as a “pivot”, so we would establish correspondance between each vocabulary and the pivot, rather than between each pair of vocabularies. (We have chosen to use Dewey, which was the largest vocabulary we had to use, as pivot.)

There are interesting corner cases, but more often than not they can be resolved at a finer level. For example, one vocabulary lumps linguistics and literature for each “foreign” language under that language’s studies. This does not correspond to any of Dewey’s categories, but specific linguistic or literary subcategories can be related as narrower terms to the appropriate Dewey categories. In some more complex cases, some categories find themselves between two Dewey categories, or vice-versa.

In general, our choice has been to let some categories dangle if they can be resolved at the subcategory level, which means that resources at that abstract level cannot be found; thus we have favoured precision over recall. But sometimes, there are no sub-categories to use, and we are reaching the limits of this approach.

In the discussion, we gave many examples of “almost identical” categories, especially across languages; someone mentioned how the German “Gemüse” category (roughly our vegetables) excluded potatoes.  We raised the prospect of codifying exceptions (NOT narrower term), but we agreed using negation was probably more demanding than it was worth.

First, we agreed that for search, lack of precision was a comparatively minor flaw, and that we should distinguish matches by a degree of quality. Path traversals with a cost at each step are a known solution to this problem; but we need also to convey to the users that some results are more or less certainly relevant.

Chris McDowall mentioned how resources belonging to a category was contextual; and explained how supervaluationism distinguishes context-free “super-truths” from contextual truths. Concretely, this could be translated to element membership being qualified by the number of people asserting it. (This might be related to Yves Raymond’s session on user feedback for categories.)

One big point of agreement is that categories equivalence should be determined through Big Data in general, and not by hand. Michel Gagnon proposed using document classification as a way to infer categories equivalence; in a private conversation, I mentioned the issue of corpus in different languages, and he proposed using dbpedia terms as a pivot.

We also mentioned the possibility of inferring the taxonomy itself rather than only equivalence between existing taxonomies; could there be intermediate cases between the rigidity of fixed vocabularies and the anarchy of folksonomy?

We also mentioned tools for multifaceted classification, such as feature lattices or Componential analysis; and methodologies for ontology alignment (such as this one for another domain, using category theory.)

Notes from the Preserving linked data Session

Only four participants! Antoine Isaac (Europeana), Romain Wenz (BnF), Ryan Donahue (Met), Cate O’Neill (Find&Connect)

As it appears, there are more urgent issues to solve for LODLAM.
In fact issues are similar to the ones that were raised about WWW long ago. As WWW survives them, maybe LD can survive them too. It however seems tricky for ‘reference’ datasets. And what would happen when you re-use others’ data?

Some (only slightly curated) bullet points:

– Basic issue: allowing decentralized data access and use, preservation beyond basic requirement of persistent URIs. Data/links can change!

– Handling updates similar to what happens for historical place names in catalogues. (scope of “The netherlands” as of 1821, as opposed to later).

– Preserving context: keeping different levels of truth, different parts of the provenance (time and data producers)

– RDF triples make time and data provenance tricky to represent, unless we go for quadruple or versioned URIs (which have their disadvantages). BnF more-or-less tracks manually (on demand) the provenance.

– Serve representations (data) for which “versions” of a resource (URI)? Interest of an “historical GET”, comparable to Memento (www.mementoweb.org).
Basic solution: no versioned URIs for the resource. but keep track of different versions of the representations (RDF data, HTML page). data.bnf.fr uses Internet archive to archive its representations (just one canonical representation – RDF/XML – for each URI)
Creating Dataset of datasets to find their archive back?

– How to decide what what to preserve/give access to? Everything/every version? Linked data users probably want to get is “best” for the identifier. And it may change! E.g., deprecating some names in authorities from preferred to alternative.
BnF has some cases, where people ask to remove data (Birth dates, Attributions that are not good for the reputation). In such cases, it’s not really desirable to even keep track of historical data in the authoritative service.
Should we mint/re-use URIs or HTTP code for saying that data was removed?
Note: cf OAIS: preservation success is success *for humans*!

– Examples of linked data that was not preserved?
Probably some Talis datasets.

– Misc. remarks on persistent identifiers.
A trick to preserve identifiers is embed identifiers inside other identifiers. But needs some resolver service!
URI design: problem of meaning attached to the URI. We need to separate description function from identification one.

Notes on the World War 1 Session

The World War 1 session ran a little over time and spilled out over lunch outside with a lot of talk about the war, literature and linking across data sets. I’ve copied here the people who listed their information on the sheet.

Country Project Org Contact URL
EU Australia War Literature Europeana WWI “Isaac, A.H.J.C.A.” http://www.europeana1914-1918.eu/
http://www.europeana-collections-1914-1918.eu/

UK Trenches to Triples King’s College Geoffrey Browell http://openmetadatapathway.blogspot.co.uk/
http://www.jiscww1discovery.net/

Australia Australia War Literature http://www.austlit.edu.au/
Canada Out of the Trenches / Au-Delà des tranchéss Pan Canadian Documentary Heritage Pat Riva http://www.ghamari.net:8080/canada/
New Zealand Remembering WW1 ? ? ww100.govt.nz
UK Open Metadata gateway King College London Archives Geoff Browell
Finland / US WW1LOD Project Semantic Computing Research Group / Aalto Thea Lindquist, Hyvönen Eero et al. http://purl.org/ww1lod
France Awesome rdf-enabled online library French National Library Romain Wenz data.bnr.fr
Canada Muninn WW1 Project Rob Warren rdf.muninn-project.org/sparql

Where do we go from here?

Suggest that you look at the lodlam group and signup to the ww1-lod mailing lists. We have had some very good talk about integrative over GIS information and integrating data over multiple sparql servers.

Keep in touch and keep doing great work!