International Women’s Day is this week so I was interested to see the following comment on a library IT discussion board:

“Our boss wanted us to find out how many books in our library are authored by men and by women. Is there a way to generate that kind of information?”

Responses on the discussion board included:

  • I do not think this is possible and besides some authors hide their gender
  • Not possible. In a perfect world, in which there were authority records for every personal author name in your local catalogue that included gender information, it would be theoretically possible to answer that question, but the vast majority of authority records still do not contain gender data

This got me thinking.

While MARC has served us well for many decades it has become an albatross around our necks. We need to remember that MARC was created to automate print catalogue cards at a time when computer storage space was expensive. At the end of the day a MARC record was designed to load information onto linear tape in a very compact way. Even today MARC Leader encoding options include “Catalogue added from tape”. Publishers on the other hand use ONIX, and as this was developed much later so it is a “XML-based standard for rich book metadata“. As a result, individual author authority information within a MARC environment tends to be pretty light on. Basically you get a name and sometimes a date of birth and death. This can cause confusion and it is very easy for the wrong author to be matched against a publication. This week I enriched records for an anthology and it was not necessarily easy to identify the correct author. There are a few Peter Conrads, so which is the Australian Peter Conrad who authored the essay “A Rolf in sheeps clothing” and was he born in 1945 or 1948, and is the record with the date of birth listed as 1945 the same Peter Conrad with the date of 1948, or is the 1945 Peter Conrad a duplicate / data error.

Irrespective the end result can be messy data which can cause confusion for patrons and librarians alike.

While libraries do clean up their records in their own siloed systems, what does this mean for the next generation cloud based collaborative systems?

Meanwhile, it is scary that Google is doing starting to do a much better job with author and publication metadata (see the screen shot at the bottom of this post), and yet there seems to be no conversation among librarians about what this means, what are the opportunities and what are the threats. Why is this or am I missing something?

An alternative approach

Wikidata (which Google does use) also offers information about authors, and if someone has coded the information it can include gender, nationality and basically much, much more data in a machine readable format than is possible in a MARC record. For example compare with

Theoretically given the great work OCLC Research has been doing with linked data it should be possible to integrate LC name authorities with information coming from sources such as WikiData. There is VIAF, but there is VIAF like cross referencing within the WorldShare metadata interface.

If this was possible a library could then see how many books in a collection were authored by women as well as by men, or how many books in the collection have been written by Australians, or Canadians, or Malaysians and so on and so on. Bibframes may help. However, our library systems are not yet Bibframes compliant, and Bibframes itself is being developed and rolled out at a glacial pace.

Meanwhile, Google is using linked data / semantic web and leaping further ahead of us. Compare the Google knowledge graph for the author Julie Murphy to your catalogue, see the screen shot at the bottom of this post for the Julie’s Google knowledge graph.

How many catalogues bring in author tweets and images? Furthermore, Google is starting to do a much better job of presenting series information, and of course it also make recommendations for “more like this” books. I suspect some of the metadata Google uses is being drawn in from publisher ONIX data, while some of this is coming from Wikipedia and Wikidata. Some of this metadata is no doubt coming from other sources, but all of it is all being brought in on the fly. This raises an interesting question:

Should we look at linking our holdings into a Google knowledge graph, especially if the library patron is searching Google within our institution’s IP address? To be in front of our patrons is it time to forgo our catalogue and / or discovery interface and use metadata and processes that only use Google?

We know library patrons go to Google before they go to the catalogue so maybe we need to have a really, really hard look at our metadata and processes. For example, this alternative approach may also enable better linking of our future discovery platforms with Wikipedia, and we also know people tend to turn to Wikipedia before they turn to a library catalogue / discovery platform.


To deliver a more timely, richer, integrated and potentially even more accurate information service, maybe it’s time to take some of our sacred metadata cows out the back to be shot.