Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Something’s Gotta Give: Charleston Conference Proceedings, 2011
Something’s Gotta Give: Charleston Conference Proceedings, 2011
Something’s Gotta Give: Charleston Conference Proceedings, 2011
Ebook1,448 pages18 hours

Something’s Gotta Give: Charleston Conference Proceedings, 2011

Rating: 0 out of 5 stars

()

Read preview

About this ebook

The theme of the 2011 Charleston Conference, the annual event that explores issues in book and serial acquisition, was "Something's Gotta Give." The conference, held November 2-5, 2011, in Charleston, SC, included 9 pre-meetings, more than 10 plenaries, and over 120 concurrent sessions. The theme reflected the increasing sense of strain felt by both libraries and publishers as troubling economic trends and rapid technological change challenge the information supply chain. What part of the system will buckle under this pressure? Who will be the winners and who will be the losers in this stressful environment? The Charleston Conference continues to be a major event for information exchange among librarians, vendors, and publishers. As it begins its fourth decade, the Conference is one of the most popular international meetings for information professionals, with almost 1,500 delegates. Conference attendees continue to remark on the informative and thought-provoking sessions. The Conference provides a collegial atmosphere where librarians, vendors, and publishers talk freely and directly about issues facing libraries and information providers. In this volume, the organizers of the meeting are pleased to share some of the learning experiences that they-and other attendees-had at the conference.
LanguageEnglish
Release dateOct 15, 2012
ISBN9780983404347
Something’s Gotta Give: Charleston Conference Proceedings, 2011

Related to Something’s Gotta Give

Related ebooks

Language Arts & Discipline For You

View More

Related articles

Reviews for Something’s Gotta Give

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Something’s Gotta Give - Beth R. Bernhardt

    Preface & Acknowledgments

    The theme of the 2011 Charleston Conference Issues in Book and Serial Acquisition was Something’s Gotta Give! from the song by Johnny Mercer of Savannah, Georgia fame. The theme seemed apropos given that librarians, publishers, and vendors are certainly encountering several irresistible forces and immovable objects.

    Over 1,373 attendees, as well as 43 Webinar participants, came to historic Charleston, SC, November 3-5, 2011, to discuss the issues that are consuming us day-to-day. There were 9 preconferences, 90 Vendor Showcase exhibitors, 11 Juried Product Development Forums, 14 plenary presentations, 31 lively lunch presentations, 18 poster sessions, 72 concurrent sessions, 16 shotgun session (pecha-kucha-like), 19 Happy Hour presentations, 14 tech talks and 14 innovation sessions.

    Several new entries were included: poster sessions, shotgun sessions, webinars and several offsite speakers. The skits brightened the morning and afternoon and helped us to keep our sense of humor!

    With so many program topics, the organizers placed sessions into groups pre-selected by the participants and this volume follows that organization. Concurrent session papers are organized into areas encompassing Acquisitions and Collection Development, Administration and Management issues, Budgets and Evaluation, Technological issues, and End Users and Usage Statistics. These 94 papers are all excellent and encompass the multi-faceted issues that are facing all of us – librarians, publishers, vendors, aggregators, consultants,

    Notable plenary sessions include Michael Keller (Stanford University) discussing the Semantic Web, Mackenzie Smith (MIT at that time) discussing Data Papers, Mark Dimunation (Library of Congress) discussed hidden collections and Robert Darnton (Harvard University) showcasing the Digital Public Library of America. Clifford Lynch (CNI) and Lee Dirks (Microsoft Research) discussed new initiatives and Scott Plutchak (UAB) brought Paul Courant (University of Michigan) and Fred Dylla (American Institute of Physics) into his popular Executive Roundtable discussion as did Greg Tananbaum in his session with Kevin Guthrie (ITHAKA) and Anne Kenney (Cornell University). Ann Okerson’s Long Arm of the Law Panel with three lawyers drew its usual enthusiastic high marks. The future of online newspapers was the subject of one of the last plenary panels and the sessions on Friday afternoon by a panel including Debora Cheney (The Pennsylvania State University Library), Chris Cowan (ProQuest), Chuck Palsho (NewsBank), and Frederick Zarndt (Global Connexions). Brad Eden (Valparaiso University) spoke on Saturday morning regarding the demise of the status quo and the Hyde Park Corner with Melody Burton (UBC) and Kimberly Douglas (California Institute of Technology) brought down the house. Clever summaries of many of the meetings were provided (through Skype) by Derek Law (University of Strathclyde).

    The 2011 Proceedings volume has been ably collected, organized, and edited by two incredible women—Beth Bernhardt and Leah Hinds. Our thanks go to them and to all of the helpers at Purdue University Press including Charles Watkinson, Katherine Purple, and Jennifer Theriot!

    Charleston, South Carolina, is a charming city and we were glad to have everyone with us this past year. We look forward to seeing you all in Charleston in 2012 – November 7-10, 2012! The Charleston Conference website is www.katina.info/conference

    Katina Strauch, Charleston Conference Founder and Convener

    Bruce Strauch, Owner and CEO of the Charleston Conference

    kstrauch@comcast.net

    Introduction

    The Charleston Conference continues to be a major event for information exchange among librarians, vendors and publishers. Now in its thirty first year, the Conference continues to be one of the most popular conferences in the Southeast. Conference attendees continue to remark on the informative and thought-provoking sessions. The Conference provides a collegial atmosphere where librarians, publishers and vendors talk freely and directly about issues facing their libraries and information providers. All this interaction occurs in the wonderful city of Charleston, South Carolina. This is the seventh year that Beth R. Bernhardt has put together the proceedings from the Conference and the third year for Leah Hinds. We are pleased to share some of the learning experiences that we, and other attendees, had at the conference.

    The theme of the 2011 Charleston Conference was Something’s Gotta Give! While not all presenters prepared written versions of their remarks, enough did so that we are able to include an overview of such subjects as content development, consortia, education, out-of-the box thinking, management, and technology issues. The unique nature of the Charleston Conference gives librarians, publishers and library vendors the opportunity to holistically examine these and other points of interest.

    Katina Strauch, founder of the conference, is an inspiration to us. Her enthusiasm for the conference and the proceedings is motivating. We hope you, the reader, find the papers as informative as we do and that they encourage the continuation of the ongoing dialogue among librarians, vendors and publishers that can only enhance the learning and research experience for the ultimate user.

    Signed,

    Co-Editors of the 31st Charleston Conference Proceedings

    Beth R. Bernhardt, Electronic Resources Librarian, University of North Carolina at Greensboro and Main Conference Director

    Leah Hinds, Assistant Conference Director

    Plenary Sessions

    The Semantic Web for Publishers and Libraries

    Michael Keller, University Librarian, Director of Academic Information Resources, Founder/Publisher HighWire Press, Publisher Stanford University Press, Stanford University

    Thank you. Good morning, everyone. So, before I start this talk I’d like to offer a few explanations and some thanks. First the thanks. This talk and the work behind it owe a tremendous amount to my colleague, Jerry Persons, who is Stanford’s Chief Information Architect Emeritus. He continues to work on this particular domain with us and for us, and it’s good because of what he’s done.

    The explanation is that this is an introductory talk. Maybe I should ask right now how many of you are quite familiar with the principles of linked data and the semantic web? Please raise your hand. Perfect. So, the rest of you may learn something, and you may come up with lots of questions. There will be a lunch and learn session this afternoon with me, Jerry Persons and Rachel Frick from CLIR, and the DLF Program Officer, right in this room. So if you are burdened and there isn’t enough time because Anthony is being very strict, or I talk too long, please come to that.

    So, semantic web for libraries and publishers. I want to start with the problem set. What is it that we are working with here that is our concern? Frankly, the fact is we have way too many silos. We have red silos. We have concrete silos. We have blue silos. We have grain elevators that look like silos, only are bigger—you can imagine who those might be. We in the library and publishing trades have forced readers, some of whom are also authors, to search iteratively for information that they want, need, or think might exist in many different silos that use many different search engines, vocabularies, and forms of user interfaces. We do not make it easier for readers to discover what is locally available, what is more or less easy to access remotely, and everything that might be available.

    We give them better interfaces, including ones that permit refinement of a result, but these interfaces show our holdings more or less at the title level. An example of such an interface is the Search Works interface of Stanford based on Black-light. Almost simultaneously we show the reader many other tools, some excellent in some ways, all of them good—because we select them, of course—and we suggest that our clients widen their search, to examine the literature more broadly.

    However, no single tool is comprehensive. We routinely do not refer our clients to the web, at least not on our own websites. Our online public access catalogs (OPACS) don’t refer them to the web either, except indirectly, when we have to go out on the web to look at an e-book or some e-information object or database that we’ve subscribed. While indices and abstracts refer our readers to articles and journals which we may have licensed, we rely on other services, such as SFX and the like, to provide the links to the titles which have been revealed through the search and the secondary publications. So neither are our OPACS, nor our secondary databases, directly referring to more than a tiny percentage of the vast collection of pages that is the World Wide Web. The web of course refers in fragmentary fashion to information resources we might—I emphasize might—have on hand for our readers. And the results of using those secondary publications or secondary databases, which are often very good, involve discovery tools and returns that involve relevance ranking determined in various ways. This provides us with different formats. This is a format from XSearch which is a locally branded product of Deep Web Technology, and various options for refinement that may or may not be different than our OPAC refinements if we have any. We therefore confuse our readership even more. Some of us provide our readers with lots of secondary databases, too many really, for all but a few who are forensic scholars.

    So, here is the Stanford interface. First you see XSearch down below, then we send the reader to select the database, we organize that by topic, and then we send the reader to the whole list. Selecting a database to search is something of an art. That’s why we have good reference librarians and subject specialists. And notice once again that we do not offer the web as a search engine, as an option, and for good reasons. Nevertheless, the discoverable relevant information resources on the web apparently are not part of our repertoire in so far as these interfaces document. And, in the case of Stanford, we offer our readers the choice of 1,113 databases. This could take all day to sort through if really assiduous, I suppose. We somehow conspired—well, actually we haven’t conspired; we’re less than a conspiracy—but in some ways we have made the search for information objects very difficult. By we, I mean librarians and publishers. We’ve just not had the tools, the methods, the vision, and yes, the gumption, to try something new.

    The next slide shows a little teeny weenie miniscule portion of what’s out there on the web that’s relevant to the economic process of teaching and learning research that our folks have to sort through sometimes, mostly on their own. This picture multiplied by maybe 1 billion changes every day and gets more complicated every day, partly by the addition of new pages, partly by the addition of new sub-pages, and partly, frankly, by some sites just disappearing altogether. And the larger the number of websites indexed by Google or Bing or whatever search engine du jour, the more likely it is that the relevance of the return will be less pointed or precisely matched to what the researcher thought he or she might find. So I return to my earlier statement: We’ve got way too many silos, in way too many places, with too many difficulties of determining what is in the silos and really with no way to get good returns on what’s in some of the silos that might be relevant. This of course is the service that most of our students start with, particularly the younger and more naϊve of them. It too, however, consists of silos.

    Do you think one-size-fits-all in the Google world? It doesn’t. Here are four of the principal silos: one for news, one for Google Books, one for Google Scholar, and one for Google Maps. That’s on top of the Google main database. Google’s main database is huge, growing, and changing all the time. These silos are very large, growing, and changing all the time, but, you can’t look at each of these very easily except for through some clever interfaces they provided to these other silos. So given all the silos and search engines, our users—some of whom are authors, some of whom are teachers, many of whom are students some of whom are people on the street—need us to find a better way. We are wasting their time and we’re not presenting them with information and information objects they need to have and they think might exist. Facts about the information objects we have acquired or leased, facts about books, articles, films, and so forth that we have published or licensed need to be found in the wild on the web. Ideally, we librarians and publishers will get the facts about what we have and what we’re making public for fun or profit discoverable on the web.

    So let’s look at the problems a little bit. First of all there are too many stovepipe systems. Second of all, there is too little precision with inadequate recall. And third, we are too far removed from the World Wide Web.

    Too many stovepipe systems: The landscape of discovery and access services is a shambles. I’ve shown you a slide to demonstrate that. It cannot be mapped in any logical way, not by us, who are supposed to be information professionals, and certainly not by the faculty and the students who must navigate this chaos. This state of affairs should not be a surprise. It grew up, as did Topsy. It just happened over the last 20, well, actually, over the last 150 years. There is too little precision with inadequate recall. Some of the problems are those various stovepipe systems. The dumbing down effects of federation often hinders explicit searches. And each interface has its own search refinements trick or tricks. There are numerous overlapping discovery paths hampering full recall. Most of the problem results from limitations in the design and execution of the infrastructure that supports discovery and access. In any given silo, that infrastructure may work very well for what is in the silo, but, it doesn’t work very well across all the silos, and certainly not across the web.

    A limiting factor is the problem of ambiguity. Most of our metadata uses a string of bytes to label a semantic entity. Semantic entity: people, places, things, events, ideas, objects. Discovery therefore is based on matching text labels, that is, on keyword searches. Discovery is not based on the meaning of the semantic entities, not based on the inherent meaning of whatever it is that has been labeled.

    For libraries, our fix has been authority files. We have been really assiduous about developing these and they are excellent and we will make very good use of them in the linked data world. So authority files are authoritative strings, forms of strings, names, organizations, titles, places, events, topics and so forth, but, what about the case where no one-to-one relationship exists between a string of text label and the underlying semantic entity? What about the case one word has multiple meanings? Take for example the text string Jaguar. All right, so here we have an example. We have the motorcar, the Jaguar, which introduces the SK series in 1996, the E-Jag between ‘61 and ‘74, and other ones coming out more recently even though it was once owned by Ford and for all I know may still be. There is hardware and software named Jaguar, there was an Atari videogame console called Jaguar, and the Macintosh OS 10.2 that was named Jaguar. In the music world, there was a heavy metal band formed in Bristol, England in 1979. A Fender electric guitar was named Jaguar, and there was a Jaguar Wright who was a singer based in Philadelphia. She was also a songwriter. In the military world, there is a type 140 Jaguar Class Fast Attack Craft Torpedo manufactured in Germany during World War II and flown there. More recently there was an Anglo-French ground attack aircraft called Jaguar and there was, in the 1950s, a prototype XF 10 F Prototype swing-wing fighter made by Grummen on Long Island. Among the heroes, for those of you who are comic book fans or believe in fantasy in a big way, the Jaguar is a superhero in the Archie comics, and the DC Comics impact series also features a character called Jaguar. Of course in football there is the team in Jacksonville called the Jaguars. And now, finally, here is what I think of when I think of Jaguar: it is either a cat or car for me, but, here you have it. That is one illustration I think of the vocabulary of names proper and otherwise that create ambiguity.

    The second limiting factor is the fact that we are evolving. We have evolved our systems to record a copy, a copy in our hands, particularly in the library world. So, most of the library metadata focuses on publication artifacts. We identify the responsibility for the creation of the artifact and we list topical headings. We describe it. For simple cases with an author that has very few titles, metadata translation, things work out pretty well. However, for authors with many titles, with many additions, things are much more difficult. So as complexity increases, precision and recall suffer dramatically, and we live in a very complicated world, as you know.

    Here’s a search that we did on the Socrates interface, the old interface of Stanford OPAC, on the terms Shakespeare and Hamlet—a very simple search. We get back 811 entities. Unflagging patience marks the task of flipping back and forth between hundreds of brief and full records to sort through the variances of the single entity. We have critical editions. We have 18th and 19th century collections of the plays. We have social and historical and literary answers. We have video and audio recordings of performances. We have reviews and indices of the same. We have treatments of stagecraft and costumes and music. We have the lives and work of others associated with the plays, that is, performances and directors. We have other art forms inspired by the plays. I’ve neglected to add here that we also have a collection of documents, information objects, people, and arguments that refute the idea that Shakespeare wrote anything, including Hamlet.

    We’re too far removed from the World Wide Web. Together our metadata collections make up a big chunk of the dark web, the web that is not indexed. It is clear that visibility on the web promotes dramatic increases in discovery and access. So if you take a look at the traffic against the Flickr images from the Library of Congress and the Smithsonian you’ll see a lot of traffic. When in 2002 Google began to acquire an index of articles published through HighWire services, we had a dramatic increase in the amount of traffic back then it has since increased. So this state of affairs is very well known.

    What is our working environment, what are we dealing with here? Take a look at this schematic to see the ecosystem in which publishers, libraries, students and scholars are involved. Now this is very simple, I give you, very simple. But you get the point. We have consumers and producers in the upper right-hand corner. We have the publishers and intermediary taking some of those products and turning them into published works which they sell or which we somehow acquire in our libraries which we then feed back to the students and scholars. And, another piece of our ecosystem has to do with the network that we communicate on. Some years ago, many years ago, there was the Internet. There wasn’t much e-discovery or analytical communication going through that. But, we had a whole bunch of prophets; three of the most important were Vannevar Bush, in the mid-50s, and Ted Nelson and Doug Engelbart, who predicted what the Internet could become. And then thanks to another prophet, Tim Berners-Lee, the Internet became a web of pages of information.

    Scholarly journal publishers and some librarians realized early on that there were functional advantages to scholarship and to publishing in the web of pages. Yahoo, Google, and others realized that mining the web of pages by words off the pages could make a rapidly growing web of pages reveal more through indexing and cataloging. As a matter of fact, indexing won out, as we now know, over cataloging. The web of data is the next big thing in discovering relevant information objects and the next big thing in empowering individuals, communities and industries and making better use of information that they or others create. What distinguishes this web of pages, this linked data environment from the web of pages is the principal of identifying entities, virtual and real, like statements of relationships which are therefore descriptions of machinery before.

    We are calling this next phase the linked data phase because it is entirely dependent upon statements of relationship and descriptions. But this phase is only a precursor to something even more complex and certainly more difficult to engineer. And that phase is the semantic web, which in theory will allow the machinery to build relationships and descriptions, to interoperate with themselves to satisfy requirements, requirements made by another system, requirements made by person, albeit without constant interaction with the demanding body, whether it’s a machine or a person. In short, in the semantic web the machines will understand meaning and presumably act upon it. That’s a scary thought.

    So what are the tools that are going to get us there? How do we work to alleviate our problems as information professionals, as librarians and publishers. Here’s the recipe: we identify people, places, things, events, and other entities including ideas, embedded in the knowledge resources that a research university consumes and produces. We tie those facts together with names and connections. We publish those relationships as crawlable links on the web. Crawlable and open for anyone to use. And we build and use applications that support discovery via the web of data. Some of those apps I can describe for you in primitive form today; I’ll show them to you.

    Here’s a pile of words representing, in a very small way, all of the words on the web that most search engines constantly use and constantly index. Good search engines can do a lot with this pile, but the search engines create a perception of relationships based on other factors such as the number of links containing words of interest or the traffic to a site. And from this pile of words, actually from the pile of webpages containing the words, we are going to build this linked environment. The structure of the new environment will be based on the meaning of the relationships.

    Here is an example, a very simple example, of how these relationships can describe a person. This is a graph of Yo-Yo Ma the great cellist you’ll see his big blue circle in the center there. This is only a small tiny bit of relationships that he has. So he was born in 1955; he is a musician; he loves the city of Paris, which is in a certain country where the temperature is a certain temperature. He’s made a recording entitled Appalachian Journal, which is a music album. It features, among others things, the music of John Tavener. This is a graph that demonstrates how relations begin to define the elements on this page. Each of these elements has a relationship through one means or another, through one hop or another to all the others.

    Linked Data Web

    Here is another one, this is a silly one, and I have to confess there is one aspect of this I really don’t understand, but someone from Scotland will have to elucidate. So, this place is haggis in the middle of the picture. Absolutely haggis is a food made in a stomach, literally. It’s a Scottish delicacy, so they say. Crombies of Edinburgh manufacture or make it, and it’s Scottish. It involves a certain amount of whiskey, I presume, before, during, and after. It involves sheep. Robert Burns has apparently written about haggis. The Great Toppings for Pizza I don’t get. I think there is some oatmeal involved so I’m having trouble putting the oatmeal on the pizza.

    Okay now some geek talk: RDF triples and URI’s. Resource Description Framework always expresses a simple sentence: subject, object, predicate, is a way to describe objects or even ideas on the web. An object or an idea may have many RDF triples describing it because everyone of us have many different relationships and there are many different ways to describe us depending on where we are, who we are, what we are doing, and so forth. And, as I said, objects or ideas need not exist on the web. URI’s: Uniform Resource Identifiers. Like URLs, only stable and steady. These allow machine interaction among Web Objects provided with various and tactical schemes and protocols used to construct to URI’s. So there is a vocabulary, there is a way of expressing URI’s that is well-known and being built-up principally on the World Wide Web Consortium in Switzerland with our support. We need at least three of these to support an RDF: subject, predicate, object.

    Here is a graph of URI’s with an RDF. The RDF is Dr. Eric Miller and the green bits are the pointers and the unhighlighted bits are the syntactical ways of expressing these elements, the elements of this sentence: Who is Dr. Eric Miller? Where is Dr. Eric Miller? Here are the linked data principles: use RDF’s as names of things; use URI’s so that people can look up those names. And when someone looks up a URI, it provides useful, actionable RDF information from URI’s and include RDF statements that lead to other URI’s so that the reader can discover related things.

    Back to the problem of library metadata. Our metadata standards are closed. We have spent innumerable hours over 70 years devising these standards, modifying them, and so forth. It is a big industry. But they’re closed. Passive metadata is searchable by word, by string, but it is in the silos. It’s readable, it’s not actionable, it’s passive. The search results are refinable, but they are final. They don’t take you another step; you can’t go beyond the search results of your OPAC or very many of the segment publishers situations.

    Here is a comparison. We’re going to spend a little time on the right column there. Semantic metadata is open, or should be, it is dynamic, it conceptualized, it is living, it’s actionable, it’s not passive, it exists in an environment—an ecology—of lots of these things. It is in the wild, ideally, it can be used. It’s interactive and responsive; it can take you places; you can do things with it. You can resolve it with words; we can look at it with graphs, or both. It can lead to other queries and other views. So my plea to all of you and to the world wide web of libraries and publishers is to make library bibliographic facts into RDF’s and URI’s, release them into the wild, and make library linked data open—usable by everyone.

    What about publishers? Why would publishers be interested in this? Well, publishers should be interested in aggregation. Aggregating their content in their own realms and allowing aggregation of their content within other realms. They could aggregate information beyond their publications, beyond articles and books, to information about conferences that are relevant to the subject of an article or a book, to career building and employment opportunities, to collaborative communities, to commercial and other services, to advertisers—who support research, ostensibly, with specific source materials, processing, and trials—and to produce productive relationships with others. Publishers should want to provide actionable and constantly updated links in support of scholars, teachers, learners, and those in the academic publishing trade. And they should be interested in providing compelling tie-in users to the publishers themselves.

    Here are some of the entities that are already committed to making accomplishments in the sphere. There are a lot of them and this is only a small selection: the Associated Press, the United States Department of Defense, C|Net, the Library of Congress, the British Library, Google, Wolfram, Thomson Reuters, Hearst Interactive Media, Novartis, PLoS One, The Guardian, Elsevier, Pearson, the British Museum London, BBC, HighWire, Merck, and Astra Zeneca.

    I want to specifically mention a few. The British Library not many months ago leased the entire British National Bibliography in RDF and URI’s. The entire British National Library. This is a tremendous contribution. The Library of Congress has released the Library of Congress subject headings and the name authority files as RDF’s and URI’s in the wild. And the subject headings have links to (28:44 of the video) Aggrevot, Rummelo DMB, the GLN Subject Thesaurus and the National Agricultural Library subject index. Every personal and corporate entry in the LC name authority file linked to the virtual international authority file, basically OCLC. The VIF is not yet open; it has not yet produced RDF’s and URI’s that I know of in the wild.

    Very significantly, about 18 months ago, the New York Times released into the wild all 500,000, and growing, of its index terms for use by anyone. That is tremendous. That is a whole other vocabulary outside of the ones we usually use. For publishers and libraries content is king. Although none of us should neglect services: services to our readers, our authors, and our institutions. However, if users cannot find content in their own context, there is a problem. Therefore, if you understand users to be readers, authors, teachers, and students, the following Venn diagram suggests the overlaps.

    Now, I believe publishers must make their content visible. Indeed, it’s an imperative, because if the published content is invisible there is no benefit in tangible or intangible form to the author and certainly no benefit to the publisher. This is a PLoS article that was published in 2009 in their journal on Neglected Tropical Diseases. It was symantesized by David Shosen and a few others at Oxford, and all those highlighted elements have information behind them. This, however, is not actionable; this was all hand built. It took 10 men weeks to build it. It is, I believe, possible that we will be seeing more of these as we do a lot of tagging, as the publishers come up with better ways for semantics to be installed using RDF’s and URI’s. So, eventually you’ll be able to see lots of these with links from the terms into information resources explaining to them. I’ve already mentioned aggregation, and I couldn’t resist putting this slide in front of you. But, for libraries and publishers, aggregation is very important, and I emphasize, as this slide emphasizes, the multiple different forms that information objects might turn out to be in a really good aggregation. It doesn’t all have to be articles that could be documentaries. It could be sounds, it could be webpages, it could be printed and published things.

    So, are we still confused and lost? Do we still have this problem of ambiguity? Well, yeah we do sort of, but there is a way out of it, and this sign in the upper left-hand corner—although it is not readable to most of you—is actually disambiguating a direction. And the point I’m making with this slide is that in the RDF, URI, or in the linked data world, there are very easy ways to make very arcane languages readable. The arcane language in this part of the slide is Irish.

    So what is the web and data progress? In 2007 these circles represented the agencies that were broadcasting, publishing URI’s and RDF’s. This is that same environment in 2011. Up here we have hundreds of millions of URI’s and RDF’s occupying gigabytes of content. Now we have hundreds of billions, going to trillions, of these entities out there. Fortunately they don’t take up that much space because they are very short. So, there’s some encouragement.

    Here is the linked open data value proposition that was developed at a workshop we did at Stanford in late June. Linked open data puts information where people are looking for it on the web. Linked open data can expand discoverability of our content. Linked open data opens opportunities for creative innovation, endeavoural scholarship, and participation. It allows for open and continuous improvement of data and creates a store of machine actionable data on which improved surfaces can be built. Library link to open data might facilitate the breakdown of the tyranny of domain silos. Linked open data can also provide direct access to data in ways that are not currently possible, as well as provide unanticipated benefits that will emerge later as the source expands exponentially. Here is a slide which shows a linked open data application in action.

    It’s from Freebase, a Google company now, and it’s based on bibliographic facts from Stanford and web resources. It is about Stephen Jay Gould. You saw the editions of The Panda’s Thumb. Now you see the description of the book. Now you see excerpts from the book. A lot of them. Now you see a couple of reviews of the book. All of this is being created on the fly; it is not hardwired using RDF’s and URI’s. Here are the RDF’s, and you see there are a whole bunch of them there, that have been built, developed algorithmically for the site, sampling them from here and there. Now we go to look at Stephen Jay Gould. We’re looking at the Panda’s Thumb site. Now we’re going to take a look at the site that is associated with the RDF Stephen Jay Gould. You’ll see a wiki biography of Steve; you’ll see a list of books, some of which are readable on the web, a lot of which are underlined. You’ll see the same environment, its papers, and some of them are highlighted because they’re machine-readable. You see a video, this is where the sound comes up, I hope (video begins playing in background).

    We’ll look at some quotes from Steve that are from books and articles, reviews that he’s written—all of this assembled on-the-fly using this linked data environment that was built at Freebase. I think we’re going to look back at the papers because I need to show you something about how the papers function can work. These are people who cited this particular article and you can go to the next tab over and look at the citations. Now we are going to look for Dawkins, Richard Dawkins. It takes a little while for the machine to think, this is not logged, by the way, this is a movie. Here we start on the Dawkins slides. All of this is done with linked data, all of it done with bibliographic facts from Stanford and web resources of various kinds. The BNF, the Bibliotheque Nationale de France, has created another interesting example using only data that they control, only bibliographic information they control and digitized content from Gallica and another movie. So now we’re going to look at Victor Hugo, the complicated author, for a variety of reasons, a very prolific author. You can see his pseudonyms; you can see the sources of the information about Victor Hugo and his output; you can see his works, lots of them, a whole lot of them. On the right where it says Visualiser it means this is where you can go to read the title in question or the addition in question. We went to Les Mis and we’re going to look at the books, enormous number of editions of Les Mis, hundreds actually, but also their translations. They are as you know the brechti for operas and for musical productions. Les Miserables appears in anthologies, all of that indexed in this site.

    On Monday, Halloween, Library of Congress announces a bibliographic framework for the digital age. A new bibliographic framework project will be focused on the web environment; linked data principles and mechanisms and the resource description framework as a basic data model. They have put down the notion that we’re moving from MARC to linked data; it is going to happen. The value proposition, which is also from that Stanford conference, would promote the following practices. This is 25 people gathered at Stanford from a variety of institutions: We want to publish data on the web for discovery and use rather than preserving it in the dark more or less unreachable archives that are often proprietary and profit driven. We want to continuously improve data and linked data rather than wait to publish perfect data. We want to structure data semantically rather than preparing flat unstructured data. We want to collaborate rather than working alone we adopt web standards rather than domain specific ones. We use open commonly understood licenses rather than closed or local licenses.

    This is where we started when we went to the World Wide Web. This is the social web which floats on the World Wide Web but we must pay attention to it in our field. I remind you of what the linked data web looks like, what it is in terms of relationships, and how relationships describe meaning. We’re headed to this; we’re headed to the semantic web. A couple of big ideas that accompany these notions: The first is the ubiquitous computing that is essential and makes it possible for lots of players, people, and institutions around the world to participate. The mobile communications part of that ubiquity is very important, as it allows people to use the linked data web wherever they happen to be. So that is the way that the world is progressing. This is what we don’t want any more of.

    Data Papers in the Network Era

    Mackenzie Smith, Research Director, MIT Libraries

    Good morning. Again, my name is Mackenzie Smith, and I’m a Research Director at MIT libraries where I was, until recently, the Associate Director for Technology Strategy there.

    I think you’re going to see some interesting synergies between my talk today and the talk you just heard, because I’m also a linked data person and many of the things we’re going to talk about build on some of the background that Mike Keller just gave you, hopefully in a useful way.

    As background, the MIT libraries have been involved for many years in developing innovative tools for the content industry, particularly libraries like DSpace, the open source institutional library platform, and Simile, which is a set of open-source tools for linked data publishing and visualization on the web. I will talk a little bit more about that later.

    More recently, we have been very involved in thinking about the role of primary research data in scholarly communication and particularly how to apply linked data standards and tools to research data, which is all my way of explaining why I am here to talk to you today about the concept of data papers and why that idea may solve some of the problems we have today in getting the full benefit of research in the network era.

    Why data sharing is important: I’d like to start with explaining why this problem is of such pressing importance today and why I’m here to talk to you about it. The most immediate driver for research is mandates—research sharing, I mean. Many funders are requiring researchers to show their research data now, and there is growing pressure to provide better access to, and accountability for, taxpayer-funded research results. This is also driving a lot of the open access debate. One notable example of this is the National Institutes of Health (NIH) data sharing policy for any grants in excess of a certain amount of money, $500,000.00 in their case, and this policy has been in place actually since 2003. Another example: This year, we have a new policy from the National Science Foundation (NFS) about data sharing, and this applies to both PI’s of grant projects and the research institutions they work for, which are the official grantees. The new guidelines include a mandatory data management plan which has to be part of every single grant proposal that gets submitted to the NSF. These plans are now part of the competitive review process, so with federal research funding that is continuing to get tighter every year, we now expect to see data management plans become a competitive advantage for PIs who do a good job with them.

    The NSF guidelines are available online at http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpgprint.pdf. These guidelines say that NSF data management plans have to be explicit about certain things, in particular about the policies and provisions that you are making to share your data, including future reuse, repurposing, and redistribution of that data. The reason for that instruction from the NSF is to get more leverage for that expensive data that they are funding the production of to generate new research and to get greater impact from that funding, which may make pretty good sense to everybody. But from the researcher’s perspective, the really big driver is credit. The core principle of the scientific method is that research should be reproducible to get the best possible science. While reproducible data-driven research is still very difficult to achieve in a lot of scientific disciplines, it requires changes to workflow and scientific processes and is a very good reason for researchers to want to share their data. So they have these two drivers: mandates from their institutions and funders, and also this underlying desire to do better science and make the research reproducible.

    But if data sharing is such a great thing to do, and it is expected in so many cases, why hasn’t it happened already, and why is it not routine today? And the reason is that it is still really hard. Even in the Internet age and with ubiquitous platforms like the web available, it is still hard. Most researchers don’t object to sharing their research data; some do, but most don’t. But things get in the way, a lot of things, like fixing data quality problems and documenting the data, usually after the fact when you’ve forgotten many of the steps that you took; losing control of your data and who is going to do what with it is a very serious concern for some researchers; often very serious confidentiality and privacy issues arise, like HIPAA regulations or protecting the location of an endangered species, and commercial interest from both private industry and universities in whatever intellectual property rights might accrue to the data. This causes a lot of confusion about the policies by which data can be made available, which is one of the things NSF asks you to be clear about. This relates a little bit to Mike’s plea for open metadata. Metadata is in many ways the data of the library profession, and he and I agree with this proposal that would make this openly available to get the most benefit from it.

    This is also true for the data of science, social science, and humanities research. Basically, we think that most data should be made openly available to get the most benefit from it. What stops researchers from doing this now is a lack of credit for all the extra effort and work that it is going to take to show their data effectively. The scholarly communication system needs ways to count that high quality data as a legitimate part of the individual’s research record and a valuable contribution to science or whatever discipline you are a part of. This has happened in a few cases if you think of people like Craig Venter and Eric Lander and the human genome project; they have gotten a huge amount of credit for creating that data and then sharing it publicly. More on Lander’s side that Venter’s side, but that is another story. The right cause won. And a barrier to including data in the scholarly communication system is the lack of infrastructure that we have had, which I’m going to talk about in a few minutes.

    So, first I need to say just a few words about what data is and what is the state of data, because I learned over the years that the data means very different things to every single person that you talk. As we well know, anything can be used as data for some purpose, and these days a lot of the public discussion around data and whether it should be shared or not is actually talking about business data like website click streams and public sector information, which is government produced data, whereas I am talking about data that underlies research, and particularly scientific research. So, research data typically includes things like observational data, which would be things like sensor readings, telemetries, and survey data. It also includes experimental data, gene sequences, and spectra-grams. It can include media, which includes text images, audiovisual files, or a neuroimage, which is still an image even though it is created by an MRI machine. Simulations are a kind of data, and these are typically software or algorithms rather than numerical kinds of data sets. So, a key property of a lot of data is that it would be prohibitively expensive or impossible to reproduce. One of the big drivers of sharing data is that you cannot get it back again. If you’re doing climate change research, for example, or if you’re taking sensor data from the ocean to measure temperature and salinity and things like that, you have that data from a moment in time and you can’t just go back in time to get a new sample, because once time has passed that data changes. So, you understand this is a very time-based type of data. Whereas other kinds of data, like genomic data, can be easily reproduced, so, in fact, there is debate about whether gene data should be shared and kept for long times because it is getting cheaper and cheaper to just re-sequence a genome then to store an old one and the techniques get better too. So, there’s a lot of tension in the community over how long to keep data and that type of thing. But sharing in the first place is really not that controversial.

    Also, keep in mind that way more than text, data can be in standard, or proprietary, or discipline-specific formats, like the FITS format in astronomy—it’s very specific to that discipline. It can even be specific to a particular instrument, like one particular confocal microscope has a proprietary format that comes off that microscope that the maker of the instrument dictated and owns control of. Data also requires software to do anything, and that software can also be standard or common, like the language R for statistical processing, or it can be proprietary and discipline specific. So another important property of data is that typically without the software the data is useless. The distinction between data and software is getting very blurry. Data can’t be neatly packaged like a book, so it has very fundamental differences from the kinds of content that we’ve historically dealt with.

    Finally, what do our researchers want to be able to do with this data, to inform how we want to share it? Well, obviously they need to be able to find it (as Mike just explained) evaluate it, process it, analyze it, visualize it, and annotate it. Sometimes they want to reuse it, whether it is to validate an experiment or do new research of their own, either alone or in combination with other data, which is a very different set of requirements than what we have for text, for articles, books, and more traditional kinds of content.

    On the last point about reusing data, that has become a big driver for data sharing for a few important reasons. First is cost. As I explained, a lot of data can only be produced once and it is also often very expensive to collect. Take an example like a neuroimaging study where every single scan with MRI cost a minimum of $1,000.00. You can get so many scans in your study, but you may not be able to achieve good statistical significance on your own. If you can combine your data set with other studies that did similar kinds of research, then you get a much bigger pool of data to do your analyses on and much more impressive and believable results. So, there’s a lot of pressure in many scientific disciplines to be able to pull the data to get better results—that is a big driver.

    The second is interdisciplinary; to be able to combine data from different fields, for example, climate change data with economic and population data, to look at the impact of policies and politics on climate change. Those fields do not talk to each other; their data is in very different formats but there is growing need to be able to combine it in order to perform important research.

    And third is the growth of computational science, like building better disease models from large aggregations of clinical trial data which are seen with efforts like the Sage Bionetworks effort. If you haven’t looked at that, it is an open access database in clinical trial data from all the big pharma companies who have decided that data is actually pretty competitive and that they will get more advantage by aggregating and sharing it so they can mine it than they would if they clung to their data and kept it private.

    So, for whatever the reason, integrating data is really important, that it is extremely difficult and labor-intensive today and, in part, that is because data without meaningful structure and documentation is useless. It is just columns of numbers, you don’t know what it means, and the only person who really does know what it means is the person who created it and maybe a handful of researchers who worked with them. Solving this problem is not something that third parties like libraries or publishers are going to be able to do after the fact. It has to be part of the research workflow somehow, and that requires better tools and some changes to current research practice. That doesn’t mean, by the way, that there’s not a role for libraries and publishers, and I’ll get to that in a little while.

    Reusable data is all of these things: structured, versioned, well-documented, so that you know exactly what you are getting; formatted for long-term access so that you know it won’t disappear the next time you need it; archived somewhere, presumably in the library or an archive; findable and citable, to Mike’s point, you need ways of being able to figure out if this data exists in the first place, which is not trivial; and legally unrestricted or with a very clear usage policy so that you know what you can do with that data once you find it.

    This brings us to the main concept of this talk which is the data paper as a way of solving some of the problems I’ve just described. The data paper is a formal publication whose primary purpose is to expose and describe data as opposed to analyze and draw conclusions from it. This is a quote from a paper on data papers published by the Narrow Commons Project, which is part of the Science Commons part of Creative Commons (http://neurocommons.org/report/data-publication.pdf).

    The point here is that data papers are like traditional research papers in some aspects: they are formally accepted, they are peer-reviewed, they are citable entities, and so on. But, in other respects they are very different from traditional research articles because they are not about the research, they are about the data. If data papers catch on, we will start to see sets of papers about particular research projects, some which are more analytical and some of which are more technical-semantic. Just in case you think I’m inventing all of this, data papers have been around for quite a long time. For example, the Journal of Physical and Chemical Reference Data publication from the American Institute of Physics started in the early 1970s to describe data about physical and chemical materials of general interest. This is still in publication; we subscribe to it at MIT. So the concept has actually been around for quite a while. But, the older journals that date from the print era tend to be not particularly useful in the modern environment—or, not as much as they could be—because what they do is visualize the data to a print format and then publish that as a PDF page, so what you’re getting is a static visualization of the data rather than the data itself, but, it’s getting at the concept that we are talking about.

    More recent forms of data papers are taking more advantage of the Internet and the web, like supporting data downloads. So, take Ecological Archives from the Ecological Society of America. It’s a modern publication. The data itself is open access, but what you see is that you can only download the data, that’s all you can do with it, and the documentation here is very complex and completely unstructured. This is not something a machine can help deal with; you just have to read this long, long, long description of the data and then download it, so we can do better.

    There is also an effort going on at National Information Standards Organization (NISO) to come up with a new standard for supplementary files for peer review published research articles. This is also a necessary step, but it’s really focused more on the paper and the data is sort of a decoration of the paper in this case. It’s not really a first-class object of its own; it is just trying to help standardize how this particular linking gets done. This brings us to some recommendations for independent scholarly publications of data sets. What we can envision data sets becoming in the near future.

    This is a paper from Jonathan Rees at NeuroCommons (http://neurocommons.org/report/data-publication.pdf), and he is trying to identify the key components and requirements for a formal data publication. He claims and recommends that published data should have certain properties: be organized, peer-reviewed, and have established quality-control measures. This is not something new to the publishing world; we would expect this in anything that’s considered a formal publication. It needs to create a citable entity, something the other researchers can refer to and know will still be there in the future. It needs to establish cross-linking mechanisms with the traditional papers to enforce that they are different but related—the set that I was describing a moment ago—it needs to specify what required documentation is needed to make the data really usable so these would be new standards for documentation metadata for papers in addition to the ones we’re familiar with to support discovery; it would supply standard and very importantly interoperable legal licenses to the data sets and examples of those might be the Creative Commons Zero Waiver of Rights so there are no IT claims made on the data at all or various kind of attribution licenses, usage licenses, and other techniques. The point here is that it needs to be normative so that people are sure that they can combine data legally. And then finally, we need an archiving strategy in place so that the data, like the papers and the metadata for the data, stay around long enough to become part of the scholarly record.

    One thing I think we can all agree on is that whatever this infrastructure is for data publishing, it has to be web-based. And to achieve the degree of data interoperability that we want, we need to look at linked data, the set of web standards underlying the semantic web that we’ve been given such a good explanation of just a few moments ago. So what would that infrastructure look like? There are three kinds of infrastructure that I’m going to talk about now that are key to this idea publishing data, that are already happening, and that we can invest our time and effort in leveraging and building out. The reason I’m here today is to kind of light a fire here and see if we can get more progress in these areas.

    The first is identifiers. As Mike explained, the web requires identifiers for resources on the web, or entities on the web, and those are called URI’s. This is absolutely even truer for linked data than it was for traditional content on the web. In online journal publications we’ve seen some new identifier systems emerge that were developed for publications like the Cross Ref DOI’s, but for data papers, we’re going to need more kinds of identifiers, in particular, people. Mike gave a very eloquent description of authority files from libraries, but the truth is that they’re not useful on the linked data web because they don’t have URI’s. Yet. There is an effort that has started called ORCID, the open researcher and contributor ID, which will become a registry of people that have globally unique URI’s associated with them but that you can start to use in publications. This initiative, ORCID, actually came from the publishing community with the help of some libraries including MIT, and it is launching next year. The idea here is that all universities and publishing houses would join ORCID and make sure that every researcher they are dealing with has one of these unique identifiers. What is behind this identifier is a profile for the researcher. In the profile data could be library authority data. That would be a fantastic way to seed this registry, but without the URI all that lovely authority data is not usable on the linked data web.

    In addition to people, we need identifiers for institutions, and there is an effort at NISO called I2 Institutional Identifiers. I don’t think it’s quite as far along as ORCID, but it’s absolutely necessary because in order to apply credit to researchers, you’d need not only URIs for the individual researchers, but also for the institutions that they work for since they move around a lot.

    And finally, we are going to need identifiers for data sets, similar to articles but with some very important twists like versions of databases, which we did have to deal with a little bit in the article world, but it’s much, much more prominent in data. And then you’ve got subsets of data sets, such as your big genome database from which you want to refer to just one gene or a set of records you pulled out. And you’ve also got data sets that were derived from multiple data sets, so aggregations. So, anyway, there are lots of variations of what you need to be able to name, but we need standard identifiers to do that, and fortunately there are two. CrossRef DOI’s can be assigned to data, and some data producers are doing that now, and then the Data Site Initiative is one that the library community has invested quite a bit in including the British Library and the California Digital Library. These are both good efforts; they both use the same underlying URI syntax of handle so it is

    Enjoying the preview?
    Page 1 of 1