« Holidays and hiatus | Main | Bits of news »

The Google deal (down on the Farm)

Edit: Caveat -- while this information was given at a staff meeting, it was announced by Andrew Herkovic (see below) that it was a 'public forum' and that employees could talk about any issues/facts/etc. that were communicated in a public forum with people outside of Stanford. On the basis of this, I posted the following information.

Today at work, there was an all-hands meeting for all library employees who wanted to learn more about the Google/Stanford deal. The talk was conducted by Catherine Tierney (Technical Services) and Andrew Herkovic (Foundation Relations & Strategic Projects) of SUL/AIR.

Google book digitization project -- 5 partners (Harvard, Stanford, UMich, NYPL, Oxford)
Stanford has NOT made a commitment to digitize all of its books, but we "will do as much as we can for as long as we can."
No exact number of books to be digitized given
Oversized materials (such as atlases) have been digitized in the prototype and have been discussed as part of the project, as will accompanying materials to books (maps, fold-outs, etc.), at least in theory

What Google will provide to the public --

  • Works in copyright won't be fully available

  • For copyrighted works -- there will be a click-through to the appropriate OCLC WorldCat record

Approximately 10% of Stanford's overall collection is clearly out of copyright; other material in the public domain (such as U.S. government documents) will be included in the project
Google will be responsible for determining what's in copyright and what's not if there are any questionable materials and copyright will drive what will be fully displayed
There's no special provision to fully display material in the last 20 years of copyright
Foreign language texts, including non-Roman languages, will be included

Google will be digitizing Stanford's material on Google's property, using their equipment/protocols and with their staff; the company has not yet been forthcoming as to how the process of digitization will be implemented in detail; however, Google's process is characterized as "industrial-strength digitization"
Google will be responsible for quality control of the scans
A format for the scans has not been decided
De-duplication is not a part of the process, at this time; Stanford is interested in having multiple copies of the same material across various partners
Google is being "coy" about standards and specs; minimums have been given, but little to no fixed specs
We believe that Google will be doing full OCR and indexing of everything they scan for us
Stanford may not mount everything that Google gives to us, but we won't reject scans for having less than perfect accuracy, either

Stanford will receive copies of Stanford's books but won't necessarily be getting the scans from Google's other partners; SUL is under contractual obligations to Google, so we won't/can't give away the digital materials to other projects, such as the Internet Archive or Project Gutenberg; however, we may be able to share our copies with other educational institutions
SUL isn't sure how the digitization will impact ILL
Funding: the scanning will be funded by Google, but the transfer of books to and from Google will have separate library funding
There are currently no plans on publicizing the protocols/process to outside institutions -- it will depend in part on the legal landscape that Google/Stanford faces prior to and during implementation

Factors in choosing which collections (or parts of collections) will be digitized:

  1. Current physical space of materials

  2. Percent of material out of copyright

  3. Will the collections end up in SAL3 [Stanford Auxiliary Library 3 in Livermore]

  4. How other projects (such as the Hoover monographs move) could be impacted

  5. Interest by publishers to make copyrighted works fully available

The plan is to start with just a few thousand books; the project will be implemented in stages
Material that is already in electronic format will not be excluded, ideally, but it may later become a factor in what material is chosen

In the short term, Stanford users will get access to the digitized texts via Google Print, just like general users -- in the long term, the scans/digital page images will be mounted on Stanford's servers with enhanced access, as part of SUL's Digital Repository

Impact on Stanford's users:
+ Materials to be scanned will be officially checked out; any books that aren't currently barcoded will have to be routed internally for barcoding before being sent to Google
+ SUL is considering arrangements for alternative access to materials that is in the process of being digitized, but there are no hard plans yet
+ Material may require metered access by Stanford users depending upon copyright issues
+ Each books that is digitized will be KEPT -- our patrons will continue to need those physical books and we will provide for them
+ We've retained the right to send or refuse to digitize material that we believe is too brittle/fragile to survive the process

More information about the project, including a FAQ, will be available at the SUL/AIR website in the near future


Listed below are links to weblogs that reference The Google deal (down on the Farm):

» Meeting at Stanford on Google's Digitization Plans from j's scratchpad
The Mad Librarian provides some notes from a meeting at Stanford about Google's new venture into the world of digital books involving several notable libraries. [Read More]

» Google digitizing libraries from The Left Half of My Brain
This story came out a couple of weeks ago (more? less?), but Confessions of a Mad Librarian provides a great deal more detail. What Google will provide to the public -- * Works in copyright won't be fully available * For copyrighted works -- there will... [Read More]

» Mad Librarian with details on Stanford and Google from Bibliotheke
New details from the Mad Librarian on Stanford and the Google project. Some of the bigger points: Not all books will be scanned. It will be done in stages, starting with a few thousand books. Google will be doing the actual scanning, but format d... [Read More]

» Harvard's Meeting about Google Print from j's scratchpad
A representative from Google shared information about Google Print's endeavors at Harvard. [Read More]


What leaps out at me is that usage or a survey of research need is not seen as a factor in choosing which collections will be digitised.

No doubt many people will access digitised items for the curiosity factor when they are available, but it would be interesting to know how many researchers genuinely benefit from this project, if space not need is the overriding factor for choosing what to digitise.

Actually, there was a question by one of the map librarians that was very similar to your concern, Fiona, i.e. the person asked how a balance would be achieved between the 'easily digitized' works that are out of copyright and not in high use versus high-demand recent works that would likely be in copyright (and thus wouldn't be fully displayed). Cath Tierney agreed that there is a need for such a balance, but there was no clear plan/protocol to ensure this. No mention was made of a usage survey, or evaluation of current usage statistics as a part of the selection process for the project.

Thanks for the update, Eli. I'm very curious how this might impact ILL in the long term -- seems like a great opportunity for them to streamline their workflow. On the other hand, would Stanford get ANY ILL requests once all their stuff is available through Google? Dunno!

How are they planning to handle math? Tables? Art (in the publishing sense, meaning "anything that isn't text")? What about works where the copyright is divided (e.g., the book got permissions for material included inside it, but said material has a different copyright owner)?

I don't know. I'm starting to get a little worried that Google hasn't thought a lot of things through. This has the feel of some of the gung-ho shortsighted weirdness of the ebook days.

As for non-text elements: there's no word on whether the specs for non-text scanning will be different from pure text. We don't know the protocols yet.

As for works with multiple copyrights, Google will decide whether they believe they can fully display the material. Supposedly, Google hopes to talk some publishers and other rights holders into allowing full displays of their material in Google Print, but there are no guarantees.

Thanks for the update, Eli. Answers some questions, but leaves many open. I'm particularly interested in the format to be used...whether they will be doing some sort of XML markup for the OCR, or some other sort of markup. Ideally they would make both the original scans and the OCR'd text available, but who knows.

And thanks for the note re: my grandfather. I appreciate it.

Hi I'm a librarian at a large university library and I really appreciate this update. It frankly confirms a lot of my suspicions, that this has been a poorly reported story from PR releases and overblown comments. I think the idea behind the project is excellent but I am very concerned that
a) Stanford is paying for any part of this (ie transportation to Google)
b) Stanford didn't make provisions to share this with non-profit projects like Internet Archive
c)no mention of preservation (of digital files) at all
d)no standards? could they possibly scan this stuff in some proprietary format? what a nightmare.

and that in general, the digital library community is just being totally shut out of learning anything about this. though I do appreciate that there has been some communication about this. I look forward to the FAQ.

that said, I still hope for the best. "May you live in interesting times" and all.

I enjoy all your comments. I believe most of you are from the library realm. I am a product of the business/marketing realm but now working at a University library digitization center.

Please note the big difference in this situation. Google is a for profit business that recently went public. They are driven by shareholders, the fast paced business market and marketing/public relations and they make their decisions that way. We at libraries, all types, are driven by other goals. Very seldom will we ever understand each other until you think about those foundational differences. Often doing the right thing is a trade off to doing a profitable 'thing'.

Thanks for the inside information about this project! --SAA

Do you know if work has actually begun at Stanford to digitization items for Google?


I don't know offhand, but I can ask around the next time I visit the Farm.