The Google deal (down on the Farm)
Edit: Caveat -- while this information was given at a staff meeting, it was announced by Andrew Herkovic (see below) that it was a 'public forum' and that employees could talk about any issues/facts/etc. that were communicated in a public forum with people outside of Stanford. On the basis of this, I posted the following information.
Today at work, there was an all-hands meeting for all library employees who wanted to learn more about the Google/Stanford deal. The talk was conducted by Catherine Tierney (Technical Services) and Andrew Herkovic (Foundation Relations & Strategic Projects) of SUL/AIR.
Google book digitization project -- 5 partners (Harvard, Stanford, UMich, NYPL, Oxford)
Stanford has NOT made a commitment to digitize all of its books, but we "will do as much as we can for as long as we can."
No exact number of books to be digitized given
Oversized materials (such as atlases) have been digitized in the prototype and have been discussed as part of the project, as will accompanying materials to books (maps, fold-outs, etc.), at least in theory
What Google will provide to the public --
- Works in copyright won't be fully available
- For copyrighted works -- there will be a click-through to the appropriate OCLC WorldCat record
Approximately 10% of Stanford's overall collection is clearly out of copyright; other material in the public domain (such as U.S. government documents) will be included in the project
Google will be responsible for determining what's in copyright and what's not if there are any questionable materials and copyright will drive what will be fully displayed
There's no special provision to fully display material in the last 20 years of copyright
Foreign language texts, including non-Roman languages, will be included
Google will be digitizing Stanford's material on Google's property, using their equipment/protocols and with their staff; the company has not yet been forthcoming as to how the process of digitization will be implemented in detail; however, Google's process is characterized as "industrial-strength digitization"
Google will be responsible for quality control of the scans
A format for the scans has not been decided
De-duplication is not a part of the process, at this time; Stanford is interested in having multiple copies of the same material across various partners
Google is being "coy" about standards and specs; minimums have been given, but little to no fixed specs
We believe that Google will be doing full OCR and indexing of everything they scan for us
Stanford may not mount everything that Google gives to us, but we won't reject scans for having less than perfect accuracy, either
Stanford will receive copies of Stanford's books but won't necessarily be getting the scans from Google's other partners; SUL is under contractual obligations to Google, so we won't/can't give away the digital materials to other projects, such as the Internet Archive or Project Gutenberg; however, we may be able to share our copies with other educational institutions
SUL isn't sure how the digitization will impact
Funding: the scanning will be funded by Google, but the transfer of books to and from Google will have separate library funding
There are currently no plans on publicizing the protocols/process to outside institutions -- it will depend in part on the legal landscape that Google/Stanford faces prior to and during implementation
Factors in choosing which collections (or parts of collections) will be digitized:
- Current physical space of materials
- Percent of material out of copyright
- Will the collections end up in SAL3 [Stanford Auxiliary Library 3 in Livermore]
- How other projects (such as the Hoover monographs move) could be impacted
- Interest by publishers to make copyrighted works fully available
The plan is to start with just a few thousand books; the project will be implemented in stages
Material that is already in electronic format will not be excluded, ideally, but it may later become a factor in what material is chosen
In the short term, Stanford users will get access to the digitized texts via Google Print, just like general users -- in the long term, the scans/digital page images will be mounted on Stanford's servers with enhanced access, as part of SUL's Digital Repository
Impact on Stanford's users:
+ Materials to be scanned will be officially checked out; any books that aren't currently barcoded will have to be routed internally for barcoding before being sent to Google
+ SUL is considering arrangements for alternative access to materials that is in the process of being digitized, but there are no hard plans yet
+ Material may require metered access by Stanford users depending upon copyright issues
+ Each books that is digitized will be KEPT -- our patrons will continue to need those physical books and we will provide for them
+ We've retained the right to send or refuse to digitize material that we believe is too brittle/fragile to survive the process
More information about the project, including a FAQ, will be available at the SUL/AIR website in the near future
Comments
What leaps out at me is that usage or a survey of research need is not seen as a factor in choosing which collections will be digitised.
No doubt many people will access digitised items for the curiosity factor when they are available, but it would be interesting to know how many researchers genuinely benefit from this project, if space not need is the overriding factor for choosing what to digitise.
Posted by: Fiona | January 8, 2005 12:11 PM
Actually, there was a question by one of the map librarians that was very similar to your concern, Fiona, i.e. the person asked how a balance would be achieved between the 'easily digitized' works that are out of copyright and not in high use versus high-demand recent works that would likely be in copyright (and thus wouldn't be fully displayed). Cath Tierney agreed that there is a need for such a balance, but there was no clear plan/protocol to ensure this. No mention was made of a usage survey, or evaluation of current usage statistics as a part of the selection process for the project.
Posted by: Eli | January 8, 2005 03:03 PM
Thanks for the update, Eli. I'm very curious how this might impact ILL in the long term -- seems like a great opportunity for them to streamline their workflow. On the other hand, would Stanford get ANY ILL requests once all their stuff is available through Google? Dunno!
Posted by: Vera | January 8, 2005 05:43 PM
How are they planning to handle math? Tables? Art (in the publishing sense, meaning "anything that isn't text")? What about works where the copyright is divided (e.g., the book got permissions for material included inside it, but said material has a different copyright owner)?
I don't know. I'm starting to get a little worried that Google hasn't thought a lot of things through. This has the feel of some of the gung-ho shortsighted weirdness of the ebook days.
Posted by: Dorothea Salo | January 9, 2005 06:26 PM
As for non-text elements: there's no word on whether the specs for non-text scanning will be different from pure text. We don't know the protocols yet.
As for works with multiple copyrights, Google will decide whether they believe they can fully display the material. Supposedly, Google hopes to talk some publishers and other rights holders into allowing full displays of their material in Google Print, but there are no guarantees.
Posted by: Eli | January 9, 2005 09:00 PM
Thanks for the update, Eli. Answers some questions, but leaves many open. I'm particularly interested in the format to be used...whether they will be doing some sort of XML markup for the OCR, or some other sort of markup. Ideally they would make both the original scans and the OCR'd text available, but who knows.
And thanks for the note re: my grandfather. I appreciate it.
Posted by: Jason | January 10, 2005 06:51 AM
Hi I'm a librarian at a large university library and I really appreciate this update. It frankly confirms a lot of my suspicions, that this has been a poorly reported story from PR releases and overblown comments. I think the idea behind the project is excellent but I am very concerned that
a) Stanford is paying for any part of this (ie transportation to Google)
b) Stanford didn't make provisions to share this with non-profit projects like Internet Archive
c)no mention of preservation (of digital files) at all
d)no standards? could they possibly scan this stuff in some proprietary format? what a nightmare.
and that in general, the digital library community is just being totally shut out of learning anything about this. though I do appreciate that there has been some communication about this. I look forward to the FAQ.
that said, I still hope for the best. "May you live in interesting times" and all.
Posted by: jen | January 10, 2005 07:45 PM
I enjoy all your comments. I believe most of you are from the library realm. I am a product of the business/marketing realm but now working at a University library digitization center.
Please note the big difference in this situation. Google is a for profit business that recently went public. They are driven by shareholders, the fast paced business market and marketing/public relations and they make their decisions that way. We at libraries, all types, are driven by other goals. Very seldom will we ever understand each other until you think about those foundational differences. Often doing the right thing is a trade off to doing a profitable 'thing'.
Posted by: Lou | January 11, 2005 06:47 AM
Thanks for the inside information about this project! --SAA
Posted by: Susan Ariew | February 18, 2005 12:37 PM
Do you know if work has actually begun at Stanford to digitization items for Google?
Posted by: Jill Hurst-Wahl | April 3, 2005 11:30 AM
Jill,
I don't know offhand, but I can ask around the next time I visit the Farm.
Posted by: Eli | April 4, 2005 11:38 AM