« Fear and metaphors | Main | Australia Joins the Mickey Mouse Club »

Phantom citations

(Caveat lector: Feel free to follow Karen's advice and go read the CIPA analysis on First Monday. I can't think about the USA PATRIOT Act and CIPA at the same time without wanting to take to a fainting couch, so I'll be printing out a copy to read in about 6 weeks. I'll catch up with that much later.)

Time to starting moving away from the personal (though intriguing) back to the professional emphasis. This is sort of a halfway step in between.

I wrote in an earlier post that I had just received a galley for an article that required a lot of edits. Nearly 2 years ago, I wrote a paper about the Internet Archive for one of my library classes. In the course of writing, I learned that my co-workers knew nothing of the Archive, my fellow students hadn't heard anything about it, and there wasn't much press about it in the library literature. I saw a niche to be filled, and after a couple of misadventures (one editor never got back to me after receiving the paper, another misfiled the submission and didn't get my follow-up messages). But it was finally accepted last summer and put on the back-burner for a future issue.

The linkrot was incredible. The Archive completely reorganized its website and went to a PHP system -- all those links had to be redone. And in the last go-round ... turns out that a couple of my sources had gone from the Open Web to the Deep Web, and one disappeared completely without a trace. The former I really should have recognized as a potential problem (especially given how often I've heard Gary Price on the subject). And I don't know how the citation styles (Chicago, in this case) typically deal with dead/closed/behind subscription walls ... so I caved. I managed to find the article on microfiche and changed the citation to a print one.

The completely disappearing source ... that was a different story. It was a PDF of a state government document that had been catalogued by the State Library as being available only through the Archive. But it's gone from the Archive's servers, and there's a robot exclusion file that looks like it was added around the same time that we had an administration change in Sacramento (mind you, I don't know which administration did it). The Archive has a policy of removing any material at the request of the original site's owner, which keeps them out of legal trouble.

Quite frustrating for me. And it also makes me wonder about scholarship. If I understand the point of citation, it's not only for validation (i.e. I'm not making this stuff up), it's also for accessibility. Just like scientific results have the requirement of reproducible results in order to be consider valid, scholarship requires that we are able to go beyond the piece in hand to read and evaluate what influenced it, whether the evidence offered is in proper context, etc.

Of course, there are many organizations that are working very hard to keep potentially ephemeral resources available to generations of scholars. But there's going to be a lot of material that may be of interest and relevance to you and your research, but may not be accessible to your audience by the time they read your research.

This begs, to me, two sets of questions:

1) Are we going to find a way to cite dead or inaccessible sources? Is the situation with lost electronic resources analogous to rare, out-of-print or non-published manuscripts and other works? Does the scholar say to his audience, "I saw this, I evaluated/critiqued it correctly, this citation was correct at the time it was generated, you have to trust me on this?"

2) Do we encourage all but the most advanced scholars/researchers to stay within the confines of library resources (which is not a guaranteed failsafe in itself)? Do we warn researchers off the Open Web, unless the material they find fall within certain parameters we find to indicate medium- or long-term stability? Do we as researchers play it safe in order to retain our credibility?


Important questions there, Eli, and despite this growing problem I haven't really seen much research addressing the issue. You've also brought up an important point that I've noticed too - sites changing their CMS and moving every document on the server. Almost as bad as linkrot.

1. As backwards as it sounds, my strategy while completing my thesis was to print each and every Internet reference that I used. That way even if it disappeared I would still have a physical copy, should anyone ask for it.

I think this involves too much trust on the part of the researcher though, and could be easily exploited. I have been thinking that maybe I should have downloaded all this info into an offline archive so it could be viewed on screen as I originally saw it, but that opportunity has passed.

2. Good question. I guess when verifying Internet sources to use in research in the first place we should also ask "what is the likelihood that this resource will exist in 2 years, or 5 years? Is the creator dedicated to making sure it's still around?"

If a resource I find does not fall into the categories of newspaper articles (though they may be removed from a site, you can always find a dead tree), research reports (OCLC, OECD, UN etc), or goverment information I tend to exclude it from my references. No guarantee that this stuff will stick around any longer, but it's a better bet than most.

Congratulations on the publication. And I feel your pain on the link problem. Just a thought about the one that disapeared, though: if it really was a "state government document," there are probably laws preventing the government from making it disappear entirely (unless I'm misinterpreting what you mean by SGD). It's certainly no help for you given the short turnaround time, but the California ACLU should be able to help you navigate the process to get the state to cough that document back up.

You bring up a good point, Fiona ... I should be saving my online research. Maybe I'll burn it to CD once I'm through.

In regards to #2, and only citing newspaper or government information ... the scary thing is that government info is only slightly less immune in some cases than non-government info. You have deliberate acts, such as the removal of 'sensitive data' after 9/11 and the incident I mention in the post. And then you have benign neglect -- there are so many times when I've gone to the website of an international agency to find out if a publication is still active and there are dead links, pages that haven't been updated in months or years, and absolutely no archives. The concept of e-sunshine is not on solid ground, especially when you look at archival access.

Oh my ... where did this soapbox come from and why is it so high?


Thank you. And, I'm not as well-versed on CA's sunshine laws as I should/would like to be. I don't think the doc was born digital ... thus a few libraries around the state should have it. Having something go out of print isn't a violation of sunshine laws around here (I think). If I remember to do so sometime this spring, I'd like to go to the catalogue entry from CSL's list of publications where my co-worker first saw it, and backtrack from there.

I don't mean to drive you further MAD on this topic, but I happen to think this problem would be a truly amazing topic for an article. How would you research it? Wheee, I don't know. I think an essay or a column on this as a problem, with citations when possible, might be more realistic. But I have to congratulate you for being the first person I know to identify this particular elephant, which is growing exponentially on a diet of the professional scholarship's Web citations!

Vera, just for suggesting it, I should make you a co-author.

Truth be told, I don't instinctively know where to start ... and I have plenty on my plate right now. Maybe it'll keep on the backburner until this fall (i.e. after Atlanta). And if someone jumps into the question in the meantime ... bully for them. I'll read their conclusions with relish.