Wayback Weirdness

Peter Suber recently linked to a post on the LibraryLaw blog which asked why the Wayback Machine does not seem to archive National Science Foundation pages:

I was just looking on the National Science Foundation's web site to try to find the Index of FOIA Frequently Requested Documents. The Index is mentioned in the NSF's Public Information Handbook. When I couldn't find the Index, I realized the Handbook was written in 1999, and perhaps an older version of the NSF website had a copy of the Index. So I went to the Internet Archive's trusty Wayback Machine, and put in the NSF's web address. Yesterday when I looked at the results page, there were no results, and the statement that the site had been blocked by robots.txt was the only information returned. Today, the Wayback Machine's results page shows each instance when the site was archive, from 1997 to 2005, but when you click on a link, the resulting page is empty and has this message:"We're sorry, access to http://www.nsf.gov/ has been blocked by the site owner via robots.txt."
I thought this was weird, and wrote the NSF webmaster, who wrote back to say this:
NSF blocks all indexing of the site between 7AM and 7PM ET, our peak traffic hours, for the convenience of our users. However, there is no block on the site from 7PM to 7AM ET. This is standard policy for most high traffic sites. The owner of [the Wayback Machine] need only comply with our policy in order to index our pages.
So that made me wonder whether archive.org is aware that NSF has this policy, or whether there might be some other error somewhere. Searching the Wayback Machine for "www.nsf.gov" or "nsf.gov" produces a list of archived pages. Clicking on any of those links earlier today produced a file location error, but right now (some hours later) it's working fine. The earliest available version of the relevant public information page says that the document Susan was looking for is "coming soon", but I couldn't find it even though I went through about six versions of the public information page from 1997 to 2005. The Public Info Handbook actually says
An index of FOIA Frequently Requested Records will be published, if applicable, on the Home Page under "Public Information - FOIA and Privacy Act Requests." Where possible, this will include an electronic version of the actual records released.
(emphasis mine), so perhaps it was never added. Searching the current NSF site for "frequently requested" does not turn up the index in question, and neither does searching their publications for "FOIA", but I did find a recent management plan (pdf) which includes "Review Agency posting of statements of policy, administrative staff manuals and copies of frequently requested records" in a list of areas "identified for review". So perhaps it's still "coming soon", 9 years on. We are, after all, talking about a government agency.

Incidentally, the NSF's robots.txt file is right where it should be:

# robots.txt for http://www.nsf.gov/
# Change history:

User-agent: vspider
Disallow: /cgi-bin/
Disallow: /stats/
Disallow: /home/nsforg/
Disallow: /awards/
Disallow: /pubsys/data/
Disallow: /search97cgi/
Disallow: /seind98/topdemo.htm
Disallow: /nsf99338/topdemo.htm
Disallow: /home/ebulletin/archive/
Disallow: /sbe/srs/start.htm
Disallow: /web/
Disallow: /geo/
Disallow: /eng/
Disallow: /home/crssprgm/igert/survey/
Disallow: /staff/
Disallow: /ads-cgi/
Disallow: /awardsearch/

User-agent: *
Disallow: /cgi-bin/
Disallow: /nsf99338/topdemo.htm
Disallow: /home/ebulletin/archive/
Disallow: /home/crssprgm/igert/survey/
Disallow: /staff/
Disallow: /ads-cgi/
Disallow: /awardsearch/

The Wayback Machine uses Alexa crawlers, so as far as I can tell the file as shown allows vspider (a commercial spiderbot) more limited access, but every other robot can go to most of the site. It doesn't change (I checked before and after 7pm ET; same file), so NSF must be implementing their block some other way. F'rinstance, .htaccess can serve/block pages depending on the time of day.

So, to sum up: NSF only restricts access during peak hours, and the Wayback Machine knows about this and archives the site just fine. The index of FOIA requests that Susan was looking for, however, does not appear to be available. The person to ask would appear to be the FOIA Officer.

miscellanea | Bill Hooker | 08 Jul, 2006 |

Comments
Post a comment

















RSS Feed

Links:
spousal unit
me
copyright anything
Bloglines account
Simpy account
Connotea account
OpenWetWare userpage
monthly irregular column on 3QuarksDaily


Please sign the petition in support of the European Commission's proposed Open Access Self-Archiving Mandate

Please also sign the SPARC/ATA Petition for Public Access to Publicly Funded Research in the United States


blogroll:



Archives:
July 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
January 2007
December 2006
November 2006
October 2006
September 2006
August 2006
July 2006
June 2006
May 2006
April 2006
March 2006
February 2006
January 2006
December 2005
November 2005
October 2005
September 2005
August 2005
July 2005
June 2005
May 2005
April 2005
March 2005
February 2005
January 2005
December 2004
November 2004
October 2004
September 2004
August 2004
July 2004
June 2004
May 2004
April 2004
March 2004
February 2004
January 2004
December 2003









Design thrown together haphazardly by frykitty.
Powered by the inimitable MovableType.