open access/open science Category Archive



Monday, 15 March
estimating ullage

Ullage, the word for the empty space at the top of a wine bottle, is Peter Suber's term for the gap between a library's actual holdings and its patrons' access needs. That's a difficult thing to measure, but I might have found a way to estimate it with reference not to patron needs but to all published journals, as follows.

  • In 2003, Kathy Varjabedian at LANL compared the electronic holdings at 12 (large, well funded research) libraries with the ISI Journal Citation Report's top 100 most-cited journals for the previous year, producing an estimate for the ullage of those libraries of between 2% and a startling 54% (or 0% and 40%, if clinical titles were excluded).
  • Also in 2003, Carol Tenopir estimated that there were around 44,000 scholarly journals in publication, just over 21,000 of them "refereed", which is the best proxy that Ulrich's Periodicals Directory allows for "peer-reviewed". Repeating Tenopir's search just now returned 26,677 active, refereed, academic/scholarly journals.
  • Last year, I used a UCOSC dataset from 2004, a curated list of about 3000 titles, to estimate the average subscription price for a peer-reviewed scholarly journal (table 2 here) at $1238/title.

  • Here are some more data from the Library Journal Periodicals Price Survey:

    LJdata.JPG

    Sorry about the jpg, I still can't make MT cope with tables. The spreadsheet is here. In case the image goes awry: the dataset covers more than 5,000 titles from 30 disciplines, and mean price/title is $723 in 2003 and $791 in 2004.
  • The mean serials expenditure for an ARL member institution was around $5.5 million in 2003 and $5.8 million in 2004.

At $1200/journal, $5.8 million1 would buy subscription access to about 4,800 titles, which is less than 23% of the number of active, refereed, academic/scholarly journals. At $700/journal, ARL members -- some of the largest and best funded libraries in America (indeed, in the world) -- are able to afford access to less than half of the scholarly literature.

This seems reasonably consistent with the earlier LANL estimate, given that Varjabedian looked only at the top 100 most-cited journals, which must surely be at the top of any research library's "must-have" list.

It's important to point out that what I'm estimating here is not ullage sensu Suber, but rather library holdings relative to all possible holdings. But I would argue that the access needs of all the scholars and other patrons served by ARL libraries is surely a decent proxy for "all possible journals", if not a significantly larger body of information! Put another way, here I am estimating the gap between current access levels and the information availability of a 100% Open Access world.




-------------
1This calculation assumes that 100% of the serials budget goes to scholarly journals. That's not true, but I've argued elsewhere that it's at least 90%.



Sunday, 14 March
an interesting mind

This entry is especially for those of my readers who do not work in science or related fields (librarians, publishers, etc), and who are not quite sure why I am so obsessed with Open Science. (Hi, Mom and Dad!)

This is Pawel Szczesny at TED Warsaw, describing for the lay public what Open Science is, and what it can mean. Pawel's is the interesting mind to which I refer in the title. I finally met him in person at Science Online earlier this year, but I have been following him around online for years. He never fails to come at a question or problem from an interesting and useful angle, and his TED talk is just the latest example.



What if?

What if I explain in simple words my research area? What if I point you to additional information so you could learn more and understand the topic I am working on? What if I make sure you have access to all relevant literature for free? What if I make sure you have access to all the data so you can play with it on your own? What if I take off this laboratory coat, so there is one artificial difference less between you and me? What if the only thing that mattered in this game of solving nature's mysteries was skills, knowledge and passion? We have a name for that utopian vision: it's called Open Science.

Do yourself a favour, watch the whole thing.



Sunday, 07 March
Where indeed?

AJ Cann has a post up that neatly summarizes the dilemma facing Open Science advocates/enthusiasts, and asks useful questions arising therefrom. In the current competition-focused environment, says Alan:

Open science is an iterated prisoner's dilemma, which is a messy and unpredictable business. Too unpredictable for most people to try to build a career on. Thinking about strategies which are likely to be successful leads me towards the concept of an open science community rather than unilateral complete openness - a long term multiplayer collaboration. Does such a community already exist? If not, how do we build one?
Having taken a job in biotech, I feel a bit cut off from any such community -- industry is notoriously protective of IP and fond of secrecy besides. I feel a bit of a fraud, for instance, taking part in discussions of Open Science issues on FriendFeed (such as the conversation kicked off by Alan's blog post), knowing that I can't talk openly about my own work. It doesn't keep me from shooting off my yap, of course, but it's a nagging icky feeling -- and I keep getting the meta-feeling that it doesn't have to be this way. Just as secrecy in academia only makes sense within the existing reward structure, secrecy in industry could be at least partly offset by policy decisions that recognize the gains in efficiency that collaboration can bring. I've heard multiple times from multiple sources that industry may close itself off from the rest of the world, but within a company, the teamwork ethic is amazing. Clearly, the value of co-operation is recognized. Why shouldn't that also work for (larger and larger) groups of companies? What you lose by not being the only company to know something from which profit can be made (call it X) is offset by the fact that you might never have learned X without the collaboration -- and in the meantime, the world gets X that much faster.

It seems clear, though, that such top-down decisions are more likely to be made in academia, and perhaps the nonprofit sector, than in profit-driven industry -- at least until there are enough concrete examples of success to tip the perceived balance of risk. If I'm -- if we Open Foo types are -- right, it's actually riskier to compete than to cooperate in the long term. Better to own a share of X sooner than to delay any return on your investment in the hope of owning X outright later. This is especially true when the resources required to try to own X could be used to get you shares in multiple other projects at the same time.

Even then, openness in industry seems to me unlikely to go beyond consortia. Complete openness (open notebook science) precludes patent protection, and in the dog-eat-dog world of business driven by the insatiable demands of disconnected shareholders, I don't think we are ever going to wean the beancounters off their patents. (We could improve the situation by overhauling the patent process so that teeny incremental changes were not granted full protection, of course; but I digress, and don't get me started.)

So to return to Alan's analogy, "multiplayer" means different things in academia (and perhaps the nonprofit sector) and in business. In business, it means defined communities of co-operation; in academia, I see no good reason why it shouldn't mean everyone (except, perhaps, where the two intersect and academics enter a business-defined collaboration1).

In academia, communities with an open science focus are beginning to form. The best example is still the one which continues to coalesce around Jean-Claude Bradley's UsefulChem initiative, but it's no longer the only one as it was just a few years ago. Chemist Mat Todd has funding for an open science project to improve synthesis of the anti-schistosomiasis drug, praziquantel. Biophysicist Steve Koch has a labful of open science enthusiast grad students. And so on; there's a list of Open Notebook practitioners on wikipedia, and my own feeling is that technical rather than philosophical barriers are keeping quite a few labs from that list. By being discoverable on the public web, all of these labs can do what Jean-Claude is doing: accumulate collaborators and get more work done. Try searching Google for "DNA tweezers kinesin" -- the second and fifth hits will hook you up with Steve Koch. "Praziquantel synthesis" -- the third hit will take you to the schisto community on The Synaptic Leap, where you'll soon meet Mat Todd, and the seventh hit will take you to a brief discussion of Mat's project on the UsefulChem blog. "Antimalarial Ugi" -- most of the first ten hits will introduce you to UsefulChem. If you're doing something that's in any way related to the work that goes on in these labs, you're one Google search away from a collaboration.

In business, too, more and more companies are recognizing the benefits of wider sharing. Details of private collaborations are hard to come by, but just try searching for "precompetitive sharing" -- even Big Pharma can see that they stand to make net gains from sharing their datasets. For an even better example, check out Sage Bionetworks. I was lucky enough to hear Stephen Friend speak at the Science Commons Symposium a couple of weeks ago, and one of the points he made was that the really big questions in biology require such immense amounts of data that the only way to collect them is to do it in the open. Any impediment at all, be it CC-BY attribution requirements or IP protections, will derail the whole process; the only answer in the end is the public domain.

So, the seeds are there. I think continued crystallization is inevitable, but it's certainly worthwhile to try to monitor and direct the process -- by way of questions like those Alan is asking.

-------------
1I don't buy the argument, by the way, that unless academics work in secret and enable strong patent protection they will never get industry partners. If you invent something from which profit can be made, someone will want to make that profit. If, without outright patent ownership, it's not enough money to tempt a Roche or an Intel, there will always be smaller, hungrier companies.



Friday, 19 February
Panton Principles for Open Data in Science

The Open Knowledge Foundation has just announced the Panton Principles for Open Data in Science. Here's the point-form version of the Principles (but do go and read the whole thing, including the concise but important preamble; and please consider endorsing):

Formally, we recommend adopting and acting on the following principles:

  1. When publishing data make an explicit and robust statement of your wishes
  2. Use a recognized waiver or license that is appropriate for data.
  3. If you want your data to be effectively used and added to by others it should be open as defined by the Open Knowledge/Data Definition – in particular non-commercial and other restrictive clauses should not be used.
  4. Explicit dedication of data underlying published science into the public domain via PDDL or CCZero is strongly recommended and ensures compliance with both the Science Commons Protocol for Implementing Open Access Data and the Open Knowledge/Data Definition.

I've written elsewhere about my feeling that Open Data/Open Science will eventually need a set of core Declarations to do for the wider movement what the BBB definitions have done for Open Access. A set of widely accepted terms and definitions provides a framework within which ongoing discussions can be much more efficient, focused and useful, as well as a point of reference and a standard introduction for newcomers to a field. Kudos to OKF and partners for making a strong start in this direction.

I do have one small quibble. Following Peters Suber and Murray-Rust, I want Open licenses to be three things:

  • explicit
  • conspicuous
  • machine-readable

The Panton Principles come right out and say "explicit", and "machine-readable" is largely covered because the recommended licenses are available in machine-readable versions (though I'd have preferred to see that actual phrase in the text of the Principles). What's missing, to my mind, is "conspicuous". The point of Open licensing is to enable and promote re-use, so it's important to make your license as obvious as possible to potential users. This might seem trivial, but I think it bears spelling out.

My own Open Data mantra is:

  • where are the data?
  • can I have them?
  • what can I do with them?

and again, the PPs are 2 for 3 by my count. The licensing covers what I can have and what I can do with it, but there's no mention of where I can find it in the first place. When we're talking about a database, the question doesn't arise since the license is in the same place as the data. But if we're talking about data which underlie a published paper, those data are very often not in the same place as the paper, even if the license is there. So it's important to make sure that your data are available: find or build them a stable online home and then let potential users know where it is. There's not much point in placing something in the Public Domain if the only copy is on your desktop. I'd have liked to see an explicit discussion of storage, access and signposting in the Principles... though come to think of it, this is really a different (and enormous) set of questions. So perhaps "conspicuous" covers this as well, and the missing Principle is simply that there should be a highly visible link to the license and the data themselves in every place where they are used, mentioned or otherwise likely to be encountered.

Of course, there are always unresolved questions no matter how carefully you craft your Declarations and Statements and Principles -- which is why the OKF has wisely built a companion tool, the Is It Open Data? web service. This is a brilliant way to remove ambiguity once and for all, on a case by case basis, by making public enquiry into the openness or otherwise of specific data sets. You can browse previous enquiries, so as to avoid redundant questioning of data owners; and naturally, recipients of multiple enquiries can use the service in a different way, simply linking to the record of their first response by way of answer to subsequent queries. Searchability might be a concern once the database of enquiries starts to grow, but that functionality can be added as needed. A central public service for asking questions about data availability and archiving the answers could go a long way towards improving access to data, simply by making clear the level of demand for Openness, and the degree to which supply falls short.



Monday, 15 February
Science Commons Symposium, Redmond WA

I am going to follow Antony's lead here and shamelessly steal Cameron's post to introduce the topic:


... sometimes someone puts together a programme that means you just have to shift the rest of the world around to make sure you can get there. Lisa Green and Hope Leman have put together the biggest concentration of speakers in the Open Science space that I think I have ever seen for the Science Commons Symposium -- Pacific Northwest to be held on the Microsoft Campus in Redmond on 20 February. If you are in the Seattle area and have an interest in the future of science, whether pro- or anti- the "open" movement, or just want to hear some great talks you should be there. If you can't be there then watch out for the video stream.


Along with [Cameron Neylon] you'lll get Jean-Claude Bradley, Antony Williams, Peter Murray-Rust, Heather Joseph, Stephen Friend, Peter Binfield, and John Wilbanks. Everything from policy to publication, software development to bench work, and from capturing the work of a single researcher to the challenges of placing several hundred million dollars' worth of drug discovery data into the public domain. All with a focus on how we make more science available and generate more innovation. Not to be missed, in person or online...


I'm going to be there, but don't let that put you off -- I'll be sitting quietly in the audience soaking up the amazing array of expertise on offer. You won't even notice me, I promise.

If you have any interest at all in Open Science (and why on Earth would you be reading me, if you didn't?), you should make every effort to attend this symposium. I'm a bit skeeved out by its being held on a Microsoft campus -- actually, I'm a lot skeeved out, and if it were any other lineup I probably wouldn't go for that reason alone. But this is simply too good to miss. Seriously, do yourself a favor and be there if you possibly can.




Monday, 19 October
"Guerrilla OA" done right.

I was reminded recently (when Graham Steel uploaded this photo) of something I've been meaning to write about for nearly two years.

For those who don't know him (which must surely exclude nearly everyone involved with Open Access!), Graham (blog, FF) is a patient advocate, which work has made him a staunch supporter of OA and all things Open. (Those of us who promote OA from an academic or research perspective sometimes, I think, forget about the incalculable value that OA offers other professionals and the lay public.)

Graham's first foray into "guerrilla OA" (most emphatically not to be confused with these well-meaning idiots) was in September 2007, when he attended a conference and ran a one-man unofficial promotional campaign. Do read his own description, but the basic strategy was to be a human signpost (wearing "Research Made Public" and "I'm Open" t-shirts) and distribute OA promotional materials in such a way as to give most of the delegates at least a brief exposure to the concept.

(Pause here to marvel at the dedication of the man whose belief in the possibilities of OA makes him willing, entirely at his own instigation, to arrange attendance, travel and accomodation, collect up the necessary materials and then physically go and do all this.)

Sadly, we can't yet clone Graham; but perhaps we can duplicate some of his efforts. I wonder how much it would cost to make "guerrilla OA" kits like the one Graham made for himself, but aimed at conference delegates so that researchers could turn into "Steel lite" activists at every conference we attend. Here are a few ideas:

  • t-shirts to start conversations
  • a badge instead of a t-shirt ("free your research, ask me how") for those who prefer more formal attire
  • "OA in a nutshell" cards the size and shape of regular business cards, for handing out in conversation and leaving on appropriate tables
  • slides for your talk: start with Cameron's "Presentation Rights" and end with a "Basics of OA" slide
  • equivalent add-ons for your poster, such as a Copyright Notice and an OA Basics placard, about the size of a postcard so they should fit on most posters as an afterthought and would be easy to incorporate into the poster itself

Here's another idea: it would only take half a dozen delegates to run an "OA stall", similar to the vendor stalls with which we are all only too familiar. This would mean working with conference administration, so maybe they would even help with "recruiting"; alternatively, it should be simple to set up a website where one can advertise for help in running such a stall at a particular conference. OA publishers could contribute materials (perhaps in return for help with costs), but I think transparent independence from any particular commercial effort would help tremendously in establishing credibility and producing a positive response. A prominent "who are we and why are we doing this?" banner might be a good idea. Flyers could include "OA:what's in it for you?", "Why the Impact Factor should be retired", and "Elsevier: just another greedy bottom-feeder, or SPAWN OF SATAN????". (OK, maybe not that last one... though a single page with this graph on it, or a reprint of this if I ever get around to publishing it, might be a good idea.)



Thursday, 23 July
Yes!

I'm swamped (new job), but just had to point this out: if you are interested in scholarly communication, Open Science, bibliometrics or anything related (and if you're not, why the hell would you read me?), then you must read this post from Deepak and the related post from PLoS ONE. Bora suggests that we mark our calendars; I think he's right, and this will prove to be one of those milestones whose importance will be clear in hindsight.

So -- what they said, especially Neil.



Tuesday, 14 July
Paying for toll access.

In response to the persuasive argument that online and peer reviewed journal audiences have significantly less than 100% overlap, I'm going to start trying to re-publish some of my Open Access writing. I'm considering submitting the draft below as a letter to the editor of Haematologica, in response to this editorial; comments, corrections and suggestions for where to send it are welcome.

In particular, I'd like input on the following: the draft letter is basically an abbreviated version of this post -- should I, instead, work the full post (including this revision of Phil Davis' Cornell study) up into a paper/essay/article and submit that somewhere?

If so, should I include only the self-reported figures for average TA journal author-side costs (see below), or should I pick a number of prominent journals and estimate their average author-side costs as I did in this post?

If the latter, obviously the methodology for the blog post is inadequate for a formal publication -- so how would one go about getting a reliable estimate -- that is, how many issues would one need to sample? And which journals should I include?

(I'm inclined to think I should just send the letter, because to do the paper properly would be a lot of work. The Davis update should be more than just plugging in new assumptions, it should really be repeated with the latest ARL numbers and new searches in Web of Science and/or Scopus. On top of that, estimating average page number and number of color figures for a single journal, let alone a selection, is an enormous task. So, frankly, it probably won't get done -- although I'm up for a collaboration if anyone out there is interested.)

Finally, one for the statisticians out there: if I do include the update to the Davis study, how would one go about a formal analysis of what is shown in Figure 3 of this post? The question is this: to what extent is high ranking on the list of predicted expenses in an all-OA world predictive of high ranking on lists of serials expenditure, enrolment or articles published? And is such an analysis (some kind of rank correlation, right?) really any better than the simple eyeball explanation I used in the linked post?


----draft letter-----


Dear Sir/Madam:

last month's issue of Haematologica featured an editorial entitled "Paying for open access" [1]. I write to point out that subscription-model ("toll access", TA) journals also impose author-side fees such as page and color figure charges. In fact, in a 2005 survey, a greater proportion of TA journals than of Open Access (OA) journals charged such fees [2]. Recent financial and publishing estimates have made it possible to compare fees across the two models, as follows.

The NIH estimates that it spends $80 to $100 million/year [3] on the publication costs of some 80,000 papers [4], and approximately 5% of research publications worldwide are available through Gold OA with no embargo period [5]. On the overly conservative [6] assumption that the average author-side fee for Gold OA is triple the average author-side fee levied by toll access journals, the average publication charge paid by the NIH to toll access journals is between $909 and $1136 per paper.

Further, OA advocate Peter Suber has pointed out (pers. comm.) that this number is certainly an underestimate since some fraction of those 80,000 articles did not use NIH funds, either because they were published in no-fee journals or because the authors found other ways to pay. Bearing this in mind, the NIH estimate is consistent with the handful of self-reported figures I have been able to find:


journal ................................. avg. author-side fee
Molecular Biology of the Cell ................... $1829 [7]
American Physiological Society (14 journals) .... $1000 [8]
Molecular Biology and Evolution ................. $922  [9]
Molecular Plant-Microbe Interactions ............ $1275 [10]


For comparison, the same issue last two issues of Haematologica in which the aforementioned editorial appeared also included 33 original research articles. On the basis of current page and color figure charges (and including submission fees), I calculate that the authors of these papers paid an average of around €600 560 ($840 790) per paper. Though the sample is hardly representative, it seems likely that the average cost of a Haematologica paper is in this ballpark. Such a figure is consistent with fees charged by other Gold OA publishers [6].

Authors considering the affordability of OA fees should bear in mind that they may well pay as much or more in page and color charges at a toll access journal, and should also ask what it is that they are paying for. Readers of toll access journals must bear a further cost, either directly or through subscriptions, whereas OA articles are immediately and permanently free for anyone to read.


Sincerely,

me.




Update 090717: corrected the calculation; you can grab the data here if you want to check my work or do something else with it. This is another argument for re-publishing: it makes you check your work! I got things wrong, and forgot to make the data available, the first time around.

I've submitted the letter; the full study I suggested is so much larger that I don't see it as salami publishing to submit that separately, if it ever gets done. Following a suggestion from Heather Morrison in comments, I'm going to try putting it up as a research project on the OAD and try to coordinate a team project. I felt compelled to point out this blog entry, the CCZero license and the fact that, if they accept the letter, I intend to use a CC/SPARC Author Addendum to retain enough rights from their copyright transfer (why does an OA journal need that?!) to offer CC-BY-NC. We'll see what happens.


References
[1] Paying for open access. Haematologica, Vol 94, Issue 6, p. 764 doi:10.3324/haematol.11505

[2] The Facts About Open Access. Kaufman-Wills Group, LLC 2005 URL:http://www.alpsp.org/ngen_public/article.asp?id=200&did=47&aid=270&st=&oaid=-1. Accessed: 2009-07-17. (Accessed July 16 2009)

[3] US House of Representatives Subcommittee on Courts, the Internet, and Intellectual Property Hearing on H.R. 6845, the "Fair Copyright in Research Works Act". Thursday 09/11/2008 URL:http://judiciary.house.gov/hearings/hear_080911_1.html. (Archived by WebCite® at http://www.webcitation.org/5iFx4iGru)

[4] US National Institutes of Health Public Access Frequently Asked Questions. URL:http://publicaccess.nih.gov/FAQ.htm#f4. (Archived by WebCite® at http://www.webcitation.org/5iFxByIMV)

[5] Björk B-C., Roosr A and Lauri M. Global annual volume of peer reviewed scholarly articles and the share available via different open access options. Proceedings of the 12th International Conference on Electronic Publishing ISBN 978-0-7727-6315-0, 2008, pp. 178-186 (http://elpub.scix.net/cgi-bin/works/Show?_id=178_elpub2008)

[6] Comparison of BioMed Central's Article Processing Charges with those of other publishers. URL:http://www.biomedcentral.com/info/authors/apccomparison/. (Archived by WebCite® at http://www.webcitation.org/5iFxJ0Mex)

[7] American Society for Cell Biology Newsletter, April 2007: MBC and the Economics of Scientific Publishing. URL:http://www.ascb.org/files/mbc_cost_printing.pdf available from URL:http://www.ascb.org/index.php?option=com_content&view=article&id=64&Itemid=216. (Archived by WebCite® at http://www.webcitation.org/5iFxptGx6)

[8] American Physiological Society AuthorChoice Frequently Asked Questions. URL:http://www.the-aps.org/authorchoice/faq.html. Accessed: 2009-07-14. (Archived by WebCite® at http://www.webcitation.org/5iFy5XFnK)

[9] Society for Molecular Biology and Evolution: Editor's Annual Report 2008. URL:http://www.smbe.org/pdf/2008editreport.pdf available from URL:http://www.smbe.org/archive.php. (Archived by WebCite® at http://www.webcitation.org/5iFxdKLqV)

[10] American Phytopathological Society Reports of Publications 2000. URL:http://www.apsnet.org/members/gov/2000/Reports%20of%20Publications.htm. (Archived by WebCite® at http://www.webcitation.org/5iFwcmIBL)



Tuesday, 30 June
Perfect match?

Surely this:


doe.jpg

You may find a technical report that you want to share with others or you think worthy of making broadly available on the Web to support the advancement of science. When you search for important science information in your area of interest, you can choose to sponsor the digitization of any adoptable technical report. The cost is $85 (approximately the same cost as ordering a hard copy). Discounts for multiples of 5 or more adoptions may be available. If you are interested in a larger scale project, please contact (865) 576-5699.



is a job for this guy:


malamud.jpg
... Most recently, Malamud has set up the nonprofit public.resource.org, headquartered in Sebastopol, California, to work for the publication of public domain information from local, state, and federal government agencies. Among his victories have been digitizing 588 government films for the Internet Archive and YouTube, publishing a 5 million page crawl of the Government Printing Office, and persuading the state of Oregon to not assert copyright over its legislative statutes.

?


(CC-BY image of Carl Malamud from Joe Hall via Wikimedia)



Sunday, 21 June
OA vs TA costs: I think I have finally got this straight.

I made some errors in the last few posts, making the information somewhat scrambled -- my apologies. Here is what I hope is a clear picture of what we know about the relative costs of OA and TA publishing.

1. The NIH estimates that it pays $100 million/year in author-side charges, and supports the production of some 80,000 scholarly articles; that's an average of $1250/article.

Update: Peter Suber points out that some fraction of that 80,000 articles did not use NIH funds, either because they were published in no-fee journals or because the authors found other ways to pay. I can't think of any way to estimate the actual number of articles the $100 million paid for in order to adjust the estimated fee/article, but it's worth remembering that it's an underestimate.

2. Björk et al. found that less than 5% of all articles worldwide are available through no-embargo Gold OA. We don't know what proportion of the NIH's $100 million went to Gold OA fees, nor what the average such fee might be. In order to be conservative, let's assume that the average Gold OA fee is triple the average TA fee (it almost certainly isn't that high). Then (if that 5% is evenly distributed) the NIH paid for (0.95x80000=) 76,000 articles at $average and 4,000 articles at 3x$average, bringing the average author-side charge for a TA article to $1136.

3. Philip Davis' 2004 library costs spreadsheet estimates the average subscription charge per scholarly article at between $970 and $1750, depending on what proportion of the library serials budget is allocated to scholarly publications.

subscriptionperarticle.jpg

Davis' original study estimated this proportion at 50% (on what basis I don't know), but I think the real value is closer to 90%. My reasoning is based on my observation (see Table 2) that the average unit cost of a curated list of scholarly journals from UCOSC is about ten times the average unit cost of "all serials" from ACRL, ARL and NCES datasets. If that result is broadly representative it means that scholarly journals must contribute either a small fraction or the vast majority of the cost (see here for a brief explanation).

So that gives an estimated fee of between $2106 and $2886 per toll-access article. That money isn't all coming from the same place -- the NIH is paying author-side fees and libraries are paying subscriptions -- but it's all going to the same place, publisher coffers.

I've added a current (under)estimate of NIH costs for author-side fees, adjusted for a 2006 estimate of %OA by article, to a 2004 estimate of subscription fee/article, but I'm confident that the real cost (if I could get up-to-the-minute figures for all inputs) would be in the same ballpark.

Sure puts one-time, up-front Gold OA fees in a different perspective, doesn't it? Here's a reminder (stupid Impact Factors in brackets just because I know a lot of people still think they mean something even though they don't):


average revenue
1 per toll-access article .............. $2100 - $2900

BioMed Central
Genome Biology (6.6) ..................................... $2250
BMC Biology (5.1) ........................................ $1950
Molecular Cancer (3.7) ................................... $1710
Retrovirology (4.0) ...................................... $1390
J. of Cardiovascular Magnetic Resonance (1.9) ............ $1195

Hindawi
Comparative and Functional Genomics (1.6) ................ $850
J. of Biomedicine and Biotechnology (1.9) ................ $975
Mediators of Inflammation (1.2) .......................... $975
Bioinorganic Chemistry and Applications (1.0) ............ $700

Public Library of Science
PLos Biology (13.5), PLoS Medicine (12.6) ................ $2850
PLoS Pathogens (9.3), Neglected Tropical Diseases (n/a),
Genetics (8.7) and Comp Biol (6.2) ....................... $2200
PLoS ONE (n/a) ........................................... $1300

Other
J. Medical Internet Research (3.6, best in field) ........ $1590
Biological Procedures Online (1.2) ....................... $1250
J. of Clinical Investigation (16.9) ..................... ~$2500



1 Update: since D0r0th34 has already pointed out one dumb thing I did, neglecting other revenue streams available to TA but not OA publishers, I think that rather than continually update this post I'll just go ahead and embed the FriendFeed discussion right here:







Friday, 19 June
OA and strategy

Stuart Sheiber recently gave a talk at Caltech, which prompted the following blogospheric exchange with Stevan Harnad (which I recommend highly if you are interested in Green vs Gold OA and the intricacies of OA mandate politics):

Harnad --> Sheiber --> Harnad

followed by this related post on "proportion and strategy" from Prof Harnad, the main points of which he also left as a comment on a couple of my posts:

#1: The vast majority of current (peer-reviewed) journal articles are not OA (Open Access) (neither Green OA nor Gold OA ).
#2: The vast majority of journals are not Gold OA.
#3: The vast majority of journals are Green OA.
#4: The vast majority of citations are to the top minority of articles (the Pareto/Seglen 90/10 rule).
#5: The vast majority of journals (or journal articles) are not among the top minority of journals (or journal articles).
#6: The vast majority of the top journals are not Gold OA.
#7: The vast majority of the top journals are Green OA.
#8: The vast majority of article authors would comply willingly with a Green OA mandate from their institutions and/or funders.
#9: The vast majority of institutions and funders do not yet mandate Green OA.
#10: The vast majority of Gold OA journals are not paid-publication journals.
#11: The vast majority of the top Gold OA journals are paid-publication journals.
#12: The vast majority of institutions do not have the funds to subscribe to all the journals their users need.

CONCLUSION I: The fact that the vast majority of Gold OA journals are not paid-publication journals is not relevant if we are concerned about providing OA to the articles in the top journals.

CONCLUSION II: Green OA, mandated by institutions and funders, is the vastly underutilized means of providing OA.

CONCLUSION III: It is vastly more productive (of OA) for universities and funders to mandate Green OA than to fund Gold OA.

I think there is a considerable strategic error embedded in those premises and the conclusions which follow, the basis of which is the emphasis on "the top minority of journals (or journal articles)". The 90/10 rule is not relevant: the goal of OA is 100% OA, not 10% -- not even "the top" 10% in which is concentrated 90% of whatever your metrics are measuring.

Much of the potential of OA lies in the provision of a comprehensive corpus of information on which to build the semantic web. Comprehensivity matters, because just as re-use beyond the scope of the original author's imagination is a primary impetus for information sharing between humans, it is folly to imagine that we can determine ahead of time what will matter to machines -- that is, which articles will be crucial to finding new and unexpected connections in text- and data-mining initiatives. The more complete the corpus, the more likely we can refine from it insights that are currently unpredictable.

Also, in an odd bit of circularity, 100% OA is vital to the development of rich, fine-grained, multiply cross-validated metrics that will likely be more reliable than existing metrics in guiding management decisions and researcher information searches. If we focus on "the top" journals and articles, we hamstring our best strategy for improving the methods with which we identify quality in the first place.

It's also worth addressing claim #11 separately. For the direct argument against the assertion that most of the "top" Gold OA journals charge fees, see Peter Suber:

If this is a claim about quality, or about future submission patterns, as opposed to present submission patterns, then it's an assumption for which there is no evidence.  Nobody has done the studies. [...] In the absence of studies, this is all we know:
[T]here are strong and weak OA journals, just as there are strong and weak TA journals. Hence, any analysis focusing on weak OA journals and strong TA journals (as if to show the superiority of TA journals) would be as arbitrary as one focusing on weak TA journals and strong OA journals (as if to show the superiority of OA journals). Without some additional argument showing that the journals on which they focus are typical of their breeds, they would be guilty of cherry-picking and generalizing from an unrepresentative sample.

There is, however, a neglected and (in my opinion) important counter-argument: even if that assertion is true, it is surely equally or more the case that the vast majority of toll-access journals charge author-side fees in addition to subscription charges. A 2005 Kaufman-Wills study found that 75% of TA journals in their sample charged author-side fees. There is at least as much reason to suppose that the top-ranked TA journals are to be found among the fee-charging cohort as there is to suppose the same of OA journals.

The NIH estimates that it pays author-side fees to the tune of $100 million per year, and funds the publication of some 80,000 scholarly articles. Assuming, in order to be conservative, 5% Gold OA at fees that are triple the average TA fee, that averages out to $1136/article, but what's sauce for the TA goose is sauce for the OA gander: if the Kaufman-Wills figures are broadly representative then those TA journals that charge additional author-side fees are charging, on average, $1515 per article. That's more than PLoS ONE, more than most BMC journals and more than any Hindawi journal.

It follows that, since we are not -- that is, I argue that we should not be -- "concerned about providing OA to the articles in the top journals", the fact that most Gold OA journals do not charge fees is in fact relevant to all strategies for increasing OA to the research literature.

I think I disagree with the second conclusion also -- in the most comprehensive study so far, about 8% of articles published in 2006 were available via Gold OA, whereas a further 11% was available as a self-archived copy. I agree, of course, that both are vastly underutilized relative to the goal of 100% OA, but it doesn't seem to me that Green suffers more neglect than Gold.

Given the flaws in some premises and the first two conclusions, I don't believe that conclusion 3 stands up either. I find Stuart Sheiber's argument for the Harvard model compelling:

In summary, a university that commits to the open access compact1 will more easily be able to answer objections against green OA policies specifically because it has an approach to long-range support for gold OA publishing, not in spite of it. The two models are inextricably tied. I, like Professor Harnad, am interested in facilitating the adoption of green OA policies. I proposed the open access compact in large part because I expect that adoption of the compact will lead to more green OA policies. The open access compact is therefore contributory to the promotion of green OA, not a sidetrack to it. I of course encourage universities to adopt green OA policies before gold OA support, but given that dystopian fears of faculty are preventing adoption of such policies, an open access compact that might assuage these worries should not be delayed.

1 The compact simply states that ""The university commits to underwrite reasonable article processing fees for open-access journals for which funds are not otherwise available".

Given all of the above, the optimal strategy seems to me to be the one adopted by Harvard: a Green OA mandate and careful (fiscally responsible) support for Gold OA.



Friday, 19 June
Update and correction re: cost to libraries and author-side fees

In comments below, Peter Suber points out that the NIH has amended its estimates to $100 million/yr spent on author-side charges and 80,000 manuscripts funded -- which brings the estimated average author-side fee to $1250, well in line with the individual journal estimates I made and the published figures I found. This is an important number because it is derived from a very large sample of the scholarly literature and casts a very different light on OA author-side fees than the one that TA publishers are wont to shine on their competitors. Compare, for instance, PLoS ONE at $1300, or the standard BMC charge of $1470 -- for a couple hundred dollars more than the average cost of a TA publication, you can make your work free for all users to access, immediately and permanently. (It would be interesting to know what proportion of the $100 million is going to OA fees, though I doubt it would be large enough to make a significant dent in the average TA charge. Edit: according to Björk et al., less than 5% of all articles are available via no-embargo Gold OA; taking this into account, and assuming that the average Gold OA fee is triple the average TA fee, gives an average of $1136/article.)

But! However! There is a flaw in my reasoning!

The problem is not with the estimate of author-side charges, but in my use of that estimate to update Philip Davis' library costs study. The point of that study was to look at what libraries would pay in an all-OA model, which is why I used the fractional cost matrix1 and graph in the first place. See the problem? Libraries don't pay the toll-access author-side charges, the NIH does! This makes the model a little artificial, perhaps, since *someone* has to pay those charges regardless of which journal levies them; nonetheless, the idea was to estimate practical library costs, so the TA author-side fees should not be included.

Here's what the updated situation looks like with the subscription/article estimate NOT adjusted for TA author-side fees (see my earlier post for details of the calculations):

Davisupdatecorrected.png

The fractional cost has to drop to 0.4 before there are no libraries predicted to pay more in the OA model -- as I pointed out in the original post, there are numerous realistic combinations that will result in a fractional cost of 0.4 or lower:

matrix2.png

The new figures also show that the fractional cost has to drop below 0.2 before all 113 libraries are predicted to save money in an OA model. That still seems to me to fall within a realistic range, given that 70% of journals in the DOAJ don't charge author-side fees and 45% of researchers in a recent RCUK study had their OA fees covered by their research funders, for a fractional cost of 0.135.

Nonetheless, it's worth taking a quick look at the libraries which are predicted to pay about the same in the OA and TA models. At a fractional cost of 0.4, they are: UC Davis, LA and San Diego, Univ Colorado, Cornell, Harvard, Johns Hopkins, McGill, Univ Massachusetts, Univ Maryland, MIT, Univ Toronto, Univ Washington and Univ Wisconsin. At a fractional cost of 0.3, only UC Davis, UCLA, Harvard, McGill, Maryland and Washington remain in the "pay about the same" category.

It's easy enough to guess what these universities have in common, and a simple analysis confirms it:

rank.png

Shading the top six yellow and the next 8 blue for visibility and ranking the libraries according to FTE, serials expenditure and "estimated scholarly articles published" reveals that the 14 "pay-same" libraries have only a slight tendency to be among the larger schools, but cluster very strongly at the high end of the "scholarly articles published" ranking. In other words, research-intensive schools that publish a lot may put more pressure on their libraries in the OA world (to the extent that libraries are likely to be asked to repurpose serials costs for OA charges).

Among other things, it was in order to examine this particular concern in detail that Davis carried out his original study, and for the same reason I have here updated it with more recent estimates and assumptions. The newer numbers show that a realistic worst-case scenario is that the libraries in question (14 out of 113 total) don't save any money in the OA model.

-------------
1 I neglected to mention in earlier posts that I got the %fee x %funded matrix idea (of which the fractional cost graph is an obvious extension) from Peter Suber. My apologies to Peter; I'm usually more careful about crediting sources.



Thursday, 18 June
Cost to libraries: OA vs TA

Note: important update/correction.

In 2004, Philip Davis carried out a study of library costs in which he estimated the average subscription cost/article for a subset of ARL libraries and compared this with a range of estimated author-side fees for Gold OA, in order to determine whether libraries might pay more or less if all journals switched to OA. Here I've tried to update that study using information that wasn't available back then.

Davis set the spreadsheet up to make it easy to update his assumptions and recalculate (kudos!), and Peter Suber (among others) pointed out that at least the following assumptions should be updated:

  1. all OA journals charge author-side fees
  2. the full cost of OA fees will be borne by libraries
  3. TA journals charge no author-side fees

We now have five different studies (one recently confirmed, improved and updated) showing that in fact the majority of OA journals do not charge author-side fees. The highest proportion of no-fee journals is in the DOAJ psychology subset (90%) and the lowest is in the chemistry subset (49-58%); the most recent analysis of the entire DOAJ showed 70% no-fee.

We also know that research funders are increasingly willing to foot the bill for OA. For example, HHMI has institutional agreements/memberships with BMC, Springer and Elsevier, and BMC's page of funder policies shows that a majority of UK funders either make additional funds available or allow publication charges to be treated as an indirect cost. A recent RCUK report showed that 45% of authors publishing in fee-based OA journals had their costs covered by their research funders.

Rather than pick a single number for either of these updates, I've plotted the fraction of the OA cost borne by libraries against the number of institutions at which OA is predicted to cost more than, the same as, or less than the TA model. The fractional cost borne by libraries is the product of (100 - %covered by funders)(%OA journals charging fees). (See Figs 1 and 2 below.)
 
We don't know much about author-side fees at toll-access journals, but we do have some information. Firstly, the 2005 Kaufman-Wills report showed that more than 75% of the 247 toll-access journals in their sample charged author-side fees in addition to subscriptions. Secondly, I just had a rough-and-ready look at a small number of TA journals and found average author-side fees ranging from $400 to almost $3000. Finally, the NIH estimates (scroll to section L) that it spends over $30 million/year in author-side fees and funds the production of around 60,000 manuscripts. This means that the NIH is paying, on average, about $500/article in page charges. Since this is the largest sample we have, I've used this figure to update the spreadsheet. I added $500/article to the calculated serials expenditure/article and compared this adjusted TA cost/article to the OA costs.

Update: this was a mistake! The point of the exercise was to compare existing library subscription costs with predicted OA costs, and libraries are not currently paying the TA author side fees. See this post for the correctly updated version of the Davis study.

I've updated two further aspects of Davis' spreadsheet. First, we now have better information about the actual range of author-side fees charged by those OA journals that do charge them. Rather than Davis' $2500 - $5000 range, I've used $1300 (PLoS ONE) to $3000 (most of the high-profile hybrid programs). If the adjusted TA cost/article falls within this range, the prediction is that the OA and TA models cost about the same from a library point of view.

Second, Davis assumed that the scholarly literature made up 50% of library serials expenditures. I don't know where this figure came from (the spreadsheet refers to a report which does not give any further information), but I think the real value is closer to 90%. My reasoning is based on my observation (see Table 2) that the average unit cost of a curated list of scholarly journals from UCOSC is about ten times the average unit cost of "all serials" from ACRL, ARL and NCES datasets. If that result is broadly representative it means that scholarly journals must contribute either a small fraction or the vast majority of the cost. Here's a simple explanation: suppose 1000 items at an average cost of $10; then average cost of the scholarly items must be about $100 if the "10 x all serials" rule is accurate. So you can either have 90 scholarly items and 910 non-scholarly items at about $1, or you can have one scholarly item and 999 non-scholarly items at about $10. What you can't have, for the averages to work out according to the "10 x" rule, is any ratio close to 50% scholarly/50% non-scholarly.

Summary of updates:

  1. plot fractional cost borne by libraries to account for %OA journals that don't charge fees and % OA costs borne by research funders (or other bodies)
  2. add $500/article to TA model costs to account for author-side fees charged in addition to subscriptions
  3. predicted OA fee range = $1300 to $3000
  4. assume scholarly literature makes up 90% of serials expenditure

The updated spreadsheet is here, and the end result is this:

Davisupdate_errornote.png

At a fractional cost of 0.8, there are no libraries at which OA is predicted to cost more than the TA model, and at a fractional cost of 0.3 the OA model is predicted to cost less than the TA model at all 113 libraries.

To see how the %fee and %funder proportions affect the fractional cost borne by libraries, I constructed a simple matrix and highlighted the two cutoff points shown on the graph above:

Davisupdate_fraction.png

As you can see, there are a number of perfectly reasonable combinations which result in a fractional cost of 0.3 or less, at which all the libraries in the sample would save money under the OA model. (This, by the way, is exactly what Peter Suber predicted.)

Update/correction: see this post.



Thursday, 18 June
Author-side fee comparison: OA vs TA.

I've posted a couple of times about the misconception that all OA journals charge author-side fees, and each time I've mentioned the Kaufman-Wills study which found that 75% of the toll-access journals they examined charged author-side fees in addition to subscription charges. I thought it would be useful to compare author-side fees charged by OA and TA journals.

It's easy to work out what OA and hybrid journals charge; BMC maintains a detailed list of publisher article processing charges.  Here are some examples: 

PLoS journals charge in three tiers:
PLoS ONE, $1300
PLoS Pathogens, NTDs, Genetics and Comp Biol, $2200
PLoS Biology and Medicine, $2850

BMC charges between $1105 and $2095 for most journals, and their standard charge is $1470

Hindawi charges between $275 and $850 for most of their journals, with a few titles up to $1400

Springer Open Choice, Wiley Funded Access and Elsevier's Sponsored Articles all cost $3000. (*cough*)

What is much more difficult to determine is how much the average author is paying in author-side fees at toll-access journals, because the charge for a given article depends on number of pages and/or color figures, and in some cases also on whether supplementary information is included.

Below are a few examples; in each case for which I calculated a figure, I extracted the page and figure counts manually from a single issue. This is far too small a sample to be representative, but I'm just trying to get some kind of feel for the numbers. Further, the published figures I managed to find (indicated by footnotes) are consistent with my "calculated guesses". Also, the NIH estimates (scroll to section L) that it spends "over $30 million annually in direct costs for publication and other page charges" and produces "roughly 50,000 - 70,000 manuscripts", which means that the NIH is paying, on average, about $500/article in page charges. If around 8% of all new articles are Gold OA, that number goes up to about $543/article. If the Kaufman-Wills 75% figure is representative, then the average author-side fee being charged is $666/article, or $724/article if the %OA is taken into account. (Note that the %OA adjustment might be spurious and the estimated average slightly off, because we don't know how much of the estimated $30 million is going to Gold OA fees.) Edit: according to Björk et al., only about 5% of all articles are available through Gold OA without an embargo period. Taking this into account, and assuming that the average Gold OA fee is triple the average TA fee, gives an average of $454/article, or $606/article on the Kaufman-Wills estimate.

Update: In comments, Peter Suber points out that the NIH has amended its estimates to $100 million/yr spent on author-side charges and 80,000 manuscripts funded -- which brings the estimated average author-side fee to $1136; if only 75% of TA journals are charging such fees, then they are charging on average $1515.

This section became way too cluttered, so I've put a summary here and the details are below:

journal .................................... average author side fee
PNAS ............................................... $1446
Science ............................................ $1019
Nature ............................................. $1669
Cell ............................................... $2031
Cell Cycle ......................................... $756
EMBO J ............................................. $2974
Mol Biol Cell ...................................... $1829 1
American Physiological Society (14 journals) ....... $1000 2
Journal of Nutrition ............................... $456
J Neuroscience ..................................... $850 + color charges 2
Molecular Biology and Evolution .................... $922 3
Molecular Plant-Microbe Interactions ............... $1275 4
J Natural Res & Life Sci Education ................. $400

1 official figures, 2006
2official figures, current
3 official figures, 2008
4 official figures, 2000


The selection of journals is fairly random, just the first few that came to mind then whatever turned up when I was searching for things like "average page color charges". They range from prestige to niche, and even the cheapest charge fees that amount to a significant fraction of Gold OA author-side fees.

It would be very interesting to extend this half-baked pilot study, but I think it would also be unavoidably labor intensive. Except for rare cases where publishers provide the numbers, there's really no way to calculate average author-side fees based on page and figure counts except by doing those counts for a representative sample of issues in each journal. (Perhaps a passing statistician could help me figure out what would constitute a representative sample -- perhaps sqrt(issues/year)?) Then you have to select which journals to investigate -- perhaps high, middle and low ranked journals in a handful of broad categories? Finally, it's pretty slow going, so I don't think Mechanical Turk would be cost effective for this job -- even if you could solve the problem of giving Turkers access to the journals. In the end I think you'd have to inflict the counting task on some hapless grad student or intern, who would probably find it easiest to sit in a library with a stack of journals and a spreadsheet.







----------------------------------------details of "calculated guesses" and official figures----------------------------------------

PNAS: $70/page, $250 for supplementary information, $300 per color figure or table

March 17 2009 vol 106 issue 11: 88 papers, pp 4079 to 4570; mean = 5.6 pages 5.6 pages = $392 10 papers had no supplementary info so mean SI=78/88=0.886 = $221 approx every 5-6th paper examined, 18 in total:

5 color figures ($1500) ii
4 color figures ($1200) iiiii i
3 color figures ($900)  ii
2 color figures ($600)  iiii
1 color figure  ($300)  ii
0 color figures ii

mean color cost = $833; mean total cost/article = $1446

In 2004 Cozzarelli et al. suggested that around $2000/article would be needed to cover PNAS'  costs without subscription income.


Science: $650 for the first color figure, $450/color figure thereafter

March 20 2009 vol 323 issue 5921: 2 research articles, 11 reports:

4 color figures ($2000) iii
3 color figures ($1550) i
2 color figures ($1100) iiii
1 color figure ($650) ii
0 color figures iii

mean color cost = mean cost/article = $1019

 
Nature: £735 ($1072) for the first colour figure and £262.50 ($383) for each additional figure (note: "Inability to pay this charge will not prevent publication of colour figures judged essential by the editors")

March 19 2009 vol 458 number 7236: 2 articles, 12 letters:

5 color figures ($2604) ii
4 color figures ($2221) iiii
3 color figures ($1838) iii
2 color figures ($1455) iii
1 color figure ($1072) i
0 color figures ii

mean color cost = mean cost/article = $1669


Cell: $1000 for the first color figure and $275 for each additional color figure. 

March 20 2009 vol 135 number 6: 12 articles:

7 color figures ($2650) iii
6 color figures ($2375) iii
5 color figures ($2100) ii
4 color figures ($1825)
3 color figures ($1550) ii
2 color figures ($1275)
1 color figure  ($1000) ii
0 color figures

mean color cost = mean cost/article = $2031


J Neurosci: $850 for regular manuscripts, $450 for brief communications, color figures are free "when color is judged essential by the editors and when the first and last authors are members of the Society for Neuroscience", otherwise $1,000 each.

March 18 2009 vol 29 issue 11: 28 articles; looked at 4 random articles, no color figs = 6,8,5,1.  Regular SfN membership is $160.  I'm guessing most authors are members but it's still impossible to tell how much each paper is being charged for color.


Landes Bioscience (all journals): four pages free, then $80/page; $340 for the first color page and $150 for each additional color page (in print -- color is free online)

Cell Cycle March 15 2009 vol 8 issue 6: 10 research reports, pp 870 - 949

pages = 5,12,6,5,6,6,8,5,8,9
pages charged = 1,8,2,1,2,2,4,1,4,5; total = 30, mean = 3 = $240

7 color figures ($1240)
6 color figures ($1090)
5 color figures ($940)
4 color figures ($790) iiii
3 color figures ($640) i
2 color figures ($490)
1 color figure  ($340) iiii
0 color figures i

mean color cost = $516; mean total cost/article = $756


EMBO J: $250/page over 6 pages, plus color charges: $650/figure for the first three figures, $432/figure for the next two, $2928 for six figures and $326 per additional figure thereafter.

March 18 2009 vol 28 number 6: 15 articles

pages = 10,8,10,10,13,8,10,13,13,10,8,9,10,12,12
pages charged = 4,2,4,4,7,2,4,7,7,4,2,3,4,6,6; total = 66, mean = 4.4 = $1100

9 color figures ($3906) i
7 color figures ($3254) ii
6 color figures ($2928) ii
5 color figures ($2814) ii
4 color figures ($2382) i
3 color figures ($1950)
2 color figures ($1300) ii
1 color figure  ($650) ii
0 color figures iii

mean color cost = $1874; mean total cost/article = $2974


Molecular Biology of the Cell: according to the Am Soc Cell Biol, in their 2006 publication "MBC and the Economics of Scientific Publishing" (available as a pdf from the linked page):

The average article published in MBC in 2006 was 11.7 pages long and included 2.9 color figures. With the 20% discount on page and color charges now offered to ASCB members, publishing such an article would cost the author $1,829.
(Regular ASCB membership is $130.) Interestingly, the same publication gives the following details of budgeted (projected?) journal revenue for 2008:


MBC.png



I don't know how similar that breakdown would be for other journals, but it's interesting that subscription revenue is roughly equal to page OR color charges -- meaning that the average author would pay about 50% more if the journal switched to full cost recovery from author side fees.  This would put MCB's author side fees roughly on par with those charged by the top two PLoS tiers.


The American Physiological Society's Author Choice (hybrid OA) fee is $3000 for review articles and $2000 for research articles; according to their FAQ this is because:

For research articles, the Author Choice fee was determined by calculating the real average cost ($3,000) of publishing an article in an APS journal, and subtracting the actual average amount already paid by authors in author fees (page charges and color fees). The Author Choice fee for review articles is $3,000, because there are no other fees paid by authors of review articles. The Author Choice fee was designed to completely cover the cost of publishing an article.
which indicates that the average author-side fee for the 14 journals published by the APS is $1000.


Journal of Nutrition: in this editorial, AC Ross gave some figures regarding costs:

On average, each published page costs about $465, and pages with color, $1300! Each published manuscript costs, on average, $3233. Page charges (starting at $70) and color charges to authors ($400 per figure) are only a fraction of the actual costs of publication. Institutional subscriptions remain a key factor in the financial success of professional society journals like JN.
Page charges are currently $75/page for the first 7 pages and $120/page thereafter, and color charges are still $400/figure.


March 2009 volume 139 issue 3: 29 articles

pages = 5,4,7,4,8,5,6,7,5,6,6,4,6,7,5,4,6,6,7,5,6,7,5,5,5,3,4,7,5
mean page charge = $415

1 color figure  ($400) iii
0 color figures iiiii iiiii iiiii iiiii iiiii i

mean color charge = $41; mean total cost/article = $456


Molecular Biology and Evolution: in the 2008 Editor's Report (pdf available here) the Society for Mol Biol and Evolution provided the following figures for MBE in 2008:

average article length: 10.1 pages
average number of color figures per article: 0.927

Current charges are $50/page plus $450 per color figure, giving an average cost/article of $922.


Phytopathology and Plant Disease: $50 per printed page for the first six pages and $80 per printed page for each additional page for members of The American Phytopathological Society and $130 per printed page for nonmembers. In addition, there is a $20 fee charged for each black-and-white figure or line drawing. Color charges are $500 for the first illustration, $500 for the second illustration, and $250 for the third and each subsequent color illustration in one article.

Molecular Plant-Microbe Interactions: $150 for the first 6 pages, $150/page or fraction of thereafter; Color charges are $500 for the first illustration, $500 for the second, and $250 for the third and each subsequent color illustration in one article. In addition, there is a $20 fee charged for each black and white figure or line drawing.

The Society's Reports of Publications from 2000 gives the following figures:

Phytopathology: average article = 7.3, average color figs/article = ?
Plant Disease: average article = 5.4, average color figs/article = ?
MMPI: average article = 9.4, average color figs/article = 1.05; mean cost/article = $1275

(Regular membership in Am Phytopath Soc is $76.)


Journal of Natural Resources and Life Sciences Education: $350/article, $10 per table and $10 per figure plus $100/color page (print only; color is free online).

Vol 36, 2007: 17 articles, number of figs/tables = 1,3,6,7,12,4,5,4,8,8,5,5,9,1,2,2,4 only a couple had color figures; mean additional charge = $50, mean cost/article = $400



Sunday, 14 June
*bump*

On FriendFeed, items move back up the temporal sequence when they get "likes" and comments, giving them extra chances to be noticed. In addition, a "like" or comment from one of your friends will bring an item into view even if posted by someone whose stream you don't follow. The emerging mores of the system include leaving a one-word comment, bump, to indicate that one feels a particular item is worthy of wider attention -- "bumping" the item up the queue, as it were.

That's what I'm doing with this post. Richard Poynder is trying to put together a list of institutions and funding bodies which have established funds to pay for Gold Open Access:

I am trying to establish how many research institutions and funders have created Gold Open Access (Gold OA) authors funds, and would be grateful for input from others.

I am aware that the Wellcome Trust announced a scheme for paying OA publication fees for its grantees in 2006. But what other funders have introduced such schemes?

So far as research institutions are concerned, Peter Suber kindly provided me with the following list of those he knows have created Gold OA funds:

University of Amsterdam
University of Calgary
University of California, Berkeley
Delft University of Technology
ETH Zurich
Griffith University
University of Helsinki
Institute of Social Studies (Netherlands)
Lund University
University of North Carolina, Chapel Hill
University of Nottingham
University of Tennessee, Knoxville
Texas A&M University
Tilburg University
Wageningen University and Research Center
University of Wisconsin

However, I do not think this list is complete.

Richard also points out that it is probably useful to keep track of which Gold funds are complemented by a Green mandate, and makes the (imo excellent) suggestion of establishing a Gold Fund equivalent to ROARMAP, which tracks Green Mandates.

So -- *bump* -- please go read Richard's post, and help him out if you can.

Update: Peter Suber has created and pre-populated the Open Access Directory list of journal OA funds, so if you have information please add it there.



Friday, 05 June
That's the way you do it!

Via Peter Suber, I am delighted to find that Stuart Shieber has started a weblog, and even more delighted that in one of his first entries he has turned my long-ago author-side fees DOAJ hack into an actual, readily reproducible study:

Here are the results computed by my software, as of May 26, 2009:

Charges.......................951  (23.14%)
No charges....................2889 (70.29%)
Information missing...........270  (6.57%)
Hybrid........................1519 (26.99%)
Total.........................5629
The numbers are consistent with those of Hooker's study some 16 months earlier.
It's great to have the numbers confirmed, and even better to be able to make regular updates and construct time series. Thanks to Stuart for doing it right, and for making the code freely available.

(Note, had to reformat the quoted table into ugly text, because I still can't get MT to play nice. Grrr.)



Friday, 05 June
What use are research patents?

DrugMonkey has a conversation going about the ongoing kerfluffle over (micro)blogging of conference presentations (see also the FriendFeed discussion). I want to go off on a tangent from something that came up in his comment thread, so rather than derail it I thought I'd post here.

In his first comment in the thread, David Crotty made the following claim:

Lots of researchers support their families and labs through money generated by patents, and most universities are heavily dependent upon their patent portfolios for funding.

That doesn't accord with my (limited!) experience -- I know a few researchers who hold multiple patents, and none of them ever made any money that way -- and my general impression is that the return on investment for tech transfer offices and the like is fairly dismal.

This seems like the sort of beans that beancounters everywhere should be counting, so I asked on FriendFeed whether anyone knew of any data to address the question of whether universities really make much money from patents. Christina Pikas pointed me to the Association of University Technology Managers, whose 2007 Licensing Activity Survey is now available.

I extracted data for 154 universities and 27 hospitals and research institutions. Between them, in 2007, these institutions filed 11116 patent applications, were awarded 3512 patents, and gave rise to 538 start-up companies. I calculated licensing income as a percentage of research expenditure:


patents1.png

Apart from New York University (I wonder what they own that's so profitable?), it's clear that none of these universities are "heavily dependent upon their patent portfolios for funding". In fact, more than half of them (78/154) made less than 1% of their research expenditure back in licensing income, and the great majority (144/154) made less than 10%.

Licensing income for Massachusetts General Hospital and "City of Hope National Medical Ctr. & Beckman Research" (whoever they are) amounted to 65-70% of research expenditure, but none of the other hospitals or research institutions made more than 20%. More than half of this group (15/27) made less than 2%, and most of them (23/27) made less than 10%.

The distribution looks just about as you would expect:


patents2.png

I also wondered whether there was any evidence that greater numbers of patents awarded, or more money spent per patent, resulted in higher licensing income. As you can see, the answer is no (insets show the same plots with the circled outliers removed):

patents3.png

patents4.png


I don't know how representative this dataset is; there are several thousand universities and colleges in the US, and surely even more hospitals and research institutions, so the sample size is relatively small. It does include some big names, though - Harvard, Johns Hopkins, MIT, Stanford, U of California -- and I would expect a list of schools answering the AUTM survey to be weighted towards those schools with an emphasis on tech transfer.

In any case, I'm not buying David's assertion that "most universities", or most hospitals or research institutes for that matter, rely heavily on licensing income. And that being so, I am also somewhat skeptical about the number of researchers' families being supported by patents.

What's the Open Science connection? Well, if you're interested in patenting the results of your research, there are a lot of restrictions on how you can disseminate your results. You can't keep an Open Notebook, or upload unprotected work to a preprint server or publicly-searchable repository, or even in many cases talk about the IP-related parts of your work at conferences. It seems from the data above that most universities would not be losing much if they gave up chasing patents entirely; nor would they be risking much future income, since so few seem to get significant funds from licensing. My own feeling is that any real or potential losses would be much more than offset by the gains in opportunities for collaboration and full exploitation of research data that come with an Open approach.

Updates:

1. Christina left a comment pointing out that patents may be required for more than simply making money from licensing:

...an extremely important reason universities patent [is] to protect their work so that they may exploit it for future research... it turns out that universities have to patent in life sciences - even if they don't actively market and license these patents - to be able to attract new research money from industry.

There are two distinct points here: first, that if you don't patent you may not attract industry partners, and second, that if you don't patent you may end up licensing your own tech back from someone else (I note that most tech licenses I know of are cheap or free "for research purposes" so the latter factor might not weigh so heavily). According to the 2007 AUTM data, industry investment in academic research amounted to about 7% of research expenditure and was up 15% over 2006.

2. David responded on DM's thread with some counter evidence, on reading which I realise that the data above may (likely?) only show what the university received and not any money that went to the labs or researchers involved. Tech transfer may not be financially worth it for the university, except that it might still be doing good things for individual labs and PIs, and so would constitute a support service the university offers its research community. It also strikes me that my experience, such as it is, is mainly with Australian researchers, whereas David's is in the US, so cultural differences may also apply.

3. More from Christina at her own place, here.

_____________
If you want the data, the spreadsheet I used is here.



Wednesday, 03 June
What happened to serials prices in 1986-87? (Update: probably nothing.)

This could be nothing but an artifact (e.g. of the way the data were collected), but if you look at Fig 1 from this post, there's a clear break in the serials expenses (EXPSER) curve that's not evident in any of the others. Here's the same plot reworked to emphasize what I'm talking about:


indices4.png

If you squint just right you can imagine a similar but much weaker effect, beginning a year or two later, in the total expenditures (TOTEXP) curve; and the salaries (TOTSAL) curve seems to start a similar upward trend at about the same time but then levels off after 1991 or so. I wouldn't put any weight on either of those observations though -- I'd never have noticed either if I hadn't been comparing carefully with the EXPSER curve.

I've added linear regression lines for the 1976-1986 and 1987-2003 sections of the EXPSER data, just to emphasize the change in rate of increase. For those of you who will twitch until they know, just 'cos, the regression coefficients of the two lines are 0.99 and 0.98 respectively. If you extrapolate from just the 76-86 section, TOTEXP exceeds the forecast for EXPSER after about 2000.

I have no idea if this means anything, but it is tempting to speculate. For instance: when did the big mergers begin in Big Publishing, and when did the big publishing companies start the odious practice of "bundling", that is, selling their subscriptions in packages so that libraries are forced to subscribe to journals they don't want just to get the ones they do?


Update: it's probably nothing; the curve simply shows an increasing rate of increase, and you can break it up into at least five reasonably convincing-looking segments with breaks at 86-87 and 94-95. It's possible there were two "pricing events" around those times, but I think this is most likely just an illustration of what can happen when you look a little too hard for patterns in your data!


indices6.png




Tuesday, 02 June
Every little bit counts.

There are so many good causes, and so many of them are not just good but urgent -- even assuming you have some money to spare, where are you to donate it? Everyone has their own solution to this problem. Mine is to try to hedge my bets: donate roughly equally to long- and short-term, local and global, human and environmental. I'm out of work and thoroughly skint right now, but I try to remember that by world standards I'm still living like a king; my budget includes some "don't go insane" funds for occasional movies or dinners out or whatever, and I can always skip one of those in order to give just a little to some good cause.

One such is the Open Knowledge Foundation, which is turning five and asking for support:

This month the Open Knowledge Foundation is five years old.

Over those last five years we've done much to promote open access to information -- from sonnets to stats, genes to geodata -- not only in the form of specific projects like Open Shakespeare and Public Domain Works but also in the creation of tools such as KnowledgeForge and the Comprehensive Knowledge Archive Network, standards such as the Open Knowledge Definition, and events such as OKCon, designed to benefit the wider open knowledge community. (More about what we've been up just over the last year can be found in our latest annual report).

While we have achieved a lot, we believe we can do much, much more. We are therefore reaching out to our community and asking you to help us take our vision further.

Our aim: at least a 100 supporters committed to making regular, ongoing donations of £5 (EUR 6, $7.50) or more a month.

These funds will be essential in expanding and sustaining our work by allowing us to invest in infrastructure and employ modest central support. To pledge yourself as one of those supporters all you need to do is take 30 seconds to sign up to our "100 supporters" pledge at:

http://www.pledgebank.org/support-okfn/

And if you want to act on the pledge right now (or make any other kind of donation), please visit: http://www.okfn.org/support/

We are and will remain a not-for-profit organization, built on the work of passionate volunteers but these additional fund are essential in maintaining and extending our effort. Become a supporter and help us take our work forward!

I'm in no position to make a regular commitment, but I skipped a movie and sent 'em ten quid. It's not much but it's my hope that small donations can be a powerful force in the internet age. The other thing I can donate is publicity, which is what this post is for.

Why donate to OKF? My belief is that openness is not only our best weapon in the unending battle against bad actors and free riders, it is the key to a radically more efficient scientific process, which in turn is the key to all material progress in human quality of life.

The OKF not only builds tools and standards for open exchange of information, but they are also part of the front line effort to make openness and transparency into a constant, widely adopted habit of mind and of behaviour. To choose a topical example, we won't have appropriate access to information about the spending habits of our elected officials until we are so in the habit of openness that it is a surprise and an affront to the average citizen to realise that such information is being kept secret. To choose my own bête noire as another example, we won't be free of "data not shown" in the scientific literature until the majority of scientists respond to that phrase with an immediate and indignant "why the hell not?".

So, support for the OKF is one of my long-term choices: an investment in a better future for everybody. If you have a couple of dollars to spare, please consider investing with me.



Monday, 01 June
Pick an index, any index.

Over at The Scholarly Kitchen, Philip Davis takes the ARL to task for comparing their serials expenditures with the Consumer Price Index:

By adopting the CPI as a general frame of reference, almost any industry that requires huge professional worker input will look like it is spiraling out of control. Perhaps this is the reason the ARL uses the Consumer Price Index as a reference for journal prices when it could have used the Higher Education Price Index, the Producer Price Index, or an index which more closely resembles professional knowledge production.

The CPI is an excellent tool for collective salary bargaining, for estimating who should be eligible for food stamps or free school lunches. It is a very bad tool for measuring the purchasing power of libraries or justifying a reinvention of the journal publication system.

Since I've just played around with updating the famous graph to which Davis takes exception, I thought I'd better take a closer look at the alternative indices he suggests.

From the Commonfund 2008 HEPI Report (pdf; linked from here) I extracted historical HEPI and CPI data from 1976 to 2003, and from the ARL stats interface at U Virginia I extracted the median values for serials expenditures (EXPSER), total salaries expenditures (TOTSAL) and total expenditures (TOTEXP) for the same period (it was limitations in the ARL data range that dictated the time period). I also extracted Producer Price Index data for "all commodities" (PPI ALL) over the same period from the Bureau of Labor Statistics. There are lots of choices for PPI data, but most of them don't go back as far as 1976. (I did try a couple of industries that I thought required "huge professional worker input", such as hospitals and book publishers, but the data weren't available for all the years I wanted -- and by eyeball it didn't look as though they showed much greater increase than the all commodities index.)

Plotting percent cumulative change against time we see:


indices1.png

There isn't a lot of difference between the HEPI and the CPI, and the all commodities PPI index shows even less increase. Davis suggests that salaries, professional worker input, are at least part of the reason why the CPI is a poor choice for comparison with serials costs, but (to the extent that the HEPI is a better "professional worker weighted" measure) the data do not bear him out. Neither does his claim regarding librarian salaries fit the data I have to hand:
If we plotted academic librarian salaries against the CPI, we could claim that the profession was in crisis, that salary growth was unsustainable, and that the system was simply broken.

It's clear from the data, though, that library salary expenditures have outstripped the HEPI and CPI, but not by as much as total expenses and not by nearly as much as serials costs.

Remember, too, that this is still only part of the story: "serials" includes a great many publications whose costs have not increased at the same rate as the scholarly literature. The Abridged Index Medicus data I got from EBSCO only cover 1990 onwards, so I reworked the comparison to include the AIM data:


indices3.png

I used the AIM data because comparison with a much larger data set, broken down by individual discipline, showed that the AIM data gave what looks like a reasonable "middle value" -- and as you can see, scholarly journal price increases outstrip all others, including total serials, by a considerable margin.

Note also that there's little difference between "total salaries" and "professional salaries" -- the professional salary data series (SALPRF) only goes back to 1986, which is why I've included it in this second graph.

None of this is to say that the CPI is the ideal comparison index against which to measure increases in the cost of the scholarly literature. It seems from the comparisons above, though, that there's not much difference for this particular purpose between the CPI and the HEPI. While I don't have data for publishing industry salaries, library salaries hew fairly closely to the HEPI and to total library expenditures. It therefore doesn't seem that salaries have much to do with the much-bruited discrepancy between "general cost of living/doing business/whatever" increases and the rise and rise of the cost of scholarly literature.

If you want the data I used, the spreadsheet is here.



Tuesday, 26 May
Motes, beams &c.

A while back, Philip Davis over at The Scholarly Kitchen posted about a small but useful research project of his:

All I did was ask five librarians at institutions administrating Open Access publication charges two simple questions:

"Can you provide a list of Open Access articles that you have supported through your author support program," and "Have you rejected any requests to date?"

This is (to me) clearly information that such programs should be collating and reporting, and after two weeks Davis' results were not exactly stellar:

Two weeks after asking my simple questions, I received just two short responses. No list, no numbers, but at least a few details: There was some confusion on the part of faculty of what an OA article publication charge really was. Some faculty requests were actually for page charges in conventional subscription journals; one faculty submitted a request for reprint charges; others submitted invoices to the library when they should have been directed to the external granting agency (like the HHMI). To date, no bonafide requests have been denied.

That's useful information, as far as it goes, but it doesn't go very far. Davis plays the conspiracy theory card way too hard for my taste, with "dark secrets" in the post title and an opening paragraph that reeks of melodrama:

You would have thought I was requesting a field manual for interrogating prisoners of war or a list of members on Dick Cheney's Energy Taskforce. At least in those instances, I would have received a response that answering my questions violated national security or "executive privilege."

Whoa, cowboy, back up a minute. As commenter Amanda R pointed out, we don't know much about how Davis went about gathering the information:

As a point of clarification, were you directly refused data, or did libraries simply not respond? Did you contact them back and ask why there was no response, or if there was a reason they weren't providing the full data you wanted?

Obviously, you deserve a professional response from the libraries you contacted. But, as much as it pains me to say it, I could easily imagine a library in which a request for statistics was bumped around internally for a few weeks before actually being answered.

In a Friendfeed discussion, librarian Christina Pikas made a related point:

the worst part of this is figuring out who you would send a request like that to. It takes me 10 e-mails and 3 phone calls to find the right person at my mothership main library. Almost seems that he's taking confusion for malicious intent

as did commenter JQ Johnson:

when I in March queried the same institutions that Davis did, I got lots of cooperation. For example, UNC pointed me to a public letter (2/20/2009) to their vice chancellor that summarized in some detail the 12 requests they had funded to date. I'm puzzled why Davis got the response he did. Did he ask the wrong people?

Davis replied to both Amanda R and JQJ, but he gave non-answers containing no information about his methodology and insisted that what he had shown was a lack of transparency:

Whether the lack of response was caused by human error, technological barriers or internal policy, the result is a lack of transparency in how these author-support programs are performing.
[...]
These are all good questions but they skirt around the main issue of why I received only 2 responses, and why even these two responses were unable to provide me with any meaningful (even summarized or anonymized) data.

I found this very frustrating and left a comment1 aimed at clarifying why that was so:

JQJ's comments and questions do not seem to me to skirt the issue at all, but rather to speak directly to alternative explanations for the lack of response. Methodological concerns are not trivial here.
  • Whom did you contact?
  • Did you say explicitly that you were sensitive to confidentiality issues and happy with various forms of anonymized data?
  • Did you phone anyone, or simply email?
  • How do you know your emails didn't just end up in the spam bin?
  • Did you follow up (an unanswered question from Amanda, above)?
And so on. You have asked good questions, and have shown that routine reporting could be improved for such programs (already a useful outcome). But you need a good deal more evidence -- including a more transparent methodology -- before you go claiming there are "dark secrets" at work.

Now, it's been almost two weeks since I left that comment, and it hasn't appeared or been answered. What dark secrets is Philip Davis hiding? What dim, Crotty-esque ambitions of being the famous naysayer, the Nicholas Carr of Open Access, are forming even now in the troubled subconscious of this ---

Or, you know, I just got stuck in the spam queue. It happens. :-)

Davis finishes up by saying something relatively unexceptionable if taken out of the context of his insistence on ignoring both Occam's and Hanlon's razors:

Library Open Access policies cannot exist with secret budgets, ambiguous guidelines, and a practice of stonewalling requests for information.

Those who campaign for Open Access need to be held accountable just like everyone else, and budget transparency is the first step.

Exactly so -- everyone else, including bloggers who wish to hold librarian feet to the accountability fire.


1I added the list formatting for this post, hoping for improved readability.




Monday, 11 May
The Semantic Web: a long and somewhat convoluted definition.

This1 is an attempt to define and explain the semantic web for a lay audience, though it should be remembered that I am a member of that audience myself...

It is a commonplace that we are drowning in information, and nowhere is this "information overload" more apparent than in scientific research. The National Library of Medicine's literature database, PubMed, is searched more than 60 million times a month and contains almost 19 million records from more than 5300 journals -- still only a fraction of the approximately 15,000 active, refereed, scientific journals listed in Ulrich's Periodicals Directory2. GenBank, the world's foremost repository of nucleic acid sequence information, contains roughly 100 billion bases in 100 million sequence records, and is growing at an exponentially increasing rate that is currently in excess of 50,000 records per day. Unlike PubMed and GenBank, which are cross-disciplinary databases, the Nucleic Acids Research Molecular Biology Database Collection is a carefully curated list of high-value specialist resources; it currently lists 1170 distinct, largely non-overlapping databases. I could go on, but you get the point3.

As things stand, researchers talk to researchers and use computers to facilitate that conversation; what we need is for computers to be able to talk to computers. To cope with (literally) inhuman volumes of data, we need that data to start making sense to machines, so that they can do something no human brain can do: process all of it. We need to make it possible for machines to transfer richly interconnected data among themselves, mix and remix it, generate new connections, filter it, process it, transform it, and output the results to formats and interfaces that make sense to human brains -- substrates on which we can carry out the sorts of synthetic, creative thinking that computers cannot do.

We need a man-machine partnership in which both partners can do what they do best, and that means we need the semantic web.

The semantic web is the outcome of processes and frameworks with which computers can manipulate data in a way that makes it accessible by human brains. It is built on the standards and metadata -- information about data -- that are required for automated data exchange and processing, which in turn is required to create machine-generated, human-scale summaries, skeletons, outlines and other representations of, and interfaces with, the entire knowledge corpus.

Here's an example. Human brains have no trouble processing the following data:

Another reason for opening access to research. Wilbanks J. BMJ. 333:1306-8 (2006).

To you, that's a reference; but to a computer, it's just a string of text. What a computer needs is information (metatada) about each substring:

Title: Another reason for opening access to research.
Author: Wilbanks, J
Journal: British Medical Journal
Issue: 333
Pages:1306-8
Date: 2006

Now the computer "knows" which letters identify John, which constitute the title of the article, and so on. If you set the standards up properly, it even "knows" that Wilbanks is the surname and J the first initial, and so on into ever finer grained properties.

Now imagine you had, oh, say, about 19 million such records. A human brain cannot do anything useful with such a database, but a computer can -- which is exactly why we can ask PubMed human-scale questions like "how many papers did J Wilbanks publish between 2000 and 2009?", or "show me all the papers with "access to research" in the title".

Now multiply that -- the ability to ask human-scale questions of a mass of information far too large for any human brain to absorb or process -- by thousands of different types of information (text, gene sequences, chemical formulae, microarray results, etc etc), millions of individual records within each data type, recorded in thousands of journals and databases, produced by hundreds of thousands of laboratories, libraries and garage hackers. Imagine what we could learn if we could query all of that information on a human scale.

There: that's a glimpse of the potential power of the semantic web.

-------------
1 This entry started life as an early draft of a letter in support of John Wilbanks' application for a TED fellowship. We didn't get enough signatures in time, so it never was even sent. My apologies to those people who did sign on; if John re-applies I'll try again, with better planning!

2 tickboxes = active, refereed, scholarly/academic; search = LC Classification Number for [Q* OR R* OR S* OR T* OR U* OR V*]

3In fact, I'm always on the lookout for more good examples of the "data deluge" and the rapid progress of science and tech; post 'em (in comments) if you got 'em.



Saturday, 09 May
More on the "Australasian Journal of..." series.

On the basis of the evidence below, I believe the entire "Australasian journal of..." series from Excerpta Medica to be either nonexistent or fake, in the same sense of "fake" that Elsevier has already admitted applies to the following six titles from that series:

  • Australasian Journal of General Practice
  • Australasian Journal of Neurology
  • Australasian Journal of Cardiology
  • Australasian Journal of Clinical Pharmacy
  • Australasian Journal of Cardiovascular Medicine
  • Australasian Journal of Bone & Joint Medicine

WorldCat lists a further thirteen titles in the apparent series:

  • Australasian journal of asthma
  • Australasian journal of bone & joint medicine
  • Australasian journal of dentistry
  • Australasian journal of depression
  • Australasian journal of gastroenterology
  • Australasian journal of hospital pharmacy
  • Australasian journal of infectious diseases
  • Australasian journal of musculoskeletal medicine
  • Australasian journal of obstetrics & gynaecology
  • Australasian journal of paediatrics
  • Australasian journal of pain management
  • Australasian journal of psychiatry
  • Australasian journal of respiratory medicine
  • Australasian journal of sexual health

I believe these all to be either nonexistent or fake because:


1a. Although WorldCat lists ISSNs for all titles, all but two include a note saying "ISSN prepublication record". The two entries which do not carry that note are also the only two titles listed as being held in any library:


1b. Only the "Australasian journal of musculoskeletal medicine" and the admitted fake "Australasian Journal of Bone & Joint Medicine" are listed as being held by any library in WorldCat.  Both are listed at the State Library of New South Wales:

Australasian journal of bone & joint medicine.
Chatswood, N.S.W. : Excerpta Medica Communications, 2002- 
v. : ill. ; 30 cm.
State Ref Library
NQ617.7005/1
Vol. 1, issue 2 (2002)-v. 4, issue 1 (2005)

Australasian journal of musculoskeletal medicine.
Chatswood, N.S.W. : Excerpta Media Communications, 2002. 
v. : ill. ; 30 cm.
State Ref Library
NQ617.7005/1
Vol. 1, issue 1 (2002).
I've written to the library to ask for a copy or photograph of either journal.


2. None of the series titles have websites that I can find.


3. None of them are listed in PubMed, Ulrich's Periodicals Directory, Elsevier's own Science Direct or Scopus (I'd be obliged if someone with access could check Web of Science). Update: Peter Murray checked, and couldn't find any of the titles in the WoS "publication name" field. Thanks Peter!


4a. A phrase search in Google Scholar returns hits only for the Australasian journal of psychiatry; all of these are citations, three of which are apparent self-citations to the same article:

Mellsop GW, Menkes DB, El-Badri S. Releasing Psychiatry from the Constraints of Categorical Diagnosis. Australasian Journal of Psychiatry. 2007;15:3-5. doi: 10.1080/10398560601083134
That DOI resolves to an article of the same name and with the same page numbers in Australasian Psychiatry, which is published by Informa Healthcare for The Royal Australian and New Zealand College of Psychiatrists.  I've written to the communicating author, Dr Mellsop, to ask for a reprint.

Of the remaining three hits, two are citations to other articles in the Australasian Journal of Psychiatry and one I cannot decipher without paying a fee to see the references of an obscure paper.  Of the two I can decipher, one resolves to a paper in Australasian Psychiatry from 2003; the same article is available from Informaworld.  The other is to an "in press" citation from 2007 (which also appears in 4b below).


4b. The same search on Google returns a number of hits, including the following:

  • from this page:
    M.I. Loh., & Restubog, S.L. (2007). Lecturers' and Students' Perceptions of Current Teaching Methods about Schizophrenia. Australasian Journal of Psychiatry, 15, 347-349.

    This does not seem to be related to the Informaworld journal Australasian Psychiatry since vol 15 p 347 is this, and I could only find these two papers by Jennifer Loh on the informaworld site. 
  • from this page:
    Langdon, R. (2003). Theory of mind and psychopathology: autism versus schizophrenia [Abstract]. Australasian Journal of Psychiatry.

    from this page:
    Griffiths, K., Farrer, L., & Christensen, H. (2007). Clickety-Click: the e- trains on track. Australasian Journal of Psychiatry, 15(2), 100-108.
    also from here and here:
    Griffiths, K.; Farrer, L.; and Christensen, H. Clickety-click: The e-trains on track. Australasian Journal of Psychiatry, In press, accepted 10/06.

    This appears to be the same paper in the Informaworld journal, Australasian Psychiatry.
  • from here, here and here:
    Tarantola D (2007) The interface of mental health and human rights in Indigenous populations: triple jeopardy and triple opportunity Australasian Journal of Psychiatry, 15(Suppl):S10-S17

    Again, here's the same paper in Australasian Psychiatry.
  • from here and here:
    Cornes, A., & Napier, J. (in press). Challenges of mental health interpreting: Therapy has taught us that it's all our fault Australasian Journal of Psychiatry.

    And the same paper seems to appear in Australasian Psychiatry.
I've written to Drs Loh, Langdon, Griffiths, Tarantola and Napier to ask for copies.



Saturday, 09 May
Excerpta Medica in action

The Elsevier fake journal scandal is expanding in two directions. First, it's now "fake journals", plural. Elsevier has admitted to publishing six of these things:

  • Australasian Journal of General Practice
  • Australasian Journal of Neurology
  • Australasian Journal of Cardiology
  • Australasian Journal of Clinical Pharmacy
  • Australasian Journal of Cardiovascular Medicine
  • Australasian Journal of Bone & Joint Medicine

Only one, Bone & Joint Medicine, is on the list I posted yesterday of Excerpta Medica "Australasian journal of..." titles from WorldCat. That leaves thirteen titles in the same series, none of which are listed in PubMed, Science Direct, Ulrich's or (thanks to Peter Murray, see comments on that post) Scopus. Jonathan Rochkind has pointed out how to find the rest of their titles in WorldCat; there are around 50 all told.

That's the tip; I await the rest of the iceberg.

The second direction in which the scandal is expanding is towards ghostwriting: I think probably Laika was the first person to make this connection clear. This is a separate but related issue, and Excerpta Medica appears to be up to their armpits in this sleazy practice as well. There's quite a large literature on ghostwriting, so here are just a few quotes (mentioning Excerpta Medica) to whet your appetite (if indeed one could be said to have an 'appetite' for something so nauseating):

Anna Wilde Mathews, At medical journals, paid writers play big role

When articles are ghostwritten by someone paid by a company, the big question is whether the article gets slanted. That's what one former free-lance medical writer alleges she was told to do by a company hired by Johnson & Johnson.

Susanna Dodgson, who holds a doctorate in physiology, says she was hired in 2002 by Excerpta Medica, the Elsevier medical-communications firm, to write an article about J&J's anemia drug Eprex. A J&J unit had sponsored a study measuring whether Eprex patients could do well taking the drug only once a week. The company was facing competition from a rival drug sold by Amgen Inc. that could be given once a week or less.

Dr. Dodgson says she was given an instruction sheet directing her to emphasize the "main message of the study" -- that 79.3 percent of people with anemia had done well on a once-a-week Eprex dose. In fact, only 63.2 percent of patients responded well as defined by the original study protocol, according to a report she was provided. That report said the study's goal "could not be reached." Both the instruction sheet and the report were viewed by The Wall Street Journal. The higher figure Dr. Dodgson was asked to highlight used a broader definition of success and excluded patients who dropped out of the trial or didn't adhere to all its rules. The instructions noted that some patients on large doses didn't seem to do well with the once-weekly administration but warned that this point "has not been discussed with marketing and is not definitive!"

The Eprex study appeared last year in the journal Clinical Nephrology, highlighting the 79.3 percent figure without mentioning the lower one. The article didn't acknowledge Dr. Dodgson or Excerpta Medica. Dr. Dodgson, who now teaches medical writing at the University of the Sciences in Philadelphia, says she didn't like the Eprex assignment "but I had to earn a living."

The listed lead author, Paul Barre of McGill University in Montreal, says Excerpta Medica did "a lot of the scutwork" but he had "complete freedom" to change its drafts. Dr. Barre says he helped design the study and enroll patients in it. In statements, J&J and Excerpta Medica offered similar explanations of the process. J&J says it regularly uses outside firms "to expedite the development of independent, peer-reviewed publications."

Carl Elliott, Pharma goes to the laundry: public relations and the business of medical education

One of the most ingenious pieces of the Fen-Phen public relations strategy was its ghostwriting scheme. In 1996 Wyeth hired Excerpta Medica Inc, a New Jersey-based medical communications firm, to write ten articles for medical journals promoting obesity treatment. Wyeth paid Excerpta Medica $20,000 per article. In turn, Excerpta Medica paid prominent university researchers $1,000 to $1,500 to edit drafts of their articles and put their names on the published product. Wyeth kept each article under tight control, scrubbing drafts of any material that could damage sales. One draft article included sentences that read: "Individual case reports also suggest a link between dexfenfluramine and primary pulmonary hypertension." Wyeth had Excerpta delete it. (21)

What made Excerpta Medica such an inspired choice is that it is a branch of the academic publisher, Reed Elsevier Plc., which publishes many of the world's most prestigious science journals. Excerpta Medica manages two journals itself: Clinical Therapeutics and Current Therapeutic Research. According to court documents, Excerpta Medica planned to submit most of the articles it produced to Elsevier journals. In the actual event, Excerpta managed to publish only two articles before Fen-Phen was withdrawn from the market in 1997. One appeared in Clinical Therapeutics, the other in the American Journal of Medicine (another Elsevier journal). In neither case did the authors of the articles disclose that they were paid by Excerpta Medica. So clean was the laundering operation, in fact, that many of the authors did not even realize that Wyeth was involved. Richard Atkinson of the University of Wisconsin wrote a letter to Excerpta Medica congratulating them on the thoroughness and clarity of their article. "Perhaps I can get you to write all my papers for me!" he wrote. He did have one reservation about the piece he was signing: "My only general comment is that this piece may make dexfenfluramine sound better than it really is." (22)

Sergio Sismondo, Ghost Management: How Much of the Medical Literature Is Shaped Behind the Scenes by the Pharmaceutical Industry?

Several of the publication planning firms identified are owned by major publishing houses. For example, Excerpta Medica is "an Elsevier business" and writes that its "relationship with Elsevier allows... access to editors and editorial boards who provide professional advice and deep opinion leader networks" [40]. Wolters Kluwer Health draws attention to its publisher Lippincott Williams & Wilkins, with "nearly 275 periodicals and 1,500 books in more than 100 disciplines," and to Ovid and its other medical information providers, emphasizing the links it can make between its different arms [41]. Vertical integration is attractive in the industry as a whole: at least three of the world's largest advertising agencies own not only MECCs, but also CROs [contract research organizations] [13].




Wednesday, 06 May
No bottom to worse at Elsevier?

Like Dorothea, I haven't said anything about the slimy Merck/Elsevier fake publication deal, because I thought the blogosphere had plenty of coverage. Anyone who reads me would know all about the scandal.

The latest development, though, strikes me as something that should be shouted from every available rooftop: Elsevier simply must answer the questions raised.

Via Dorothea: Jonathan Rochkind has done a little "forensic librarianship" and raised astonishing questions about the entire imprint, Excerpta Medica, which published the fake journal that started all of this.

Go read Jonathan, but the bottom line is this: Excerpta Medica does not provide a straightforward list of its own publications or make clear which are, ahem, "industry-sponsored".

Jonathan says "WorldCat lists 50 publications by Excerpta Medica Communications"; I just tried a simple author search for that phrase and got only 21 results, including the recently-exposed-as-fake Australasian journal of bone & joint medicine; how many others are fake? How about the other fourteen thirteen "Australasian Journal of" titles in the same list:

  • Australasian journal of asthma
  • Australasian journal of bone & joint medicine
  • Australasian journal of dentistry
  • Australasian journal of depression
  • Australasian journal of gastroenterology
  • Australasian journal of hospital pharmacy
  • Australasian journal of infectious diseases
  • Australasian journal of musculoskeletal medicine
  • Australasian journal of obstetrics & gynaecology
  • Australasian journal of paediatrics
  • Australasian journal of pain management
  • Australasian journal of psychiatry
  • Australasian journal of respiratory medicine
  • Australasian journal of sexual health

Why, for one thing, are none of them indexed by Science Direct? The PubMed journal limit field contains only Australasian journals of dermatology, pharmacy and optometry; the latter two seem to be defunct and the first is published by Wiley.

Futher obvious questions arising:

  • What exactly were the 11 "publications" mentioned in this case study, and where were they published?
    Excerpta Medica published more than 11 scientific publications, all offering medical education credits, and targeting medical specialties from the clinical pharmacist to the physician specialist and emergency nurse. Over 700,000 of these publications have been sent to medical professionals to build awareness...
  • Someone should take a close look at the publications (and faculty) mentioned in this case study:
    Excerpta Medica summarized the issues and recommendations from these ["faculty-led regional advisory board"] meetings and communicated them in a funneled approach, beginning with broad reach and comprehensive content, to more regionally focused publications.

    Excerpta Medica first created a full issue and subsequent supplement of Clinical Cornerstone™, the company's proprietary, peer-reviewed, indexed, continuing medical education (CME) journal distributed to 75,000 physicians. As a result, the data gained significant credibility within the larger physician community.

    The final published product from these regional meetings was a series of regional newsletters. The newsletters referenced the indexed Clinical Cornerstone publications and also highlighted the leading regional attendees on the cover to establish credibility and regional buy-in with the recipients. Approximately 2000 copies of each newsletter were sent to physicians in each region.

  • What exactly is the "company-sponsored journal" created in this case study? We're told that
    The quarterly publication was created to build awareness of the disease [targeted by the client's product] and prepare the specialist and primary markets for future indications. It was also designed to establish this client as one of the industry's authorities on cardiovascular disease.
    and that
    The clinical content was complemented with high-quality photographic images, giving each issue a very professional and attractive appearance.
    [...]
    The publication was launched in December 2004 and continues to run today. Circulation has increased from 10,000 at launch to 17,000 currently and includes such specialties as cardiology, diabetology, nephrology, internal medicine, and general practice.
    but not the name of the journal. Wanna bet it starts with "Australasian journal of..."?




Monday, 04 May
Alternative Connotea bookmarklets for OATP

Peter Suber launched the Open Access Tracking Project on April 16, and you can read a full description of it in this month's SPARC OA Newsletter.

I encourage anyone interested in contributing to the OATP to read the full description so as to make your contributions maximally useful. Here are the basics:

  • the project runs on Connotea, using shared tags
  • the only official tag right now is oa.new
  • use the oa.new tag for developments from the past six months or so
  • user-defined tags are encouraged and should use the same format: oa.foo, where foo can be any relevant subtopic

If you are pressed for time, and we all are, then it may help to have a Connotea bookmarklet with the oa.new tag (or oa.unclassified, if the item is older than six months) already filled in. That way you can just hit the bookmarklet, hit "add to my library" and be done. It's better if you have time to put in further classifying tags and a description, but at least this way the page will be recorded.

I guess the easiest way to do this would be to have three bookmarklets, the regular one and the "two click" bookmarklets I describe here. If you're using FireFox, here are the two-click versions; you can install them the same way as the regular one (drag to the toolbar) and, if you like, rename them using the "Organize Bookmarks" dialog box:

Connotea/oa.new

Connotea/oa.unclassified

This would obviously be better as a one-click than a two-click bookmarklet, but I failed dismally in my attempt to make it so because I don't actually know anything about javascript. I've previously suggested to the lazyweb that someone make a bookmarklet for another project, and nothing came of it; I'm hoping both that this little hack will be useful, and that it will inspire an actual programmer to improve it.




Saturday, 02 May
Congratulations to Harvard.

Harvard has been fortunate enough to secure the services of Peter Suber, who has been appointed a Berkman Fellow.

I cannot say it better so I will simply quote Stevan Harnad's comments accompanying the announcement:

A brilliant choice, and eminently well-deserved. Peter -- whose historic contributions to the growth of OA have been spectacularly successful -- will continue his invaluable OA work, but this Fellowship will also make it possible for him to begin writing the books on OA and related matters that are welling up in him, and that the world scholarly and scientific research community (as well as the historians of knowledge) are eagerly waiting to read, digest and learn from for years to come.

It is so gratifying to see true merit being rewarded occasionally, as it ought to be (although my guess is that this is just the beginning of the honors to be accorded to this selfless and sapient transformer of Gutenberg scholarship into PostGutenberg scholarship).





Friday, 01 May
Open Access, copyright transfer and NC licensing: caveat emptor!

When I was rummaging around in J Vis a while back, I noticed something that I've been meaning to blog about: why is an Open Access journal still requiring complete surrender of author copyright1?

I happen to know one answer to that question, though I don't know whether this is the case at J Vis. The deal is this: Big Publishing sells paper reprints, and not just of their own articles -- they pay fees where necessary in order to provide a one-stop shop (e.g. through Excerpta Medica or Ovid), mainly to the pharmaceutical industry. In order to blanket existing and potential customers with research favorable to their causes, pharm companies spend a great deal of money on these reprints -- some of which trickles down to small publishers, some of whom depend on that revenue. Such publishers therefore cannot afford to give up such rights as force the reprint traders to pay for their wares.

J Vis has a copyright notice which says, in part:

Users may view, reproduce or store copies of articles comprising the journal provided that the articles are used only for their personal, non-commercial use. [...] Any uses and or copies of Journal of Vision articles, either in whole or in part, must include the customary bibliographic citation, including author attribution, date, article title, journal name, DOI and/or URL, and copyright notice.

A closely related strategy is to use open(ish) licensing that contains a noncommercial (NC) clause. For instance, Springer Open Choice leaves copyright with authors, but uses their own license that is compatible with CC-BY-NC. That, like J Vis' copyright notice, puts their publications out of reach of the reprint traders, except for the little clause that says:

No term or provision of this License shall be deemed waived and no breach consented to unless such waiver or consent shall be in writing and signed by the party to be charged with such waiver or consent.

which allows the small publishers to waive the NC part for certain uses, in return for what amounts to royalties2.

Why do I care about this? Because it's another instance of the old "Free is not Open" argument, and the problems discussed here and here. Since digital repositories -- as far as I know, all existing digital repositories -- carry no blanket license, but leave intact the licensing of each individual digital object they contain, the effect is that there are no OA repositories that remove both price and permission barriers (that is, provide "strong" or "libre" OA to their contents).

The end result is the same problem that copyleft causes3: Reuse, Rework and Redistribute may not be powerfully affected, but Remix is killed outright.

Consider, for instance, PubMed Central, all the papers in which are free to read. What else can you do with them? Textmining, datamining? As far as I can tell, the answer is no, you can't do any of that -- because whatever you want to do, some papers will be licensed to allow it and some won't. Barring some way to reach agreements with dozens or perhaps hundreds of publishers and pre-sort millions of papers on the basis of licensing, the entire PMC barrel is spoiled by the copyrighted, NC and similar apples -- though there is a much smaller uncontaminated barrel available4.

Which brings me, at long last, to my title. Why "caveat emptor"? Well, if you're buying Open Access -- that is, publishing with a journal that charges author-side fees (remember, most don't), make sure you're getting value for your money! If the journal demands your copyright, or slaps a NC license on your work before distributing it, you should know that many possible downstream uses for your work are being pre-emptively eliminated. Are you sure that's what you want?


-------------
1 From the copyright form, emphasis mine:


form.png


2 There's even a clause in the canonical definitions of OA that deals with this issue -- or at least I suspect that's what it's doing there. Budapest, which came first, says this:

The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.

But Bethesda and Berlin, both of which were written about two years later, include this in the definition of Open Access (emphasis mine):

The author(s) and right holder(s) of [OA works] grant(s) to all users a free, irrevocable, worldwide, right of access to, and a license to copy, use, distribute, transmit and display the work publicly and to make and distribute derivative works, in any digital medium for any responsible purpose, subject to proper attribution of authorship (community standards, will continue to provide the mechanism for enforcement of proper attribution and responsible use of the published work, as they do now), as well as the right to make small numbers of printed copies for their personal use.

I suspect, though of course I'm really just guessing, that the "small numbers" clause was inserted at least in part as a reaction to the gleeful scarfing-up of OA works for resale by reprint distributors, or to the threat of same.

It needs the force of law to be any use for that purpose, though, which is where licensing comes in -- using a noncommercial clause like the one in CC NC licenses is a bit like swatting a gnat with a bulldozer, but I know of no licenses which deal specifically with the volume reprint trade but allow other commercial uses.

Frankly, even if there were such a license, I wonder whether publishers who insist on NC now would switch to it. Springer's Open Choice, for example, charges $3000 per article. I would say they've already been paid and shouldn't much care if someone else, without restricting access to their content, makes further profit from it. The barrier (to such a view) seems to be a mindset that says "why shouldn't I get my cut?" -- and if any other downstream use should arise that starts to make serious money, they would want their cut of that too. To make sure they get it, just in case it ever comes into being, I expect that many publishers would be willing pre-emptively to kill off any smaller commercial innovations that might otherwise arise.

(Someone will no doubt argue that these fledgelings could always negotiate via the waiver clause, as above. The main problem there is that such negotiations themselves cost money, and since much of the promise of OA is in remix across a wide range of sources, that means negotiating with every publisher. Let me know how that works out for ya.)


3 In fact, although NC clauses don't require a particular license for derivative or collective works, they do exert a kind of de facto copyleft, because they are only downstream-compatible with other NC licenses -- see footnote 1 here, or play this game for a while.


4 Two things of note here: firstly, the NIH apparently agrees with me that OA by definition removes both price and permission barriers, since they refer to the uncontaminated barrel as Open Access and explicitly say that the rest of their content is free, not OA. Secondly, following on from Egon's and Antony's questions, I wonder: by permitting the spoilage, can databases violate the licensing terms of the CC-BY papers they also contain? The question hinges on this wording:

You may not offer or impose any terms on the Work that restrict the terms of this License or the ability of the recipient of the Work to exercise the rights granted to that recipient under the terms of the License. [...] When You Distribute or Publicly Perform the Work, You may not impose any effective technological measures on the Work that restrict the ability of a recipient of the Work from You to exercise the rights granted to that recipient under the terms of the License.

Egon and Antony are asking more directly technological questions, but I do think it could be argued that if they do not do as PMC has done and make available a libre OA subset, databases can be seen to be imposing terms that restrict, etc.



Tuesday, 28 April
Perpetuating an OA myth

Maxine at Nautilus posted a slightly shortened version of this letter to Nature from Raf Aerts; what caught my eye was the rearing of a familiar ugly head (emphasis mine):

...the [global recession] may also be affecting the publication output of research institutions in a more subtle way. It could be boosting the traditional reader-pays publication model for scientific journals at the expense of the author-pays, or open-access, model.

Open-access journals ask authors to pay for processing their manuscripts (which involves organizing a form of quality control, formatting and distribution) so that the final product becomes freely available, and free to use if properly attributed. [...]

This myth, that OA is synonymous with author-pays, is a toll-access publisher's delight. It simply is not true. See here for detail; briefly:

  • in 2005, the Kaufman-Wills group showed that "...more than half of DOAJ [Open Access] journals did not charge author-side fees of any type, whereas more than 75% of ALPSP, AAMC, and HW subset [Toll Access] journals did charge author-side fees." (Note that this study included only 248 journals from the DOAJ.)
  • in 2007, Peter Suber and Caroline Sutton showed that, of 450 OA journals published by 468 scholarly societies, only 75 -- fewer than 20% -- charged author-side fees
  • also in 2007, I showed that only 18% of the almost 3000 journals in the whole DOAJ charged author-side fees; 67% did not charge such fees, and the information was missing for 15%.
  • in March 2008, Heather Morrison showed that more than 90% of the psychology journals in the DOAJ charge no publication fee1
  • about a month ago, I showed that only 38 (42%) of the 90 full-OA chemistry journals in the DOAJ charged author-side fees (49% did not charge such fees, and information was missing for 9%).

Raf goes on to say:

...few peer-reviewed open-access journals have so far had a high impact factor in their field, except for a small number such as those published by the Public Library of Science and BioMed Central. They are therefore struggling to emerge and to attract the most prestigious research findings.

This situation could deteriorate further if open-access journals are forced to move to (partial) site licensing in order to cover their production costs -- a shift recently undertaken by the Journal of Visualized Experiments, for example -- as authors become increasingly reluctant or unable to pay in the current financial climate.

I don't see why we should assume that anything will "deteriorate" if OA journals switch to new funding models, or that OA journals will have a harder time 'emerging' if they move to a model that is actually closer to the old, familiar toll-access model. After all, there already exist a wide variety of ways in which OA publications pay the bills: advertising, endowments, philanthropy, institutional subsidies, memberships, priced editions and more. In particular, hybrid journals (which is what JoVE has become) are popular with toll-access publishers as a way to establish a foothold in OA territory. Inter alia, Elsevier, Springer and Wiley all publish hybrid journals, and between them, those three account for more than 40% of the worldwide science/tech/medicine publishing market -- so the hybrid model is pretty well established.

There's more to say about authors' willingness and/or ability to pay, too. Firstly, it's almost never the author who pays, but the funding body paying for that author's research. At the moment, this can translate into using up precious grant money when there's a need to pay author-side fees, but with 77 funder, institutional and departmental OA mandates in place and more on the way, it seems reasonable to suppose that more and more of the mandating bodies will underwrite more and more of the costs of publishing. For example, HHMI has institutional agreements/memberships with BMC, Springer and Elsevier, and BMC's page of funder policies shows that a majority of UK funders either make additional funds available or allow publication charges to be treated as an indirect cost. Many OA journals also waive or reduce their fees on application; for instance, here are the PLoS (scroll down) and BMC policies.

Finally, remember that the Kaufman-Wills study showed that 75% of the toll-access journals surveyed charged author-side fees (page charges, colour charges, reprint charges, etc) in addition to their subscription charges. So when there are author-side fees involved, I'd like to know how those charged by OA journals (in return for which the work is freely available to everyone, forever) compare with those charged by toll-access journals (in return for which, authors often cannot retrieve their own work, and anyone who wants to read it must pay another fee).


1 updated 04/29 after reading this post from Peter Suber



Saturday, 18 April
Scholarly (scientific) journals vs total serials: % price increase 1990-2009

Following on from this post, I manually extracted historical data for average scholarly journal prices in a dozen broad disciplines from the Library Journal Annual Periodicals Price Surveys by Lee Van Orsdel and Kathleen Born, and compared these with three datasets from the earlier post: ARL libraries' median total serials expenditures (ARL all serials), Abridged Index Medicus average journal price (AIM) and the consumer price index (CPI):


LJ.png

My concern with the AIM dataset was that it was too small and specialized to support broad conclusions, but it turns out that the AIM data sit somewhere in the middle of the disciplines analysed. Astronomy is closest to the ARL all serials median, with math and computer science not much worse; general science is the worst offender, with engineering and technology, chemistry and food science not far behind. From 1990 to 2008, total price increases ranged from 238% (astronomy) to 537% (general science); that's 3.7 and 8.3 times the increase in the CPI, respectively.

This dataset covers an average of around 3600 journals from 2005-2009, 3255 from 1997-2001 and 2655 from 1989-1990. I think this represents good evidence that historical price data for total serials, even though it shows a rate of increase far greater than that of the CPI, masks an even greater rate of increase among scholarly (scientific) journals. It's difficult to look at that graph and believe that scholarly publishers are playing fair, particularly when one remembers that online publishing, with its attendant cost reductions, came of age during the same period of time.

The Van Orsdel/Born surveys include a number of other scholarly disciplines (art, architecture, business, history, language, law, music, etc etc). If I have the time I'll work those up as well, to provide as broad a picture as possible. I should also include numbers of titles in each discipline, to give some idea of total influence. For instance: although general science (around 60 or 70 titles) shows the greatest increase, it likely contributes far less to the serials crisis than health sciences (more than 1500 titles).

(The data are available in this Excel spreadsheet.)



Friday, 17 April
Some wishes come true.

A while back, I posted about my discovery (new to me, though not new to many others) that the serials crisis should probably be called something like the "scholarly journals crisis". The term "serials" includes a wide range of publications, most of which are not peer-reviewed scholarly journals -- newspapers, goverment reports issued in series, yearbooks, magazines and more. Only about 1/10 of the serials in Ulrich's directory are peer-reviewed. The average scholarly journal costs around 10 times as much as the average serial, and while the cost of the scholarly literature continues to climb, median serial unit costs at ARL libraries have actually been falling for the last seven or eight years (Fig 1 below). It therefore appears that scholarly journals are the driving force behind the serials crisis.

At the time, I wished that I had some specific data to show the difference between scholarly and average serials -- hence the title of this post: via medinfo, I learned that EBSCO Information Services has released a brief report (pdf!) on the price history of well regarded clinical journals, using 117 titles from the NLM's Abridged Index Medicus (AIM). This is a curated list of biomed journals "of immediate interest to the practicing physician" and can be searched on PubMed as a subset limit named "core clinical journals".

As a reminder, here's that graph; it's from the ARL stats report from 2004-5 and the reason it's famous is the way that "Serials Expenditures" outstrips the Consumer Price Index (CPI) and other measures:


ARL.png



Here's a comparison of that data with the price history of the AIM journals; the line labeled "expser/ARL libraries all serials" shows the 1990-2005 subset of the "Serials Expenditures" data from Fig 1, and "EBSCO/core clinical journals" shows the AIM data:


EBSCO.png

Data labels (ARL data from here):

  • serpur: Current Serials Purchased, median value from all ARL libraries
  • expser: Expenditures for Serials, median etc
  • totsal: Total Salaries & Wages, median etc
  • serunit: Serial Unit Cost; median value of expsur/serpur calculated for all ARL libraries
  • EBSCO: average price per journal in the Abridged Index Medicus set
  • CPI-U: Consumer Price Index, all urban consumers, annual average, not seasonally adjusted


This is exactly what I wished for, hard evidence of the difference between scholarly and average serials; and what that evidence strongly indicates is that price increases in scholarly journals are driving the serials crisis. Scholarly journals far outstrip total serials in terms of annual price increase, even though the latter shows a much more rapid increase than the CPI. In contrast, library salary expenditure follows the CPI closely, and median serial unit cost (all serials) has been dropping slowly since 2000.

Frankly, I'm tempted to name this the Big Fat Ripoff Graph. Between 1990 and 2008, the CPI increased by about 65%, whereas over the same period the average price of an AIM journal increased by 415%, a 6.4-fold difference. I've seen publishers try to defend the "total serials expenditures" vs CPI discrepancy by pointing out that journals are proliferating -- indeed, the "serials purchased" curve is headed upwards at an increasing rate, particularly over the last five years or so. But that defense is no good against the BFR Graph, on which the most damning curve shows average journal prices. I've also seen comments to the effect that if mean or median serial unit costs are dropping, publishers must be offering increasing value for money even if they are charging more in total. That might be true of the set of "all serials publishers", but it's apparent from the BFR Graph that scholarly journal publishers can make no such claim.

It must be remembered, of course, that we are only looking at a little over a hundred clinical journals here, a small and discipline specific subset. Nonetheless, the result is so striking that I think it is a considerable inducement to the gathering of more data. Since it seems my wishes for more work are coming true, I'll make another: now I want price history data for other, larger journal subsets in other scholarly disciplines. I wonder what the BFR Graph looks like for those datasets?

(P.S. If you want the numbers I used, or to check my work, the spreadsheet is here.)


Update: ha! I just got around to reading this article, linked by Peter Suber a couple of days ago; turns out it's full of annual price data, and Van Orsdel and Born have been doing these surveys for at least ten years. There doesn't seem to be a central collection or data collation, so I'll have to piece it together. Stay tuned!



Wednesday, 15 April
What's wrong with copyleft?

This FriendFeed thread regarding the Wikipedia licensing vote has stirred up an old hornet's nest of issues surrounding copyleft and noncommercial clauses in Open licenses. As I said in the thread, I get most of my ideas on this topic from David Wiley, and have posted about those ideas before. Herewith another attempt to organize and clarify my thoughts, as much for my own benefit as anything:


1. The purpose of Open licensing is to enable the following (this is straight from David's Open Education License draft, about which more later):

  • Reuse - Use the work verbatim, just exactly as you found it
  • Rework - Alter or transform the work so that it better meets your needs
  • Remix - Combine the (verbatim or altered) work with other works to better meet your needs
  • Redistribute - Share the verbatim work, the reworked work, or the remixed work with others


2. The purpose of restrictive clauses in such licensing is to prevent specific types of reuse, rework, remix and/or redistribution:

2a. Copyleft prevents future copyright lockup by requiring that all downstream (reworked or remixed) works be similarly licensed.

2b. Noncommercial clauses prevent profitmaking, and are complicated, and I'm not getting any further into it than that right now. (Maybe later, if my brain doesn't melt.)


3. Although copyleft and NC clauses achieve their own immediate goals, widespread license incompatibility1 means that they often (perhaps usually) defeat part of the larger purpose of Open licensing. The use case where this is most prominent is remix2, since reuse and redistribution of individual copylefted or NC-licensed works or their derivatives is usually just a matter of retaining the original license. But multiple works can only be recombined into new works if their respective licenses are compatible -- otherwise, there's no licensing option for the remix that doesn't violate the licensing terms of at least one of the ingredients. Not only that, but if any of the works in the mix carries a copyleft license, that license takes over the entire remix and everything downstream of it, thus propagating the incompatibility problem.


4. One last thing: could copyleft be saved from itself? What if someone wanted copyleft protection, without the compatibility issues? Creative Commons is already beginning to build the only solution I can think of: widespread interoperability agreements between existing and any newly developed copyleft licenses. CC-BY-SA 3.0 contains the following clause:

You may distribute, publicly display, publicly perform, or publicly digitally perform a Derivative Work only under: (i) the terms of this License; (ii) a later version of this License with the same License Elements as this License; (iii) either the Creative Commons (Unported) license or a Creative Commons jurisdiction license (either this or a later license version) that contains the same License Elements as this License (e.g. Attribution-ShareAlike 3.0 (Unported)); (iv) a Creative Commons Compatible License.
where (iv) is defined as
a license that is listed at http://creativecommons.org/compatiblelicenses
Sadly, the cupboard remains bare so far:
Please note that to date, Creative Commons has not approved any licenses for compatibility; however, we are hopeful that we may be able to do so in the future. If you would like to discuss the possible compatibility of your license with a Creative Commons license, please email us at info@creativecommons.org.

I am personally persuaded that the Public Domain is the best way out of the copyleft trap, which is why I use CCZero for everything I make.






-------------
1 Among CC licenses, there is only about 33% compatibility, and that drops to 20% among NC and SA versions -- including self-compatibility*:

cccompatibility.png


Restrictive (NC, SA) versions currently account for around 80% of worldwide CC licence uptake. Once you start factoring in the dozens and dozens of other Open/Free licenses out there, it only gets worse. The FSF and OSI maintain lists of licenses and compatibilities (here and here, respectively), and wikipedia includes a couple of fairly extensive comparison tables. Speaking of Wikipedia, the world's favourite online encyclopaedia is currently released under the GNU Free Documentation License, which is not compatible with any CC license except Public Domain though it does allow transition to CC-BY-SA. If the current vote on that transition is "yes", that will be a step forward -- but it will still leave Wikipedia with the compatibility problems shown in the figure above. Exploration of compatibility issues with all the other Free/Open licenses is left as an exercise, etc.

* from here and here; green indicates compatibility, light green indicates possible compatibility -- some disagreement between sources.


2This is why I consider David's "Four R's" formulation so important, because it makes a clear distinction between rework and remix that is essential to understanding the aims and implementation of Open licenses.



Monday, 13 April
Anniversary of sorts

This question from Antony Williams on FriendFeed:

Is PubChem Data Open or not? There are many discussions saying that PubChem data are Open but I see PubChem as a host and the disclaimer does not say "open": http://tinyurl.com/e78as

reminded me that it's almost a year to the day since Egon Willighagen asked a similar question about PubMed Central content:
I was wondering about this section in the CC license of much of the PMC content, such as our paper on userscripts (section 4a of the CC-BY 2.0):
    You may not distribute, publicly display, publicly perform, or publicly digitally perform the Work with any technological measures that control access or use of the Work in a manner inconsistent with the terms of this License Agreement.
CC-BY 3.0 reads differently, but has similar aims. [...] Peter [Murray-Rust, see here] indicates that the NIH has put in place 'technological measures to control access' to the distribution of our work on userscripts (the PMC entry). That is in clear violation of the CC license. [...] What the PMC website should indicate, instead, is that text mining is allowed for the PMC OAI subset, but that they would highly prefer to use the PMC OAI or PMC FTP routes. This is the least they have to do.

No matter what, I still have the feeling that any technical obstacles are disallowed by the CC-license. Any legal expert here, that can explain me if the CC license allows controlling how people have access to my material?

These are both very good questions, and I still don't have an answer for Egon's even after a year. I'm reluctant to go pestering John Wilbanks with every CC-related question I come across, so I'm reposting in the hope that someone will be able to save John from me.



Monday, 13 April
Lazy reporter, no donut.

Dennis Carter in an eCampus News article about NPG's Scitable:

Scitable's January launch came as elite universities across the United States are embracing open-access formats--making research articles available for free online. This marks an abrupt departure from the traditional model of printing research articles in academic journals, which can cost campuses as much as $20,000 annually, open-access experts say.
So, is it the traditional model that can cost campuses up to $20K/yr, or academic journals, each of which can cost etc?

It's only obvious that what is meant is $20K/yr per journal subscription if you already know that libraries spend millions of dollars per year on serials.

I'd expect a publication that wants you to register to read its content1 to bother making that content accurate and unambiguous.


-------------
1 Sure, registration is free. Registration also provides the publisher with a great bolus of immensely valuable marketing information, to say nothing of the slimy opt-out spam opportunity. Which is why I recommend poisoning such databases with fake information providing minimal information unless you get content that you really value from the site. (Two wrongs etc, hence the edit.)



Monday, 13 April
Someone else is fooling around with numbers.

Via Peter Suber, I came across this editorial in the Journal of Vision:

Measuring the impact of scientific articles is of interest to authors and readers, as well as to tenure and promotion committees, grant proposal review committees, and officials involved in the funding of science. The number of citations by other articles is at present the gold standard for evaluation of the impact of an individual scientific article. Online journals offer another measure of impact: the number of unique downloads of an article (by unique downloads we mean the first download of the PDF of an article by a particular individual). Since May 2007, Journal of Vision has published download counts for each individual article.
The author goes on to compare download vs citation (counts and rates, and downloads or citations over time). It's a pretty good analysis of an important topic, but something vital is missing:
Where are the data? Can I have them? What can I do with them?1
In fact, the data are approximately available here. Why "approximately"? Well, I can get a range of predigested overviews: DemandFactor (roughly, downloads/day/first 1000 days) Top 20, total downloads Top 20 and article distributions by DemandFactor and total downloads. I can also get the download information for any given article -- one article at a time, and once again predigested in the form of a graph from which I have to guesstrapolate if I want raw, re-useable data.

This is disappointing, for both general and specific reasons. It's always disappointing to see data locked away in a graph or a pdf or some similar digital or paper oubliette, there to languish un(re)used. It's also disappointing to see a journal getting way out ahead of the curve on something as important and valuable as download metrics (is there another journal besides J Vis that provides this information, even predigested?), and then missing an opportunity to continue to innovate by providing real Open Data.

It's also disappointing in this specific instance, because I have a question: why is Figure 1 plotted on a log scale and, more importantly, was the correlation coefficient calculated from log-transformed data? I could understand showing the log scale for aesthetic reasons, but I can't think of a reason to take logs of that kind of data -- and doing so can alter the apparent correlation. For instance, remember Fig 1 from this post? Here it is again, together with a plot of log-transformed data, both shown on natural and log scales:


logarithmssarehard.PNG



I could answer my own question quickly and easily if I could get my hands on the underlying data -- which leads me right back to one of the primary general arguments for Open Data. If I, statistical ignoramus and newcomer to these sorts of analyses, have questions after a brief skim through the paper, what questions might a better equipped and more thorough reader have? It's simply not possible to know -- the only way to find out is to make the data openly available!

I realise it's not possible for journals to demand Open Data from their authors -- that's what funder-level mandates are for, though there's much discussion still to be had regarding whether Open Data mandates would be a good idea. Nonetheless, when journals publish analyses of their own data, it would be great to see them leading the way by providing unrestricted access to that data.

-------------
1 Astute readers, both of you, will remember that howl of anguish refrain from this post.



Saturday, 04 April
Why don't we share data? Not for the reasons Steven Wiley thinks we don't.

Via Peter Suber, I came across an editorial about data sharing in The Scientist. I disagree with the author, PNNL's Steven Wiley, on a number of points:

Despite the appeal of making all biological data accessible, there are enormous hurdles that currently make it impractical. For one, sharing all data requires that we agree on a set of standards. This is perhaps reasonable for large-scale automated technologies, such as microarrays, but the logistics of converting every western blot, ELISA, and protein assay into a structured and accessible data format would be a nightmare -- and probably not worth the effort.

Wiley is making two mistakes here: setting the perfect against the good, and vastly underestimating human ingenuity.

Standards are inarguably required for automated sharing and essential for the sharing of ALL data, but that doesn't mean that sharing SOME data, with evolving standards or even without any standards, has no utility. My pet example is the long standing practice of supporting scientific claims with the phrase "data not shown" in peer-reviewed papers, something I think should no longer be allowed. All scientific claims should be supported by data. "Data not shown" belongs to the print era, when space was limited and distribution relied on physical reproduction and transport. This is the era of the online supplement, to which no such restrictions apply.

Reasonable people might contend that I am stretching the concept of "data sharing" to cover my pet peeve there, but I chose the example deliberately as an edge case: there is, to me, clear utility in that kind of data sharing, even though it involves no standards, only some data, and only eyeball-by-eyeball access (whereas I myself frequently argue that the greater part of the value of Open distribution probably lies in the long term, in machine-to-machine access). I argue that more sharing, using -- despite their current flaws -- evolving standards, is likely to yield significant dividends well before reaching the eventual goal of sharing all data using universal standards.

This leads me to the second mistake. It seems odd to me to insist that because standards are difficult to develop and implement, the bulk of such work is futile. The key is the phrase "currently... impractical". The whole concept of the internet was probably considered "currently impractical" by a great many people, until someone went and built it. There are plenty of people still willing to pronounce Free/Open Source software "currently impractical", even as they (perhaps unwittingly) rely on it every time they go online or send email. Then-existing hurdles at various times surely made business on the internet "currently impractical", and banking on the internet "currently impractical", and -- need I go on?

Moreover, I am not the only one who disagrees about the value of creating standards for difficult-to-share data. If you think western blots would be a nightmare, how about biodiversity data -- like, say, museum specimens? How about anthropometric data, exchangeable biomaterials, neuroscience data, electron micrographs, magnetic resonance images or microscopy images? The MIBBI project has dozens of other examples, the Open Biomedical Ontologies Foundry is working on dozens more, and Bioformats.org might offer a lightweight solution to some of the same problems.

(In re: Wiley's specific examples: I was easily able to find efforts underway to enable sharing of gel electrophoresis data, protein affinity reagents and molecular interaction experiments; and I can't imagine ELISA data being much harder to share than microarray information -- surely MIAME, for instance, could readily be adapted if it wouldn't already serve? I'm not sure what kind of protein assay Wiley has in mind.)

I cannot begin to imagine how to build semantic and exchange standards for those kinds of data, but I'm not about to bet against the people currently trying to do so; nor do I believe that, once built, their systems will prove to have been "not worth the effort".

As I mentioned, reasonable people might disagree about various points above. But Wiley goes on to say:

Unfortunately, most experimental data is obtained ad hoc to answer specific questions and can rarely be used for other purposes.

which is just plain wrong. Much of the rationale for data sharing, the engine of much of its promise, is the simple observation that you cannot know what someone else will do with your data, particularly when they have access to lots of other people's data to go with it. Re-use beyond the scope of the original author's imagination is a primary impetus for data sharing, and innovative examples abound; for instance, just take a look at Tony Hirst's blog. (If there is a dearth of examples from biomedical research, I'd call that an argument in favor of more, not less, data sharing.)

"Can rarely be used" is an empirical claim, and those should be backed by data -- and I can think of only one way to get the relevant data in this case.

Wiley continues:

Good experimental design usually requires that we change only one variable at a time. There is some hope of controlling experimental conditions within our own labs so that the only significantly changing parameter will be our experimental perturbation. However, at another location, scientists might inadvertently do the same experiment under different conditions, making it difficult if not impossible to compare and integrate the results.

[...] In order to sufficiently control the experimental context to allow reliable data sharing, biologists would be forced to reduce the plethora of cell lines and experimental systems to a handful, and implement a common set of experimental conditions.

Experimental results are supposed to provide useful information about the world of sense-perception. If a result cannot be repeated by different hands in a different lab, then it is probably not telling us what we think it is telling us about the way the world works. If, on the other hand, a particular result does mean what we think it means about the underlying system, then we should be able to design different experiments to be carried out with different hands, conditions, equipment etc., and obtain data that supports the same conclusions. That's what we call a robust result, and standard practice is to aim for robust results.

Regarding integration and comparison of results from different conditions -- just what does meta-analysis mean, if not exactly that? As an example, if you were to knock Pin1 down in HeLa cells, you'd block their growth, but Pin1 knockout mice survive just fine. Comparison of those results is not only possible, but extremely interesting, and is the way we learned that mice have an active Pin1 isoform, Pin1L, which is present but potentially inactive in humans.

I think that variation in conditions between labs is a good reason to build finer-grained semantic structures, but no reason at all to throw up our hands and give up on linked data.

Wiley goes on to give as his sole concrete example the lack of uptake into published papers of data from the Alliance for Cell (sic) Signaling. It's actually the Alliance for Cellular Signaling1; their website lists 20 publications, NextBio finds 35 and Google Scholar (which covers a lot more than peer-reviewed papers) finds 440. Scholarly papers are a somewhat limited measure of research impact, but that's not at first glance an impressive showing. Consider, though, that the AfCS was established in the late 1990's, which puts it well ahead of its time, and then compare the first, second and ongoing third decades of the undisputed poster child of data sharing2:

genbankgrowth.PNG

There's more to Wiley's choice of example, though:

In my own case, I am interested in the EGF receptor and receptor tyrosine kinases. This aspect of cell signaling was not covered in their dataset, and thus it is of no interest to me.

I wish I had a dollar for every time I'd heard an argument against some new idea that boils down to: "I can't figure this out, or find a use for it myself; therefore it's no good and will never be any use to anyone". I'm sure there's a pithy Latin name for this particular logical fallacy.

Wiley continues in, as it turns out, a similar vein:

And soon, discussions about the importance of sharing may become moot, since the rapid pace of technology development is likely to eliminate much of the perceived need for sharing primary experimental data. High throughput analytical technologies, such as proteomics and deep sequencing, can yield data of extremely high quality and can produce more data in a single run than was previously obtained from years of work. It will thus become more practical for research groups to generate their own integrated sets of data than try to stitch together disparate information from multiple sources.

And just what does the PNNL's Biomolecular Systems Initiative (of which Wiley is director) do? By a strange coincidence, this:

advancing our high-resolution, high-throughput technologies by exploiting PNNL's strengths in instrument development and automation and applying these technologies to solve large-scale biological problems....

We are building a comprehensive computational infrastructure that includes software for bioinformatics, modeling, and information management. To be more competitive in obtaining programmatic funding, we will continue to invest in new capabilities and technologies such as cell fractionation, affinity reagents, high-speed imaging, affinity pull downs, and ultra-fast proteomics. This will help us build world-class expertise in the generation and analysis of large, heterogeneous sets of biological data. The ability to productively handle extremely large and complex datasets is a distinguishing feature of the biology program at PNNL.

The remainder of this post is left as an exercise for the reader; be sure to cover the question of how less well-heeled institutions are supposed to carry out work in proteomics and deep sequencing and so on, and don't forget to ask for evidence showing that it is not important to share data even between such high-fliers, since presumably they can extract every last conceivable piece of useful information from their own data...


-------------
1You'd be amazed how many things share that acronym -- activity-friendly communities, antibody-forming cells, ataxia functional composite scale, antral follicle count, alveolar fluid clearance, age at first calving, amniotic something something -- that's where I gave up. Why oh why can't we have a decent text search? Even just "match case" would have solved much of my problem here. /rant

2 graph from here



Wednesday, 01 April
Fooling around with numbers, part 5b.

I've already assigned part 6 to a particular analysis in an effort to get me to actually do that work, but I felt that I just had to include this (via John Wilbanks) in the series:



Lemongraph.jpg



I'm just sayin'. (I may have to get that graph as a tattoo).


P.S. Never mind the date, this is not a trick; I hate online April Fool jokes with the fiery power of a thousand burning suns.




Tuesday, 24 March
Entry for Ada Lovelace Day

Today is Ada Lovelace Day:

Ada Lovelace Day is an international day of blogging to draw attention to women excelling in technology.


Women's contributions often go unacknowledged, their innovations seldom mentioned, their faces rarely recognised. We want you to tell the world about these unsung heroines. Entrepreneurs, innovators, sysadmins, programmers, designers, games developers, hardware experts, tech journalists, tech consultants. The list of tech-related careers is endless.

Since most of my role models who happen to be female are not really in any kind of tech career, I'm spared the need to write the enormous essay that it would take to cover them all. Instead I'll point to just two for whom I can reasonably make a tech connection: Rosie Redfield and Maureen Hoatlin.

I've never met Rosie, who is a PI in the Zoology Department at University of British Columbia, but she is one of the first biomed researchers -- if not the very first -- to embrace Open Science and I've been following her online presence for a couple of years now. From her lab's homepage you can read not just the usual list of publications and personnel, but also submitted research proposals and work in progress. The latter is communicated by blog: Rosie has one, and so do several other lab members. They discuss upcoming and ongoing experiments, work up data and think out loud about their research in general.

I met Maureen after we were both quoted in Mitch Waldrop's SciAm article on Open Science, and she realized that we worked on the same campus. Maureen is a PI in the Biochem Dept at OHSU. She tells a great story about neglecting her family one weekend while she sat in bed reading scientific articles online -- "this changes everything" was all she would say to their pleas for breakfast, etc. Well, Maureen meant what she said, and she's walking the walk. You can find the Hoatlin lab on OpenWetWare, along with a wiki-based, bottom-up, ongoing experiment in improving grad student education that she pioneered, and you can find Maureen on a range of social networking sites including FriendFeed and LinkedIn. Her lab has its own Twitter account.

Since I think this sort of open, collaborative model is very much the way of the future, if science is to have a future at all, I'd like to see Rosie and Maureen get their props for having been such early adopters. It's also worth mentioning that, in addition to still being a Boys' Club in many ways, research is a very conservative environment in which new ideas are usually met with scorn and active resistance. So, having made it up the foodchain in the face of irrational opposition, they are now confronting the same tribe with another set of new and threatening ideas. Both are worthy additions to the Ada Lovelace Day pantheon.



Tuesday, 24 March
New blog in town.

I don't normally promote new blogs, other than to add them to my blogroll if I think they are worth my readers' time, but I'll make an exception for PLoS ONE's new community blog, EveryONE:

Why a blog and why now? As of March 2009,  PLoS ONE, the peer-reviewed open-access journal for all scientific and medical research, has published over 5,000 articles, representing the work of over 30,000 authors and co-authors, and receives over 160,000 unique visitors per month. That's a good sized online community and we thought it was about time that you had a blog to call your own. This blog is for authors who have published with us and for users who haven't and it contains something for everyone.


Why did you call the blog everyONE? For three main reasons that encapsulate the mission of the journal:

Firstly, because PLoS ONE is for every rigorous research article that passes our peer-review process.

Secondly, because PLoS ONE is a forum for research in every scientific discipline (with a current emphasis on life and health sciences because of PLoS's history).

Thirdly, because PLoS ONE is a source of information for every inquisitive reader with an interest in high-quality scientific research.

I hope, and on my better days believe, that PLoS ONE is one of the leading models for the future of scientific journals:
  • they offer gold OA -- that is, free online to everyone everywhere from the moment of publication, including submission to PubMed Central
  • they offer a sustainable business model for OA: in the black after less than three years and with an author-side fee of $1300
  • their peer review process is as rigorous as any, but it does not ask reviewers to make guesses about what is "hot", or what is likely to be important at some time in the future: if it's solid science, PLoS ONE will publish it
  • they don't have an Impact Factor: homey don't play dat, as the kids around here say
  • that's not to say that they are not actively seeking rich measures of utility/impact for scientific publications: for instance, here's Bora's roundup of analyses of an experimental dataset that they passed around a while back, and an update from Euan
  • in the same vein, I can't find a link right now but there are plans afoot to release real-time access to such data as downloads, comment frequency and so on -- post-publication measures which can improve and speed up citation based measures; for another example, scroll down on this page for some self-measurement that represents a level of disclosure I have not seen from any other journal
  • they are responsive to and engaged with the community: for instance, both Bora Zivkovic (community manager -- how many journals have one of those?) and Peter Binfield (managing editor) are active on FriendFeed
  • they encourage and enable community input in the form of notes, comments and ratings on every article; I particularly like the option given to reviewers to have their reviews included as comments with the paper

EveryONE is another way for PLoS ONE to engage with their community of readers and contributors, and well worth a look.


DISCLAIMER: I consider Bora and Peter friends of mine, and I've previously applied to work at PLoS.



Saturday, 21 March
Should we talk about the "journals crisis" instead of the "serials crisis"?

I stumbled upon something new-to-me, and possibly even useful-to-others, in my fooling around with numbers (table 2 and discussion thereof here), but it's somewhat buried under all the "how I made this figure" and "where I got these data" details. For that reason, and because I didn't trust my idea until I had some external reinforcement, I thought I'd give it a separate post all its own.

Here's the thing: what is widely known as the serials crisis in library costs is probably driven largely by the pricing of scholarly journals. In library parlance, "serials" includes, inter no doubt many alia, newspapers, goverment reports issued in series, yearbooks and magazines (periodicals), in addition to the scholarly literature. Of the 225, 000 or so periodicals in Ulrich's, only about 25,000 are peer reviewed. In the FriendFeed discussion started by my post, Walt Crawford said

...some of us have long argued that there isn't a serials crisis for library budgets, there's a scholarly journal crisis. Magazines (and there are about 1/4 million magazines as compared to about 25,000 scholarly journals) tend to have very low prices and very modest increases.
Although non-refereed serials dominate product counts (and, apparently, library collections), the situation is reversed for unit expenditures. The average unit cost for the UCOSC dataset, which is composed entirely of scholarly journals, is roughly ten times the average unit cost for any of the other datasets I used, all of which were general data that included all types of serial. Here's Walt again:
the 10:1 ratio for UC (that is, scholarly journals averaging 10x as expensive as all serials) sounds about right
When the numbers and Walt's experience began to line up, I became much more confident in my conclusion, that the serials crisis is really a scholarly journals crisis. It's not clear to me, in fact, why the phenomenon got the nickname it did; perhaps it's just that "serials crisis" is a punchier phrase.

I'm not at all sure that any of this is more than semantic nitpicking, but giving things their proper name can be important. Most researchers who only hear the name won't care about a "serials crisis" -- that's a library problem, nothing to do with us. But if they hear about a "scholarly literature crisis", it becomes clearer that the issue is the potential loss of access to resources necessary to do our jobs. I suspect most researchers who've heard of the serials crisis are aware that it is, at least in part, about journal pricing, but I wonder how many know that it's pretty much only about journal pricing? This little "discovery" of mine really did put things in a different perspective for me, and I'm probably more informed about library- and publishing-related issues than most benchmonkeys.

I doubt that an alternative name will catch on, and I'm not going to start campaigning for one -- but I think that from now on I'll at least occasionally refer to the "serials/scholarly literature" crisis, or something similar, if only to remind myself of my own little satori. (Question for the lazyweb: can anyone suggest a better phrase, one which would make it more apparent to researchers that they should care about this?)




Thursday, 19 March
Fooling around with numbers, part 5

As promised, here is the distribution of journal prices for the subsets of the Elsevier life sciences dataset which either have or don't have impact factors, and for the entire UCOSC dataset (in which all journals have IFs):

plusminusIF.PNG

Each interval is $499: $0 to $499, $500 to $999, etc, and datapoints are plotted at the midpoint of each interval.

The conclusion is the same as in part 1, just a bit clearer now. Elsevier journals without an impact factor are priced lower than those which have an IF, and the price distributions are somewhat different between journals with and without an IF. Note, though, that if I'd used a $1000 interval instead of $500, the initial rise in the +IF curves would not appear; if these are power-law distributions the main difference is probably the scaling exponent. I think. (Math is not my friend.)

It almost looks as though low-end journals are shunted out of the lowest price bracket as soon as they get an IF, any IF, and then tend to increase in price as the IF goes up. Update: no it doesn't. I don't know what I was thinking there.


The rest of the series: part 1, part 2, part 3, part 4.



Tuesday, 17 March
Author-side fees in hybrid and OA chemistry journals

Peter Suber, responding to a J Cheminfo paper, wondered what proportion of chemistry journals in the DOAJ charge author-side fees. Since I was in that mode, as it were:


DOAJchem.png



Hybrid journals are those that offer OA-for-a-fee, so of course all of those charge fees. "Open" above refers to Gold OA journals, roughly half of which charge author-side fees in this chemistry subset. This is broadly consistent with the overall DOAJ listing (as of December 2007) and also with several other studies that Peter mentions.


I still can't solve the tables bug; if you want the numbers, view source -- I've commented out a simple table that displays fine unless Moveable bloody Type gets hold of it. If you want to see how I generated the numbers, grab this spreadsheet. I first cut-and-pasted from the DOAJ subject listings into a text editor, then used the replace function to introduce tabs before "hybrid" or "open" and between "publication fee" and the entry for each journal. Then I used the replace function to delete all lines between "hybrid/open" and "publication fee", to simplify the Excel formula... you'll see what I mean if you look at the spreadsheet.



Tuesday, 17 March
Fooling around with numbers, part 4; or, those data -- you keep using them -- I don't think they mean what you think they mean...

At the end of part 3, having looked at some of the ways in which prices and price/use were distributed, I said I'd try to say something about what constituted a fair price. I hadn't thought that through at all, and it turns out that I really can't get much leverage against that question from the UCOSC dataset alone.

In addition to the graphs in parts 1-3, here's yet another way to look at the UCOSC data (again, this is a png from a screenshot because MT ate my balls perfectly good table1):


Table 1
MTsucksass.png


Perhaps Elsevier doesn't stand out quite so much as I might have expected -- they still dominate by virtue of market share, but in terms of cost/use or use/title, Springer looks the worst of the bunch. Mean ($0.76) and median ($1.89) cost per use doesn't mean much without context. I could argue that since libraries are having trouble keeping up with serials costs and usage is only likely to increase, those probably don't represent fair prices... but I don't know how much weight that argument would hold, and anyway you should go read Heather Morrison on why usage-based pricing is dangerous. (That's one of the benefits of thinking-out-loud like this; knowledgeable people come along and point out stuff you need to know. Yay lazyweb!)

So, I need context: let's start with, how many libraries are there? According to the American Library Association, there are more than 120,000 libraries in the USA -- but for my purposes, I'm really only interested in those which carry the scholarly literature. The US Dept of Education's National Center for Education Statistics runs a Library Statistics Program, which provides data specifically on academic libraries.

According to the ALA and the NCES, there are about 3700 academic libraries in the US. If all of them subscribed (at list price) to the 2904 journals in the UCOSC dataset, that would work out to $13,306,150,900 -- about $13 billion -- per year on scholarly journals alone. To put that into perspective, the entire NIH research budget for 2008 was less than $30 billion. I have been told that most libraries don't pay list price, because publishers offer all kinds of deals, but I wondered whether that $13 billion was at least in the right ballpark, so I went looking for more data.

Since the UCOSC dataset covers 2003-4, I looked at the NCES report for 2004 (the spreadsheet I used is here). The ALA has another division, the Association of College and Research Libraries, which keeps its own records; alas, these are not free, but I could get nearly everything I wanted from the summaries -- again, I just looked at 2004. There's also the Association of Research Libraries, which is "a nonprofit organization of 123 research libraries at comprehensive, research-extensive institutions in the US and Canada that share similar research missions, aspirations, and achievements", mostly made up of very large libraries (think Harvard, Yale, etc). The ARL also compiles and makes available statistics on its members; I pulled out the 2004 data from the download page (spreadsheet here).

Finally, I added the UCOSC dataset for comparison, and for extra context I pulled out the University of California subset from the ARL data (Berkely, Davis, Irvine, LA, Riverside, San Diego and Santa Barbara; I think these are the largest 7 of UC's 10 main campus libraries).  The resulting data look like this2:


Table 2
MTstillsucksass.png


Na, not applicable; cc, couldn't calculate. The ACRL data is derived mainly from two summaries, one showing expenditure (red) and one showing holdings (blue). The mean cost/serial is a fudge, since it was calculated using figures from both summaries, but I doubt it's significantly different from the value I would get if I had all the data, since the number of libraries included in each set is so similar. The other values in green are also approximations derived from summary reports3. Note that the "per library" figures for the UCOSC dataset are actually just for that subset of journals (hence the "<<1" entry for "no. libraries").

I've put some sanity checks -- do these data make sense? -- in a footnote4; to me, the data appear both externally and internally consistent.  I don't, in other words, appear to have done anything egregiously stupid. Not with the numbers, anyway:

Two things jump out at me from Table 2, which together are responsible for the subtitle of this entry. First, my $13 billion guess was way off -- the actual amount spent on serials by US academic libraries is probably closer to $1-2 billion.  Large (e.g. Ivy League) libraries might spend many tens of millions of dollars, small libraries maybe only a few hundred thousand.  That's still an enormous amount of money, but it's not half the NIH budget!  So why the discrepancy?

Quite apart from "list price" and "what libraries actually pay" being two very different things, I've been making a mistake in terminology.  When I think of "serials" in a library, I think of the peer-reviewed scholarly literature; I tend to use "journals" to mean the same thing.

This is very, very wrong.

(As, no doubt, any librarian could have told me, without the need to go ferreting through all those numbers.) From the NCES survey instrument used to collect their data (emphasis mine):

[expenditure]
Current serial subscriptions (ongoing commitments) (line 13) - Report expenditures for current subscriptions to serials in all formats. These are publications issued in successive parts, usually at regular intervals, and, as a rule, intended to be continued indefinitely. Serials include periodicals, newspapers, annuals (reports, yearbooks, etc.), memoirs, proceedings, and transactions of societies.
[...]
[holdings]
Current serial subscriptions (line 26) -- Report the total number of subscriptions in all formats. If the subscription comes in both paper and electronic form, count it twice. Count each individual title if it is received as part of a publisher's package (e.g., Project MUSE, JSTOR, Academic IDEAL). Report each full-text article database such as Lexis-Nexis, ABI/INFORM as one subscription in line 27. Include paper and microfilm government documents issued serially if they are accessible through the library's catalog.

From the ARL ditto:

Questions 4-5. Serials. Report the total number of subscriptions, not titles. Include duplicate subscriptions and, to the extent possible, all government document serials even if housed in a separate documents collection. Verify the inclusion or exclusion of document serials... Exclude unnumbered monographic and publishers' series. Electronic serials acquired as part of an aggregated package (e.g., Project MUSE, BioOne, ScienceDirect) should be counted by title. A serial is
a publication in any medium issued in successive parts bearing numerical or chronological designations and intended to be continued indefinitely. This definition includes periodicals, newspapers, and annuals (reports, yearbooks, etc.); the journals, memoirs, proceedings, transactions, etc. of societies; and numbered monographic series.

Oy vey. Newspapers, yearbooks, government documents and a whole bunch of other things that aren't scholarly journals are (or can be) serials too. "Periodicals" means National Geographic qualifies -- hell, so does Playboy magazine!

As of today (March 17), Ulrich's Periodicals Directory lists 224,151 "active" periodicals; of those, 65,461 are "academic/scholarly"; and of those, 25,425 are "refereed".

What do those things cost which aren't part of the peer-reviewed literature? How does their inclusion in library data impact the means and medians I've been looking at?

Which brings me to the second item of note from Table 2: the mean cost/serial is on the order of ten times higher for the UCOSC dataset than for the other sets.  Does that mean that the scholarly literature is actually the powerhouse of the serials crisis (pdf!), and if we could zero in on the peer-reviewed fraction of the serials data we would see an even more dramatic rise in price? Or does it have more to do with the fact that the UCOSC dataset is deliberately composed of relatively high-end journals, thus artificially inflating the apparent costs? If every library in the NCES set subscribed to those journals at even one-tenth of list price, it would still account for pretty much the entire serials expenditure -- so how many libraries subscribe to which journals? What of the roughly 22,000 peer-reviewed journals that aren't included in the UCOSC dataset?  If libraries are subscribing to anywhere from a few thousand serials to well over 100,000 (e.g. ARL 2007 numbers for Columbia, Harvard and Illinois/Urbana), what proportion of those subscriptions are to peer-reviewed journals -- or, conversely, to what proportion of the peer-reviewed literature does the average library subscribe?

In other words, I've made no headway at all on the question of a "fair price"; all I've managed to do here is to find more questions.  I guess that's progress, because at least they are better-defined, more specific questions. Answering them will require much more fine-grained data, though: which libraries subscribe to which peer-reviewed journals, and at what cost?  I think the answers might be very useful to the research community, but collecting the data would be a full-time job. (I'm up for it, by the way, if anyone reading this is in a postion to hire me to do it. Seriously, I'd love it. After all, look what I'm doing for fun.)

To return to where I started: there's another angle of attack on the "fair price" question, which is to look at things from the other side.  How much does it cost to publish a paper in the peer-reviewed literature, and how does that compare to actual income at publishing companies? This information is notoriously hard to come by, but I've been collecting links and notes for a while so in Part 5 6* I'll try to put them all together and see if I've got anything useful.

* I've just remembered something else I want to do first: Part 5 will take a look at journal price distributions with and without impact factor, using the Elsevier Life Sciences (see Part 1 Fig 3) and the UCOSC datasets.

Update: if you've read this far, go read the FriendFeed discussion, you'll like it.


-------------


1 If you want the data there's a comma-delimited text version of the table here and the spreadsheet from which the table is derived is here.

2 Comma-delimited text file here.

3 The following table shows the figures used to calculate the sum total library expenditure for the ACRL dataset.  Numbers in black are taken from the summaries provided, numbers in pink are calculated from them.

Table 3
MTsucksassforever.png

Mean total expenditure per library was calculated using an approximate average number of libraries of 1074.

4 Sanity checks:

Internal:

  • the ARL and ACRL subsets of the NCES libraries spend less in sum than the NCES set, but the mean and median expenditures/library are lower for the NCES set because it includes more, and smaller, libraries
  • the mean and median number of serials/library is similar between the ARL dataset and its UC subset, both figures being much larger than the mean serials/library for the NCES or ACRL sets (again, more and smaller libraries)
  • the mean and median cost/serial is similar throughout, except for the UCOSC dataset which is a curated subset of high-end scholarly journals (discussed above)

External:

Are those reasonable totals for the libraries to be spending?

  • The ARL 2004-5 report shows that member libraries spent $680,774,493, with a median per library of $5,904,464, on serials, and total library expenditure was $2,683,008,943 (median per library $20,210,171)
  • The NCES 2004 summary shows that 3653 libraries surveyed spent, in sum, $5,751,247,194 on total operating expenses, $1,363,671,792 on serials and $2,157,531,102 on information resources in general

Are those reasonable total numbers of journals per library?

  • OHSU (where I was until recently employed) has 20857 entries in its "journals" catalog
  • The NCES 2004 summary shows that, all together, 3653 academic libraries held 12,763,537 serials subscriptions
  • The ARL 2004-5 report shows that 113 member libraries held 4,658,493 subscriptions, with a median per library of 37,668

Are those reasonable mean and median costs per serial?

  • I could only find unit costs for serials in the ARL report, in the "analysis of selected variables", where the mean cost/serial is given as $247.55 per subscription (range $656.31 to $93.72, median $231.90, 88 libraries reporting).

So, at least in ballpark terms, the numbers in my tables appear to check out against summaries compiled by the various agencies from their own data (and the OHSU library catalog).  There are, e.g., no order-of-magnitude discrepancies -- except perhaps in cost/serial, as discussed above.






Monday, 16 March

Update the first: now I feel bad for not waiting (though I did put "read AFTER honeymoon!!!" in the subject line), but John Wilbanks wrote back right away to say that it will take him a while to get to it, but he will ferret out specific answers regarding the Science Commons work and interoperability.

Update the second: Peter Sefton has more here, including specific recommendations for working with Microsoft while avoiding "a new kind of format lock-in; a kind of monopolistic wolf in open-standards lambskin":

  • The product (eg a document) of the code must be interoperable with open software. In our case this means Word must produce stuff that can be used in and round tripped with OpenOffice.org and with earlier versions, and Mac versions of Microsoft's products. (This is not as simple as it could be when we have to deal with stuff like Sun refusing to implement import and preservation for data stored in Word fields as used by applications like EndNote.)

    The NLM add-in is an odd one here, as on one level it does qualify in that it spits out XML, but the intent is to create Word-only authoring so that rules it out -- not that we have been asked to work on that project other than to comment, I am merely using it as an example.

  • The code must be open source and as portable as possible. Of course if it is interface code it will only work with Microsoft's toll-access software but at least others can read the code and re-implement elsewhere. If it's not interface code then it must be written in a portable language and/or framework.





Friday, 13 March
Fooling around with numbers, part 3; or, why would anyone pay for these journals?

Following on from part 2, I thought I'd ask a couple more questions about price-per-use, based on the online usage stats in the UCOSC dataset. I started on this because I noticed that in Fig 2 of part 2, I'd missed a point: there is an even-further-out outlier above the Elsevier set I pointed out:

UCOSCpriceuse2.JPG

It's another Elsevier journal, Nuclear Physics B. In 2003, only 1001 online uses were reported to UC by the publisher, but the 2004 list price was $15,360. The companion journal Nuc Phys A is not much better, $10,121 for 1198 uses. Compare that with Nature, 286125 uses at just $1,280!

It gets worse, too, because I'm led to believe that anything that appears in a physics journal these days is available ahead of time from the arXiv. I tried to confirm that for Nuc Phys B, but either I'm missing something or the arXiv search function is totally for shit, so I couldn't do it systematically. I did go through the latest table of contents (Vol 813 issue 3) on the Science Direct page, and was easily able to find every paper in the arXiv -- mostly just by searching on author names, though in a couple of cases I had to put titles into Google Scholar. Still, they were all there, which leads me to wonder why any library would buy Nuc Phys B (or Nuc Phys A, assuming it's also covered by the arXiv). Prices haven't improved in the intervening 5 years, either:

[I had a table here but Movable Type keeps munging it. Piece of shit. Here's a jpg until I sort it.]

MTsucksass.jpg


That got me wondering how the rest of the journals are distributed by price/use and publisher:


UCOSCpriceusepublisher.JPG


The inset shows a zoomed view but even that wasn't particularly informative, so I zoomed in a bit further:


UCOSCpriceuseregression.JPG

The curve fits are for the whole of each dataset, even though it's a zoomed view; the Nature set excludes British Journal of Pharmacology, the only NPG title that recorded 0 uses, and Nature itself. Colour coding by publisher is the same for each figure in this post. As in part 2, the correlation between price and use is weak at best and doesn't change much from publisher to publisher. Also, each publisher subset shows a stronger correlation than the entire pooled set -- score another one for Bob O'Hara's suggestion that finer-grained analyses of this kind of data are likely to produce more robust results. Since cutoffs improved the apparent correlation for the pooled set, I tried that with the publisher subsets:


UCOSCpriceuseregression1.JPG


As in part 2, with uses restricted to 5000 or fewer there was improvement in price/use correlation in most cases, but nothing dramatic; I'm not sure why the Blackwell fit got worse. The Nature subset is close to being able to claim at least a modest fit to a straight line there, so not only does NPG boast some of the lowest prices and highest use rates, they are the closest of all the publishers to pricing their wares according to (at least one measure of) likely utility. Special note to Maxine Clarke, remember this post next time I tee off on Nature! :-)

Next, I broke the data out into intervals (for clarity the labels say 0-1, 1-2 etc, but the actual intervals used were 0-0.99, 1-1.99 etc):


UCOSCpriceuseintervals.JPG


Now it seems that we're looking at some kind of long-tailed distribution, which is hardly surprising. The majority of the titles fall into the first few price/use intervals, say less than about $6/use. Since most pay-per-view article charges are between $25 and $40, I more-or-less arbitrarily picked $30/use as a cutoff and asked how many titles from each publisher fall above that cutoff, and what proportion of the total expenditure (viz, list price sum) does that represent? The inset shows that 161 titles, most of them from Kluwer and Springer (whose figures I combined because Springer bought most of Kluwer's titles sometime after 2003), account for about 5% of the total in list price terms. That was a bit more useful, so I expanded it to ask the same question for each interval:


UCOSCpriceuselistpricesum.JPG


What becomes apparent now, I think, is that the UC librarians are doing a good job! Only 6% of the total number of journals (5% of the total list price cost) fall into the "more than $30/use" category, of which it could reasonably be said that the library might as well drop the subscription and just cover the pay-per-view costs of their patrons. Only a further 15% or so work out to more than $6/use, and around 80% of the collection (figured as titles or cost) comes in under $6/use, with around 30% less than $1/use.

So, are these reasonable prices -- $1 per use, $6 per use? I'm not sure I can, but I'll try to say something about that question, using the UCOSC dataset, in Part 4.



Thursday, 12 March
Peters Murray-Rust and Sefton on "science and selfishness"

Peter Murray-Rust (welcome back to blogging!) has replied to Glyn Moody's post about semantic plugins being developed by Science Commons in collaboration with the Evil Empire, which I discussed in my last post. Peter MR takes the view, with which I concur, that it's more important to get scientists using semantic markup than to take an ideological stand against Microsoft:

Microsoft is "evil". I can understand this view - especially during the Hallowee'n document era. There are many "evil" companies - they can be found in publishing (?PRISM), pharmaceuticals (where I used to work) Constant Gardener) , petrotechnical, scientific software, etc. Large companies often/always? adopt questionable practices. [I differentiate complete commercial sectors - such as tobacco, defence and betting where I would have moral issues] . The difficulty here is that there is no clear line between an evil company and an acceptable one .

The monopoly exists and nowhere more than in in/organic chemistry where nearly all chemists use Word. We have taken the view that we will work with what scientists actually use, not what we would like them to use. The only current alternative is to avoid working in this field - chemists will not use Open Office.

Another, to my mind even more important, point was raised by Peter Sefton in a comment on Peter MR's entry:

I will have to talk about this at greater length but I think the issue is not working with Microsoft it's working in an interoperable way. The plugins coming out of MS Research now might be made by well meaning people but unless they encode their results in something that can interop with other word processors (the main one is OOo Writer) then the effect is to prolong the monopoly. There is a not so subtle trick going on here - MS are opening up the word processing format with one hand while building addons like the Ontology stuff and the NLM work which depend on Word 2007 to work with the other hand. I have raised this with Jim Downing and I hope you can get a real interop on Chem4Word.

(Peter S, btw, blogs here and works on a little thing called The Integrated Content Enviroment (ICE), which looks to me like a good candidate for an ideal Electronic Lab Notebook...)

There's a difference between the plugins being Open Source and the plugins being useful to the F/OSS community. If collaborators hold Microsoft to real interoperability, the "Evil Empire" concerns largely go away, because the project can simply fork to support any applications other than Word.

(I've emailed John Wilbanks to get his reaction to all this, but be patient because he's insanely busy in general, and right now he's on honeymoon!)




Wednesday, 11 March
On science and selfishness.

Glyn Moody has a nice post up about fraternizing with the enemy in Open Science; you should read the whole thing, but here's the gist:

One of the things that disappoints me is the lack of understanding of what's at stake with open source among some of the other open communities. For example, some in the world of open science seem to think it's OK to work with Microsoft, provided it furthers their own specific agenda. Here's a case in point:
John Wilbanks, VP of Science for Creative Commons, gave O'Reilly Media an exclusive sneak preview of a joint announcement that they will be making with Microsoft later today at the O'Reilly Emerging Technology Conference. [...] Microsoft will be releasing, under an open source license, Word plugins that will allow scientists to mark up their papers with scientific entities directly.

That might sound fine - after all, the plugins are open source, right? But no. Here's the problem:

Wilbanks said that Word is, in his experience, the dominant publishing system used in the life sciences [and] probably the place that most people prepare drafts. "almost everything I see when I have to peer review is in a .doc format."

In other words, he doesn't see any problem with perpetuating Microsoft's stranglehold on word processing. But it has consistently abused that monopoly [...]

Working with Microsoft on open source plugins might seem innocent enough, but it's really just entrenching Microsoft's power yet further in the scientific community [...]

It would have been far better to work with OpenOffice.org to produce similar plugins, making the free office suite even more attractive, and thus giving scientists yet another reason to go truly open, with all the attendant benefits, rather than making do with a hobbled, faux-openness, as here.

Let me say upfront that I mostly agree with Glyn here. Scientists should be at the forefront of abandoning closed for Open wherever possible, because in the long term Open strategies offer efficiencies of operation and scale that closed, proprietary solutions simply cannot match.

Having said that -- and most expressly without wishing to put words into John Wilbanks' mouth -- my response to Glyn's criticism is that I think he (Glyn) is seriously underestimating the selfish nature of most scientists. Or if you want to be charitable, the intense pressure under which they have to function. Let me unpack that:

Glyn talks about making Open Office more attractive and providing incentives for scientists to use Open solutions, but what he may not realize is that incentives mostly don't work in that tribe. Scientists will do nothing that doesn't immediately and obviously contribute to publications, unless forced to do so. Witness the utter failure of Open Access recommendations, suggestions and pleas vs the success of OA mandates. These are people who ignore carrots; you need a stick, and a big one.

For instance: I use Open Office in preference to Word because I'm willing to put up with a short learning curve and a few inconveniences, having (as they say here in the US) drunk the Open Kool-Aid. But I'm something of an exception. Faced with a single difficulty, one single function that doesn't work exactly like it did in Word, the vast majority of researchers will throw a tantrum and give up on the new application. After all, the Department pays the Word license, so it's there to be used, so who cares about monopolies and stifling free culture and all that hippy kum-ba-yah crap when I've got a paper to write that will make me the most famous and important scientist in all the world?

The last part is a (slight) exaggeration, but the tantrum/quit part is not. Researchers have their set ways of doing things, and they are very, very resistant to change -- I think this might be partly due to the kind of personality that ends up in research, but it's also a response to the pressure to produce. In science, only one kind of productivity counts -- that is, keeps you in a job, brings in funding, wins your peers' respect -- and that's published papers. The resulting pressure makes whatever leads to published papers urgent and limits everything else to -- at best -- important; and urgent trumps important every time. Remember the old story about the guy struggling to cut down a tree with a blunt saw? To suggestions that his work would go faster if he sharpened the saw, he replies that he doesn't have time to sit around sharpening tools, he's got a tree to cut down!

I said above that scientists should move from closed to Open wherever possible because of long term advantages. I think that's true, but like the guy with the saw, scientists are caught up in short-term thinking. Put the case to most of them, and they'll agree about the advantages of Open over closed -- for instance, I've yet to meet anyone who disagreed on principle that Open Access could dramatically improve the efficiency of knowledge dissemination, that is, the efficiency of the entire scientific endeavour. I've also yet to meet more than a handful of people willing to commit to sending their own papers only to OA journals, or even to avoiding journals that won't let them self-archive! "I have a job to keep", they say, "I'm not going to sacrifice my livelihood to the greater good"; or "that's great, but first I need to get this grant funded"; or my personal favourite, "once I have tenure I'll start doing all that good stuff". (Sure you will. But I digress.)

So to return to the question at hand: it's a fine thing to suggest that scientists should use Open Office, but I flat-out guarantee you that they never will unless somehow their funding comes to depend on it. Word is familiar and convenient; none of the advantages of Free/Open Source software are sufficiently important to overcome the urgency with which this paper or that grant has to be written up and sent.

It's also a great idea to get researchers to start thinking about, and using, markup and metadata and all that chewy Semantic Web goodness, but again I guarantee 100% failure unless you fit it into their existing workflow and habits. If you build your plugins for Open Office, that won't be another reason to use the Free application, it will be another reason to reject semantic markup: "oh yeah, the semantic web is a great idea, yeah I'd support it but there's no Word plugin so I'd have to install Open Office and I just don't have time to deal with that...".

When it comes to scientists, you don't just have to hand them a sharper saw, you have to force them to stop sawing long enough to change to the new tool. All they know is that the damn tree has to come down on time and they will be in terrible trouble (/fail to be recognized for their genius) if it doesn't.



Tuesday, 10 March
Fooling around with numbers, part 2

Following on from this post, and in the spirit of eating my own dogfood1, herewith the first part of my analysis of the U Cali OSC dataset.

The dataset includes some 3137 titles with accompanying information about publisher, list price, ISI impact factor, UC online uses and average annual price increase; these measures are defined here. The spreadsheet and powerpoint files I used to make the figures below are available here: spreadsheet, ppt.

As a first pass, I've simply made pairwise comparisons between impact factor, price and online use. There's no apparent correlation between impact factor and price, for either the full set or a subset defined by IF and price cutoffs designed to remove "extremes", as shown in the inset figure:


UCOSCpriceIF.JPG


One other thing that stands out is the cluster of Elsevier journals in the high-price, low-impact quadrant, and the Nature group smaller cluster of NPG's highest IF titles at the opposite extreme. Note that n < 3137 because not all titles have impact factors, usage stats, etc. I've included the correlation coefficients mainly because their absence would probably be more distracting than having the (admittedly fairly meaningless) numbers available, at least for readers whose minds work like mine.

Next I asked whether there was any clearer connection between price and online uses aggregated over all UC campuses:


UCOSCpriceuse.JPG


Again, not so much. I played about with various cutoffs, and the best I could get was a weak correlation at the low end of both scales (see inset). And again, note Elsevier in the "low value" quadrant, and Nature in a class of its own. Being probably the one scientific journal every lay person can name, in terms of brand recognition it's the Albert Einstein of journals. Interestingly, not even the other NPG titles come close to Nature itself on this measure, though they do when plotted against IF. I wonder whether that actually reflects a lay readership?

Finally (for the moment) I played the Everest ("because it's there") card and plotted use against impact factor:


UCOSCuseIF.JPG


The relationship here is still weak, but noticeably stronger than for the other two comparisons -- particularly once we eliminate the Nature outlier (see inset). I've seen papers describing 0.4 as "strong correlation", but I think for most purposes that's wishful thinking on the part of the authors. I do wish I knew enough about statistics to be able to say definitively whether this correlation is significantly greater than those in the first two figures. (Yes yes, I could look it up. The word you want is "lazy", OK?) Even if the difference is significant, and even if we are lenient and describe the correlation between IF and online use as "moderate", I would argue that it's a rich-get-richer effect in action rather than any evidence of quality or value. Higher-IF journals have better name recognition, and researchers tend to pull papers out of their "to-read" pile more often if they know the journal, so when it comes time to write up results those are the papers that get cited. Just for fun, here's the same graph with some of the most-used journals identified by name:


UCOSCtitles.JPG


Peter Suber has pointed out a couple of other (formal!) studies that have come to similar conclusions to those presented here. There are probably many such, because the relevant literature is dauntingly large. There's even a journal of scientometrics! The FriendFeed discussion of my earlier post has generated some interesting further questions, for instance Bob O'Hara's observation that a finer-grained analysis would be more useful. I'm not sure I'm up for manually curating the data, though, and I can't see any other way to achieve what Bob suggests... I might do it for the smaller Elsevier Life Sciences set. For the moment I think I'll concentrate more on slightly different questions regarding IF and price distributions, as in Fig 3 in my last post -- tune in next time for more adventures in inept statistical analysis!


-------------
1 I'm always on about Open Data and "publish early, publish often" collaborative models like Open Notebook Science, and it occurs to me that the ethos applies to blogging as much as to formal publications. So I'm going to try to post analyses like this in parts, so as to get earlier feedback, and of course I try to make all my data and methods available. Let me know if you think I'm missing any opportunities to practice what I preach.



Tuesday, 10 March
Fooling around with numbers

A while back, there was some buzz about a paper showing that, for a particular subset of journals, there was essentially no correlation between Impact Factor and journal subscription price. I think, though my google-fu has failed me, that the paper was Is this journal worth $US 1118? (pdf!) by Nick Blomley, and the journals in question were geography titles. Blomley found "no direct or straightforward relationship" between price and either Impact Factor or citation counts. He also looked at Relative Price Index, a finer-grained measure of journal value developed by McAfee and Bergstrom. He didn't plot that one out, so I will:

blomley.jpg

There is some circularity here, since RPI is calculated using price, but once again I'd call that no direct or straightforward relationship.

All this got me wondering about the same analyses applied to other fields and larger sets of journals. My first stop was Elsevier's 2009 price list, handily downloadable as an Excel spreadsheet. It doesn't include Impact Factors, but the linked "about" page for each journal displays the IF, if it has one, quite prominently. So I went through the Life Sciences journals by hand, copying in the IFs. I ended up with 141 titles with, and 90 titles without, Impact Factors. As with Blomley's set, there was no apparent correlation between IF and price:

Elsevier1.jpg

Interesting, no? If the primary measure of a journal's value is its impact -- pretty layouts and a good Employment section and so on being presumably secondary -- and if the Impact Factor is a measure of impact, and if publishers are making a good faith effort to offer value for money -- then why is there no apparent relationship between IF and journal prices? After all, publishers tout the Impact Factors of their offerings whenever they're asked to justify their prices or the latest round of increases in same.

There's even some evidence from the same dataset that Impact Factors do influence journal pricing, at least in a "we can charge more if we have one" kinda way. Comparing the prices of journals with or without IFs indicates that, within this Elsevier/Life Sciences set, journals with IFs are higher priced and less variable in price:

Elsevier2.jpg

About the time I was finishing this up, I came across a much larger dataset from U California's Office of Scholarly Communication. I've converted their html tables into a delimited text file, available here: UCOSC.txt. For my next trick I'll see what information I can squeeze out of a real dataset (there are about 3,000 titles in there).

Oh, and if anyone wants it, the Elsevier Life Sciences data are in this Excel file: ElsevierLifeSciPriceList.xls.



Sunday, 18 January
Another wonderful conference.

I'm sitting in the computer room at the Radisson RTP after Science Online '09 has wound down, and most of the attendees have left -- though I'm looking forward to dinner with a few fellow stragglers this evening.

Many thanks are due Anton, Bora, David and their various helpers, sponsors and assorted minions for running another wonderful conference. I was happier'n a pig in a puddle with this year's program, as I was able to attend an Open Something (or related) session in almost every slot. There's nothing quite like indulging an obsession with a crowd of like minds, especially when there remains enough diversity of opinion to (mostly) avoid the echo chamber effect. There was only one thing I can point to that wasn't essentially perfect, which is that the web connection, wifi or wire, was flaky and slow quite a lot of the time. That observation must be taken in context, though: although everyone commented, no one complained. It just isn't that sort of gathering.

My session with Björn went well (OK, I can't really judge that -- but I had fun!) -- although it would have gone better if I'd shut up sooner. Having not been to an unconference before, I wasn't strict enough with my introductory blurb and took up time that would have been better spent on the ensuing discussion, which was just terrific. I'll know next time -- and Björn was careful to learn from my mistake, limiting himself to a quick intro for his session with Peter Binfield and obstinately driving the discussion away from echo chamber territory, challenging the participants to come up with new ideas and ways forward. (If you're interested in the Impact Factor question -- that is, metrics and measurement in science -- there's a collaborative bibliography underway in a Google Doc here. I'll make it publicly editable as soon as I figure out how; in the meantime email me if you want an invite to collaborate.)

I definitely prefer the unconference format to a traditional lecture-style conference. When there is a subject that needs more intensive coverage by the speaker(s), the flexible format easily accomodates that -- for instance, John Wilbanks' talk on the semantic web was of necessity about half informal lecture and half rowdy discussion, simply because it's a complex topic about which few of us knew very much. (Before John got through, I mean, since it was an informative and inspiring look at the technology which will probably underpin the next truly radical leap forward in scientific capability.)

As Eva Amsen and Henry Gee both observed, the line between people I know online and people I've met in meatspace is getting very blurry these days. I was nonetheless pleased to meet Eva and Henry f2f for the first time, and also Björn, Peter and John, Cameron Neylon (more like "nylon" than "nay-lon"!), Victor Henning, Martin Fenner and a dozen others to whom I apologize for being too tired to remember you right now! I was of course no less happy to catch up with old friends, repeat offenders like me who were also at the 2007 and 2008 events.

And now it's too late for me to get a nap before dinner, so I think I'll go see if a shower will wake me up instead. More later as I process the many new ideas and insights I collected in the course of two very enjoyable days.




Monday, 12 January
What do you want to know about Open Access?

Science Online '09 is less than a week away, and I'm going to be co-moderating an unconference session with Björn Brembs, the theme of which is "Open Access publishing: present and future".

Björn has already put some notes up on the wiki, and there's an interesting contribution from Antony Williams of Chemspider. As both Björn's and Antony's notes make clear, we think the future of Open Access (indeed, all scholarly) publishing will feature prominently the long-overdue death of the Impact Factor. In fact, audience willing, we plan to use some of this session as a sort of preface for Björn's Sunday session with Peter Binfield, which is titled "Reputation, authority and incentives. Or: How to get rid of the Impact Factor".

It's difficult to overstate the extent to which that single figure has come to dominate scholarly and administrative decision making: where to publish, who to fund or promote, which candidate to hire, and so on. It's also difficult to overstate how bad an idea it is to put so much weight on a single journal-level metric derived by undislosed calculations and decisions from a proprietary database.

But that's the future of publishing, about which much more from Björn and Peter. Regarding the past, I thought I would do a five-minute definition-plus-potted-history, cribbed almost entirely from my earlier talk and Peter Suber's timeline.

That leaves us with the present, and in the spirit of an unconference about science online, I thought I'd simply ask the audience: what do you want to know about Open Access?

There are two things I must clarify. Firstly, by audience I mean both online and on the day: if you're there, you can ask in person, but if you're not going to the meatspace conference you are welcome to ask your question here, on the conference wiki, or by email to me, at any time. Secondly, I'm not claiming I'll have the answer ready to hand -- but OA and related Open ideas are pretty much an obession with me my hobby these days, and if you have a question I can't answer I'll be sure to find out and get back to you. (In addition, the conference will be packed with OA experts and I have no hesitation in bothering them for answers!)

So: what do you want to know about Open Access?



Saturday, 20 December
The serials crisis has a name, and it's Reed Elsevier.

It's notoriously difficult to get good numbers on publisher income, expense and profit -- even nonprofits like PLoS only publish what they have to1 -- and so I'm always on the lookout for more data. If I had more spare time, I could dig out more information, but for now I rely on articles like this one (via OAN) from McGuigan and Russell at Penn State:

The Business of Academic Publishing: A Strategic Analysis of the Academic Journal Publishing Industry and its Impact on the Future of Scholarly Publishing

(Incidentally, in the unlikely circumstance that you've read this far and your eyes haven't glazed over, you will probably like my oa.numbers and serialscrisis tags on Simpy, which is where I keep my collection of such references.)

Interested persons should, as the kids say, RTWT, expecially the nice readable introduction to scholarly publishing and the serials crisis; I just want to publicize this table of profit margins, comparing Elsevier S&M with the broader STM industry:

year
Elsevier Science and Medical
all Elsevier journals
all periodical publishers
1998
35.9
25.7
4.9
1999
35.4
23.4
4.7
2000
36.4
21.0
4.3


I am not going to pay over $100 for the Risk Management Assoc. data that McGuigan and Russell used, but I did download the UK Competition Commission report, wherein I found numbers supportive of the Elsevier figures in the table above.  The 2007 LJ Periodicals Price Survey says that commercial STM publishers' profit margins were "around 25 percent on average" for that year, so the figures for "all periodical publishers" would seem to include a variety of non-STM publishers.  Even so, Elsevier's science and medical division has a clear and commanding lead in the price-gouging stakes.

They also have a clear lead in market share. In one of McGuigan and Russell's references (a 2002 Morgan Stanley report that you can get in pdf format if you have half a clue about search), I found a table showing the proportion of the STM market (measured in number of journals and number of articles) enjoyed by a range of publishers. With a little digging (in the filthy muck of commerce, at that; you owe me, loyal readers!) I discovered that Bertelsmann is part of Springer's original name and they now own Kluwer Academic Publishing (as far as I can tell, most of Wolters Kluwer's journals except for Lippincott Williams & Wilkins) under the rubric of Springer Science+Business Media, and that Wiley bought Blackwell a few years ago.

With that in mind, here's an abbreviated version of the Morgan Stanley table of data:

publisher
no. journals
% ISI journals
% articles
Elsevier Science
1347
18
25
Springer + Kluwer
878
11
11
Wiley + Blackwell
620
8
8
[15 other named companies]
874
11
14
Others (2,028 publishers)
3716
48
40


Although those figures are from 2002, the 2008 Library Journal Periodicals Price Survey estimated that

the top ten STM publishers pulled in 53 percent of the revenue in the $16.1 billion periodicals market in 2006
so the bottom line doesn't seem to have changed much.

Mind you, I don't mean to imply that we should launch another boycott; reigning in Elsevier's profit margins and/or market share would do little to offset the serials crisis. The only answer to that, in the long term, is Open Access, because it scales where Toll access doesn't. No, this entry is not really about OA at all, it's just a little kick in the shins for my favorite Greedy Bastard Publishers.




------------------
1 I'd link to the GuideStar reports but I can't get them: I registered, but they haven't bothered sending me the verification email, and until they do I can't use their search.  What is this, amateur hour?




Wednesday, 26 November
Huh. I didn't suck.

A while ago, I mentioned that I was giving a talk at the Berglund Center. Well, now you can watch the whole thing on video, here (scroll down to Sept 9th).

I watched it myself, and despite seeing mostly room for improvement, was pleasantly surprised at just how much I didn't suck.

Many thanks to all those who offered suggestions on FriendFeed and on this blog. My slides are available here, and like everything I make they are intended for the public domain.



Wednesday, 26 November
Pop quiz!

Two unrelated quizzes that I recently took, and that might amuse some readers:

Via Peter Suber, Lund University's ten-question quickie on Open Access. And yes, I got 10/10.

Via 3 Quarks Daily: from the Intercollegiate Studies Institute, something that purports to be a Civics Quiz but which looks to me rather more like libertarian/capitalist propaganda. Of the roughly 2500 citizens who took the test as part of a survey, nearly three-quarters failed, and the average score was 49%. (I got 27/33, for those keeping score.)



Saturday, 22 November
Bizarre omission from my blogroll

I just noticed that Richard Poynder's blog Open and Shut? was missing from my blogroll -- which is weird, because I know it was on there at one time. I think that I didn't notice earlier because everything Richard writes gets covered multiple times across my "news network", simply because it's so damn good.

Anyway, the blog is back -- and if you read me because you are interested in Open Access and Open Science, and you're not already reading Richard, then do yourself a favour and start.



Monday, 17 November
Recommend OA to President Obama

Via Peter Suber and Bora: Obamacto is a new site where you can make recommendations to Obama's Chief Technology Officer and vote on recommendations made by others. Peter's suggestion was this:

Require open access for publicly-funded research

Require open access to the results of non-classified research funded by taxpayers. Extend the exemplary policy now in place at the NIH to all federal agencies.

You can vote anonymously, but registration is a snap -- seriously, the fastest and easiest online signup I've ever seen. Go vote!




Tuesday, 14 October
Open Access Day 2008

It's OA Day, and all the usual suspects are posting entries in the synchroblogging contest. I'm staying off the web except for 30 minutes or so mornings and evenings (because I desire and intend to finish the Project That Would Not Die by the end of the year), and that really only leaves me time to keep up with my feeds and friends.

So, that's my excuse for not having a contest entry (well, that and I dislike contests and prizes... a rant for another time). But I can't let OA Day go unremarked, so check out the official blog and the FriendFeed room. Here is the blog feed (sorry it's Flash, but I don't have time to test other widgets -- and it is pretty):

(Next year, I'm going to treat OA Day as a national holiday and take the day off work in celebration. Maybe one day everyone will do the same...)



Monday, 06 October
What she said.

With one alteration (viz I have had no differences with Richard Poynder), what Dorothea said goes for me as well. (For more background see Matt at Journalology: 1, 2.)

This is just a for-the-record, public statement that I fully support Richard Poynder's laudable and transparently conducted investigation of SJI and other publishers whose conduct threatens to bring Open Access into disrepute, and that if any such publishers take their legal bullying further than the bluff and bluster we are currently seeing from SJI, I will do what I can to help Richard fight back.

Update 081006: Peter Suber and Stevan Harnad have issued a joint statement in support of the investigative work of Richard Poynder. I was hesitant to do so when it was just me following Dorothea's lead, but now I would like to encourage everyone who is familiar with Richard's work and the SJI story to pick sides and do so publicly. (I have no doubt that every reasonable person will pick Richard's side!)



Tuesday, 26 August
Help me make the most of an opportunity.

Check me out:

ad.jpg

That means I've got about a week to put together a 30-40 minute talk. I won't have any trouble filling up the time, of course -- the real problem is what NOT to present. I aim to use the web instead of powerpoint, by creating a series of bookmarks that I can open in browser tabs (or from a History sidebar; haven't decided) and move through those like slides. I plan to follow the basic format of my old essays: we're all familiar with Free/Open Source software, the NIH just mandated a kind of Open Access so here's what that means and what that can do, and what else can be Open? leading into Open Data, Open Standards/semantic web, Open Licensing -- in short, Open Science.

The Berglund Center is affiliated with Pacific University, a "a small, private university with a blend of liberal arts, education and health care". I attended the Center's Summer Institute this year at the kind invitation of the director, Jeffrey Barlow, after he read Mitch Waldrop's "Science 2.0" article and noticed that I was local. (Sadly, I could only attend one day, but it was both fun and productive. The whole thing was also filmed, so I'll make a note when the footage and transcripts are available.)

Pacific U's College of Arts and Sciences includes schools of biology, bioinformatics and chemistry, and all three strongly encourage undergraduate research. I hope to tailor the presentation somewhat in the hope of getting faculty in these schools enthused about Open Access and Open Science.

So, my question to you dear LazyWeb, is essentially: what should I present? What are the basic, must-know tools and ideas of Open Science? How can I best introduce the possibilities of Open-ness to faculty and students at a small liberal arts college? Who has given really good presentations from which I can swipe ideas? I have an opportunity here to expand the Open Science community; help me make the most of it.


Update 080909: the slides -- after a suggestion from John Dupuis, I ended up using Google Presentations -- are here, and I'll post when the video becomes available.



Saturday, 19 July
An Open Access partisan's view of "Electronic Publication and the Narrowing of Science and Scholarship"

There's been a good deal of online chatter about this recent Science article that discusses the effects of online access on scholarship -- see, e.g., discussions here and here and blog entries noted therein.  The report is not available without paying a toll or subscription, but the abstract is freely visible:

Online journals promise to serve more information to more dispersed audiences and are more efficiently searched and recalled. But because they are used differently than print -- scientists and scholars tend to search electronically and follow hyperlinks rather than browse or peruse -- electronically available journals may portend an ironic change for science. Using a database of 34 million articles, their citations (1945 to 2005), and online availability (1998 to 2005), I show that as more journal issues came online, the articles referenced tended to be more recent, fewer journals and articles were cited, and more of those citations were to fewer journals and articles. The forced browsing of print archives may have stretched scientists and scholars to anchor findings deeply into past and present scholarship. Searching online is more efficient and following hyperlinks quickly puts researchers in touch with prevailing opinion, but this may accelerate consensus and narrow the range of findings and ideas built upon.
This seems thoroughly counter-intuitive to me, since I find a good deal more information by direct search now that I can do it online, and browsing has never played a significant role in my literature searching.  (And remember, I'm old -- I started out using Index Medicus!)  Who has time to browse probably-irrelevant journals and tables of contents on the offchance that something might be useful?  I'm far more likely to stumble across things I'd never have otherwise found when I'm relying on a variety of relevance-based search algorithms (PubMed's Related Articles, Google Scholar, NextBio, etc.).

For anyone who thinks that "forced browsing of print archives" makes a lick of sense: we'll pick a topic, then you spend a day or two browsing in meatspace, and I'll spend an hour searching online.  Who do you think is likely to come up with the best (most useful, most comprehensive) set of references?

Moreover, the article's conclusions seem to be based on a couple of unspoken assumptions with which I don't agree.

The first is that citing more and older references is somehow better -- that bit about "anchor[ing] findings deeply intro past and present scholarship".  I don't buy it.  Anyone who wants to read deeply into the past of a field can follow the citation trail back from more recent references, and there's no point cluttering up every paper with every single reference back to Aristotle.  As you go further back there are more errors, mistaken models, lack of information, technical difficulties overcome in later work, and so on -- and that's how it's supposed to work.  I'm not saying that it's not worth reading way back in the archives, or that you don't sometimes find overlooked ideas or observations there, but I am saying that it's not something you want to spend most of your time doing.

Secondly, let's take the author at his word:

I show that as more journal issues came online, the articles referenced tended to be more recent, fewer journals and articles were cited, and more of those citations were to fewer journals and articles.
OK, suppose you do show that -- it's only a bad thing if you assume that the authors who are citing fewer and more recent articles are somehow ignorant of the earlier work.  They're not: as I said, later work builds on earlier.  Evans makes no attempt to demonstrate that there is a break in the citation trail -- that these authors who are citing fewer and more recent articles are in any way missing something relevant.  Rather, I'd say they're simply citing what they need to get their point across, and leaving readers who want to cast a wider net to do that for themselves (which, of course, they can do much more rapidly and thoroughly now that they can do it online).

If that means citing fewer articles now than researchers tended to cite 20 years ago, it probably has more to do with changes in the culture of science than in the electronic availability of research papers.  For instance, I think it far more likely -- to exaggerate, for the purposes of illustration, in the opposite direction to Evans -- that earlier authors, unable to rapidly and comprehensively scan the literature, cited everything they could get their hands on, padding their bibliographies well beyond anything useful in an attempt to lend weight to their arguments.

It's potentially worrisome if more citations are going to fewer journals, but once again I see no more reason to attribute that to increasing online availability than to attribute it to the sharply rising cost of scientific journals in any form.  It's well documented that as journal prices have continued to rise, researchers and institutions have had to cut back on the number of subscriptions they take.  It is not difficult to imagine that "long tail" and "preferential attachment" phenomena (see, for instance, Evans' own references 14 - 18, reproduced below) would drive the concentration of likely subscriptions towards a pool of "must have" journals.  Indeed, publishers actively promote the concept of such a pool and compete strongly to be seen to be part of it.

Finally, and to me most importantly, Evans seems to me to gloss over the question of what proportion of the online archives are freely available, and what effect that has on the phenomenon he is attempting to model.  Here's the crux of what he does say (fair use! fair use!):

Evansfig2C.JPG

I've rearranged the figure so that what were left, middle and right panels are now top, center and bottom panels; in all graphs the abscissae are "Years of journal issues online" and the ordinates are "Herfindahl citation concentration", which is explained as follows:

A concentration of 1 indicates that every citation to [a given] journal [or subfield] in a given year is to a single article; a concentration just less than 1 suggests a high proportion of citations pointing to just a few articles; and a concentration approaching zero implies that citations reach out evenly to a large number of articles.
Here's Evans' interpretation of that data:
Figure 2C illustrates the concurrent influence of commercial and free online provision on the concentration of citations to particular articles and journals. The left panel shows that the number of years of commercial availability appears to significantly increase concentration of citations to fewer articles within a journal. If an additional 10 years of journal issues were to go online via any commercial source, the model predicts that its citation concentration would rise from 0.088 to 0.105, an increase of nearly 20%. Free electronic availability had a slight negative effect on the concentration of articles cited within journals, but it had a marginally positive effect on the concentration of articles cited within subfields (middle panel) and appeared to substantially drive up the concentration of citations to central journals within subfields (right panel). Commercial provision had a consistent positive effect on citation concentration in both articles and journals. The collective similarity between commercial and free access for all models discussed suggests that online access -- whatever its source -- reshapes knowledge discovery and use in the same way.
Wait, what?  Let me unpack that with a rewrite from my point of view:
The number of years of commercial availability appears to significantly increase concentration of citations to fewer articles within a journal, whereas free electronic availability had a negative effect on the concentration of articles cited within journals. If an additional 10 years of journal issues were to go online via any commercial source, the model predicts that its citation concentration would rise from 0.088 to 0.105, an increase of nearly 20%. In contrast, if an additional 10 years of journal issues were to go online via any free source, the model predicts that its citation concentration would drop from 0.088 to just under 0.08 [I had to estimate this by eye, since the data are not available], a decrease of around 10%. Similarly, free electronic availability had only a marginally positive effect on the concentration of articles cited within subfields. Only when considering concentration to journals within a subfield did free availability cause a substantial increase, and even then this effect was considerably less than that driven by commercial availability, which had a consistent positive effect on citation concentration in both articles and journals.
In other words, I take issue with the final sentence of the paragraph I quoted: commercial and free access do not show "collective similarity".  On one of three measures they have the opposite effect, and on the other two measures commercial access has by far the stronger effect.

What this suggests to me is that the driving force in Evans' suggested "narrow[ing of] the range of findings and ideas built upon" is not online access per se but in fact commercial access, with its attendant question of who can afford to read what.  Evans' own data indicate that if the online access in question is free of charge, the apparent narrowing effect is significantly reduced or even reversed.  Moreover, the commercially available corpus is and has always been much larger than the freely available body of knowledge (for instance, DOAJ currently lists around 3500 journals, approximately 10-15% of the total number of scholarly journals).  This indicates that if all of the online access that went into Evans' model had been free all along, the anti-narrowing effect of Open Access would be considerably amplified.

In fact, the comparison between print and online access is barely even possible when considering Open Access information.  The same considerations of cost -- who can afford to read what -- apply to commercial print and online publications, but free online information has essentially no print ancestor or equivalent.  Few if any scholarly journals were ever free in print, so there's a huge difference between conversion from commercial print to commercial online on the one hand, and from commercial print to Open Access on the other.

Indeed, I would suggest that if the entire body of scholarly literature were Openly available, so that every researcher could read everything they could find and programmers were free to build search algorithms over a comprehensive database to help the researchers do that finding, then in fact the opposite effect would obtain.  Perhaps it's true that the more commercial online access you have, the less widely a researcher's literature search net is cast, but as I mentioned above I see no reason to attribute that more to the mode of access than to its cost.

In support of this assertion, consider the expanding body of literature on the Open Access "citation advantage" -- studies which show that the likelihood of a given paper being cited is increased up to several hundred percent if the paper is OA rather than commercially available.  There is some controversy over that literature, but it stands in direct contrast to the idea that online access of any kind tends to narrow citation reach.

There are more data in Evans' paper that speak to the free-vs-commercial issue, and some of those data show free access having a stronger "narrowing" effect than commercial access.  I'd go through it in detail, but I am probably already pushing the limits of fair use so I'll have to refer you to the published article -- in particular, Figure 2 panels A and B.  My response is much the same, that the apparent effect suffers from a loading in "favour" of commercial access, because of the wildly disparate sizes of the two different bodies of online literature. 



-----
refs 14-18 from Evans, JA Science 321:395, 2008:

A. L. Barabási, R. Albert, Science 286, 509 (1999).
R. K. Merton, Science 159, 56 (1968).
D. J. de Solla Price, Science 149, 510 (1965).
H. A. Simon, Biometrika 42, 425 (1955).
M. J. Salganik, P. S. Dodds, D. J. Watts, Science 311, 854 (2006).

Updates 080720:

1. I linked to the FriendFeed discussions but meant to emphasize -- in one of those conversations, Lars Juhl Jensen points out that the single biggest change is information volume:

I cannot help but wonder if this has anything to do with electronic publication, or if it is simply an effect of sheer volume. If researchers have to search through ten times as many articles (because of the exponential growth of the literature), is it really surprising that they don't make it as far back into the past as they used to do?
This is related to, though stronger than, my point about changes in the culture of research.

2. Bora reminded me of another conflicting study by Arthur Eger, this one showing that "a larger [online] content offering coincides with a dramatic increase in Full Text Article requests, and an increase in Full Text Article requests, after about 2 years, coincides with increased article publication". This is not necessarily inconsistent with Evans' claims, especially since the Eger study also showed that the effect of increasing backfile availability is "modest", but I would like to see those increased Full Text requests broken down by date of publication...

3. Tom Wilson doesn't necessarily agree with my (rather blithe?) assertion that researchers are indeed aware of preceding work:

would it were true that authors are not ignorant of earlier work. In my experience as an Editor and a PhD supervisor, I am continually amazed at the extent to which authors and students are unaware of pre-WWW work. It seems that if the work was done before 1995 it is assumed to have no relevance to the present day. In many cases, of course, that will be true and in some cases the research record is a record of building upon earlier work. In the case of many subfields in information science, however, it isn't the case. A great deal of work was done in the 1970s, which is now completely ignored. Researchers rediscover wheels again and again, when a search of the earlier literature would have revealed that what they think of as novel, was novel 50 years ago!
I think this points up my own biases, in that when I think of research I tend only to think of wet lab science, molecular biology in particular since that's what I do for a living. There are many other fields of research! It strikes me that if molecular biologists do in fact reinvent wheels less often than other disciplines, it is perhaps because our online records go back a long way: PubMed reaches back to 1966, and has some coverage all the way back to 1951. Since molecular biology can fairly be said to have come of age as a discipline in 1953, this suggests two things: that Evans may be more right than I think for disciplines outside my own, and that if those disciplines could digitize their archives efficiently it might go a long way towards solving the problem. In other words, the answer to the narrowing effect of online access on scholarship may be to broaden and deepen online access.



Thursday, 03 July
Lie down with pit bulls, wake up with a blogospheric flea in your ear.

This clumsy hatchet job from Nature reporter Declan Butler is beneath him, a poor excuse for journalism and an affront to the respect with which many of his colleagues are regarded by the research community.

Let's start with the title: "PLoS stays afloat with bulk publishing". Loaded rhetoric, anyone? The clear implications are that PLoS is floundering (Butler's own numbers show otherwise!), and that "bulk" is somehow inferior (to, one presumes, "boutique" or some such). PLoS is "following an haute couture model of science publishing" sniffs our correspondant, who goes on to clarify: "relying on bulk, cheap publishing of lower quality papers to subsidize its handful of high-quality flagship journals".

This emphasis on "quality" and the idea that the same somehow equates with scarcity continues throughout: "the company consciously decided to subsidize its top-tier titles by publishing second-tier community journals with high acceptance rates", "the flood of articles appearing in PLoS One (sic)", "difficult to judge the overall quality", "because of this volume, it's going to be considered a dumping ground", "introduces a sub-standard journal to their mix".

The intent is obvious, and the illogic is boggling. Where does Butler think the majority of science is published? Even if you buy into this nebulous idea of "quality" (one knows it when one sees it, does one not old chap? wot wot?) there can be no "great brand" journals without the denim-clad proletarian masses. All the painstaking, unspectacular groundwork for those big flashy headline-grabbing (and, dare I say it, all too often retracted) Nature front-pagers has got to go somewhere.

It gets much worse, though, when we get some measure of what Butler thinks "quality" means:

Papers submitted to PLoS One (sic) are sent to a member of its editorial board of around 500 researchers, who may opt to review it themselves or send it to their choice of referee. But referees only check for serious methodological flaws, and not the importance of the result.
That, along with an earlier remark about "a system of 'light' peer review", is a blatant and serious misrepresentation of PLoS ONE's review process. Here's the actual policy:
The peer review of each article concentrates on objective and technical concerns to determine whether the research has been sufficiently well conceived, well executed, and well described to justify inclusion in the scientific record. [...]

Unlike many journals which attempt to use the peer review process to determine whether or not an article reaches the level of 'importance' required by a given journal, PLoS ONE uses peer review to determine whether a paper is technically sound and worthy of inclusion in the published scientific record. [...]

To be considered for publication in PLoS ONE, any given manuscript must satisfy the following criteria:

  • Content must report on original research (in any scientific discipline).
  • Results reported have not been published elsewhere.
  • Experiments, statistics, and other analyses are performed to a high technical standard.
  • Conclusions are presented in an appropriate fashionand supported by the text.
  • Techniques used have been documented in sufficient detail to allow replication.
  • Reports are presented in an intelligible fashion and written in standard English.
  • Research meets all applicable standards, including the Helsinki Declaration, with regard to the ethics of human and animal experimentation, consent, and research integrity.
  • Report adheres to the relevant community standards for research, reporting, and deposition of data. (Standards PLoS promotes across its journals).
Which is to say that PLoS ONE* holds authors to exactly the same scientific standards that every journal should follow. Which is to say that any methodological flaws, not "only... serious" ones, will see a paper revised, or rejected if the flaws can't be overcome. Which is to say that PLoS ONE uses peer review to do what it was designed to do, not to create an artificial scarcity from which to milk profit with scant regard for the integrity of the scientific record. That's not "light" peer review, it's real peer review.

With this scurrilous parroting of anti-OA FUD, Nature makes pretty clear where its interests and its allies are.  Well, you know what happens when you lie down with pit bulls...

There's a lot more, but that was the issue that pushed my buttons the hardest. See Bora for a roundup of responses; here's a quick outline of some of the key issues:

Jan Velterop, responding to Butler's last "investigation" of PLoS finances two years ago, pointed out that it's ridiculous to expect a new journal with a new business model to break even in a few years, when new journals from established publishers take up to a decade to achieve the same goal; DrugMonkey also mentions the "so what" nature of this complaint. Jonathan Eisen remarks that somehow Butler gets from "PLoS ONE is doing well and making money" to "PLoS is a failure"; go read Jonathan to see how twisted your logic has to be to make that particular trip. (Jonathan also provides an important reminder, that we should not confuse Nature Publishing Group as a whole with their many talented and well intentioned employees!) Grrlscientist observes that, while Butler's piece makes it sound as though PLoS' reliance on donations were a bad thing, all journals rely on the donation of time and expertise by unpaid reviewers. Drugmonkey, Jonathan and Grrlscientist all make the point that Nature has its own stable of "second tier" journals with "lower barriers to entry" -- the same mechanism for which Butler criticizes PLoS. Stevan Harnad is famous for making the point (here, for example) that if the funds currently draining into subscriptions were used to pay OA costs, there would be an immense improvement in the utility of the scientific record even if there were no financial saving.

Finally, pretty much every commenter has pointed out the glaring lack of any "conflict of interest" statement on the Nature piece -- having said which, I'd better make one of my own. It's well known and obvious at a glance at this blog that my favorite drink is the Open Access Kool-Aid. I have personal friends who work for PLoS, and I've previously applied to work there myself.


* originally in lowercase -- so much for my snotty (sic)s!



Sunday, 11 May
OA and licensing: why not Public Domain?

This is an unpublished post that's so old (Aug '07) that I don't know why I didn't just post the damn thing; I've forgotten what I was intending to do with it. I'm posting it now because it contains pointers to useful thinking by David Wiley and others that is germane to the ongoing discussion of data licensing (see post below). I was reminded of this old draft of mine by Deepak's comment that copyleft may be harmful in the case of scientific data, a point David also makes in respect of his particular Open area, education. Much of what David says maps readily from his field to research, so without further ado:

David Wiley of Iterating Toward Openness has been blogging up a storm about open content licensing:

That's a lot to read, but it's all good stuff. David makes one very strong argument that I want to emphasize here, because it points up the difficult distinction between data and (creative) work.

In the post introducing his draft Open Education Licence, he provides a very useful outline of the aims of open content:

  • Reuse - Use the work verbatim, just exactly as you found it
  • Rework - Alter or transform the work so that it better meets your needs
  • Remix - Combine the (verbatim or altered) work with other works to better meet your needs
  • Redistribute - Share the verbatim work, the reworked work, or the remixed work with others

I really, really like that. David's "four R's" resemble the four fundamental freedoms of the Free Software Foundation but do a better job of discriminating between Rework and Remix. The Four R's make immediate sense to me and I will certainly be Reusing and Redistributing that idea.

David goes on to quote some believable numbers and points out that:

Since half of all CC licensed materials are licensed using a copyleft clause and all GFDL licensed materials are licensed using a copyleft clause, this means that over half of the world's open content is copylefted. And while the CC and GFDL copyleft clauses guarantee that all derivative works will be "open," they also guarantee that they can never be used in remixes with the majority of other copylefted works. You can't remix a GFDL work with a By-NC-SA work when the licenses require that the child be licensed exactly as the parent. Each parent had one and only one license - which license would the derivative use? It's just not possible to legally remix these materials; copyleft prevents this remixing. [see David's earlier explanation for details of the incompatibilities among various copyleft licenses]

While promoting rework at the expense of remix - in other words, taking the copyleft approach - is fine for software, it is problematic for content and extremely problematic for education. As educators, we are always remixing materials for use in our classrooms both in the "real" world and online. Your mileage may vary, but over my last 15 years of teaching I would estimate that my remixing activities outnumber my reworking activities 10:1 or more. If other teachers are like me in this regard, then, copyleft is a huge problem for open education.

It's potentially a huge problem for scientists, too, because much of the potential of Open Science and Open Data (see here for an attempt at defining those terms) is in Remix. There are answers in existing datasets to questions their creators never thought to ask; as Alma Swan put it,
...exciting new developments in text-mining and data-mining are beginning to show what can be done to create new, meaningful scientific information from existing, dispersed information using computer technologies. Research articles and accompanying data files can be searched, indexed and mined using semantic technologies to put together pieces of hitherto unrelated information that will further science and scholarship in ways that we have yet to begin imagining.
This is why I join Peter Murray-Rust in being against copyleft for data:
I am not in favour of copyleft for data. I have no fundamental objection to creating a copyrighted work from data as long as there is significant added value. And copyleft is viral - deliberately. If any item in a system/collection/program etc. is copyleft, then the whole is (at least by the algorithm). [...]
I would argue that if I get factual information from WP [wikipedia] then it cannot carry a copyleft. I need the fundamental physical constants and get them from WP. I don't think that my data and programs are thereby copyleft. All algorithms are now slightly fuzzy.
So what do we mean by "data"? What I mean is "facts about the world of sense-perception", as distinct from the presentation and interpretation of those facts. So I might not be free to reproduce, say, a scan of a Western blot from a published paper -- but having looked at that image, I had better be completely free to do whatever I like with the information it gives me about the way the world works, or else science will grind to a halt. Similarly, if a review article (which contains no new facts, and is all reuse and remix) brings together the results of a number of studies to create new information, or a new hypothesis, about the way the world works, I am not free to copy the wording but I must be free to go into my lab and test the hypothesis.


See also (this was a note to myself in the draft, so caveat lector!):

CC-NC considered harmful (Kuroshin)
When is OA not OA? (Catriona MacCallum in PLoS Biology)
CC, OA and moral rights (Thinh Nguyen, Science Commons blog)
Open Data and Moral Rights (Peter Murray-Rust)


-----
In the interests of full disclosure, I have a personal statement for this blog which I hope places the content squarely in the public domain, and for my columns on 3QuarksDaily I use CC-BY so that, if those pieces generate any interest, 3QD might at least get some traffic out of having generously offered me a spot on their roster.



Saturday, 10 May
Data are difficult.

Scientific data are not only hard to come by, they're almost as hard to share, mainly because the scientific infrastructure is armpit-deep and sinking fast in the quicksand of patents, copyrights and ever-multiplying licenses. See Peter Murray-Rust, Antony Williams and Egon Willighagen for the latest dust-up over data licensing; I just want to point out this clear-eyed commentary by John Wilbanks:

The public domain is not an "unlicensed commons". The public domain does not equal the BSD. It is not a licensing option.

It is the natural legal state of data.

It is a damn shame that we no longer think of the public domain as an option that is attractive. It's a sign of the victory of the content holders that the free licensing movements work against that something without a license -- something that is truly free, not just just free "as in" -- is somehow thought to be worse. We've bought into their games if we allow the public domain to be defined as the BSD. The idea of the public domain has been subjected to continuous erosion thanks to both the big content companies and our own movements, to the point where we think freedom only comes in a contract.

The public domain is not contractually constructed. It just is. It cannot be made more free, only less free. And if we start a culture of licensing and enclosing the public domain (stuff that is actually already free, like the human genome) in the name of "freedom" we're playing a dangerous game.

There's a lot more to get at here.

Yes, there is, and you should read the rest of that entry (and keep up with John's blog) if you're at all interested. I'll add just one brief comment: back when John's current job was first advertised, I considered applying for it -- not that I thought I was qualified, but perhaps the SC would want to hire the new director an offsider of some sort. Having had a couple of years to start learning a bit about Open Access and Open Science, I would venture to say that we are all better off with me in the cheerleading section instead of on the field.




Sunday, 13 April
Term dilution; or, that phrase, you keep using it...

As the terminology wars between "Free Software" and "Open Source Software" afficionados demonstrate, as soon as you stick a label on what you are doing, someone will come along and co-opt it. Sometimes, as with F/OSS, there are real disagreements to be had by reasonable people; at other times, well, not so much. This:

"Open science" is liberated from methodological naturalism (MN), even though it begins with an MN position. That is, all scientists start their work in pursuit of natural explanations for events or natural solutions for problems. If evidence and logic point to an end of the road for natural explanations, on rare occasions a scientist using open science would be willing to consider an explanation which does not force him to a naturalistic conclusion. For instance, the genetic code stored in the DNA molecule has no precedent in naturalism, since all codes are the product of a mind. Open science would allow possible supernatural causation as a topic for further research. The scientist would not be restricted to naturalism as the only explanatory option. But alas! Professional scientists do not practice open science. They practice "closed science."
has most emphatically nothing whatsoever to do with Open Science in the sense in which I -- and my friends, colleagues and allies in the nascent movement, see e.g. blogroll to right -- use the term.



Sunday, 13 April
reminder

Over at Free Genes, Jason Kelly has a nice reminder for those of us who tend to be disheartened by slow rates of progress in our chosen field, be it Open Science or, in Jason's case, synthetic biology. I liked it so much I'm stealing it. This:


firsttransistorgif.jpg

is a transistor, circa 1948. Now you can buy the equivalent of many millions of these for pocket change, in a device that will fit on your keychain.



Saturday, 12 April
Good question.

Egon has an interesting angle on Peter Murray-Rust's observation that you can't mine PubMed Central:

I was wondering about this section in the CC license of much of the PMC content, such as our paper on userscripts (section 4a of the CC-BY 2.0):

    You may not distribute, publicly display, publicly perform, or publicly digitally perform the Work with any technological measures that control access or use of the Work in a manner inconsistent with the terms of this License Agreement.
CC-BY 3.0 reads differently, but has similar aims. [...] Peter indicates that the NIH has put in place 'technological measures to control access' to the distribution of our work on userscripts (the PMC entry). That is in clear violation of the CC license. [...] What the PMC website should indicate, instead, is that text mining is allowed for the PMC OAI subset, but that they would highly prefer to use the PMC OAI or PMC FTP routes. This is the least they have to do.

No matter what, I still have the feeling that any technical obstacles are disallowed by the CC-license. Any legal expert here, that can explain me if the CC license allows controlling how people have access to my material?
In other words, can they do that? Like Egon, I await legal advice... how 'bout it, Creative Commons?



Monday, 07 April
Removal of permission barriers is already part of the definition of OA

Heather Morrison points to this excellent post by Glen Newton, wherein Glen proposes that Open Access should explicitly include machine readability:

Open Access must include access by machines:

* At minimum one must allow crawls of the site/content or (to reduce the impact of badly configured crawlers) create a compressed XML file containing all metadata and either content, or direct links to content and make it available for download (and if bandwidth is still an issue put it on a P2P network like BitTorrent).
* Preferable is to offer some kind of API (OTMI) or protocol (OAI-PMH) to get at content and metadata and citations.
* Better is to offer access to the XML of the articles in addition to the PDF and/or HTML; if the XML actually has some semantic content, then we are approaching the optimum.

The end goal is to support and encourage text mining and analysis of the full-text (preferably semantically rich XML), metadata and citations to allow literature-based exploration and discovery in support of the scientific research process.

Most importantly: hear, hear!

I do, however, have a nitpick to make. Heather makes no comment on Glenn's idea that this is an addition to the definition of OA, but in fact I think it's already built in to the accepted BBB definition. Peter Suber refers to the removal of price and permission barriers, to distinguish Open from "merely" free access, which removes only price barriers; I've quoted him on this before, so here he is again:

The best-known part of the BBB definition is that OA content must be free of charge for all users with an internet connection. However, the BBB definition doesn't stop at free online access. It adds an extra dimension that isn't as easy to describe, and consequently is often dropped or obscured. This extra dimension gives users permission for all legitimate scholarly uses. It removes what I've called permission barriers, as opposed to price barriers. The Budapest statement puts the extra dimension this way:
By "open access" to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.
The Bethesda and Berlin statements put it this way: For a work to be OA, the copyright holder must consent in advance to let users "copy, use, distribute, transmit and display the work publicly and to make and distribute derivative works, in any digital medium for any responsible purpose, subject to proper attribution of authorship".

All three tributaries of the mainstream BBB definition agree that OA removes both price and permission barriers. Free online access isn't enough. "Fair use" ("fair dealing" in the UK) isn't enough.
Having said all that, though, I'll add that an explicit description of machine readability requirements would be an addition to the accepted definition of OA -- and one that I would welcome. Peter Murray-Rust recently noted that, according to the "price and permission barriers" view of Open Access, PubMed isn't OA -- even PubMed Central isn't OA.

I'll go even further: can anyone point me to a single Open Access repository? I don't know of even one such site that removes both price and permission barriers. Surely there must be some, but the Big Names (PubMed Central, arXiv, Cogprints, CiteSeer, RePEc, etc -- see ROAR) don't seem to qualify, because digital objects in these repositories carry their own copyrights, rather than being covered by a blanket license provided by the repository.

Can this be true? Five years after the BBB definition came together, more than ten years since Stevan Harnad's subversive proposal and on the first day of the NIH mandate -- widely referred to as an OA mandate! -- can it be that we really don't have a single truly OA repository in all the world? And if it is true, would it help to make the official definition more explicitly machine-friendly?




Wednesday, 06 February
Open Science Conference proposal

I'm probably too late with this to do any good, but Shirley Wu is putting together a proposal for an Open Science session at the Pacific Symposium on Biocomputing. You can read a draft of the proposal which already reads pretty well to me, and Shirley could do with letters of support:

One thing that would really help outside of the proposal itself is to have actual letters of support. That way the organizers will know there is serious interest and commitment for a session on Open Science - it's a gamble for them, frankly, but much less of one if there is a good crowd on board.

So if you would like to support this proposal and are willing to commit to participating should it get accepted, please send me an email to that effect (with as many details of your anticipated participation as you can provide at this time), and I will include all the emails as "supplementary material" next Friday.
Er, yes, that's this coming Friday... I did mention I was late with this, no?

So anyway, if you can come up with an idea for a presentation or can simply commit to attending, please drop Shirley a line. She's another graduate student who's caught the Open Science bug, and the more of them we have -- and the more we can do to help and encourage them -- the better.




Saturday, 12 January
Mitch Waldrop on Science 2.0

I'm way behind on this, but anyway: a while back, writer Mitch Waldrop interviewed me and a whole bunch of other people interested in (what I usually call) Open Science, for an upcoming article in Scientific American.  A draft of the article is now available for reading, but even better -- in a wholly subject matter appropriate twist, it's also available for input from readers.  Quoth Mitch:

Welcome to a Scientific American experiment in "networked journalism," in which readers -- you --get to collaborate with the author to give a story its final form.

The article, below, is a particularly apt candidate for such an experiment: it's my feature story on "Science 2.0," which describes how researchers are beginning to harness wikis, blogs and other Web 2.0 technologies as a potentially transformative way of doing science. The draft article appears here, several months in advance of its print publication, and we are inviting you to comment on it. Your inputs will influence the article's content, reporting, perhaps even its point of view.

So consider yourself invited. Please share your thoughts about the promise and peril of Science 2.0. -- just post your inputs in the Comment section below.

It's good to see Science 2.0 getting not just mainstream attention, but well-crafted and balanced mainstream attention.  It's also good to see a "Journalism 2.0" approach being tested, so if you have ideas or opinions, go participate.

On a personal note, I'm pleased but a little embarrassed to have been quoted by name in an article for which I know Mitch interviewed a lot of people who are actually *doing* Science 2.0, not just cheering from the sidelines like me.  It's hard to be critical of choices made in the face of space constraints (the article is destined for print), but there's no such limit online.  I wonder whether Mitch and his SciAm editors would consider putting a longer version online? 

In a similar vein, in comments here Bora asks whether we (John's "usual suspects") couldn't put together a longer article for publication somewhere.  I think I might have a better idea (though it's hardly original with me).  From my point of view, the best thing about my 3Quarks Open Science articles from about a year ago is that they are already wildly out of date.  The -- to me -- obvious way to update them and keep them up-to-date is to turn them into a wiki (probably starting from the Nodalpoint wiki's Open Science page).  I think the articles cover most of the main bases, and each section could relatively easily be turned into a wiki page; with a little attention to style, it should then be fairly easy to re-write the articles from the updated information.  I am, as usual, swamped with work, so I won't be able to wiki-ize anything any time soon -- I do intend to get to it eventually, but in the meantime the articles themselves are all CC-BY and my Simpy bookmarks, which should help with updating, are pub dom and I'd be happy to help if anyone else wanted to take a stab at it. 

Finally, if you enjoyed the SciAm article, you might also enjoy more of Mitch's writing: he has a blog, a new gig at Nature and has written three books to date: The Dream Machine (2001), Complexity (1992) and Man-Made Minds (1987). (I swiped his affiliate links, I hope they still work.)



Sunday, 06 January
Another clarification -- actually a correction.

Being careful with the language of the letter below made me see that, in earlier entries, I've fallen into one of the easy traps in which OA opponents would like to catch everyone:

...of these, 16 are listed as "grey" (won't allow archiving), 23 are "green" (allow refereed postprint archiving -- NIH mandate compliant) and 7 "pale green" (allow preprint archiving; many "pale green" publishers actually allow postprint archiving and are NIH compliant...

...at least 50% of PSP members are already complying with the NIH mandate, and a further 15% at least allow preprint archiving and may even be NIH-compliant.

The majority of journals for which information is readily available are already compliant with the new NIH mandate...

This phrasing is deeply misleading: it's not the journals or the publishers who must comply with the new NIH (or any other) Open Access mandate!

Publishers can choose to allow their authors to self-archive, or not. They are under no compulsion whatsoever. It's the authors -- who have taken public funding, and so are working for the public -- who must comply with the mandate to give the public full value for its money.

There is no such thing as an NIH-compliant, or non-compliant, journal or publisher. That's a phrase that comes readily to hand, a convenient shorthand perhaps, but we should not use it. The mandate simply does not concern itself with the actions of publishers. Beware the rhetorical frame in which the new law is cast as "the government telling publishers how to run their business"!

The obvious replacement phrase, when talking about journals or publishers and their policies, is "mandate-compatible", so I'll be careful to use that from now on.



Saturday, 05 January
They get letters. Maybe.

Peter Suber points out that no members of the AAP/PSP's ill-conceived PRISM "coalition" were ever identified, and that at least nine publishers publicly disavowed or distanced themselves from it; he then asks:

Has AAP/PSP ever consulted its members about its position on the NIH policy? Are AAP/PSP members willing to see their dues spent on a lawsuit to delay it?

I think it's worth finding that out.

Listed at the bottom of this entry are the "green" and "pale green" EPrints/RoMEO publishers listed as members by the PSP (links and names taken directly from the PSP website). On closer inspection, it seems that RoMEO proper lists all of the "pale green" publishers as yellow, and (with one or two caveats concerning journals with long embargo periods) gives them all a "compliant" rating in respect of NIH policy.

Here is a draft of the letter I have in mind to send to each of these publishers:

Dear [Publisher],

the Association of American Publishers' Professional and Scholarly Publishing division (AAP/PSP), which lists [your company] as a member [1], recently issued a press release [2] in response to the new NIH mandate [3] for Open Access to publicly funded research. The press release was highly critical and contained a number of mistaken and misleading assertions; for details, you can read a public, point-by-point rebuttal [4] by Prof Peter Suber, open access project director at Public Knowledge [5] and a senior researcher with the Scholarly Publishing and Academic Resources Coalition [6]. I'm sure you remember PRISM, the AAP/PSP's ill-considered campaign against Open Access [since your company publicy distanced itself from same]; this latest press release is similar in tone and apparent intent.

In stark contrast to the AAP/PSP's public stance, [your company] is listed by Project RoMEO [7] as a [yellow/green] publisher. This means that [your company] policy regarding self-archiving of journal articles was fully in line with the new law even before it became law, and there is absolutely no conflict between your business model and the NIH mandate. In fact, of the 46 PSP member companies indexed by Project RoMEO, 30 have no policy that conflicts with the new law; and of the approximately 6000 journals published by those 46 companies, around 5700 already allow their authors to comply with the NIH mandate.

I write, therefore, to ask: does the AAP/PSP accurately represent its members in its opposition to the NIH mandate? Was [your company], as a member of the Association, consulted before the AAP/PSP respnse was made public? Finally, if [your company] is not in agreement with the AAP/PSP on this matter, would you consider making a public statement to that effect [in the same way you did regarding PRISM]?

sincerely,

Me.

[1] http://www.pspcentral.org/index.cfm?left=member_companies&page=/home/member_companies.cfm
[2] http://www.pspcentral.org/publications/AAP_press_release_NIH_mandatory_policy.pdf
[3] http://thomas.loc.gov/cgi-bin/query/z?c110:H.R.2764:
[4] http://www.earlham.edu/~peters/fos/2008/01/aappsp-response-to-oa-mandate-at-nih.html
[5] http://www.publicknowledge.org/
[6] http://www.arl.org/sparc/
[7] http://www.sherpa.ac.uk/romeo.php

The most obvious thing missing from the draft is "who the hell am I, to be asking you this?" Now, I can send the letter as myself -- concerned citizen, professional research scientist, potential client of publishers -- but I am only an egg, and it would have a good deal more impact as an open letter from a variety of interested and concerned parties, and still more if it came from somewhere official (ARL, SPARC, I don't really know who would be appropriate here).

So -- anyone up for a multi-author open letter? Any other ideas?

Update 080310: decided not to send letters after all; see here, scroll to bottom of post.




The publishers in question:

Pale green:

Green:



Saturday, 05 January
Quick clarification

The publisher list I've been using in the last few posts actually comes from EPrints.org, using information from SHERPA/RoMEO. I'll refer to the EPrints interface as EPrints/RoMEO from now on.

This wouldn't cause any confusion and I wouldn't bother to point it out, except that RoMEO actually uses a four-colour scheme (green, blue, yellow, white) which EPrints has squished into three (green, pale green, grey).

Update: see Stevan Harnad's comment on the next entry.



Friday, 04 January
Does the AAP/PSP really represent its members?

Via Peter Suber, Dorothea Salo and Heather Morrison, I see that the AAP/PSP has responded to the new NIH mandate in typical, PRISM-esque fashion. For anything I might have said in response, and much more, read the linked entries -- especially Peter Suber's. I have something else in mind.

The PSP lists its members here ; it didn't take long to compare that list with the list of publishers indexed by SHERPA/RoMEO. Of the 355 publishers in the RoMEO database, 46 are members of PSP; of these, 16 are listed as "grey" (won't allow archiving), 23 are "green" (allow refereed postprint archiving -- NIH mandate compliant) and 7 "pale green" (allow preprint archiving; many "pale green" publishers actually allow postprint archiving and are NIH compliant, but are not listed as green because of various restrictions).

It's not possible to do what I wanted here -- which was to answer the title question. The problem is that the PSP lists 102 about 100 members that aren't indexed by RoMEO. I found that somewhat surprising, particularly since the list includes names I'd have expected to find in RoMEO: FASEB, Stanford U Press, Yale U Press, Cold Spring Harbor Lab Press, NEJM, Highwire Press and others.

Nonetheless, we can say that if the RoMEO-indexed sample (46 of 148, 31%) is representative, then at least 50% of PSP members are already complying with the NIH mandate, and a further 15% at least allow preprint archiving and may even be NIH-compliant.

It's even more unbalanced if we compare the numbers of journals published by each company. Those 46 publishers account for 5901 journals; the grey publishers put out 222 (4%), the green publishers 4243 (72%) and the pale green publishers 1436 (24%).

If the PSP were honest and interested in fairly representing its members, I'd think they would find out (and make public) whether the remaining, non-RoMEO indexed members follow the same pattern. I won't hold my breath.

____
Full disclosure: the numbers above are not 100% accurate, since the comparison between the two lists was not always straightforward. For instance, RoMEO indexes "Yale Law School" and the PSP lists "Yale University Press" as a member. I tried to err on the side of the PSP -- for instance, Yale Law is grey, so I included them. There were a few such problematic instances; I very much doubt that they made any difference to the data expressed as percentages, I'd welcome correction and a better dataset, and if anybody wants the Excel files I used I'll be happy to provide them.
Update: see strikethroughs above; some of the overlap issues can be resolved by searching more carefully -- for instance, NEJM is published by Massachusetts Medical Society, which is in RoMEO, and I have no idea how I missed FASEB the first time around. But again, little or no change to the percentages.



Wednesday, 02 January
Public Domain Day

Via Dorothea Salo and Peter Suber, John Mark Ockerbloom reminds me that New Year's Day is also Public Domain Day -- the day on which, each year, a new batch of works enters the public domain:

In countries that use the "life plus 50 years" minimum standard of the Berne Convention, works by authors who died in 1957 enter the public domain today. That includes writers, artists, and composers like Nikos Kazantzakis, Diego Rivera, Dorothy L. Sayers, Jean Sibelius, and Laura Ingalls Wilder.

In countries that use the "life plus 70 years" term, works by authors who died in 1937 enter the public domain, including works by J. M. Barrie, Jean de Brunhoff, H. P. Lovecraft, Maurice Ravel, and Edith Wharton. [...]

In countries like the US and Australia, which are under 20-year freezes of all or most of the public domain, it's not quite as momentous a day. Here in the US, like Bill Murray in Groundhog Day, we're once again waking up to a public domain 1922, as we have since 1998. Our next mass expiration of copyrighted published material is scheduled for New Year's Day 2019, 11 years from now. [...]

Let's not just ask what the public domain can do for us; let's ask what we can do for the public domain. In particular, as of this year more than 14 years have passed since the Web started to explode into public consciousness, with NCSA's release of the Mosaic web browser in 1993. Many of us older Net users started creating web sites that year. And 14 years was the original term of copyright specified in the UK's Statute of Anne, and the US's first copyright law (with an optional renewal term).

As an advocate of more reasonable copyright terms, like those envisioned by our country's founders, I am therefore today dedicating the copyrights of all 1993 versions of my web sites into the public domain. These sites include The Online Books Page, which is still in operation, and Catholic Resources on the Net, which I stopped maintaining in 1999.

Many thanks to John Mark for the informative post, and also for his gift to the public domain. Like Dorothea, I have long since tried to make it clear that I consider my weblog to belong to the public domain. (Do read Dorothea's explanation.) As you can see from comments on my entry, though, an informal statement is suboptimal because people still have questions, and are not confident simply taking whatever they want from the site (as I intend that they should be). It turns out that it's not easy to put something into the public domain without waiting out the requisite copyright term -- it means giving something away for free, and the law is leery of that. So you need meatspace signatures and whatnot, and the Creative Commons Public Domain Dedication is not really much use, even within the USA. I've thought about ditching my homebrew dedication for a CC-BY license, but I don't actually want to place that restriction on the use of anything I post here. Fortunately, CC is on the ball and will soon offer CCZero, which I hope will turn out to be an effective way to dedicate something to the public domain, formally and officially and in a widely recognized and accepted manner. Once I have an option that puts the weight of Creative Commons behind the dedication I want, I'll switch to that. For now, just trust me -- take whatever you want from this site (so long as I made it, of course) and do with it as you please. I'd love to hear back about anything you do with something you found here, but you're under no obligation to inform me.



Thursday, 27 December
A new beginning; here's why.

Rich Apodaca asks whether the new NIH OA mandate marks a new beginning, or more of the same. His argument hinges on the (admittedly unfortunate) phrase "in a manner consistent with copyright law", and he concludes that

Neither HR 2764 nor any form of government intervention will bring widespread Open Access into being.
Here's why I think Rich is wrong.

Point the first: Rich claims that

Most of the journals in question will be hostile to the idea of having their copyrighted material deposited into PubMed Central and so understandably won't allow it to be done by the authors of papers or anyone else.
The available data do not support this. Of the 355 publishers indexed by SHERPA/RoMEO, 66% formally allow self-archiving; more importantly, 56% formally allow archiving after refereeing. (There's a big gap between "formally allow" and "formally forbid", too.) The numbers are even more OA-positive at the journal level. Those publishers between them account for 10199 journals, of which 91% are at least "pale green" -- that is, allow at least preprint archiving. Well over 6000 journals, 62% of the total, are "green" -- that is, allow self-archiving of refereed postprints. You can use the web interface to find out whether your favorite journal or publisher will allow you to self-archive; here's a quick look at the big names (> 50 journals) and a few usual suspects (sorry about the jpg, I can't make html tables to save myself):


romeo.JPG


Point the second: Rich goes on to give the following hypothetical:
Professor Gross at California University gets his manuscript approved for publication in the Journal of Nanoscale Devices (JND). Professor Gross is fully aware both of HR 2764 and JND's refusal to deposit manuscripts into PubMed Central - the reasons why Professor Gross would choose JND anyway are interesting, but not relevant here. Along with the acceptance letter, JND requests prompt return of a signed copyright transfer agreement. Professor Gross sends in the signed form and from that point on, all rights to his article belong to JND. As is their policy, JND refuses Professor Gross permission to deposit a copy of his paper into PubMed Central within 12 months after publication.

Unless I'm missing something, neither Professor Gross nor JND have violated any laws.

Does Professor Gross have to publish in JND? Pace Rich, the good Professor's reasons are relevant. Let's take a look at those publication-related sins through an OA lens:

  • Greed -- the OA advantage should drive the greedy to reject journals like JND which deny them the opportunity fully to profit from their own work
  • Envy -- if you want your publication record to be all it can be, publish OA (either by choosing OA journals, or by self archiving)
  • Pride -- if you want your science to have maximal impact, ditto
  • Wrath -- STM publishing is big business with big fat profit margins; as consumers and producers, let's at least get value for money (i.e., OA) and put the hurt on greedy publishers who won't at least allow us to make our own work OA
  • Gluttony, Lust -- see Greed, Envy, Pride
  • Sloth -- for just a few keystrokes, you can increase your research impact and professional standing; why would you not?

Given all that, will the good Professor continue to kowtow before the little godlings who publish JND? Or will he simply find himself a journal that will play ball?

Point the third: Rich continues:

The assumption made by proponents of the new law seems to be that to implement the new policy, the Director of NIH will forbid publication by grant recipients in journals that don't allow deposition of articles into PubMed Central.

How many influential scientist do you know of who would tolerate the government telling them which journals they can and can't publish in? The minute such a misguided policy is put in place, the national scientific outcry would more than overwhelm anything Open Access proponents could muster.

How many? All of them. When a funder says "jump", even "influential" scientists say "Was that high enough? Shall I try again?". (Besides which, this is not "the government telling them" anything, this is a funding body making a reasonable demand.) Where scientists do have some weight to throw around is with publishers: the NIH can always get another benchmonkey, but publishers need a steady supply of authors. So if I want to publish in the Journal of Dodgy Results, which won't allow repository archiving, and the NIH says "not if you take our money -- not until they comply with the mandate", I can: look for other funding (believe me, there ain't a lot); fight authority (see Mellencamp, J.C., 1983); or I can try to get the editors of JDR to let me put a copy in PubMed Central after 12 months. Identifying the path of least resistance is left as an exercise for the reader.

Here again, the data (though scanty) are on my side. A 2005 survey of nearly 1300 authors found 81% of respondants reporting that they would willingly comply with a green OA mandate; a further 13% replied that they would comply unwillingly,and 5% claimed they would not comply. Not only is 94% a great deal better than the roughly 4% compliance observed while the NIH policy was voluntary, but I've got five bucks right here that says those 5% are full of it. If push comes to shove, they won't be handing back any grants or handing in any letters of resignation. Most of them, confronted with the evidence, will do what scientists are supposed to do in such cases: say "oh, I was wrong", and change their views and behaviour. The few who don't do that will still comply, they'll just yell at a couple of editors to make themselves feel all tough again.

(Stevan Harnad and Alma Swan have both reported that Arthur Sale's ongoing study of institutional repositories in Australia corroborates these figures, showing that authors comply in much the same way that they claimed they would in the survey. What I've seen of Sale's data is certainly consistent with that notion... but more on that later perhaps.)

So, to recap:

1. The majority of journals for which information is readily available are already compliant with the new NIH mandate; I see no reason to assume that any significant proportion of the remainder will be hostile to the policy.

2. I disagree that the NIH will not be able to enforce the policy; faced with the evidence that OA is a good idea and the fait accompli of an NIH mandate, researchers will comply and journals will have to follow suit. To believe otherwise is, I think, to give the publishing industry too much credit for being able to cow their authors.

3. Voluntary reposit policies simply don't work; we have evidence to suggest that mandates will, and already do. (An aside: the new NIH policy joins 20 funder mandates, 11 institutional mandates, 3 departmental mandates, 5 proposed funder mandates, 1 proposed institutional mandate and 2 proposed multi-institutional mandates. Most of those include growth data in their ROARMAP entries. Why don't we have more data on the effects of mandates?)

Happily, I can finish up on a note of agreement with Rich, who says:

The only things that will change the status quo are: (1) the availability of tools for making it happen; and (2) the realization by individual investigators that continuing to give away their hard-earned copyright makes them far less competitive than their peers who don't.

Open Access proponents should forget about getting the Federal Government to fix the mess that modern scientific publication has become. Instead, they should focus on making Open Access-like options more attractive to scientists.

I've outlined my disagreements above, now let me agree with the more important points here:

1. It is vitally important that tools for OA (and Open Science) be built -- tools that researchers will want to use; to see a graphic illustration of this, listen to the forlorn cry of the repository-rat

2. OA provides a host of benefits, not least the boost to individual impact and standing; the clearer this becomes, the closer we get to 100% OA

3. Modern scientific publishing is a mess, and needs fixing. Making OA more attractive to the benchmonkeys is going to be an indispensible part of that fix (see also #1).

P.S. still on hiatus... sorta. Still haven't put that ms together so posting will remain infrequent at best.

P.P.S. see also Peter Murray-Rust's response to Rich's entry.



Sunday, 02 December
If it won't sink in, maybe we can pound it in...

Another brief un-hiatus, this one sparked by a question asked by Dave Munger at BPR3:

If you know of a peer-reviewed, open-access journal that does not charge a publication fee, let us know about it in the comments.
Practically every time I talk about OA, online or in meatspace, I hear "I'd like to support OA but I can't afford it, don't all those journals charge, like, $2500 per article?"

No. They don't.

Everyone seems to be thinking of PLoS, never mind that they waive their fees at the drop of a hat; the assumption that most OA journals charge (high) author-side fees is both widespread and completely wrong.

In fact, more than 2/3 of the journals listed in the Directory of Open Access Journals (DOAJ) and more than 80% of OA journals published by scholarly societies charge no author-side fees at all; in contrast, more than 75% of the 247 non-DOAJ journals in a 2005 survey do charge author-side fees (page charges, colour charges, reprint charges, etc) in addition to subscription charges.

Let's unpack those numbers a little (especially since I generated the first one myself, and so you should take a look at how I did that).

In October 2005, the Kaufman-Wills group published a commissioned survey of journal publishing practices, The Facts about Open Access. The study was initially designed to include only full OA journals (listed in the DOAJ, OA immediately upon publication) and delayed-OA ("embargo") journals from the HighWire Press stable, but was expanded to include the full range of financial models by inclusion of journals published by the Association of Learned Professional and Scholarly Publishers (ALPSP) and the Association of American Medical Colleges (AAMC). The final report included responses from 248 DOAJ, 85 HighWire, 34 AAMC and 128 ALPSP journals and showed that:

52.8% of DOAJ journals charge no author-side fees at all. The percentage for subscription journals was much lower: ALPSP journals overall (23.4), ALPSP for-profit journals (44.9), ALPSP non-profit journals (10.1), AAMC journals (14.7), Highwire subset (17.6)
These are the figures that Kaufman and Wills summarize as "...more than half of DOAJ journals did not charge author-side fees of any type, whereas more than 75% of ALPSP, AAMC, and HW subset journals did charge author-side fees."

So -- not only do the majority of OA journals charge nothing on the author side, an even larger majority of non-OA journals do charge author-side fees. If the sample is representative, you're less likely to have to pay to publish if you choose an OA journal than if you don't.

When I first heard these numbers I thought, as Peter Suber did, that they should "recast the debate" around OA. In January 2006 Peter's regular yearly predictions included this forecast:

It will start to sink in that fewer than half of OA journals charge author-side fees and that many more subscription-based journals do so than OA journals.... People will stop talking about "the OA business model" for journals as if there were just one. People will talk less about how OA journals might exclude indigent authors and compromise on peer review and talk more about how toll-access journals do so. We'll start to document the range of models actually in use for OA journals... We'll get more creative in finding models that suit the range of niches...
He has since called this "the worst prediction I've ever made". I confess myself at something of a loss as to why the Kaufman-Wills study has not come to dominate and reconfigure the OA debate; I can only guess that profit-hungry lowlifes have successfully sidestepped it. In this year's predictions, Peter expects more of the same:
Because both Hindawi and Medknow have both been profitable for more than year, you'd think that the fact of their success would start to sink in, with corresponding effects on attitudes toward the sustainability of OA journals and interest in their business models. But well-documented truths about OA tend to sink in very, very slowly, because they have to compete with myths, misinformation, and misunderstanding. With regret, I predict more of the same.

In 2005 the Kaufman-Wills Group discovered that the majority of OA journals charged no publication fees at all. In 2006 I predicted that that fact would start to sink in. I was dead wrong. The fact still hasn't sunk in, and I've learned my lesson.

Caroline Sutton and I discovered last month that the OA journals published by learned societies follow same pattern as OA journals overall: most of them charge no publication fees. But while 52.8% of OA journals overall use no-fee business models (from Kaufman-Wills, 2005), we found that 83% of society OA journals use no-fee business models, a significantly greater fraction. However, I'm not predicting that this fact will sink in any time soon. Likewise, we found 425 societies publishing 450 OA journals, a much larger number than the societies known to oppose OA policies. But neither am I predicting that this fact will sink in any time soon. We'll continue to hear the unargued claim that society publishers are intrinsically vulnerable to OA and predominantly opposed to it.

The Kaufman-Wills study is not the only one of its kind, either. As discussed in the quote above, just last month Peter Suber and Caroline Sutton of Co-Action Publishing released preliminary findings from their ongoing study of OA journals published by scholarly societies. They identified 468 societies which publish, between them, 450 full OA journals and 73 hybrid ("pay-for-OA") journals. Of the full OA journals, only 75 charge author-side fees -- meaning that more than 80% of society journals do not charge such fees.

Finally, there's me. All of the above got me to wondering what proportion of journals in the entire DOAJ database charge author-side fees (since Suber and Sutton showed that when the dataset was expanded, at least among society publishers, the no-fee percentage went up considerably).

Fortunately, the DOAJ now includes a metadata field indicating whether or not a particular journal charges author-side publication fees. Unfortunately, that field is not included in the downloadable comma-delimited metadata file they make available. Fortunately, it's not a whole lot of work to make a replacement file by copy-and-pasting from the "browse by title" page. Unfortunately, you have to do this from the new "for authors" section, because the front-page browsing interface doesn't include the "fee/no fee" field. What's unfortunate about that, for my purposes (though it's a wonderful thing overall), is that the "for authors" browse does include hybrid journals, in which regular articles are subscription-only but authors can pay extra to have their work made OA. (In fact, even the logo at the top is different; on the front page you are seeing the Directory of Open Access Journals, but in the "for authors" section it becomes the Directory of Open Access and Hybrid Journals.) The front page says 2971 journals are indexed, but if you browse by title from the "for authors" page, the totals add up to 4638, the database having apparently added 1667 hybrid journals.

There's probably a smarter way to do this using the OAI-PMH, but that syntax is as impenetrable to me as Ancient High Martian, so I simply pasted the browse-by-title pages into a text document and imported that (colon-delimited) into Excel. Then I filtered on "publication fees", sorted by yes/no/missing and read off the totals from the row numbers. Horrible hack, but it worked.

Including hybrid journals, we get:

charge publication fees: 2185 (47%)
don't charge pub fees: 1998 (43%)
fee information missing: 455 (10%)
total no. of journals: 4638

Given the DOAJ definition of hybrid journal, those should obviously be excluded and the data reworked. This is where a smart person would have stopped and waited for the DOAJ to autogenerate the numbers, but I went ahead and deleted the hybrid entries by hand. (Shut up. I wanted to know, OK?) That yields:
charge publication fees: 534 (18%)
don't charge pub fees: 1980 (67%)
fee information missing: 453 (15%)
total no. of journals: 2967

The second total should presumably be 2971 and it would make sense for the "missing" totals to be the same in both sets, so either there are some errors in the database or I made a couple myself. In either case the errors appear small and make no difference to the percentages, and anyway did I mention this kept me up to 4 am? Actually I suspect some inconsistencies in the database, because the front-page total does not update as quickly as the actual entries, and because there are in fact hybrid journals which don't charge fees (e.g. Emerald Engineering's model).

So now we have three studies (OK, two studies and one ungainly hack) showing that a (strong) majority of OA journals do not charge author-side fees, and one of those studies further showing that a strong majority of non-full-OA journals do in fact charge author-side in addition to subscription fees.

Now, can we please put to rest the myth/FUD/whatever that there is only one OA model, the author-side fees/PLoS model? While we're at it, let's have a few more closely related ideas go the way of the dodo: that OA journals discriminate against indigent authors (because they charge publication fees -- except that most of them don't); that OA journals will compromise on quality (in order to collect payment for manuscripts -- except that most of them don't); that if most journals went OA, universities would have to pay more in author-side fees (which, remember, most OA journals don't, but most non-OA journals do, charge) than they do now in subscription fees.

I swiped that list of candidates for memetic extinction from Peter Suber, and you should go read his full discussion, which offers a lot more detail, especially on that last point. Me, I'm going to take a nap and go back to my blog hiatus. But the next time you hear someone talk about the "cost" of publishing in OA journals, please point 'em here.




Thursday, 22 November
brief hiatus in my hiatus

I'm not ending my blogging break, but I simply couldn't let this from Cameron Neylon pass by without comment:

The UK Engineering and Physical Sciences Research Council currently has a call out for proposals to fund 'Network Activities' in e-science. This seems like an opportunity to both publicise and support the 'Open Science' agenda so I am proposing to write a proposal to ask for ~£150-200k to fund workshops, meetings, and visits between different people and groups. The money could fund people to come to meetings (including from outside the UK and Europe) but could not be used to directly support research activities. The rationale for the proposal would be as follows.

  • 'Open Science' has the potential to radically increase the efficiency and effectiveness of research world wide.
  • The community is disparate and dispersed with many groups working on different approaches that do not currently interoperate - agreeing some interchange or tagging standards may enable significant progress
  • Many of those driving the agenda are early career scientists including graduate students and postdocs who do not have independent travel funds and whose PI may not have resources to support attending meetings where this agenda is being developed
  • There is significant interest from academics, some publishers, software and tool developers, and research funders in making more data freely available but limited concensus on how to take this forward and thus far an insufficient committment of resources to make this possible in practice
This is a terrific opportunity to move Open Science forward; as Cameron points out, existing efforts are scattered and perhaps the most important thing right now is to make connections among the community. The whole idea is that a community approach will be vastly more efficient than the existing hypercompetitive model! This funding could move Open Science into the big time by driving the creation and adoption of working standards, possibly even a BBB-style declaration, and by creating a seed network of cooperative scientists out of which mainstream Open Science could emerge.

Cameron writes, in a followup:

I've made a start with an outline on a GoogleDoc which can be viewed here. I have tried to set out some general headings and areas to be fleshed out and added a little text. This is early days but if anyone wishes to add anything then please feel free. I have given editing rights to all those people who have comments on the original post (as of around 9:30 pm GMT on Thursday 22 November) so they should now have editing rights. I have set the document so that those people with invitations can cascade them to others (I hope). I will continue to issue invitations to anyone who comments on the original post. No need to feel obliged to add anything  - I'm not asking you to write the grant for me - but if you feel so inclined then the assistance will be very welcome.

What I will request is from those who are interested is a short letter stating your current post/position/ambitions, your interest in 'Open Science' and why you would like to be involved in this network. Either email to me at C [dot] Neylon [at] rl.ac.uk or simply drop it in as a comment.

Please, if you have anything to offer, step up. And I cannot emphasize this too strongly: if you're at all interested, you do have something very valuable to offer: a letter of support, as described. It is vital that the powers-that-be (that is, the powers-that-fund) see real commitment to these ideas, from real people. The deadline loometh (next Tuesday), so don't put this off. Your letter doesn't have to be a literary masterpiece -- just stand up and be counted.



Sunday, 21 October
Call yer congresscritters -- right now.

The bill to make the NIH OA policy mandatory instead of voluntary is in trouble: from the ATA via Peter Suber (with some editing by yours truly):

The Senate is currently considering the FY08 Labor-HHS Bill, which includes a provision (already approved by the House of Representatives and the full Senate Appropriations Committee), that directs the NIH to change its Public Access Policy so that participation is required (rather than requested) for researchers, and ensures free, timely public access to articles resulting from NIH-funded research. On Friday, Senator Inhofe (R-OK), filed two amendments (#3416 and #3417), which call for the language to either be stricken from the bill, or modified in a way that would gravely limit the policy's effectiveness.

Amendment #3416 would eliminate the provision altogether. Amendment #3417 is likely to be presented to your Senator as a compromise that "balances" the needs of the public and of publishers. In reality, the current language in the NIH public access provision accomplishes that goal. Passage of either amendment would seriously undermine access to this important public resource, and damage the community's ability to advance scientific research and discovery.

Please contact your Senators TODAY and urge them to vote NO on amendments #3416 and #3417. (Contact must be made before close of business on Monday, October 22).

Contact information and a tool to email your Senator are online [here]. No time to write? Call the U.S. Capitol switchboard at (202) 224-3121 to be patched through to your Senate office.

If you have written in support before, or when you do so today, please inform the Alliance for Taxpayer Access. Contact Jennifer McLennan through jennifer@arl.org or by fax at (202) 872-0884.

The ATA has provided a sample email, but I think they miss one important point: Inhofe's amendments are likely to be presented as compromises aimed at avoiding a presidential veto, and that is purely bullshit. (Note to self: find out how much money Inhofe gets from publishers.) Here's Peter Suber's extract from the White House Statement of Administration Policy:
The Administration strongly opposes S. 1710 because, in combination with the other FY 2008 appropriations bills, it includes an irresponsible and excessive level of spending and includes other objectionable provisions....

S. 1710 exceeds the President's request for programs funded in this bill by nearly $9 billion, part of the $22 billion increase above the President's request for FY 2008 appropriations. The Administration has asked that Congress demonstrate a path to live within the President's topline and cover the excess spending in this bill through reductions elsewhere, while ensuring the Department of Defense has the resources necessary to accomplish its mission. Because Congress has failed to demonstrate such a path, if S. 1710 were presented to the President, he would veto the bill.

The Administration strongly opposes provisions in this bill that overturn the President's policy regarding human embryonic stem cell research....

Public Access to Research Information. Provisions in the bill would require that manuscripts based on NIH-funded research be made available to the public within 12 months of publication. The Administration notes that NIH's current policy requesting the voluntary submission of manuscripts has only been in effect for 2 years, and the Administration believes there is opportunity to work with Congress to study the current policy and consider ways to encourage better participation. The Administration believes that any policy should balance the benefit of public access to taxpayer supported research against the possible impact that grant conditions could have on scientific research publishing, scientific peer review and on the United States' longstanding leadership in upholding strong standards of protection for intellectual property....

The Administration strongly opposes...the elimination of the longstanding definition of abstinence education that keeps these programs focused solely on abstinence....

Note that the real reason for the President's objection is the money he'd rather spend on his own priorities. The paragraph that deals directly with the NIH provision shows unsettling echoes of the PRISM propaganda but is really just waffle -- padding to make the list of objections look longer. In fact, as I noted earlier, the NIH estimates that it will cost about $3 million to implement the mandate -- not much of a dent in that $9 billion the President is complaining about. So, here's an alternative sample email, the one I just sent:
Dear Congresscritter,

I am a research scientist and about to become a US citizen. I have worked in the US for four years, having held an NIH T32 postdoctoral fellowship for two of those years. As a scientist and as a concerned member of the US public, I recently wrote to you in support of that portion of the Senate Appropriations Committee's FY 2008 Labor-HHS-Education appropriations bill (S.1710) which directs the NIH to change its policies from a request to a mandatory requirement for free, timely public access to NIH funded research. I have just learned of two last-minute amendments to this bill (#3416 and #3417) proposed by Sen Inhofe (R-OK). The first of these amendments would eliminate the relevant portion of the bill altogether, and the second would cripple it.

I write now to urge you to oppose both of these amendments, which are likely to be presented to you as compromises aimed at avoiding a Presidential veto. They will do nothing of the sort: the President's primary objection to the bill, as a recent Statement of Administration Policy (1) makes clear, is the $9 billion in spending over and above the Administration's topline. The NIH recently estimated (2) the cost of implementing the mandatory public access requirement of S.1710 at less than $3 million per year -- hardly a significant reduction in a $9 billion overshoot!

As I wrote in my earlier letter, traditional scientific publishing sees the taxpayer pay for the research, pay to have it published, and then pay again to access it (or for the same researchers to access it!) through subscriptions to privately owned journals (3). Legislators have a practical, legal and moral obligation to end this inefficiency and waste, and the way to do that is through Open Acess to publicly funded research. Open Access maximizes research efficiency (and thus the return on research investment) by removing obstacles to the acquisition of new results by researchers (4), and is essential for realizing the vast and virtually untapped potential of automated data- and text-mining (5,6).

Since the current voluntary policy has achieved only a 5% compliance rate in the two years since its instigation, a mandate is clearly required to fulfil Congress' obligation to maximize the return on public investment in research. The current language of S.1710 contains just such a mandate, and Sen. Inhofe's amendments #3416 and #3417 would eliminate it. Please oppose these amendments and approve without change that portion of the appropriations bill which changes the language of the NIH deposit policy from voluntary to mandatory.

Sincerely,

me.


-----references-----
(1) http://www.whitehouse.gov/omb/legislative/sap/110-1/s1710sap-s.pdf
(2) http://grants.nih.gov/grants/guide/notice-files/NOT-OD-05-022.html
(3) http://www.earlham.edu/%7Epeters/fos/newsletter/09-04-03.htm#taxpayer
(4) http://eprints.ecs.soton.ac.uk/10713/01/timcorr.htm
(5) http://eprints.ecs.soton.ac.uk/13028/01/AS-OA-final.pdf
(6) http://www.jneurosci.org/cgi/content/full/26/38/9606



Sunday, 14 October
A big step in the right direction.

This is excellent news:

We are delighted to announce that a reviewer discount now exits for all those who review manuscripts for Chemistry Central Journal, and this is linked to the rest of the the BMC series journals. The review must have been received on time, and during the last 12 months.

This means that if the submitting author has reviewed a manuscript for Chemistry Central Journal or any of the BMC series, they are entitled to a 20% discount off the article processing charge (APC) when submitting articles to any of these journals. We ask that qualifying authors request this discount at the time of submission.

The number of articles submitted to these journals continues to grow significantly, and we are grateful those who agree to review for our journals.

This is a terrific idea., and I hope BMC will extend a similar program across all of its own BMC series journals -- that is, if you review for any of them, you qualify for some level of discount when you submit a paper to any other. (I'm an idiot. At least the link's right.)

Recognition of the value of peer review is a Good Thing™ and long overdue; it gets plenty of lip service but this is the first time I've seen anyone put their money where their mouth is. Let's just hope that funding and tenure review committees find a way to do something similar.

(Hat-tip: Peter Suber.)



Monday, 10 September
Reply to Timo Hannay.

Timo Hannay on Nascent, branching off from a discussion of intemperate responses to PRISM:

A case in point is the criticism that my NPG colleague, Maxine Clarke, faced when talking about "open access" projects at NPG. Not everyone shared her definition of open access and she was accused by some bloggers of using the term as a marketing slogan. (Peter Murray-Rust, who made the original point, later recanted when he understood that Maxine was being genuine, so I don't take issue with him.)
Mr Hannay does, presumably, take issue with me. I will apply Hanlon's Razor and assume Mr Hannay did not bother to read beyond the post he linked, since the very next is this one:
In the entry below, I was not sufficiently careful to avoid Nature-bashing, or the implication that Maxine Clarke was morphing, werewolf-like, into some kind of publisher pitbull. Thanks to Pedro, bdf and RPM for responses which made this clear.

[...]

Let me finish, though, by pointing out that I do not wish to paint NPG as one of the unscrupulous publishers whose intentions worry me, nor Maxine Clarke as their sneaky shill. If NPG uses the term "open access" differently from me, I take that as a good-faith disagreement, and if Maxine uses the term in her employers' sense that is hardly "marketing". Specifically, I apologize for the phrase "if [Maxine] is going to start abusing [the term "OA"] as marketing for Nature", which contains an uncalled-for implication that I hope this entry will dispel.

The elision there includes the list of NPG's OA-related activities that Mr Hannay goes on to point out. The next post on my blog is this one in which I quote Peters Suber and Murray-Rust some more regarding OA definitions and conclude, in what I am happy to have readers interpret as a further step back:
I take Peter S to be saying that it's inevitable that "Open Access" will come to mean, in general use, more things to more people than strict BOAI, and we will not achieve anything by making arseholes of ourselves over it. (Even if that's not quite the way Peter S would put it, that's the way I've come to look at the situation.) There's no point in picking quarrels we don't have to have. It's enough to be more careful in our own usage, for which purposes suffixes a la Peter MR should prove very useful when we need extra precision. I don't think we need invent terms ("fuzzy") just yet -- "OA (specific licence, with hyperlink if writing online)" and "OA (free to read)" should cover most cases.

If we can get to the point where the average consumer -- basically, any researcher -- responds to an OA claim or label by asking "which licence?", we will have done an end-run around the problem of term dilution.

It seems to me entirely unfair and misleading to link to the first of my posts without also linking the next two.

I think Mr Hannay is also in error in describing this post from Jean-Claude as a "followup" to the posts above; I think that Jean-Claude was referring to much more recent and clear-cut abuses outlined by Peter Murray-Rust.

Mr Hannay also goes on to say that

Some people are just too quick to assume base motives, and employ words like "boycott" as if they were punctuation marks.
I do not know who that is aimed at, but as for my own reference to a boycott, I do not think it unreasonable or precipitous to consider such action against publishers who do not distance themselves from PRISM and similar efforts. Why should it be up to me to determine who is and is not part of PRISM? The AAPThe PRISM organizers would certainly like me to assume that all their AAP members are PRISM supporters. As Mr Hannay himself makes clear, publishers need scientists more than the other way around. If you want my manuscripts, you had better demonstrate to me that you are not part of the pack of corporate bloodsuckers and soulless spin doctors that is pushing the palpably dishonest, profit-driven PRISM agenda. (Not that I would, given a free choice, publish in Nature anyway, even after Mr Hannay made it clear NPG does not support PRISM and even if they'd have me -- because they're not OA.)



Update: Peter Murray-Rust did a better job than me of responding to the Nascent post: he rightly led with the important part, which is that Nature is not endorsing PRISM. That's no surprise, but I think it important to be explicit and public about who is and who is not backing PRISM.

Also, now I feel bad about the snotty "Mr Hannay" stuff. I use people's first names here as a rule, even when I've never met them, because a blog is an informal conversation and because I think it fosters a sense of civil fucking discourse. I know perfectly well that Timo is on the side of the angels (viz, on the side of science!) when it comes to scientific communication, and it follows that his comments -- and criticisms -- on this issue are made in good faith. So, er, *shuffles feet*, sorry Timo.



Wednesday, 05 September
Nature mission statement update

Since I spend a fair bit of time excoriating publishers, it's only fair that I take note of those who act in good faith. In response to the blogospheric reaction to the Nature mission statement, Maxine Clarke asked the appropriate persons to update the NPG web page (as you remember, Bob, the journal site already made clear the necessary distinction between the original and updated statements). Accordingly, the NPG page now reads:

Nature's original mission statement was published for the first time on 11 November 1869. The journal's original mission statement was revised in 2000. The original mission statement is reproduced below:
and there follows the same version of the original that was on the page last time I looked.

It's nitpicking to note that I prefer the way the journal does it, with the updated statement immediately visible and a link to the pdf of the original. The new page removes any confusion as to which mission statement now obtains.

Maxine also asked for the print edition of the journal to follow the online version and make both versions of the mission statement obvious. This will necessarily take more time than updating a web page, and I don't have the latest Nature to hand so I don't know if the print change has gone through yet. I will update again as soon as I find out.

So, many thanks to Maxine for responding to somewhat barbed criticism in such a constructive manner.



Wednesday, 05 September
PRISM and PMR

I'm swamped, but two quick points:

1. I'm not going to try to keep up with reactions to PRISM here, unless I think I have something potentially useful to add. If you want a news stream, read OAN or watch my PRISM tag on Simpy -- I'll grab everything I notice.

2. Peter Murray-Rust is blogging up a storm on publisher policies, copyright and Open Access:

There is a great deal of confusion regarding publisher policies and the rights of readers, scholars, institutions &c. I hope that publishers will agree with me that Peter MR is doing a sterling job of getting these issues out into the open, where they can be clarified -- to everyone's benefit.




Saturday, 01 September
More on PRISM: let's not take this lying down.

Jonathan Eisen has got the right idea, listing the entire members' directory of the AAP and calling on academics to consider a boycott if those entities will not at least request dissociation from the PRISM program (as Rockefeller University Press has done) or its discontinuation. You can also read the members' list on the AAP site, and Peter Suber points out that we should pay particular attention to their Professional and Scholarly Publishing division:

I suspect that AAP/PSP did not consult its members before launching PRISM. But in any case the members should know that the launch of PRISM tarnishes them, alienates authors, readers, and referees, and, if successful, will only harm science by entrenching rather than removing access barriers to the results of publicly-funded research.
Peter is commenting there in response to someone else who has got the right idea, Peter Murray-Rust, who (as a Cambridge faculty member) has written to Cambridge University Press; his letter is an excellent example of what everyone should do who has any connection, professional or personal, with any of the AAP/PSP member companies, so I quote it here in full:
Open Letter to Stephen Bourne, Chief Executive Cambridge University Press

Dear Stephen Bourne,

I am writing as an individual member of staff in the University (heavily
engaged in developing new approaches to scientific scholarly publishing) to
ask about CUP's involvement with the recently launched PRISM initiative
from the AAP (http://www.prismcoalition.org/). This initiative is an
undisguised coalition to discredit Open Access publishing and its launch a
few days ago has generated universal dismay and anger in many quarters
including several outside mainstream publishing. The press release was
reported in full by Peter Suber on his Open Access News blog
(http://feeds.feedburner.com/~r/earlham/dGCQ/~3/147374721/2007_08_19_fosblogarchive.html)
where he has objectively answered and dismissed the basis of PRISM and its
methods. As an example of the language of PRISM it implies that publishing
in Open Access journals (as I do on occasions) is "junk science". There is
much more from PRISM which is both deliberately factually incorrect and
misleading and I cannot see how a reputable scholarly organisation such as
CUP could be associated with it. Indeed at least one similar publisher
(Rockefeller University Press
http://feeds.feedburner.com/~r/earlham/dGCQ/~3/150207794/2007_08_26_fosblogarchive.html)
writes:

"I am writing to request that a disclaimer be placed on the PRISM website
indicating that the views presented on the site do not necessarily reflect
those of all members of the AAP. We at the Rockefeller University Press
strongly disagree with the spin that has been placed on the issue of open
access by PRISM." [rest of letter omitted here]

The purpose of my letter is simply to request factual information from CUP
about its involvement with PRISM. Since PRISM itself has not reacted to any
of the recent comment I can simply speculate that not all members of the
AAP (perhaps including yourselves) were consulted before PRISM made its
press release and new site. In particular it is unclear whether PRISM is de
facto composed of all the members of the AAP or whether it uses their
unsought goodwill to reinforce the apparent strength of the PRISM
organization.

This mail is an Open Letter (posted on my blog,
http://wwmm.ch.cam.ac.uk/blogs/murrayrust) and I would intend to publish
your reply in toto and unedited since your position (and those of similar
publishers) is of great public interest). If there is anything you would
not wish to be published, please indicate. Alternatively you may leave a
comment on the blog itself. (My blog itself, though strongly advocating
Open Access and particularly Open Data, attempts to be fair and accurate).

Thanks in advance

Peter Murray-Rust

This letter hits every necessary nail squarely on the head:
  • be polite
  • make clear the nature of your connection with the publisher to whom you are writing
  • keep the background brief and be sure to point to Peter Suber's rebuttal
  • explicitly request a specific response: did Publisher X know about PRISM, and does Publisher X support PRISM?
  • suggest that Publisher X should publicly distance themselves as RUP has done
  • if at all possible, do all of this in public: an open letter, on a blog
I don't know whether I have any direct connection with any AAP/PSP member companies, although I could certainly write to publishers of journals in which I have published papers. In a later entry I will dig through the list and try to find likely recipients of such letters -- for which Peter Murray-Rust has provided such a splendid template.

Update: in comments on Jonathan's post, CSHL Press has repudiated PRISM. Good for them, and I hope they will make a formal public statement to the same effect -- for instance, on their website.



Sunday, 26 August
a bit more on PRISM

If you haven't already, go read Peter Suber's initial response -- it is, as always, clear, calm, comprehensive and compelling. (I hope to meet Peter one day; I imagine him as a kind of unflappable, scholarly James Bond...) This is your one-stop anti-PRISM shop for the time being: if you read nothing else, read this; and whenever PRISM rears its ugly head, make sure Peter's response gets an airing too.

Peter has also responded to a Publisher's Weekly article that simply repeats the PRISM propaganda. The by-line is Rachel Deahl, a senior news editor at PW. I wrote to her, as follows:

Dear Ms Deahl,

I write in response to your recent brief article in Publishers Weekly ("AAP Tries to Keep Government Out of Science Publishing", August 23, 2007), in which you quote or repeat several egregious errors of fact which are being propagated by the newly formed anti-Open Access disinformation factory, PRISM.

Briefly, there is no aspect of the Open Access publishing model which would force anyone to "turn over" anything to the government, nor will OA publishing damage peer review in any way. For a detailed and authoritative response to the PRISM campaign, I refer you to Peter Suber, Professor of Philosophy at Earlham College, on his weblog Open Access News:

http://www.earlham.edu/~peters/fos/2007_08_19_fosblogarchive.html#365179758119288416

In your article, you quoted PRISM and AAP members, but gave no space to the opposing point of view, which is simply that taxpayers should get what they have paid for: the results of the research they fund, and maximally efficient use of those results by the researchers whose salaries they also pay. I hope you will follow your initial report with a more balanced article that includes interviews with Open Access experts and advocates. In case it is of use in your research, I include here, in no particular order, a brief list of potential interviewees:

Peter Suber, as above (contact)
Paul Ginsparg, founder of arXiv (contact)
Barbara Cohen, Executive Editor, Public Library of Science (contact)
Mark Patterson, Director of Publishing, Public Library of Science (contact)
Matthew Cockerill, Publisher, BioMed Central (contact)

Finally, I should point out that I have also published this letter on my own weblog, and you are of course welcome to respond there (www.sennoma.net) at any time.

Best wishes,

me.

I'm not sure whether this will do any good -- William Walsh has pointed out that Publisher's Weekly is owned, once removed, by Reed Elsevier, noted price-gougers and employers of the notorious Publisher's Pitbull, so Ms Deahl's options may be limited by her bosses. This is also a good place to point out that if you write to her, being a jerk about it will not only be pointless and stupid but will in fact damage the OA cause. (That should go without saying but these things do tend to get out of hand when emotions run high and email allows one to send in haste and repent at leisure...)



Thursday, 23 August
PRISM = Publishers Relying on Insidious Subversion Methods

From Peter Suber:

The AAP/PSP has launched PRISM (Partnership for Research Integrity in Science & Medicine).  I'm quoting today's press release in its entirety so that I can respond to it at length:


A new initiative was announced today to bring together like minded scholarly societies, publishers, researchers and other professionals in an effort to safeguard the scientific and medical peer-review process and educate the public about the risks of proposed government interference with the scholarly communication process.

[much egregious lying]

Anyone who wishes to sign on to the PRISM Principles may do so on the site.

Fortunately for us all, Peter has already responded; I won't excerpt his point-by-point rebuttal here, you should go read it all.

This is disgusting. This runs counter to everything that science, academia, scholarship (and scholarly publishing!) stand for.

There are no names on the PRISM site yet -- but I'm going to find as many as I can and publish them here. Sunlight is the best disinfectant, and I want to know just who is taking part in this revolting effort to steal from the commons and turn public goods into private profit.

(We can start with the AAP: their members page is essentially one long list of companies and organizations with whom I will assiduously avoid doing business until and unless they dissociate themselves from PRISM, and preferably from the AAP altogether.)

More later. Oh yes indeedy.



Tuesday, 21 August
Another note on terminology.

In a comment on one of my 3QuarksDaily columns about Open Access/Open Science, Matthias Röder points out that there are more kinds of research than scientific:

One thing that might be worth thinking about is the fact that Open Science is a term that excludes many projects in the humanities and social sciences. I think Open Research might be a good alternative.
By way of illustration he points to a wikipedia entry on Open Research, which in turn points to a number of Open projects, including SCRIBE, with which Matthias is involved:
  1. SCRIBE is an open and peer-reviewed database with information on music copyists and samples of their handwriting.
  2. SCRIBE is a software tool for searching music manuscripts by handwriting characteristics.
He's got a point. I don't mean to be exclusionary, and am happy to accept Open Research as an umbrella term, a higher level taxon of which Open Science and Open Anything Else are subgroups.

That said, there's also no reason not to use the phylum name when you don't mean to speak for the entire kingdom. I don't know much about research outside of science; I've posted a little about it, but haven't looked into it with nearly the obsessive care with which I follow developments in Open Science. I'm a scientist; my focus is on science.

I'm happy to learn about efforts towards openness in other fields, of course, but I hope no one is surprised or offended to hear that I'll be thinking "how can we use this for science?" the whole time. So for now, I will continue to talk about "Open Science", and I hope that researchers from other fields will not feel excluded but will instead simply look to see whether anything I'm saying is of use in Open Whatever-It-Is-That-They-Do.



Monday, 20 August
What do we mean by open science?

(Addressed in absentia to "Tools for Open Science", Second Life, Aug 20 2007.  I am sorry I could not be there.)

I think we all know what we want, and I think we all want much the same thing, which boils down to just this: cooperation.  A way forward for science, a way out of the spiralling inefficiency of patent thickets, secret experiments and dog-eat-dog competition.  But we use a variety of terms, and probably mean slightly different things even when we use the same terms.  It might -- I am not sure -- be useful at this point to come together on an agreed definition for an agreed term or set of terms  -- something equivalent to the Berlin/Bethesda/Budapest Open Access Declarations.

If this does not seem like a "tool for open science", consider what the BBB definition has done for Open Access.  It provides cohesion, a point of reference and a standard introduction for newcomers, and acts as a nucleation center for an effective movement with clear and agreed goals.  Since this SL session takes off from SciFoo, and SciFoo is by all accounts very good at converting brainstorming sessions into practical outcomes, I thought perhaps the idea of a definition or declaration of Open Science might be a suitable topic.  In what I hope is the spirit of SciFoo, here are some ideas that might be useful in such a discussion.


Terms

Whatever this thing is, what should we call it?  There are a number of terms in use:

  • Open Science -- has the weight of Creative Commons/Science Commons behind it, via iCommons
  • Open Source Science -- Jamais Cascio, Chemists Without Borders
  • Open Source Biology -- Molecular Biosciences Institute
  • I think "biology" too narrow -- there seems little point in Open Chemistry, Open Microbiology, Open Foo all having different names.  I think Open Source Foo too likely to lead to confusion with software initiatives, and too likely to lead to pointless arguments about what the "source code" is.
  • That leaves Open Science, which would be my choice for an umbrella term.  A case can be made, though, for Open Research, on the same basis on which I argue against Open Biology etc -- see this comment from Matthias Röder
  • Another "inclusive" possibility is to focus on information -- Open Data, as per PMR's wikipedia entry, or the broader Open Content.  In the same vein, the Open Knowledge Foundation provides a fairly comprehensive definition of Open Knowledge.
  • I have seen "Science 2.0" around quite a bit lately, though it's a bit too marketing-speak for my taste
  • Open Notebook Science is a very specific subset of Open Science: if your notebook is open to the world, there's not much confusion about access barriers!  It even comes with its own motto: "no insider information".  This is as Open as Open gets.


Sources and Models

We don't have to re-invent the wheel:



Flexibility

We don't want to start a cult, and we don't want to bog anyone down in semantics.  There's no purity test or loyalty oath.  My own view is that Open Science (or whatever we end up calling it) is not an ideology but an hypothesis: that openly shared, collaborative research models will prove more productive than the highly competitive "standard model" under which we now operate. 

Openness in scientific research covers a range of practices, from tentative explorations with a single small side-project all the way to Open Notebook Science á la Jean-Claude, and we should welcome every step away from the current hypercompetitive model.  Open Notebook Science provides a useful marker for the Open end of the spectrum; perhaps all a Declaration need do is identify the minimum requirements that mark the other end of the spectrum?


Conditions


What standards must a research project or programme meet in order to be considered Open?

  • obvious: Open Access publication
  • equally crucial: Open Data, that is, raw data as freely available (including machine access) as OA text
  • probably indispensable: Open Licensing so as to avoid confusion as to what is truly available and for what purposes; as per Peters Suber and Murray-Rust, this must be
    • explicit
    • conspicuous
    • machine-readable
  • Open Semantics: perhaps none of this will be much good without metadata and standards to allow interoperability and free flow of information
  • desirable: Free/Open Source Software
  • David Wiley: "four Rs" of Open Content (cf. Stallman's four fundamental freedoms for software):
    • Reuse - Use the work verbatim, just exactly as you found it
    • Rework - Alter or transform the work so that it better meets your needs
    • Remix - Combine the (verbatim or altered) work with other works to better meet your needs
    • Redistribute - Share the verbatim work, the reworked work, or the remixed work with others
  • OKF definition of Open Knowledge




Wednesday, 08 August
Yale vs. BMC

Yale science Libraries have stopped paying the article processing charges for Yale faculty who publish in BioMed Central journals. Yale says:

Starting with 2005, BioMed Central article charges cost the libraries $4,658, comparable to single biomedicine journal subscription. The cost of article charges for 2006 then jumped to $31,625. The article charges have continued to soar in 2007 with the libraries charged $29,635 through June2007, with $34,965 in potential additional article charges in submission.
BMC responds:

The main concern expressed in the library's announcement is that the amount payable to cover the cost of publications by Yale researchers in BioMed Central's journals has increased significantly, year on year. Looking at the rapid growth of BioMed Central's journals, it is not difficult to see why that is the case. BioMed Central's success means that more and more researchers (from Yale and elsewhere) are submitting to our journals each year. [...]
An increase in the number of open access articles being submitted and going onto be published does lead to an increase in the total cost of the open access publishing service provided by BioMed Central, but the cost per article published in BioMed Central's journals represents excellent value compared to other publishers.

The increased cost arises because Yale researchers are submitting more and more work to BMC journals.  More manuscripts = higher costs, but if the cost per article has not gone up, then BMC's model scales effectively.  Here are some other ways to look at the numbers:

  • For around $65K, Yale gets about 40 articles published OA, that is,available free to everyone everywhere forever, plus a "subscription" (that is, Open Access, like everyone else) to 179 journals.  Theaverage biomed journal subscription is around $1000-1500/yr; choosing the lower figure to be conservative, those subscription-equivalents are worth $179K/yr.  Even if Yale only wanted to subscribe to around a third of the BMC journals, that would still cost about the same as the OA charges --and this comparison ignores the page, color and miscellaneous charges that many journals levy.  (An example: PNAS charges $70 per printed page, plus $325 for each color figure or table; $150 for each replacement or deletion of a color figure or table.)

  • Yale could publish those 40 articles elsewhere without paying anything (again, ignoring page etc. charges).  Assuming they don't subscribe to any of the journals they publish in, though, every time any Yale employee wants to read one of those articles they're on the hook for somewhere around $30; so it only takes 2166 person-articles, or an average of about 50 employees wanting to read each article, to get back to $65K -- without the benefit of OA.

  • Yale spent about $7.7 million on subscriptions in 2005-6; converted to OA author-side charges at $1600/article, that's about 4800 articles.  A PubMed search on "Yale" gives 2272 hits; "Yale in title/abstract" gives 131, leaving 2141 papers where "Yale" is probably in an author's address.  I can't find a quick way to break out Yale's subscription expenditure by field, so what proportion of the $7.7mil goes to biomed journals I couldn't say (though STM titles are the most expensive subscriptions for any academic library). If PubMed-indexed journals make up 44% (2141/4800) of Yale's subscription costs, which does not seem unlikely, then they're already paying $1600/article -- without the benefit of OA.

A quick fiddle with biology + medicine data from theJournal Cost-Effectiveness database gives an average price per article of around $12 for toll-access journals, but that's (one subscription)/(total no. articles).  The question is, how many subscriptions do they sell -- that is, what is their income/article?  We know what BMC makes per article: about $1600 on average.  If an average toll-access journal sells just 135 subscriptions per year, they're bringing in more per article than BMC.

There's more, but that'll do for now.  Two questions arising:

1. what's the average page/colour/misc charge levied by toll-access journals?
2. how many subscriptions does an average journal sell each year?



An appendix of sorts: the BMC cost structure

Article Processing Charges

standard charge = $1600 (129 journals)
alternative charges: $2410 (2 journals)
$2310 (1 journal)
$2170 (1 journal)
$2010 (2 journals)
$1710 (1 journal)
$1970 (2 journals)
$1910 (4 journals)
$1810 (2 journals)
$1710 (5 journals)
$1505 (11 journals)
$1455 (1 journal)
$1305 (5 journals)
$1205 (2 journals)
$1005 (2 journals)
$805 (1 journal)
$725 (2 journals)
$500 (1 journal)
no charge (5 journals)

Supporter's Membership
Supporter Members pay a flat rate annual Membership fee based on the number of biology, chemistry, physics and medical researchers and graduate students at the institution. Members of the institution are then given a 15% discount on the APC when publishing in our journals.
Very small institution (21-500 faculty and postgraduate students in biology, chemistry and medicine) $1994
$13293
8.3
Small institution (501-1500 faculty etc.) $3987
$26580
16.6
Medium size institution (1501-2500 faculty etc.) $5980
$39867
24.9
Large institution (2501-5000 faculty etc.) $7974
$53160
33.2
Very large institution (5001-10000 faculty etc. $9967
$66447
41.5

So if this fee is to be less than 15% of total APC, total APC must be at least the figure in column 3.  Since the average is likely to be close to $1600/article, dividing through gives the number of articles in column 4.

Postpay Membership
...group members are invoiced in arrears for articles authored by their members that have published in our journals since the last invoice date. Invoice schedules are set on a monthly or quarterly cycle.
Prepay Membership
...enables an organization to cover the whole cost of publishing for their investigators when publishing in our open access journals. No additional fees will be paid by individual authors. This is an advance payment system whereby customers pay upfront for accepted articles authored by their investigators to be processed and published. Upon publication, the full Article-Processing-Charge (APC) for the journal in question, minus a loyalty discount, will be deducted from the account.

The higher the amount paid in advance, the greater the loyalty discount given on each APC.
No numbers seem to be available for the "loyalty discount".



Saturday, 21 July
OK, but I still don't want to see "Open Access" become the new "Low Fat".

Peter Suber commented on the last entry to clarify his position on the varying uses of the term "Open Access":

For me, OA in the strict sense removes both price barriers and permission barriers; all the major public definitions say so; and I'm only too glad to repeat this whenever it comes up. However, as a matter of word usage, the term now covers more territory than this and I've stopped fighting that fact. That is, the term is often used for content that is merely free-to-read.
Peter goes into more detail in a recent entry on his blog:
...many projects which remove price barriers alone, and not permission barriers, now call themselves OA. I often call them OA myself. This is only to say that the common use of the term has moved beyond than the strict definitions. But this is not always regrettable. For most users, removing price barriers alone solves the largest part of the problem with non-OA content, and projects that do so are significant successes worth celebrating. By going beyond [I would say "outside" -- BH] the BBB definition, the common use of the term has marked out a spectrum of free online content, ranging from that which removes no permission barriers (beyond those already removed by fair use) to that which removes all the permission barriers that might interfere with scholarship. This is useful, for we often want to refer to that whole category, not just to the upper end. When the context requires precision we can, and should, distinguish OA content from content which is merely free of charge. But we don't always need this extra precision.

In other words: Yes, most of us are now using the term "OA" in at least two ways, one strict and one loose, and yes, this can be confusing. But first, this is the case with most technical terms (compare "evolution" and "momentum"). Second, when it's confusing, there are ways to speak more precisely. Third, it would be at least as confusing to speak with this extra level of precision --distinguishing different ways of removing permission barriers from content that was already free of charge-- in every context. [...]

and in the Sept 2004 edition of the SPARC OA Newsletter:
One danger is the dilution of our term. That's why [this newsletter discusses] the BBB definition and its place in our history. But another danger is the false sharpening of our term. If we thought that the BBB definition settled matters that it doesn't settle, then we could prematurely close avenues of useful exploration, needlessly shrink the big tent of OA, and divisively instigate quarreling about who is providing "true OA" and who isn't.

The BBB definition functions as a usefully firm definition of "open access" even if it leaves room for variation. We should agree that OA removes some permission barriers (e.g. on copying, redistribution, and printing) even if it leaves different OA providers free to adopt different policies on others (e.g. on derivative works and commercial re-use). My personal preference, for example, is to permit derivative works and commercial re-use. But (as I wrote in FOSN for 1/30/02) I want to make this preference genial, or compatible with the opposite preference, so that we can recruit and retain authors on both sides of this question.

I've omitted a lot of good information to save space here; anyone interested in this issue should read all of the linked discussions. In particular, the SPARC newsletter goes into useful specifics about the OA-related activities of a number of publishers.

Peters Suber and Murray-Rust have both pointed out that one way to be specific about "levels" of openness is to be explicit about licensing -- PMR:

If the community wishes to continue to use "open access" to describe documents which do not comply with BOAI then I suggest the use of suffixes/qualifiers to clarify. For example:
  • "open access (CC-BY)" - explicitly carries CC-BY license
  • "open access (BOAI)" - author/site wishes to assert BOAI-nature of document(s) without specific license
  • "open access (FUZZY)" - fuzzy licence (or more commonly absence of licence) for document or site without any guarantee of anything other than human visibility at current time. Note that "Green" open access falls into this category. It might even be that we replace the word FUZZY by GREEN, though the first is more descriptive.
I take Peter S to be saying that it's inevitable that "Open Access" will come to mean, in general use, more things to more people than strict BOAI, and we will not achieve anything by making arseholes of ourselves over it. (Even if that's not quite the way Peter S would put it, that's the way I've come to look at the situation.) There's no point in picking quarrels we don't have to have. It's enough to be more careful in our own usage, for which purposes suffixes a la Peter MR should prove very useful when we need extra precision. I don't think we need invent terms ("fuzzy") just yet -- "OA (specific licence, with hyperlink if writing online)" and "OA (free to read)" should cover most cases.

If we can get to the point where the average consumer -- basically, any researcher -- responds to an OA claim or label by asking "which licence?", we will have done an end-run around the problem of term dilution.



Thursday, 19 July
In which our hero takes his customary couple steps backwards...

In the entry below, I was not sufficiently careful to avoid Nature-bashing, or the implication that Maxine Clarke was morphing, werewolf-like, into some kind of publisher pitbull. Thanks to Pedro, bdf and RPM for responses which made this clear.

Peter Suber provides a handy roundup of Nature's OA and free-to-read offerings:

[the Current Science partnership] won't be Nature's first OA journal.  Nature and EMBO publish Molecular Systems Biology, a full OA journal, along with a couple of hybrid OA journalsNature publishes another hybrid with the British Pharmacological Society.  It publishes a regular series of OA supplements to its flagship TA journal, and in January of this year began offering OA to the backfiles of its academic and society journals. 

In addition, Nature has a raft of non-journal OA projects, including a self-archiving policy, a data sharing policy, a neuroscience gateway, a signaling gateway, a networking site, mixed journalism and research sites on climate change and stem cells, blogs, podcasts, job listings, a news aggregator, and a preprint exchange

[Updated after talking to Timo Hannay to include] The Cell Migration Gateway, Dissect Medicine, The Functional Glycomics Gateway, GI Motility Online and The Pathway Interaction Database
It's worth noting that Peter uses the term OA for services and projects which I would describe as free-to-read (or free-to-use), but not OA. I would welcome clarification from Peter here, as I do not feel I am in a position to argue OA definitions with someone who helped draft its founding declarations! [update: see comments]

Even on my more restrictive reading, Nature does have a couple of full-OA journals and a handful of hybrids -- not "one barely-OA journal". Further, whether or not one considers them OA the free-to-read/use projects and services include some important and useful innovations. (The list above doesn't even include Connotea, a science-centric social bookmark manager which I use myself.) Nature is head and shoulders above any of its toll-access competitors in terms of web savvy and willingness to experiment, and I think it's important to recognize this whenever one (quite rightly!) criticizes them for not (yet) being Open Access.

What bothers me about calling Nature's free-to-read/use publications and doohickeys "OA" is the Low Fat/Greenwashing Problem, which Peter Murray-Rust describes thus:

Publishers blaze around "free" "choice", etc. which confuse rather than inform. For a publisher "open" and "free" are to be used like "low fat" "energy food" "healthy" as a way of legitimising current practice.
Everyone is familiar with companies which label their products "environmentally sound" or "healthy choice" when in fact they are paying only underhanded lip service to those concepts. It seems to me entirely possible that unscrupulous publishers may try the same tricks with "open access", and that the best defense is to insist on the BBB definitions. A number of commenters have wondered (can't find a link right now) whether we need another term for Open Access sensu stricto -- something like "BBB-OA", perhaps. (If you say that "be-three-oh-ay" it's not so bad.)

Let me finish, though, by pointing out that I do not wish to paint NPG as one of the unscrupulous publishers whose intentions worry me, nor Maxine Clarke as their sneaky shill. If NPG uses the term "open access" differently from me, I take that as a good-faith disagreement, and if Maxine uses the term in her employers' sense that is hardly "marketing". Specifically, I apologize for the phrase "if [Maxine] is going to start abusing [the term "OA"] as marketing for Nature", which contains an uncalled-for implication that I hope this entry will dispel.


You can get to like the taste of crow... you just have to eat enough of it...




Tuesday, 17 July
"Open Access" is not a marketing phrase and you are not free to use it as you see fit.

Peter Murray-Rust recently pointed to Paul Wicks' (Nature Networks) blog article, "Is Publisher-Lead "open access" a swindle?", which refers to PMR's recent blog series on publisher licensing and permissions barriers in hybrid OA models. In comments on Paul's entry, Jennifer Rohn pointed out

The two dedicated open-access publishers (BioMed Central and Public Library of Science) don't have these problems. People who want to ensure their articles are truly going to be open access, published by companies who have put real thought into the publishing as well as business model, might want to look there.
PMR quoted that comment, to which Maxine Clarke replied (in a comment on PMR's entry) with what looks for all the world like classic publisher anti-OA FUD:
Hello, I declare conflict of interest as I am an editor at Nature, not in itself open access but our publisher has many open access projects and products.
In response to Jennifer's point: I agree that BMC has got an OA publishing/business model and indeed business, but the PLOS model is dependent on a large grant from a charitable foundation, so the jury is still out (in my opinion). As an editor I am concerned about the archiving and the preservation of the scientific record, for example.
I note the commendable upfront COI declaration and state for the record that I do not think Maxine was consciously engaging in FUD. It is nonetheless standard operating procedure for OA opponents to link PLoS to "charity" and cast vague aspersions on the ability of OA publishers to maintain the scientific record. PLoS was intended as a flagship-cum-icebreaker for OA; breaking even financially was always a secondary objective. Nay-sayers about the viability of OA in business are invited to explain the success of (at least) BioMed Central, Hindawi and Medknow. Persons who wish to claim that OA puts the record at risk are invited to explain how a proprietary archive in the hands of a for-profit publisher is safer than PubMed Central or the wide network of repositories linked by OAI-PMH. (Again, I don't think Maxine was making such anti-OA claims, but it bears pointing out that what she did say contains clear echoes of standard FUD.)

Peter MR's response to Maxine's comment was this entry, in which Peter sets out to find the "many open access projects and products" and gets no further than did Jonathan Eisen, who praised the establishment of Molecular and Systems Biology (NPG's only OA journal) only to find that in fact the MSB license is the same as CC-BY-NC-ND, which is far too restrictive to call itself OA. As Chris Surridge (of PLoS) puts it in comments on Jonathan's entry,

'Free Advertising' isn't 'Open Access' in my book.
Maxine had this to say:
Nature Precedings, several database publications, Nature Reports publications (3), Nature Network, Scintilla, online daily news service, gateways, blogs, many individual articles and collections of articles are freely available ("projects and products" as I mentioned in my comment to your earlier post. MSB is to my knowledge NPG's only formal open access journal.)
Peter responded with another post, giving the necessary background and pointing out that, excepting MSB,
...the rest of [Maxine's] list completely muddies the "open access" debate. If Nature believe that "open access" applies to any freely visible information on their site, most not peer-reviewed, many without licences and many with the publisher's copyright, then they are making my life much harder.
This is clear and unexceptionable in the context of Peter's ongoing quest for clarity in publisher OA-related policies. That context, or at least its existence and importance to the entry in question, was made clear by the entry itself, and I take ordinary netiquette to involve being familiar with an ongoing conversation before taking part. Nonetheless, Maxine again:
frankly I was not responding to anything you have written in the past few weeks, I was responding to your request to give examples of NPG's "open access" or "free" material.
This is weak at best. Peter asked for "pointers to [Nature's] open access products and the licences which they carry"; see also netiquette, ongoing conversations and. Claims of a limited response made in ignorance of context are either disingenuous or, if made in good faith, still no excuse.

Maxine continues:

It is your perogative to define terms however you like, but not your perogative to enforce other people to use the same definitions - I know what I mean by "open" or "free" content and I don't need to be told off by you for having a different definition to whatever your definition is
I don't know and I don't care what Maxine means by "open" or "free". I care what the BBB Declarations mean. Peter is not defining terms however he likes; he is working with published, widely accepted definitions. He is well within his rights to expect that other people will indeed use the same definitions: that is, after all, the point of having developed and published them. Nature does NOT have "many open access projects and products", it has one (barely) OA journal and the excellent Precedings, together with a number of commendable free-to-read initiatives (blogs, Nature Network, the various free-to-read web special collections, etc). "Open Access" is not a fuzzy buzzword that Maxine is free to define as she sees fit, and if she is going to start abusing it as marketing for Nature then she most certainly does need telling off.

Peter has apologized for being "over-brusque", which is a handsome gesture but in my opinion no such apology was called for.



Friday, 13 July
Giving Open Notebook Science a Try

Openness is spreading, one researcher at a time: Jeremiah Faith, a Boston U graduate student in bioinformatics, has put his lab notes online:

Open Notebook Science [...] is a term coined by Jean-Claude Bradley. The idea is simply that the heart of every person's research - their lab notebook - should be open to the world.

Since most of our scientific work is funded by tax payers who expect their money to be well-spent, it's interesting that openness isn't required. Science typically builds on the body of available knowledge - the more knowledge available the faster science goes. It's striking when you visit other labs in person; you see all of their unpublished work, and you know that most of their results and data won't be available to the bulk of the scientific community until a year after each particular scientific project is finished. By the time papers are in print, it's old news to the insiders. More striking is when you visit labs whose work you've thought about replicating and expanding on. It's not too uncommon to find that only one person in the entire lab is able to get the technique to work, and even for him the technique only works on Wednesdays. This type of information would be useful to know before you embark on a useless three months trying to adapt their method. But scientific publications are covered in a thick coat of high-gloss finish, making these unacknowledged difficulties hard to detect.

Lab notebooks on the other hand are flat black. As long as people keep them regularly updated, they contain the good, the bad, and the completely nonsensical results.

Today I test the waters of Open Notebook Science.

The latest version of my lab notebook is now automatically posted on J's Lab Notebook Page each night. I've been using an electronic lab notebook for two years now, so there's quite a bit of data in there - good and bad (300+ pages).

This is simply fantastic. One of the things that Open Science advocates most sorely lack is concrete examples. Doing research in public, instead of in secret, is a new and somewhat unnerving idea for most scientists; early adopters like Jeremiah are essential to take the edge off that unfamiliarity.

(It's also, to be honest, just plain fun to snoop around in someone else's lab notes! I was amused to note that Jeremiah talks to and about himself in his notebook, the same way I do -- "if I weren't so stupid I'd...", "next time load the control first, doofus", etc. I wonder if everyone does that?)



Tuesday, 03 July
FINO

Once more unto the breach, dear friends, once more: the dreaded Free Is Not Open argument rears its ugly head again. I've made my position (indeed free != Open, and the distinction matters) clear elsewhere, and was gratified recently to find PMR agreeing; now it seems that the Open Medicine editorial team takes the same position:

The Canadian Medical Association Journal (CMAJ) has just published:

Here is our response:

Although the endorsement by CMAJ's editors of open access medical publishing is welcome, we would like to take this opportunity to clarify several points raised in their commentary.1 First, there is an important distinction between open versus free-access publication. Open Medicine has not only adopted the principle of free access, that is, making content fully available online, but endorses the definition of open access publication drafted by the Bethesda Meeting on Open Access Publishing. This definition stipulates that the copyright holder grants to all users a free, irrevocable, worldwide, perpetual right of access to, and a license to copy, use, distribute, transmit and display the work publicly and to make and distribute works derived from the original work, in any digital medium for any responsible purpose, subject to proper attribution of authorship. Given that CMAJ holds copyright and charges reprint and permission fees, it is not in fact an open access journal.

In comparison, Open Medicine does not assume the copyright of our authors' work. We believe that it is only fair and just that authors retain the ownership of their work; as such, Open Medicine does not charge reprint or permission fees, and our work is available for reproduction for educational and teaching purposes without copyright limitations or charges.  We use a Creative Commons Copyright License that also ensures derivative works are available through an open access forum. It is through this creative and unlimited use of published material, with due attribution, that we believe scientific discourse can flourish. This truly open access forum also has a contribution to make to a journal's integrity, independence, and freedom.   [...]

Chris Surridge of PLoS also agrees, and supplies an excellent analogy:
Free Access to scientific research is great, and all publishers who make their content free to read should be praised for doing so. But this is not Open Access. It is like giving a child a Lego car and telling them that they can look at it, perhaps touch it, but certainly not take it apart and make an aeroplane from it. The full potential of the work cannot be realised.
Where the OM team refer to Bethesda, Chris links to Berlin and goes on to enumerate
...the four unmistakable marks by which you may know, wheresoever you go, the warranted genuine Open Access publication:

1. Content is made freely and immediately accessible to all.
This basically means that you can get it on the internet without paying anything in addition to what it costs you to access the internet.

2. Authors retain the rights of attribution.
So the work is the authors [' property]. The author doesn't sign over the copyright to the publisher or anyone else. Rather the author allows the publisher to publish the work under licence. A licence which also ensures that:

3. Content can be distributed and reused without restriction.
So I or anyone else can take Open Access content and use it, in whole or in part, for any purpose including purposes that have not yet been dreamt of as long as I don't infringe the Authors rights of attribution.

4. Papers are deposited in a public online archive such as PubMed Central.
This ensures, as best as anyone can, that the above three conditions continue to apply to the Open Access content in perpetuity.

It's been my contention that in the absence of explicit, conspicous and machine-readable Open licensing, condition 3 is violated because in this litigious age, the conscientious and the risk-averse will not download and derive without explicit permission. I got "explicit and conspicious" from Peter Suber:
The newer definitions [of OA] recognize one further element: an explicit and conspicuous label that an open-access work is open access. Readers should be told when a work is free of price and permission barriers. They might be reading a copy forwarded from a friend and not know whether the publisher would like to charge for access. They might want to forward a copy to a friend and not know whether this kind of redistribution is permitted. When an article has no label, then conscientious users will seek permission for any copying that exceeds fair use. But this kind of delay and detour, with non-use as the consequence of a non-answer, are just the kinds of obstacles that open access seeks to eliminate. A good label will save users time and grief, prevent conscientious users from erring on the side of non-use, and eliminate a frustration that might nudge conscientious users into becoming less conscientious.
and "machine-readable" from Peter Murray-Rust:
For me, if my robots cannot read the articles then as a human I have no interest at all in reading the "fulltext".
Peter MR is not saying that free access for humans is useless, but that to realize the full potential of text- and data-mining, OA materials need to be machine-readable, which includes letting the machines know what they are allowed to have.

I must confess that finding my thoughts echoed by such leading OA proponents makes me feel better about being, on this issue, at odds with Stevan Harnad. I simply cannot agree that Open "comes with" Free, and the distinction bothers me. It should be relatively easy to convert Free to Open -- simply add a Creative Commons or similar license -- but I think it would be better to do that proactively. If we gloss over the difference between Free and Open at this relatively early stage of OA, we risk creating a (potentially enormous) body of Free text that must be updated to include complete, useful permissions when at last we realize that Free Is Not Open. (The game's afoot: / Follow your robots, and upon this license / Cry "Free is not Open"!)



Tuesday, 05 June
Mission-critical OA!

While you're over at Attila's blog (see the entry below), be sure to read this entry about surgeons in desperate need of information during an operation. Library staff were able to provide the required paper (at 3am!), but the connection with OA is inescapable. Attila:

Even if the surgeon found the title or abstract of the paper within seconds [...] would he/she be able to download the whole (copyrighted) content somehow within minutes too without an institutional subscription referring to informational and life emergency?

Could this exceptional information and life emergency be interpreted as a basic right with complementary duties? [...] What if a perfectly targeted Google app (call it Google Emergency) would be at hand, one that would be able to transiently abandon copyright issues for the sake of human help and solidarity?

That's a fine idea, but I hope that Open Access will render it moot, and that in the not-too-distant future no special application, only PubMed or Google Scholar, will be needed.



Tuesday, 05 June
Two small steps...

Two small but (I think) profound steps forward today, the common thread being movement towards openness:

(1) Attila Csordas will be editing his doctoral thesis "live" on his blog. He won't, at least for now, be including data or unpublished discussions, but he did check with several relevant persons about the "prior publication" status of whatever he does blog (and concluded that the blogging will not present a barrier to publication). Says Attila:

...no idea on how challenging, meaningful this project, a sub-series in Pimm, will be. What I know is that continuous experimentation with genres and frames is the essence of free blogging!
It's at the heart of Open Science, as well; bravo, Attila!


(2) In reference to my earlier post about the proposal to make referee's comments public, Heather points out that PLoS One already offers reviewers the option of having their reviews published, anonymously or signed, as a discussion linked directly from the article. Kudos to Heather for opting to have her review of this paper made openly available.



Sunday, 03 June
Petition for OA to Brazilian science.

Via Stevan Harnad, a petition to establish a self-archiving Open Access mandate for Brazilian research:

Hélio Kuramoto of IBICT has helped to formulate a Proposed Law (introduced by Rodrigo Rollemberg, Member of Brazil's House of Representatives) that would require all Brazil's public institutions of higher education and research units to create OA institutional repositories and self-archive all their technical-scientific output therein.
Once established, OA does not care about national boundaries: open is open. So every institute, funding body, nation or other group that adopts an OA mandate is helping to bring worldwide 100% OA closer.

I join Stevan in congratulating Kuramoto and Rollemberg on their initiative and in urging all OA supporters to sign the petition. (I am signature #31.) Thanks again to Stevan, here is an English translation of the petition text:

To: The Brazilian Scientific Community

On May 23 of 2007, Rodrigo Rollemberg, Member of Brazil's House of Representatives, introduced Proposed Law nº 1120/2007 concerning the dissemination of Brazil's technical-scientific output.

This is a pioneering initiative for this country and indeed for all of Latin America. Brazil can become the first Latin American country to establish a legal mandate for the deposit and distribution of Brazil's technical-scientific output. This Proposed Law represents a decisive and courageous step toward providing open access to Brazilian scientific research. If approved, the Law will contribute to eliminating access barriers to scientific information worldwide. In addition to being beneficial to the national economy, the Law will allow greater transparency in Brazil's investment in its scientific research, generating quantitative metrics to guide the planning and support of science and technology.

The first article proposes that all Brazil's public institutions of higher education, as well as all research units, should be required to establish institutional repositories in which all the technical-scientific output of their academic and researcher staff must be deposited. The intention is to ensure that this content will be made openly accessible on the Web.

The article proposes creating a High Level Committee co-ordinated by IBICT to design and direct whatever actions need to be taken to provide open access to scientific research. This Committee will have the mandate to discuss and formulate a National Policy of Open Access to Scientific Research Output.

It is incumbent on all members of the scientific community to promote open access in Brazil by fighting for the approval of Law nº 1120/2007 by the National Congress.

We hereby invite all those who support the Proposed Law to sign this petition here.





Thursday, 31 May
Damn good idea.

Via Peer-to-Peer, Ariberto Fassati in this week's Nature correspondence (sorry, toll access only):

Reviewers [of scientific publications] often make significant contributions in shaping discoveries. They suggest new experiments, propose novel interpretations and reject some papers outright. [...] It is well worth keeping a record of such work, for no history of science will be complete and accurate without it.

I therefore propose that journals' records should be made publicly available after an adequate lapse of time, including the names of reviewers and the confidential comments exchanged between editors and reviewers. The Nobel Foundation makes all its records available after 50 years, as do many governmental and other institutions. This delay may be reduced for scientific journals to, perhaps, 15 or 20 years.

Now that's a damn good idea: it's long past time that reviewing got its due as an essential part of a scientist's job, and opening the records should help to generate such recognition (to say nothing of the invaluable contribution to historiography of science).

My only quibble: why 15 years? If six months is long enough for an embargo on a closed-access paper, why is it not also long enough to keep the reviews secret? I presume the idea is to prevent retaliation for harsh reviews, but if all the information is public it would take a truly dedicated holder of a truly heinous grudge to follow up (in such a way as not to get caught doing it!) after six or twelve months. More to the point, we can dramatically reduce the risk of such retaliation by changing the community attitude towards reviewing. If peer review becomes a fully acknowledged part of the job, excellence in which is respected and rewarded -- and if everyone knows their reviews will be made public! -- then low quality (gratuitously mean, ill-informed, lazy, self-serving, etc) reviews should be a thing of the past.



Saturday, 26 May
Happy (blog) Birthday!

I usually try to keep my entries in this category entirely "serious", because then readers can avoid all the personal and other clutter in this blog, so I don't do a lot of birthdays and such.

It is, however, hardly unserious to take a moment to wish a happy birthday, and many happy returns, to the indispensable Open Access News, which turns five today -- and to extend a hearty thanks and congratulations to its indefatigable author, Peter Suber.

It's safe to say that Open Access would not be where it is today, nor expanding at its current rapid rate, without Peter and his blog. So thanks Peter, and happy (blog) birthday!



Tuesday, 22 May
Another new Open Science blog.

Speaking of new faces in the blogosphere, Heather Piwowar has a new blog, Research Remix, focusing on Open Data:

... the goal of this blog is to capture my notes as I flail around learning everything I can about data sharing and re-use, with the short-term goal of writing my biomedical informatics doctoral dissertation literature review. Taking notes here out in the open in case it interests anyone else along the way.
(Link not in original.) Bravo!

In one of her first posts, Heather points to a Nature editorial (sorry, closed access) calling for psychologists to move towards Open Data:

In psychology there is little tradition of making the data on which researchers base their statistical analyses freely available to others after publication. This makes it difficult for anyone to independently reanalyse research results, and prevents small data sets from being combined for meta-analysis, or large ones mined for fresh insights or perspectives.

Psychologists need to rethink their reluctance to share data.

Heather notes that the article only glances off the really interesting question:
Does the concept of sharing data generate unnecessary angst? Does it actually generate angst, or is it mostly laziness or selfishness or fear? If angst, is the angst indeed unwarranted? To what extent does sharing data in fact lead to additional stresses for authors?

I’d love to see research into the reasons why scientists do not share data, and whether their reasons are upheld by events. This knowledge would allow us to address the underlying issues deterring authors from making their data available, which is bound to be more effective for long-term goals than simply relying on requirements from funding agencies and journals.

The article touches on what I think is the most important reason for reluctance to share:
Like many researchers in other disciplines, psychologists fear that if different analytical approaches are brought to bear on their data, different conclusions could be drawn, casting doubt on their competence — or even their integrity.
In my field (biomed), it's not so much fear of being found out in a mistake or a lie (though I bet a fair proportion are worried about being caught in "normal misbehaviour"). The real killer is ego: what if someone else gets there first? The field has become so over-competitive that many (I'd say most) researchers seek to maximize any edge they can get. Everyone seems to think their Nobel is just around the corner, and they can't bear the idea of someone else getting it -- so they're willing to let data go underutilized rather than risk having to share credit (or being done out of it).

I think Heather is right about addressing underlying issues, but it does occur to me that the same researchers who won't share their data may also be unlikely to cooperate with research into the reasons why: those reasons frequently do not reflect well on individuals or the community. In the short term, mandates are probably the only effective mechanism for getting widespread adoption of open access and open data practices over the initial hump of apathy, fear of change, selfishness, laziness and so on. In the long term, I hope that as the mandates take effect, the increased efficiency of open science -- of collaboration over competition -- will become apparent, and the nature of the scientific community will change in an ever more open direction.



Tuesday, 22 May
Open Science news

Via Jean-Claude, the Open Science world welcomes another researcher, Sivappa Rasapalli of Totally Retrosynthetic. This is great news, since one of the primary obstacles to wider acceptance of Open Science ideas is the lack of working examples (real research, not just blabber on a blog like mine). In addition to the blog, Shiva also has a wikispace for his research proposals, and (when he is in a position to do so) plans to publish his research results openly as well. In his own words:

Basically, I want to
1. Avoid unnecessary duplication (thus protecting the ideas)
2. Reap the expertise of chemists out there thus improve the ideas further
3. Collaborate with researchers willing to try the ideas and give the credit
4. Help the folks with the research ideas, but no opportunity to execute them .

So feel free to pitch in and voice your opinions on the ideas.
So get on over there, O my tens of readers, and lean those giant brains of yours against Shiva's research questions. As Jean-Claude is fond of pointing out, what better way to get credit for your idea than to collaborate in real time with an Open Science advocate using documents "registered with third-party time stamps and efficiently indexed by the most popular search engine in the world"?

And besides, collaboration is fun. Discovery is the addiction that drives research -- it's the crackpipe hit, the rush, the thrill, that keeps us going through the down times and the plodding; but one of the best ways to alleviate the boredom and despondency that sets in between fixes is to collaborate. Not only does it bring fresh perspectives and ideas, it reminds us that we're not in this alone.

(If you read my last post, that might seem at odds with the views I expressed there. What can I say? I have my bad days. But even on the worst of 'em, it's the possibilities of Open Science that keep me from throwing up my hands and leaving research altogether.)



Monday, 14 May
Another "why didn't I think of that?" moment.

Rich Apodaca provides my daily dose of "smack self in forehead":

Recently, I attended a talk given by Max Levchin, co-founder of PayPal, on the subject of product design. In it, he advised those seeking to create a successful startup to build products designed to enable users to commit one or more of the Seven Deadly Sins.

His reasoning was simplicity itself. The Seven Deadly Sins were those activities so universal, that people needed to be threatened with all kinds of bad things if they did them. Looking at it from a detached, secular perspective, most people seem hard-wired to want to commit one or more of the Seven Deadly Sins - repeatedly and without encouragement. Looking at it from a product designer perspective, cha-ching!

See Rich's post for a concise summary of the Seven Scientific Deadly Sins, and why they are not necessarily sins at all; the take-home point is this:
Why does any of this matter? For the simple reason that information technology and economics are in the process of rendering obsolete existing models of scientific publication. To build the systems of the future, it's essential to understand the motivations of those using the current one.
Rich is exactly right. Scientists have all kinds of reasons for publishing, and the particular exigencies of research mean that the nobler impulses tend to be pushed to the back of one's mind -- at the practical, day-to-day level, it's the Sins that win. This strikes me as an insight that open access/open science advocates would do well to keep in mind.



Wednesday, 25 April
Every time a traditional publisher puts their foot in it, an angel gets its wings.

Zuska alerted me to Shelley's recent run-in with Wiley, one of the big 7 -- or is it 6 now? -- science/tech/med publishers. Shelley reviewed a recent article in the Journal of the Science of Food and Agriculture (no link -- what would be the point, they won't let you read most of it), and in doing so reproduced a chart and one panel from one of more than 10 figures. Rather than see this as fair use and damn good publicity, Wiley sent a nastygram:

Re: Antioxidants in Berries Increased by Ethanol (but Are Daiquiris Healthy?) by Shelly Bats

http://scienceblogs.com/retrospectacle/2007/04/antioxidants_in_berries_increa.php

The above article contains copyrighted material in the form of a table and graphs taken from a recently published paper in the Journal of the Science of Food and Agriculture. If these figures are not removed immediately, lawyers from John Wiley & Sons will contact you with further action.

Regards,

[redacted]
Editorial Assistant
Journal of the Science of Food and Agriculture
Society of Chemical Industry
14-15 Belgrave Square
London UK
SW1X 8PS

T: [redacted]
F: [redacted]
E: [redacted]
W: www.soci.org

SCI - where science meets business

Register with Wiley Interscience to sign up for free contents alerts to SCI journals (Journal of the Science of Food and Agriculture, Journal of Chemical Technology and Biotechnology, Pest Management Science and Polymer International) by email. Visit http://www.interscience.wiley .com/alerts

Note that the flack doesn't even bother to spell Shelley's name properly; and can you believe that marketing boilerplate bullshit at the bottom there?

Shelley got around this hassle by re-creating the necessary figures for herself, but as she rightly notes, the point of science publishing is to disseminate information, not to threaten grad students who happen to be interested in a particular paper. Except that for Wiley, the point is profit, and apparently you do make that by threatening grad students. (Cue more flacks in my comments squealing about how Wiley is "your partner in research" or somesuch. Save your breath, weasels.)

Here's the bottom line: if you're a researcher, publish only in Open Access journals whenever possible, and if you absolutely have to publish with a toll-access journal then use an Author Addendum to retain copyright in your paper and in your data and deposit your article in an OA repository just as fast as you can find one to take it. Until the research community stands up and says "enough", we will continue to be held hostage in this fashion by greedy, oversized corporations -- but the good news is, we need only reach out and take that power back. In the Gutenberg era, publishers had leverage; in the Google age, they have none.

If this kerfluffle is the first you've heard of, or really thought about, Open Access publishing, please read Peter Suber's brief introduction or more detailed overview. If you have serious stamina/interest/masochistic tendencies, you could also read my 3QuarksDaily series on Open Access/Open Science (part 1, part2, part 3).

Update: Shelley got a pretty standard-issue non-apology apology from further up the foodchain, and (having neither the time nor the money to waste on pursuing this further) is content to let it rest there. So, Shelley now has permission to reproduce the figures in question and no threat of attack lawyers, and Wiley has a public black eye; seems about right to me. Per Shelley's request, and because apparently some of the letters she received were less than polite, I've redacted the original flack's name and contact details above. (Obdisclosure: I wrote, but I was polite -- although I included a link to this entry, which isn't.)



Saturday, 07 April
Another early-career scientist goes on the public record intending to do open science.

I forgot to blog about this article in The Scientist when Bora first linked it, but now Jean-Claude has reminded me. The main focus is on Reed Cartwright's adventures in authorship (and do go read that link; it's a nice example of how science should work, and Comai is a class act), but Bora and Jean-Claude also get a mention; they've posted the relevant excerpts on their blogs. The bit that really grabbed me, and that I meant to write about, was this quote about/from Bora:

Zivkovic concedes that he has had less luck in convincing people that he should post his dissertation on his blog before he publishes it [than in convincing them to publish orphan data]. "But if and when I get to having my own lab I'd like to be completely open," he says, "having a live blog where everyone posts what happens in the lab every day."
Bravo, Bora! I've said the same thing, here and elsewhere, and of course Jean-Claude is actually doing it. It makes me wonder, who else is out there, hoping and planning to do open science? In comments here, Propter Doc (I wish I'd thought of that nick!) wishes there was a way to publish orphan data in the open (and Jean-Claude points to a couple of possibilities, including blogging). I have previously pointed to some other examples: bioinformatics work from Sandra Porter and Pedro Beltrao, chemoinformatics software from Egon Willighagen, organic syntheses from Org Prep Daily and Rosie Redfield and her students blogging hypotheses, thinking-out-loud and even data. I recently noticed that Jonathan Eisen had started blogging his OA papers (reminding me that I must get my professional back catalog, such as it is, onto a repository somewhere).

There must be more. Who else is doing, or planning to do, open science? And further, how can we help each other?

My working hypothesis is that open, collaborative models should out-produce the current standard model of research, which involves a great deal of inefficiency in the form of secrecy and mistrust. Open science barely exists at the moment -- infancy would be an overly optimistic term for its developmental state. Right now, one of the most important things open science advocates can do is find and support each other (and remember, openness is inclusive of a range of practices -- there's no purity test; we share a hypothesis not an ideology).

So talk to me, putative ally and colleague! Who are you, where are you, how can I help you? I sure would like to hear from you.



Sunday, 18 March
Open Access Conference (Call) Series

If you are interested in Open Science, the following may be of interest. Chemists Without Borders is hosting a series of conference calls on Open Access, Open Source:

Thursday, April 5 9:00 a.m. Pacific Time / Noon Eastern Time
Heather Joseph: Federal Research Public Access Act

Heather Joseph, Executive Director, Scholarly Publishing and Academic Resources Coalition (SPARC), will talk about the Federal Research Public Access Act (FRPAA). FRPAA is anticipated to be re-introduced this spring. The purpose of this bill is to require all U.S. Federal research granting agencies with portfolios of over $100 million (11 agencies altogether) to develop policies requiring open access to the results of the research they fund. FRPAA has been endorsed by many higher education leaders and the Alliance for Taxpayer Access. Chemists Without Borders is a member of the Alliance for Taxpayer Access; should we support FRPAA?
More information about FRPAA can be found on the SPARC website.
As the Executive Director of SPARC, Heather Joseph is very involved in advocacy for FRPAA. Before joining SPARC, Heather worked for many years in the publishing industry, and was formerly Executive Director of the BioOne publishing cooperative.

Thursday, June 7, 9:00 a.m. Pacific Time / Noon Eastern Time
Peter Suber: Open Access Questions & Answers

Peter Suber, Open Access Project Director, Public Knowledge Project, author of Open Access News
Peter Suber, one of the world's leading academics in the area of open access, will join Chemists Without Borders for a question and answer session on any aspect of open access.

Thursday, September 6 9:00 a.m. Pacific Time / Noon Eastern Time
Jean-Claude Bradley: Open Source Chemistry

Chemists Without Borders' own Jean-Claude Bradley, Coordinator for E-Learning at the College of Arts and Sciences at Drexel University, will talk about the Useful Chemistry approach to open source chemistry, founded by Bradley.
Importantly, you don't have to be a member of CWB to participate:
Not a member? No problem - contact us and let us know you would like to participate. There is no charge, other than regular long distance rates, to join the teleconference.
I've asked, via the web form, whether I might participate in the conference call series; I'll update this entry when I hear back.


Update as promised: I took part in the July 5 call, and will try to make one of the bi-monthly calls from here on out. Initial impressions are all positive, these are people who are genuinely trying to make the world a better place. If you're interested in taking part, email me as well as using the web form, since there were a couple of snafus in getting me signed up.



Monday, 22 January
The Future of Science is Open, Part 3: An Open Science World

The third of my columns on Open Science is now up at 3QuarksDaily. I'm not sure why I'm bothering to announce it here, since if you read me you certainly should read 3QD (and it's not as though my teeny readership will register on the radar of a behemoth like 3QD). Still, for them as is interested:

In parts one and two, I talked about the scholarly practice of Open Access publishing, and about how the central concept of "openness", or knowledge as a public good, is being incorporated into other aspects of science. I suggested that the overall practice (or philosophy, or movement) might be called Open Science, by which I mean the process of discovery at the intersection of Open Access (publishing), Open Data, Open Source (software), Open Standards and Open Licensing (those last two being another way of saying semantic web, or Web 2.0, or whatever the kids are calling it these days).

Here I want to move from ideas to applications, and take a look at what kinds of Open Science are already happening and where such efforts might lead. Open Science is very much in its infancy at the moment; we don't know precisely what its maturity will look like, but we have good reason to think we'll like it.

You can read the rest at 3QD; as always, I won't repost and comments are off here so as not to split the conversation. In particular, please speak up if I've got something wrong, or missed something out.

open access/open science | Bill Hooker | 22 Jan, 2007 | | [Trackbacks](0)


Monday, 08 January
Error notice.

I goofed. In my draft Open Data Addendum entry, I said:

Now remember that these highly unsatisfactory examples are drawn from the most prominent Open Access publishing houses, which might be expected to be much more supportive of Open Data than commercial publishers.
This implies, wrongly, that OA publishers are somehow not, or non-, commercial. I think BioMed Central, Hindawi and Medknow might all have something to say about that! As Peter Suber points out in his summary of OA developments in 2006:
Both the Hindawi and Medknow OA journal collections became profitable, an industry first. All the Hindawi OA journals use author-side fees and none of the Medknow journals do so. Together, therefore, they elegantly answered doubts about the business models for fee-based and no-fee OA journals.
It's actually a fairly common FUD tactic from OA opponents to claim that OA journals will never realize profits or even support themselves (so OA is going to destroy all academic publishing and the world will end in flames, etc.). This is, of course, nonsense, and I'm sorry to have lent it unwitting support. I've changed "commercial" to "traditional" in the offending paragraph, and linked to this entry.



Friday, 05 January
open science, data on blogs

Three things, file under "loosely related":

1. If you read my RSS feed, 'scuse the deluge; I went back and assigned a number of posts to the "open access/open science" category.

Speaking of old posts, here are two that never made it out of the "drafts" folder:

2. Peter Suber notices Jean-Claude Bradley's chemistry data blog and points to an entry on JCB's e-learning blog that I've been meaning to highlight:

I think that the part [of open access/open science] that we have yet to embrace is the posting of work fresh out of the test tube. As long as scientific research is published in an article format and its value is determined by a popularity contest of citations and peer-reviewed blessing, there will be little motivation to post work fresh out of the test tube. Especially when issues like competition and tenure are at stake.

The reality is that the impact of raw experimental data is usually unknowable at the time when it is generated. It may never be used by anyone (which is a guarantee if kept in a private lab notebook) or it may at some point answer a key question for an agent (human or otherwise) looking for a solution to an important problem.

My opinion at this point is that publishers or any kind of central repositories are not going to be as effective in communicating this kind of raw scientific data, unless it is readily available on the uberdatabases like Google or MSN. That's why Blogger makes an optimal vehicle to communicate raw experimental data: no cost, no gatekeeper and anyone looking on an uberdatabase will find your stuff.
Update: in comments, Jean-Claude points out that in fact, blogs are better for reporting milestones, overviews and so on. For the fresh-out-of-the-test-tube stuff, he's moved to a hosted wiki which provides version tracking with 3rd party timestamps. These features provide proof of priority, in case it is ever needed.

3. Blogging data/ideas: it's not just for science. Here's Rob Helpy-Chalk doing it in philosophy: I just had a think, and My presentation at the ISEE/IAEP.



Monday, 01 January
OA/Open Science resolutions.

Taking my cue from Jonathan Eisen, herewith the things I plan to do this year to the benefit of Open Science:

1. Get my act together in the lab and publish some quality papers in OA journals, complete with Open Data (even if I have to cobble something together to provide the data).

One of the most important things researchers can do is to increase awareness of the issues by making OA-centric choices with their own work. Jonathan's entry brings to mind the difference between what he -- that's Professor Eisen, with a CV as long's your arm! -- can do for OA, and what I can do. I think it could also be useful to have a lowly postdoc publicly choosing OA journals, refusing to deal with Elsevier, and so on. I've heard a number of colleagues say that such choices are the sort of thing they will put off "until after tenure" -- and I suppose Jonathan has heard "well, it's OK for you, you have a lab and tenure and so on, the risk is lower for you". Thing is, I don't think these choices add up to a risk. There are clear advantages to having my work available under Open Access conditions, and I think similar advantages will accrue as a result of my willingness to provide Open Data (and, when I can get colleagues to agree, Open Notebook access to my work). I think I've said this before, but I view it as a sort of experiment. My hypothesis is that Open Science will be good for my career, and there's only one way to test it! (I know, no control, yadda yadda. Call it "money where my mouth is" if you prefer.)

The rest of these are swiped from Jonathan's list, and from Peter Suber's "what you can do" list:

2. Find an OA, OAI-PMH-compliant repository for my existing postprints and future pre/postprints. In the case of published papers, I think I can get 'em into ePrintsUQ (as discussed here). In the case of future papers, I've already made tentative contact with the relevant people where I work, and I'm going to try to get an IR up and running. Futher possibilities to discuss: everything on Peter Suber's list for administrators.

3. Review papers for OA journals (or do anything else they ask me to, pretty much), but for non-OA journals, decline and explain. (One exception: the boss sometimes asks me to pre-review a paper he's agreed to review, so as to speed up the process for him. I'll do that no matter which journal it is.)

4. Find a way to work at least a quick push for OA/Open Science into every presentation.

5. At least ask the administrators of any conference or meeting I attend about providing Open Access to proceedings.

6. Discuss OA/Open Science with colleagues (note to self: avoid hectoring!).

7. Discuss OA/Open Science with everyone; use blog for same. As Jonathan notes, public support is going to be necessary to get mandates and such working.

8. Sign the BOAI (you can do this as an individual, whereas Bethesda is closed and Berlin only open to organizations).



Sunday, 31 December
Does the "green road"1 lead off a cliff?

Further to my complaints about the copyright thicket in which data are being lost, Charles W Bailey Jr points out that, in fact, it's worse than that: a good deal of the potential functionality of existing Open Access archives is jammed up in the same thicket:

If... repositories could not be trusted, then libraries would have to attempt to archive the postprints in question themselves; however, since postprints are not by default under copyright terms that would allow this to happen (e.g., they are not under Creative Commons Licenses), libraries may be barred from doing so.
(Emphasis mine.) Charles is talking about the question of whether or not self-archiving of scholarly articles (the "green road" to Open Access) will cause libraries to cancel journal subscriptions. I touched on this issue in an earlier entry, and don't want to revisit it here. What interests me here is the fact -- which I initially had trouble grokking, as you'll see if you read the comments on Charles' entry, where he patiently explains it -- that digital objects in Open Access repositories carry their own copyrights, rather than being covered by a blanket license provided by the repository.  For instance, PubMed Central refers to Open Access (using the Bethesda Statement), and then says:
Note that this definition of open access goes beyond the simple free access that applies to all full-text content viewable directly in PubMed Central (PMC) from the National Institutes of Health (NIH).

A number of PMC journals make all or most of their contents available as open access publications. See the Open Access list for details.
So PMC is OAI-PMH-compliant, but contains digital objects that are not themselves Open Access. I suspect the same is also true of the majority of institutional and centralized repositories (though I only checked ePrintsUQ, arXiv.org and Cogprints, none of which make any mention of copyright at all).

To get an idea of what that actually means, read carefully this brief discussion by Peter Suber of the BBB definition of Open Access:
The best-known part of the BBB definition is that OA content must be free of charge for all users with an internet connection. However, the BBB definition doesn't stop at free online access. It adds an extra dimension that isn't as easy to describe, and consequently is often dropped or obscured. This extra dimension gives users permission for all legitimate scholarly uses. It removes what I've called permission barriers, as opposed to price barriers. The Budapest statement puts the extra dimension this way:
By "open access" to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.
The Bethesda and Berlin statements put it this way: For a work to be OA, the copyright holder must consent in advance to let users "copy, use, distribute, transmit and display the work publicly and to make and distribute derivative works, in any digital medium for any responsible purpose, subject to proper attribution of authorship".

All three tributaries of the mainstream BBB definition agree that OA removes both price and permission barriers. Free online access isn't enough. "Fair use" ("fair dealing" in the UK) isn't enough.
Because each digital object carries its own copyright, e-print repositories do not remove permission barriers.  Here's Peter Suber again:
Permission barriers are more difficult to discuss than price barriers.  First, there are many kinds of them, some arising from statute (copyright law), some from contracts (licenses), and some from hardware and software (DRM).  They are not like prices, which differ only in magnitude.  Second, their details are harder to discover and understand.  Third, different users in different times, places, institutions, and situations can face very different permission barriers for the same work.  Fourth, authors who deposit their articles in open-access archives bypass permission barriers even if they also publish the same articles in conventional journals protected by copyright, licenses, and DRM. 
As far as I can tell, that fourth point is simply not true of any existing archives.  If you want to do anything with an article in, say, PubMed Central, other than simply read it -- if you want to copy it and distribute the copies, if you want to make a derivative work, if you want to pass it to text-mining or other software -- you will have to determine, on an article-by-article basis, whether you are allowed to do that. 

Take, for example, the following paper from the lab I work in, available free from PubMed Central:
Deletion of Mnt leads to disrupted cell cycle control and tumorigenesis.
Peter J. Hurlin, Zi-Qiang Zhou, Kazuhito Toyo-oka, Sara Ota, William L. Walker, Shinji Hirotsune, and Anthony Wynshaw-Boris
Right above the title on the linked page is a copyright notice: "Copyright © 2003 European Molecular Biology Organization".  The link provided goes to a PMC page which makes it very clear that an article's presence in PMC tells you nothing about what rights the copyright holder(s) reserve or waive.  Searching the EMBO site for "copyright" brings up nothing useful, but the EMBO Journal (which is actually part of Nature Publishing Group) has this to say:
Nature Publishing Group does not require authors of original research papers to assign copyright of their published contributions. Authors grant NPG an exclusive licence to publish, in return for which they can re-use their papers in their future printed work. NPG's author licence page provides details of the policy and a sample form. Authors are encouraged to submit their version of the accepted, peer-reviewed manuscript to their funding body's archive, for public release six months2 after publication. In addition, authors are encouraged to archive their version of the manuscript in their institution's repositories (as well as on their personal web sites), also six months after the original publication.
Apart from the foul six-month embargo (Do you have any idea how many experiments I can do in six months?  But I digress.), this seems reasonable, and it leaves permissions up to the authors.  So "copyright EMBO" is misleading, and it's likely that EMBO J authors, having reposited their articles, wish them to be fully Open Access.  As it happens, in this case the corresponding author is my boss so I can assure you that he knows about Open Access and is all in favour. The point, though, is that you have to dig around to find out that it's up to Peter, and then you have to contact him to find out that he fully intends you to have the permissions you need. You are not going to be able to do that for more than a handful of papers; it certainly puts an effective brake on text-mining.

I think this brief example makes clear that, in practice, you cannot do anything much with repository content but read it ("fair use", of course, still applies).  You simply don't have the time to uncover the necessary permissions for anything else.  Which in turn means that there are no, or very few, actual Open Access repositories currently in existence.

I'll say it again: e-print repositories do not provide Open Access.  They provide free access to human eyes, one paper at a time; as the accepted definitions make clear, that's not at all the same thing.  Since self-archiving in such repositories is the current focus of many, if not most, efforts to provide 100% Open Access to the world's scholarly literature, this is a big deal.  There are two obvious solutions: 1, ignore the whole issue; and 2, start applying labels to digital objects.

In the short term and for individual researchers, solution 1 has considerable appeal.  There's even precedent: a recent study pointed out that patents do not slow research down much, mostly because researchers ignore them.  The majority of e-prints are probably in a repository because their authors want Open Access; the likelihood of running afoul of copyright and actually being called to account for it seems pretty low.  I think, however, that this head-in-the-sand approach is a very bad idea.  What authors want is not always what counts, as when the copyright is actually owned by a publisher.  I've been trying to think of the kinds of things you might do with a body of OA literature -- build a text-mining robot that offers novel ways to look for deep connections between ideas and among data, make a local database of papers on your research specialty, and so on -- but in fact, much of the point of Open Access is to make possible things I cannot think of.  Look what the Web has made possible, and ask yourself: how much of that could I have predicted in 1991?  It seems to me that anything which makes use of a substantial number of papers, or relies on being able to mine an entire corpus, runs the risk of being shut down or co-opted just when it starts to get interesting and useful.  Suppose, for instance, that I write that text-mining robot: while I am using it to feed ideas into my own benchwork, I'm OK, but the minute I give that robot to someone (or, as is my preference, everyone) else, I run the risk of being sued for copyright violations.

This is the same risk that researchers are already running when using patented technology without a license; you are fine until you come up with something good, but then if the patent owner notices what you've done, you can be in trouble.   "Trouble" means three things: legal sanctions, loss of the opportunity to profit from your invention, and removal of your invention from the commons.  The first seems pretty unlikely from an individual perspective -- what company is going to risk the PR nightmare of trying to recover fines from a researcher? -- but substantially more worrisome for universities and other institutions.  The "loss of profit" is of no interest to me; if I wanted to be rich I wouldn't be a scientist.  What really concerns me is the potential for patent/copyright owners to exert anti-commons, profit-taking control over research outcomes, and it's this risk that makes the Ostrich Option unacceptable to me.

In the longer term, for community minded researchers and especially for institutions (which are typically more wary of litigation than individual researchers, and since Bayh-Dole, increasingly focused on profiting from research outcomes), solution 2 is a reasonable fix.  In principle, OA repositories could include labels (that is, metadata) specifying which uses are explicitly permitted or prohibited, so search engine users and text-mining robots could search only that portion of the database that allowed whatever rights they need.  In fact, the Bethesda and Berlin definitions of OA both include the requirement for every OA article to carry an explicit label regarding permissions.  Project RoMEO was intended to deal with precisely this issue, and produced (in addition to the valuable SHERPA/RoMEO database of publisher permissions for forward-looking authors) six surveys of the field and an XML-based implementation of the resulting rights management concepts, incorporating Creative Commons licenses. Unfortunately, there seems to have been zero uptake of the concepts or the technical implementation.  As far as I can tell there are no search interfaces which provide this kind of rights-based functionality, and every repository contains a mix of well-labelled, partially labelled and unlabelled objects.  In addition, the body of scholarly work in relevant repositories is already so large that adding the necessary rights metadata is an enormous task, one which grows larger and more forbidding by the day (I might call this the "backlog problem"). 

Nonetheless, the fundamental OA definitions include rights beyond simple reading access for good reason.  As I discussed in my earlier entry, rights management is going to be at the heart of Open Data, and I have argued elsewhere that licensing and standards/metadata are also going to be crucial to bringing the "openness" of Open Access to science as a whole.  I think the Open Science field is headed for some serious problems if permissions barriers are not given more attention.   I might concede that the most important thing to achieve right now is removal of access barriers to human eyeballs, but why make trouble for ourselves by -- as seems to be happening3 -- ignoring the rights issue?  There's no reason why the process of encouraging authors to self-archive, and building tools to make that easier, should not include information and tools that focus on rights management.  At the very least, we should be making authors who are already on-side, who are self-archiving and using the SPARC Author Addendum and so on, aware of the issue -- and giving them the tools to label their own papers with clear statements of the rights they wish to retain or waive. At least then the rate of growth of the backlog problem will begin to slow down, and should approach zero as we approach 100% OA (even on the green road) rather than continuing to grow unchecked.


-----
1 It's always good to get an "outside" perspective.  The Spousal Unit points out that "green road" is possibly a bad term for use in PR/advocacy/etc, as it brings to mind not OA but environmental issues.  Having read the title over my shoulder, she thought this entry was going to be interesting, but then found out that it was about the intricacies of scholarly publishing!

2 In fact, EMBO J deposits papers in PubMed Central for free access after 12 months, and most authors probably do not place copies in PMC, or anywhere else, at the 6-month mark. I know that these authors didn't, and I bet NPG is relying on this behavior to get an effective 12-month embargo. Bastards.

3I am, of course, re-inventing the wheel here.  In comments on the entry that sparked all of this, Charles notes that he was debating this issue with OA movers and shakers before the BOAI went public.  Here's a 2005 article by Richard Poynder covering the same ground and then some.  I mentioned Project RoMEO above.  Pretty much anything I have to say about OA will have been said before, since I'm a newcomer to the field, but I write out my thoughts here in order to collect and organize them.



Sunday, 17 December
Where are the data? Can I have them? What can I do with them?

There's a new subversive proposal in town.  The original was Stevan Harnad's landmark call for self-archiving of the scientific ("esoteric") literature (see here for a ten-year update, and here for context).  Now, 12 years later, Open Access is gathering momentum and forward-looking advocates of knowledge as a public good are thinking about Open Data (some extra background here).  Peter Murray-Rust recently stepped up with a subversive proposal of his own:

The simplest thing that researchers can do [to promote Open Data] is to add a Creative Commons license to their data. It costs nothing, is a simple cut-and-paste, and could be trivially made a template in any data production tool. [...]

I think the effect of this would be dramatic. Scientists would start to see these messages and think: "Why should I give these data to the publisher?" And if the publisher simply adds a copyright notice saying "all these data are copyright the publisher - you cannot use them for X, Y, Z without permission" this would be in violation of the authors' license. The author would have to deliberately remove this statement to hand over the IPR to the publisher.

I think Peter's proposal is a good one, similar in form and effect to the SPARC author addendum.  Importantly, Science Commons also offers author addenda, and will soon offer them in the machine-, human- and lawyer-readable versions that come with all Creative Commons licenses; as Peter notes, the machine-readable version is crucial to full Open Data utility.  Use of the proposed Open Data addendum (in combination, where necessary, with an Open Access addendum) would clarify the legal status of an author's data, provided we get the wording right.  Herewith some thoughts on how to do that, based on the questions in the title.

First, note that papers do not usually contain raw (useful, useable) data. They contain, say, graphs made from such data, or bitmapped images of it -- as Peter says, the paper offers hamburger when what we want is the original cow.  Chris Surridge of PLoS puts it this way:

A figure in a paper is a way of representing the raw data in such a way to best illustrate the point the author is making. A figure then is the product of an operation upon the raw data, and that operation results in a loss of information.

The raw data could have been presented in a host of different ways possibly supporting other conclusions not thought of by the author. Equally if a reader had raw data compatible with that the author obtained wouldn't it be useful if it could be processed in the same way for comparison? Wouldn't it be much better for readers to have access not only to the figures in a paper but also to the underlying data and the transform that created it. In this way no information, neither implicit nor explicit, is lost.

So if authors want to make their data openly and usefully available, they will need to host it themselves or find someone to host it for them.  Many journals will host supplementary information, and many institutional repositories will take datasets as well as manuscripts.  I have been saying for some time that it should by now be de rigueur to make one's raw data available with each publication. This is very rarely done -- even supplementary information, when I have come across it, tends to be of the hamburger-rather-than-cow variety and so not very useful.  (The situation speaks sad volumes about the emphasis on competition over cooperation within the scientific community and, perhaps in many cases, about the quality of the raw data in question, if only one were ever able to see it; but I digress.)  Thus an effective Open Data addendum will first have to answer the question: where are the data?

Second, there is the issue of licensing ("Can I have them?  What can I do with them?").  In comments on Peter's proposal, Jonathan Eisen observes that publishing in Open Access journals should provide open access to data as well.  Peter replies that this is not always the case and points to Molbank as a problematic example, because they require a copyright transfer and it is simply not clear what rights they claim over raw data.  In fact, the situation is even worse.  In the same entry, Peter points approvingly to the BioMed Central OA charter, which is based on the Bethesda Statement:
Every peer-reviewed research article appearing in any journal published by BioMed Central is 'open access', meaning that:
  1. The article is universally and freely accessible via the Internet, in an easily readable format and deposited immediately upon publication, without embargo, in an agreed format - current preference is XML with a declared DTD - in at least one widely and internationally recognized open access repository (such as PubMed Central).
  2. The author(s) or copyright owner(s) irrevocably grant(s) to any third party, in advance and in perpetuity, the right to use, reproduce or disseminate the research article in its entirety or in part, in any format or medium, provided that no substantive errors are introduced in the process, proper attribution of authorship and correct citation details are given, and that the bibliographic details are not changed. If the article is reproduced or disseminated in part, this must be clearly and unequivocally indicated.
But what does that mean for Open Data?  Take any paper in any BMC journal: where are the data?  Can I have them?  What can I do with them?  It's true but it's simply not enough that, having published in BMC, the authors are probably amenable to giving me the data and allowing me to do with them as I please.  I need unfettered access to the data at the same time as I access the paper.  Even as a human I don't have time to chase down permission for every dataset I want to re-use, and if I'm data-mining by web crawler I need machine-readable licenses that tell my robot what it can have.  Policies regarding data and materials are journal-specific within the BMC group, but I browsed a few and it seems they all use a standard template, which includes the following:
Submission of a manuscript to [BMC Journal in question] implies that readily reproducible materials described in the manuscript, including all relevant raw data, will be freely available to any scientist wishing to use them for non-commercial purposes. Nucleic acid sequences, protein sequences, and atomic coordinates should be deposited in an appropriate database in time for the accession number to be included in the published article. In computational studies where the sequence information is unacceptable for inclusion in databases because of lack of experimental validation, the sequences must be published as an additional file with the article. [There follows a list of databases that can be used to deposit nucleotide and protein sequences and structures, chemical structures and assays, microarray data, computer models and plasmids.]
Note though that these policies are not strict demands, and I'll bet they are not policed in any way.  I think most journals include similar language in their instructions to authors, and have done for some time, but we still do not have widespread Open Data.  Further, the actual BMC license (which BMC says is identical to the Creative Commons Attribution License) refers only to "the work" which it defines as "the copyrightable work of authorship offered under the terms of this License".  That seems to me to allow an interpretation that excludes data, which sit in the grey zone between creative works that can be copyrighted and, er, things (like gene sequences and chemical structures of drugs) that can be patented.

So how about Public Library of Science and Hindawi, the other major OA publishers?  Well, Hindawi seems to say nothing about data whatsoever, only that authors retain copyright and articles are published under a CC Attribution license.  PLoS also publishes everything under a CC Attribution license, which says nothing about data, but if you dig a bit you find encouraging things in the editorial/publishing policies:
Publication is conditional upon the agreement of authors to make freely available any materials and information associated with their publication that are reasonably requested by others for the purpose of academic, noncommercial research.

Data Availability
Open access applies to both the scientific literature and the data used to establish that literature. Publication is contingent on making data integral to a manuscript freely available without restriction, provided that appropriate attribution is given and that suitable mechanisms exist for sharing the data used in a manuscript.

  1. Data for which public repositories have been established that are in general use should be deposited before publication, and the appropriate accession numbers or digital object identifiers published with the paper.
  2. If an appropriate repository does not exist, data should be provided as supporting information with the published paper. If this is not practical, data should be made freely available upon reasonable request.
  3. The conclusions of a study must not be dependent solely on the analysis of proprietary data. If proprietary data were used to reach a conclusion, and the authors are unwilling or unable to make these data public, then the paper must include an analysis of public data that validates the conclusions so that others can reproduce the analysis and build on the findings.
Note that any restrictions on the availability or on the use of datasets might be judged to diminish the significance of a paper and will therefore influence the decision about whether a paper should be published. These policies have been developed in accordance with the principles established in Sharing Publication-Related Data and Materials (National Academies Press, 2003).
That's better, stronger language -- but why is there no mention of data in the actual license, and why is there a need for warnings about restrictions that "might be judged to diminish the significance, etc" if publication is truly conditional on open access to data?  I suspect another toothless tiger.  It's not that I want the tiger to have teeth, that is, for journals to actively police data availability, but that I wonder why I have to go digging around the website just to find this wishy-washy nod in the general direction of Open Data.  To illustrate my point here, suppose I read a paper in PLoS Biology, and I want to get my hands on some raw data from that paper: where are they?  Can I have them?  What can I do with them?  All of these things are, basically, left up to the authors. 

Now remember that these highly unsatisfactory examples are drawn from the most prominent Open Access publishing houses, which might be expected to be much more supportive of Open Data than commercial traditional publishers.  Thus the power of Peter's Open Data addendum becomes apparent: it is attached directly to the paper, so readers do not have to go hunting through journal websites to find out the intellectual property status and location of interesting datasets.  It allows authors to take control.

To be effective, then, an Open Data addendum must at least answer my opening questions: it must point to the online, freely accessible location of the raw, un-hamburgered data; it should make clear that yes, you can have them; and it should state clearly what you can do with them.  The last question probably requires the creation of multiple addenda, since some people (like Jonathan Eisen) will want to effectively copyleft their data, whereas others will prefer less restrictive licenses.  My preferred answer is "anything you want, so long as you do not remove information or materials from the scientific commons".

So, finally, let me take a stab at a draft Open Data addendum.  This is based on largely copied from the SPARC author addendum, and my idea is that it should, like (and if necessary with) the SPARC addendum, be submitted to the publisher together with their publication agreement.

AUTHOR'S ADDENDUM TO PUBLICATION AGREEMENT

THIS ADDENDUM hereby modifies and supplements the attached Publication Agreement concerning the following Article:

[manuscript title]

and the following Raw Data from which the Article was prepared:

[list of data sets, including permanent web address/es from which they can be obtained]

The parties to the Publication Agreement and to this Addendum are:

[list of authors, indicating corresponding author] (individually, or if more than one author, collectively, the Author), and

[publisher].

The parties agree that wherever there is any conflict between this Addendum and the Publication Agreement, the provisions of this Addendum are paramount and the Publication Agreement shall be construed accordingly.  Notwithstanding any terms in the Publication Agreement to the contrary, AUTHOR and PUBLISHER agree as follows:

1. Author's Retention of Rights. In addition to any rights under copyright retained by Author in the Publication Agreement, Author retains all rights to the Raw Data underlying the Article, including but not limited to: (i) the rights to reproduce, distribute and publicly display the Raw Data in any medium; and (iii) the right to authorize others to make any use of the Raw Data so long as Author receives credit as author and the journal in which the Article has been published is cited as the source of first publication of the Article and Raw Data.

2. Licensing of Raw Data.  Author hereby releases the Raw Data under the terms of a Creative Commons Attribution Share-Alike License [or insert whatever license you prefer], where "the work" is understood to mean the data sets listed above.  Publisher agrees to include in the Article this statement of licensing terms and the above list of data sets and web address/es from which they can be freely obtained.

3. Publisher's Acceptance of this Addendum. Author requests that Publisher demonstrate acceptance of this Addendum by signing a copy and returning it to the Author. However, in the event that Publisher publishes the Article in the journal identified herein or in any other form without signing a copy of the Addendum, Publisher will be deemed to have assented to the terms of this Addendum.

That's not perfect, not by a long shot -- most especially not for automated data mining, which requires machine-readable metadata and data. It should, however, do what Peter suggests: provide some relief from endless rounds of find-the-permissions, and get a much-needed conversation underway.



Sunday, 03 December
The bottom line, and an idea.

Relatively new addition to the blogroll Glyn Moody points out the bottom line in all "intellectual property" issues: it's not property, and anyone who tells you otherwise is lying for profit:

A very interesting transcript of a conversation between Reuters and Warner Music Chief Executive Edgar Bronfman. The latter [...] is revealed for what he is when he slips in the Big IP Lie:

Intellectual property is intellectual property, whether it's in the form of an avatar or a song or any such thing. These are the creations of someone's mind, and it's property as real as real estate.

No, Ed, no, no, no. What you call "intellectual property" is really an intellectual monopoly: it is a limited privilege, granted by the state, to encourage creativity. It is not property, however much you might like to claim it implicity. It is a bargain, with a quid pro quo: it has to allow reasonable "fair use", and it has to be given up after a reasonable time. You and your industry seem to have forgotten both aspects.

(Quite a lot elided there, so do read the whole post.)

From there to an idea: Glyn pointed to Moving To Freedom, where I found Scott pointing back to Glyn's The Great Software Schism and sideways to his own thoughts on Free vs Open Source. I had a section on this in my "open access" essay for 3QD, but I cut it out in the interest of brevity, because the open source section was just there for background and I assumed most people reading 3QD would be at least somewhat familiar with it. It went like this:

Richard Stallman started the GNU Project in 1983/4 as a reaction against the rising influence of proprietary software, and a year or so later founded the Free Software Foundation, which "is dedicated to promoting computer users' rights to use, study, copy, modify, and redistribute computer programs."  What Stallman and the FSF mean by "free software" is famously summed up by the dictum, "free as in speech, not free as in beer"; more precisely, they mean "free" as in:

  • The freedom to run the program, for any purpose
  • The freedom to study how the program works, and adapt it to your needs
  • The freedom to redistribute copies
  • The freedom to improve the program and release your improvements to the public

Access to the source code is a precondition for these freedoms, and many advocates prefer that the "four fundamental freedoms" also be combined with some form of copyleft (basically a licence which explicitly disallows use of the original resource in any way that restricts the four freedoms for anyone else).

About a decade later the Open Source Initiative appeared, offering itself as a "more pragmatic" approach to free software.  The two definitions are pretty similar, though the OSI version allows some licencing that the FSF considers too restrictive of end users.  A common view of these two groups is that Open Source is a development methodology, whereas Free Software is a social movement.  (You can, if you care to really get into it, read Stallman on why free is better than open source and the OSI on why the term "free" is too ambiguous.  Oy.  Wikipedia is good on all of this if you want more details: open source, open source software, free software.)

So anyway, if you're not familiar with the "schism", there's some background. I've argued that the same sort of openness as brought to mind by Free/Open Source Software is vital for the future of science, and since a movement needs a name I've tentatively proposed Open Science as the banner under which open access, open data, open standards, open licensing and open source might assemble to their greatest mutual benefit. As it happens though, one of the earliest movements towards what I am calling "open" science was called the Free Science Campaign, run by Stefano Ghirlanda. (The page is offline now. I ran across it while doing my graduate studies, and it is an enduring regret that I never signed up.)

Here's the idea, then, for all that it opens up an awful can of worms: should we be calling the campaign to free up scientific information (text, data and software) "Free Science", for the same reasons Stallman insists on "Free Software"?

It would be rather too much to just toss that out there, so here's my view. While I have great sympathy with Stallman's arguments in favour of Free, and am personally committed to do as much of my science completely in the open as I can, I know my tribe. Scientists are a cynical, self-interested lot. For instance, I was scoffed at for recommending BioRoot to colleagues -- the whole idea of sharing tends to be seen as naive, asking to be taken advantage of. It's been my experience that the first response of most scientists to any "open" scheme (like BioRoot, or Open Notebook Science) is not "how cool!" but "what about bad actors? how will you keep from being robbed?". (Which says something about what the culture of science does to a person, but I digress.) To my mind, this largely explains why BioRoot hasn't taken off as I would have hoped/expected, and is something of which to be wary. I am concerned that "Free Science", particularly if explicitly connected with "idealistic" Stallman (as contrasted with the "pragmatic" OSI), might meet with a chorus of sneers from the people who need it most. So for now, I think we should stick with "Open Science".



Monday, 27 November
The Future of Science is Open, Part 2.

The second instalment of my series of posts on the future of science is up now on 3QuarksDaily. (Part 1 is here.) Conversation is already beginning in the comments there, so comments are off here and I won't cross-post.

open access/open science | Bill Hooker | 27 Nov, 2006 | | [Trackbacks](0)


Friday, 24 November
An odd oversight on the part of J Neurosci

For the last few months, the Journal of Neuroscience has been hosting a series of articles on Open Access and the future of scientific/scholarly publishing. Laudably, they are all freely available; inexplicably, you have to search for them one-by-one. The big red "Free Articles on Open Access" link on the front page goes to the editor-in-chief's editorial introducing the series, but the list of articles is not hyperlinked to the articles themselves.

So here, without further ado, is the list of articles complete with links to the full text:

Sept 6 Why Open Access to Research and Scholarship? John Willinsky

Sept 13 Will Research Sharing Keep Pace with the Internet? Richard K. Johnson

Sept 20 As We May Read. Paul Ginsparg

Sept 27 Reinventing the Biomedical Journal. Richard Smith

Oct 4 Open Access and the Future of the Scientific Research Article. Matthew Cockerill and Vitek Tracz



Wednesday, 15 November
Open Question on Open Access

In a comment on Scott's recent entry (discussed below), Mark D makes a good point, one that I've touched on previously and that bears repeating:

The problem is, I haven't seen any hard data that documents the cost of peer review, redaction, and publishing. Everyone throws numbers around as if they were confetti. We are all, supposedly (publishers and librarians) in the scientific/technical community, yet so very few people take a scientific approach to this issue.

The first step on the road to open access, should be a review of the processes and costs associated with scientific publication. Sounds like a good paper for the library association journal. Any librarians out there that want to tackle this paper?

And as for the publishers, if they really do wish a dialogue, then why don't they reveal their redaction costs? Any takers out there in the publishing world?

Online publication dramatically lowers costs relative to printed journals, but it is not free. Copyediting is still required, peer review must be co-ordinated even though the actual reviewing is done by authors for no charge, and the digital objects (articles, data, etc) must be created, archived and maintained in an accessible format. There are surely other important costs, too, that do not occur to me right now. All of this costs money, but the Big Question of OA is: how much money? According to a recent survey, publishers experimenting with optional OA charge author-side fees ranging from $85 to $3000, while fewer than half of full-OA journals charge any author-side fees at all (Peter Suber has a good discussion of no-fee OA here). Alternative revenue streams listed include member dues (e.g. for journals published by scholarly societies), industry support (I think this means/includes advertising), third-party licences, grants and subscriptions.

So, an open question: just what does it cost to run an OA scholarly journal?

The Public Library of Science charges $2500 for an article in its flagship journals PLoS Biology and PLoS Medicine and $2000 for its second-tier journals, about which they say this:

Ultimately, the fees that PLoS charge reflect the costs associated with publishing. We are not in this to make a profit - our bottom line is to make the literature a public resource.
According to this article, PNAS articles cost "up to $3800" to publish. BioMed Central charges about $1400 per article in most of its OA journals, with a few under $1000 and about a dozen in the $1500 - $1800 range. BMC also maintains a helpful comparison table of author-side fees, which shows that they are one of the less expensive options, with typical charges in the $2000 - $3000 range. Hindawi is a fully OA publisher whose business model is based on page charges of about $60-$120/page (say around $500 for a typical article).

I mention Hindawi specifically because they are already showing profit, and because of a recent comment by their senior publishing developer Paul Peters on the Nature newsblog. Responding to an article by Declan Butler (toll-access! see Declan's blog) focusing on PLoS finances and entitled "Open-access journal hits rocky times", Peters wrote:

Based on our experience as a publisher of both subscription-based journals and author-pays open access journals, I would not only argue that the author-pays publishing model is sustainable, but also that it has many economic advantages over the subscription model. Even though our open access journal collection is only a few years old, we have already achieved profitability for the collection as a whole. [...]

Opponents of open access publishing will most likely use the financial information that is available about the Public Library of Science to defend their stance that the author-pays business model in unsustainable. However, drawing conclusions about a business model based on the financial records of a single non-profit organization, whose stated purpose is that of an advocacy organization, seems like a rather weak argument. It is much more telling to look at a commercial publisher like Hindawi and ask why we would employ an author-pays business model, since our main objective, like that of all commercial enterprises, is financial success.

The emphasis is mine; and yes, it would be very informative to see inside the finances of a variety of OA publishers. Knowing what publishers charge, as above, does not tell us what it actually costs to run the journals. Beyond saying "we are showing profit", Hindawi does not seem to be forthcoming on that issue. I take it as read that for-profit ventures charge what the market will bear, but when the market in question is largely scientists and their allies (librarians, clinicians, &c.), it seems logical that the market should look for data on which to base decisions about just what it will bear. Commercial entities rarely have open-access balance sheets, but perhaps OA publishers could take the lead there as well?


Update:Peter Suber has some sensible things to say about this.



Wednesday, 15 November
Congratulations to Peter Suber

It seems just a trifle odd that a for-profit, mostly toll-access publisher has created a one-time special award for a "Non-Librarian Working for Our Cause" in order to recognize Peter Suber "for his excellent work in managing the influential SPARC Open Access Forum (blog) and the Open Access Newsletter".

Nonetheless, there you have it, and congratulations to Peter, who richly deserves all OA-related honors that come his way.

Hat-tip: Heather at OAL.



Tuesday, 14 November
Does this make me Orthodox?

T. Scott Plutchak describes himself as an OA heretic ("Martin Luther continued to believe in Jesus...") in decrying what he calls the "strong moralistic approach" to Open Access advocacy. He and I disagree fairly extensively (hence the entry title); to wit:


When one takes the strong moralistic approach, the open access all or nothing approach, and treats it as if it is the most important issue in scholarly publishing, then one is essentially absolved from the difficult consideration of social costs. If one feels that the social benefits of open access are clearly and completely overwhelming, then one is compelled to push for whatever solutions might point in that direction and let the chips fall where they may. But to righteously ignore the fact that some of those chips may fall very heavily indeed is irresponsible.
1. It's hardly fair to equate ethical arguments for OA with an "all or nothing" approach and to set up "if one feels that" strawmen, particularly if you're going to complain in the same entry that others' rhetoric "has been extremely damaging to the entire discussion".

2. Of the issues of which I am aware, OA is the most important issue in scholarly publishing, being at once one of the most pressing and one of the most readily solved. I'd be interested to hear of any that are more important. That being said, the most important issue is not the only important issue, and there is no shortage of reasonable OA advocates happy to acknowledge that.

3. "Social costs" is frequently shill-speak for "loss of profits" on the part of publishing companies. My heart does not bleed at this prospect. (Scott, I hasten to add, means something different -- he goes on to talk about re-allocation of research budgets to cover publishing costs; see below.)


In a mom and apple pie kind of way, the statement that taxpayers should have immediate access to the results of federally funded research is trivially true. But this could easily be met by having scientists write up the results of their work and post it to publicly available websites. This, however, is clearly not what those who are making the argument would be satisfied with -- they still want the benefits of the peer review and editing processes that are part of the publication system and that are not, under the traditional system, paid for by the taxpayers. It is the subscription system that currently pays for those added benefits.
1. The "mom and apple pie" clause and the word "trivially" are pure snark here. The statement is simply true.

2. What proportion of the libraries and institutions that form the bulk of the subscription market are publicly funded? I'd be surprised if, even in the US, the majority of toll-access revenues were not readily traceable back to public coffers. Moreover, the return on whatever public dollars are being spent on subscriptions could be increased by using that money to pay for OA. Yes, research and scholarly publication are separate costs, but it's a mistake (or a convenient strawman) to claim that the taxpayer access argument conflates them. It does not; Peter Suber has gone over this in some detail in the SPARC newsletter, issue #65.


...having funders pay for publication costs [...] seems perfectly reasonable and logical to me. It is not, however, without social costs, and the blithe response on the part of the advocates, who dismiss the concern about costs by saying it is such a tiny portion, maybe 2% or so, of overall NIH funding, is simply not sufficient. At a time when the NIH budget is flattening and competition for grants is becoming tighter and tighter (at present, NIH is funding just under 20% of approved applications), and promising young scientists are leaving academic careers because they're not able to get that all important first grant, shifting even 2% of the budget toward publication is not a trivial matter. Open access advocates need to do a much better job of making a compelling and detailed case for why the benefit is worth the cost.
1. OA advocates have put forward a great many compelling and detailed arguments regarding the benefits of OA; see also Peter Suber's response to Scott.

2. A 2% shift in a $25 billion budget is not a trivial amount of money, nor are the careers potentially cut short trivial, but we are not talking about absolute amounts and feel-good (or feel-bad) personal stories. We are talking about policy decisions at the federal government level. I stand a fair chance of being one of the young scientists (I scruple to describe myself as "promising"!) who will fail to establish an academic career as a result of tightening budgets. From that precipitous perspective, let me state for the record: if one of the costs of widespread OA is my research career, then so be it: the needs of the many, and all that. I got into science in the first place to try to make the world a better place.


The taxpayer rights argument is the soundbite hook on which FRPAA hangs as well, and it is a soundbite that plays well with members of congress and in the press. But, of course, FRPAA itself is a compromise and doesn't provide any more immediate access than the Highwire publishers do independently. "Libraries aren't going to cancel subscriptions if there is an embargo," say the partisans. Since this seems so obvious to them, they accuse the publishers who are opposed to FRPAA of bad faith for claiming that they are concerned about the survivability of their organizations.
1. The access may not be any more immediate, but it is a lot cheaper. Further, if all the embargo is doing is protecting the profits of private corporations, governments would seem to me to have a compelling interest in mandating (and, yes, paying for) immediate OA.

2. In respect of subscriptions, I agree in part with Scott, in that I think OA -- especially once we get rid of the embargo by paying publication costs upfront -- will cause subscription cancellations. (Although I have yet to see any data that support this contention, it still seems intuitively likely to me.) The obvious impact is two-fold: for-profit publishers will have to adapt or die, and the science community is going to have to find new ways to carry out peer review. Between traditional publishers who are willing and able to adapt, existing OA publishers (PLoS, Hindawi, BMC) and numerous high-profile experiments in overhauling peer review, I am confident that no scholarly crisis is likely when the subscription model dies.



Monday, 30 October
Big Time

At Abbas' kind invitation, I've joined the team at 3 Quarks Daily as a guest columnist. It's intimidating company to be keeping, but I'll do my best not to lower the tone.

My first essay is up now: The Future of Science is Open, Part 1: Open Access. I won't reproduce it here, because I don't want to split the conversation that I hope it will spark.



Thursday, 05 October
I take exception!

In the course of promoting next year's Science Blogging Conference, Coturnix writes:

Jean-Claude Bradley is the pioneer in the use of blogs in science in the way that too many of us are still too scared to do - posting on a daily basis the ideas, methods and data from the lab.

Not all of us are scared. I have colleagues with legitimate claims on all of the work I am doing at the moment, and none of them are willing to go to open-notebook. I anticipate even having trouble with my refusal to deal with Elsevier and my intention to publish only in open-access journals.

I've been in this lab a year, so everything I'm doing is directly based on someone else's data and ideas -- that is, to such an extent that I do not feel I can insist on an open notebook. Recently, though, I applied for funding to start an entirely new project. This will not mean that I can suddenly ignore my colleagues' wishes, but it will put me in a stronger position to say, "well, this is my project, and I want to do it this way".

I think of it as just another experiment. If I'm right, open science is a better way to work, and the benefits of choosing a better model will become apparent to my colleagues, and so open science will spread from early adopters like Jean-Claude (and, soon, I hope, me). If I'm wrong, I'll fail -- but I'll fail on my own terms, and I can live with that.



Tuesday, 05 September
heads-up

stolenfromrob.jpg Rob Helpy Chalk is a philosopher and a teacher and has a brain approximately the size of a planet (note: not a pluton), so I am very pleased to see that he's interested in scientific communication. I'll weigh in on his ideas later, when I have some time, but I'm posting this now to alert people I think will be interested. If you read me at all, you will probably be interested in Rob's thoughts on scientific communication and I'd really like it if you'd take a look at the linked post and give Rob some feedback. This has all the makings of a fun, useful conversation.

Comments are off here, go talk to Rob. Mind you don't put links in your comments though, his spam filter has got teeth. The comment I tried to post is below the cut; I'll wait until tomorrow and try again if it still hasn't appeared. it's now up at Rob's post. PZ Myers' comment thread also has some good stuff.

Update: Arunn of Nonoscience has put together an alternative chart and generated even more good discussion.


more...
open access/open science, science | Bill Hooker | 05 Sep, 2006 | | [Trackbacks](0)


Tuesday, 08 August
public service announcement

OK people, when I talk about publishing data on blogs this is most emphatically NOT what I am looking for.

Words fail me. (Hat-tip blame: Chad.)



Saturday, 01 July
snark

Peter Suber points to a new book from Chandos Publishing. It sounds interesting, but here's a free clue for Chandos: I'm not about to pay GBP 39.95 (USD 73.86) for the paperback edition of a 250-page book about -- you guessed it -- Open Access.

Title: Open Access; Key strategic, technical and economic aspects
Editor: Dr Neil Jacobs
The authors are some of the leading experts in open access, and will be familiar to anyone with even a passing acquaintance with the debates in the field. They include academic researchers, librarians and publishers, and are all strategic thinkers with both breadth and depth of knowledge in scholarly communication. They have subtly different views on open access, and these come across in the book, which therefore documents the open access movement at a critical point in its progress. The editor is an experienced information professional and researcher, currently managing the JISC (Joint Information Systems Committee) Digital Repositories development programme in the UK.

Peter says that many of the chapters are available, open access, on the web -- but Chandos doesn't seem to want me to know where they are. Look, I know there are no Publishing Fairies to leave free books under my pillow, and I don't begrudge anyone an honest living; but come on, seventy smackers for a paperback? I'm going to take some convincing before I see that as anything but gouging.



Wednesday, 28 June
More on OA, mandates and Harnad-vs-Velterop

In a followup today, Jan Velterop makes the point that he knows Stevan Harnad is a strong supporter of peer review, but their disagreement in fact hinges on another aspect of that system:

...I know that Stevan is a very strong supporter of peer review and so am I. In a way, that is precisely the problem. Stevan wants peer review, but not to pay for the process of formal publishing in peer-reviewed journals. That, he argues, should be done by librarians. I wouldn't dispute that, but according to him librarians should (or is it just will?) keep the subscription model going and in that way provide sustenance for the formal peer-reviewed journal system.
So in fact I misrepresented Jan's argument somewhat; mea culpa. All parties agree that peer review is indispensable, and what we are disagreeing about is how to pay for it. Jan1 is (I think) emphasizing the connection between formal publishing and peer review. These two processes are not necessarily inseparable, but Jan's argument has the advantage that the current peer review infrastructure is already in place, and we know it works. Stevan (again, on my reading) is arguing that we can achieve close to complete OA, at least of papers not currently in commercial archives, just by doing it using IRs -- and then wait to see whether journal subscriptions decline. If they do, journals will simply have to adjust to what the market wants, offering their services as co-ordinators of peer review plus whatever else they find their customers will buy (nice pdf files, advertising, job markets, etc).

Suppose, by way of Gedankenexperiment, we wave a magic Harnadian wand: all researchers from this moment will archive postprints of all their papers; journals that strictly disallow this practice will be shunned. Not only that, but the bulk of archived literature will also be, at the wave of our wand, deposited in IRs by authors working through their own back catalogues. (Note that we have sidestepped the quality control quagmire inherent in posting preprints, with or without corrigenda, and completely ignored the Kitten Herding problem of getting researchers to do, well, anything.)

OK, now what? I simply cannot see university and other administrations continuing to pay journal subscription fees, so what are the journals (and the companies that run them) going to do? Not being a publisher or any kind of businessperson, I don't know what they will actually do, but it seems to me that our magic wand has left them little recourse: lacking leverage, they will have to either go out of business or sell their remaining wares (including the co-ordination of peer review) for whatever the market will bear. A quick Google search on "serials crisis" reveals that, in many cases, the market will not likely bear what they are currently charging for subscriptions. Journals with significant cachet, like Nature or Science, will probably be able to charge a pretty penny, but smaller fish will have to either drop their rates or leave the pond. It's the latter that worries me: if researchers force this scenario, they risk losing diversity within the peer review infrastructure. It's also possible that, if significant numbers of journals close, their review boards will be effectively taken over by other organisations such as scholarly societies -- it even seems likely to me that researchers will figure "why pay at all? we can run peer review for ourselves". I don't like that idea either; in fact, the more I think about it the less I like it. There is value in having a sizeable chunk of the peer review process co-ordinated by outsiders: commercial publishers respond to different incentives than researchers, and I think that independence is a useful hedge against, well, corruption.

Ideally, from an entirely selfish point of view as a researcher, every existing journal would simply lay off staff and move offices and so on until it could subsist on what it could charge authors for having their work reviewed (plus advertising etc as above). I simply do not know whether the transition would work that way.

So where does that leave governments that are considering mandating OA to publicly funded research? Should they mandate OA without specifying mechanism and risk the problems discussed above, or mandate OA through the existing journal system and risk handing publishers, some (in my opinion, many) of whom are decidedly unscrupulous, even more leverage? I still prefer the former option, because it seems much more likely to me that publishers would gleefully gouge if handed the opportunity than that a mandate which did not specify mechanism would precipitate Peer Review Doomsday as above. For one thing, though I ignored it for convenience above, the Kitten Herding problem is significant. Researchers are conservative in their view of structural change, and reluctant to spend much time on anything but their chosen research problems. Journals will likely control access to their archives for the forseeable future, so they will not lose all leverage at once, and researchers would much rather pay (what they see as) reasonable charges than have to deal with change. There are a number of different publishing models being tried out as alternatives to the subscription-based system. It seems to me that a mandate that doesn't specify mechanism will make these experiments more pressing for publishers without forcing them into any immediate decision.

Two quick final points in reply to Jan:

If anybody has problems with the harnadian solution, it's scholarly societies that publish journals.
I think this is because scholarly societies rely on journal income to fund other activities and are fretting over lost revenue, which tells me they do not think they could persuade their members to pay membership fees that would cover the loss. Perhaps, to be frank, we could afford to lose a few such societies.
It is just possible that there may be reasons why researchers are researchers and publishers publishers. Everybody can sow the seed to grow the wheat to grind the flour to bake the bread. Who, after all, needs farmers, millers and bakers?
Quite right, and I do not even want to be a farmer, miller, baker -- or publisher. I am happy to buy my bread and my peer review, I just want to pay a fair price.

(It now occurs to me that part of the problem is that, although I strongly suspect I am being overcharged at the moment, I do not know what would be a fair price. I will think and write more about that, but this entry is too damn long already.)




1 I hope it will not seem overly familiar if I follow Jan's lead and use first names, it sounds less combative and this is not a fight, it's civil fucking discourse.



Tuesday, 27 June
"Alms race", hee hee. Funny, but not accurate.

(Attention Conservation Notice: this post assumes some familiarity with Open Access. I've included some background links but the essential introduction to OA is by Peter Suber: see his Open Access Overview and also the one-page version thereof.)


In his latest blog entry, Jan Velterop takes an entertaining but (I think) overstated swipe at Stevan Harnad for the latter's postion on recent moves by publishers to get OA mandates to encompass the paid version, that is, where publishers offer OA on any manuscript in exchange for an extra fee. Harnad argues, if I read him right, that this amounts to a land grab by the publishers, since if their efforts were to succeed the governments in question would be mandating further payments to publishers, on top of the fees and subscriptions they already charge. Since authors can already self-archive, pace Velterop, this really is a kind of alms for the publishers. Velterop takes issue with what he sees as dismissal of the role of publishers in providing peer review and authentication:

Those with a 'harnadian' inclination should really not bother publishers at all with their articles. They should just 'archive' (read 'publish') them in some repository and move on. Shame the articles can't be labelled as having been published in a peer-reviewed journal, which would make them more valuable and be noticed and taken seriously, but hey, everybody can see them and the publishers just haven't been able to beg enough cash to publish them.
I think this is a mis-statement of Harnad's postion in re: peer review. Neither he, nor any serious OA advocate that I know of, has ever indicated that they do not recognise the value in peer review. Harnad's point, in the post in question, is not about peer review at all but about co-opting a government mandate to direct research funds into publisher's pockets, when a free alternative is available.

That said, I think there's more to mandatory OA to publicly funded research (of which, let me state up front, I am strongly in favour) than a simple choice between for-pay or free models. TANSTAAFL. Velterop:

Subscriptions, on the whole, currently sustain the journal system. But they have a downside. They do not, by definition, provide open access. So that's why new publishing models have emerged that do.

Unfortunately, Stevan derisorily calls these new publishing models PPA, for 'Paid Publisher-Archiving'. As if 'archiving' is what publishers do. Nobody pays a publisher for archiving and no publisher asks for payment for archiving. Publishers ask for payment for having an article peer-reviewed and formally published in a reputable journal.

Well, yes, but remember that researchers provide a good part of the value of peer review -- the actual reviews! -- to publishers for nothing. Publishers are therefore charging money for co-ordinating the review process and for "formal publishing", which I take to include archiving and distribution. This is a legitimate charge for legitimate added value, but there is a limit to what the market will bear, particularly as costly print versions of journals move closer and closer to obsolescence. (I haven't picked up a print journal in years, excepting only those old publications which have not yet been digitized.) At some point it could conceivably become attractive for researchers -- perhaps through the NIH, or professional societies, for instance -- to co-ordinate the review process themselves, construct a robust search architecture that encompasses the vast majority of institutional repositories and thumb their noses once and for all at the STM publishing industry. (Indeed, when it comes to constructing a global virtual archive, the Open Archives Initiative has done much of the heavy lifting already.) So, let the publishing companies beware. Researchers, if I know my tribe, don't want the hassle (and we are talking about a mind-bendingly HUGE endeavour, if we are considering moving virtually ALL peer reviewing out of the current commercial infrastructure) -- but if pushed far enough, they'll act.

Harnad and other OA advocates make much of the fact that there is currently no evidence that OA reduces journal subscriptions. I think this may be somewhat disingenuous. Evidence drawn from physics/arxiv is of limited value in making predictions about the much larger and more lucrative (or, from a subscriber's point of view, expensive) biomed field. Further, if the subscription model still has a much greater market share than any OA based model, it may not mean much if current subscription levels are not (yet) falling as OA grows. Harnad himself has written about the transition from the current system to universal OA:

An alternative outcome [to 100% self-archiving] is that when the refereed literature is accessible online for free, users will prefer the free version (as so many physicists already do). Journal revenues will then shrink and institutional savings grow, until journals eventually have to scale down to providing only the essentials (the quality-control service), with the rest (paper version, online PDF version, other 'added values') sold as options.
This is the outcome that strikes me as likely: who is going to subscribe to a journal whose contents can be had for free? I have been saying since the days of E-biomed that OA would mean a significant reduction in the size, both physical and financial, of the STM publishing industry. Don't get me wrong, I shed no tears over the likely loss of a few companies and shrinkage of the rest; but the important question is how to maintain the integrity of the peer review and authentication system. Will journals (that is, the companies that publish them) simply accept the new order of things and quietly "downsize" until they provide basically just the quality control function? Harnad writes:
In none of these outcomes [see here] is peer-review itself compromised or put at risk; nor do authors have to give up, even temporarily, submitting to their established journals of choice.

Er, well, it may not be quite that simple. If I put my Doom&Gloom hat on, I can see a number of "established journals of choice" simply going out of business, leaving open the questions of what happens to their archives and, more importantly, how do we replace the resulting hole in the quality control infrastructure? We certainly do not want to allow a power vacuum into which will rush the remaining publishers, likely the biggest and some of the worst, gleefully wielding a new near-monopoly. Harnad writes:

Self-archiving could be done virtually overnight.
It could, but it won't. Getting researchers to work together, even for their own good, is worse than herding kittens. We (the research community, and open access advocates in particular) are NOT going to wave a magical open-access wand and present the publishers with a fait accompli in which they must quietly acquiesce. Government mandates of OA to publicly funded research will go a long way towards forcing their hand, though -- which, to return to the original point of this post, is why Velterop is mostly wrong and Harnad is mostly right: researchers should come out strongly against the attempt to have such mandates include what Harnad calls PPA.



Sunday, 25 June
We don' need no stinkin' ethics. Unless we do.

Dr Free-Ride has a good entry up about scientists and ethical behaviour. I have nothing to add to her basic point, which is that when ethics is seen as something imposed from outside, it is largely ignored; this idea will be entirely familiar to any researcher who has ever sat through the obligatory (!) ethics class or seminar or whatever their department requires.

Where I think Janet's discussion is missing something is in how to deal with this issue (and to be fair, she was mostly pointing out the problem, not trying to solve it):

To get "buy-in" from the scientists, they need to see how ethics are intimately connected to the job they're trying to get done. In other words, scientists need to understand how ethical conduct is essential to the project of doing science.
So OK, how exactly does that work? In a fairly straightforward sense, ethical conduct is demonstrably NOT essential to science or scientific progress. Science is being done now, often quite successfully (in terms of personal career advancement and, more importantly, in terms of real additions to the knowledge base), by unethical means. There is nothing about vivisection that makes it an inherently ineffective means of gathering information; many experiments that do not make it past IACUC would yield useful data. Further, if I successfully steal your ideas and publish them, I will have been doing science from the point of view of anyone (or anything, like the knowledge base itself) that doesn't know or doesn't care that I stole the ideas.

The trivial category here is unethical conduct like that of the Korean stem-cell team; this was dumb as well as wrong, because it produced bad data and was bound to be found out. The important category is unethical conduct that produces clean (useful, reproducible) data: what makes such conduct unethical, what aspect of its unethical nature makes it antithetical to doing science, and what is the mechanism of that opposition?

Within this category, we can distinguish between conduct that, if you get caught, will hamstring you within the scientific community (thieving) and conduct that, if you get caught, will cause the wider community to stop supporting you (vivisection). The key phrase here is "if you get caught"; that is, ethical judgement is community judgement. An individual cannot do much science without the scientific community; infrastructure needs alone make that clear. Neither, for even more obvious reasons, can the highly-specialized scientific community do anything without the support of the wider community. Unless you posit something like karma or divine retribution, I don't think you can find an unethical behaviour that both produces clean data AND is in and of itself "anti-scientific", that is, proof that ethical conduct is in and of itself essential to scientific progress -- unless, that is, you take into account the reliance of scientific research on community support.

In other words: what is ethical conduct? Whatever the community decides is ethical conduct. Why is ethical conduct essential to the project of doing science? Because community support is essential to that project.

I have, of course, sidestepped the larger question of HOW the community -- the scientific community, or society at large -- decides what constitutes ethical conduct. It's not true that vivisection is wrong only because if you get caught doing it your grant will be cut off (without anaesthesia, of course). Scientists are not just scientists, they are members of society at the same time. This is an enormous question, but a quick look at the scientific community will allow me to sketch my own view: why is it unethical for me to steal ideas? Because if everyone stole ideas, collaboration and other networks of trust would collapse. It's far more efficient to act in good faith and initially to assume the same of others. The same holds true for the wider community: whatever benefit I derive from someone else's disadvantage will eventually come back and bite me in the ass. On any but the short-term, immediate-future view, "do unto others as you would have them do unto you" is not a Divine Command but a sensible way to maximize one's own preferences.



Thursday, 11 May
Glee!

Bora recently asked whether anyone was using Connotea. I am, and I like it fine. It's open source and has a web API, there's a lively dev forum, and it's continually improving. You could use any bookmarking service, like Simpy, to collect your science/work-related links, of course, but Connotea offers the compelling advantages of auto-discovery of relevant fields (DOI, author list and so on), an improving ability to play nice with reference manager software, and a more focused community with whom to share tags, bookmarks and ideas.

Now, much to my glee, Connotea has started actively supporting citations to blog entries:

A lot of you are increasingly bookmarking articles from personal blogs alongside traditional journal-published articles. In response to this, Connotea now has experimental support for treating bookmarked blog posts as citations, and it will automatically import publication data for those articles wherever possible.
Hot damn, says I! Of course I had to try it out, on the obvious test post. Here's a screenshot, with a regular PubMed entry for comparison:
scrnsht.jpg
As you can see, Connotea correctly identified the blog, although it didn't grab the entry title (and I'm not the only one reading Science & Politics!).

This is the sort of thing that makes me feel that there really is an open science revolution underway. The internet is making possible real-time collaboration between large numbers of people with minimal regard to geography; as proprietary barriers to information flow are dismantled, this collaborative process can only accelerate and will, I believe, supplant traditional competitive models of research.



Tuesday, 25 April
Science blogging continued: more about scooping.

In something of an aside to his reply to Abel's musings about a medical wikipedia, Orac makes a couple of good points about publishing hypotheses on blogs and the "scooping" issue:

[...]most cases of scooping aren't nearly as blatant as the one [PZ Myers described]. Most are a lot more subtle, and the vast majority don't involve any chicanery at all. Indeed, in my experience, most cases involve multiple labs working on the same question. In such cases, one of these groups will inevitably succeed at publishing their results first, and the rest will be "scooped," no dishonesty or using ideas or experimental protocols without appropriate attribution necessary. [...] (In fact, I wouldn't even call it getting "scooped.")
The Grey Area Problem, yes. (As an aside: I quite agree, being beaten to publication by legitimate methods is not the same thing as "being scooped" as I mean the term, though of course "scooped" is used both ways. Perhaps we need a better term for the despicable version.) My main point about grey areas is that their inevitability is not a dealbreaker: we have the tools and infrastructure to deal with them. Orac goes on to say:
In an ideal world, Bill Hooker's concept would be the way things should work and any hint that labs might be scooping each other would result in offers of collaboration, but that isn't always how things actually work.
The gentle implication of naivete is, of course, perfectly reasonable, and the realpolitik of the science tribe is already forcing me away from any strong position I might have started staking out (see, e.g., this).

Nonetheless, I think there's a place for the naive position, and I'd like to keep it around, even if only to mark a boundary -- "OK, fine, that's too much trust, but how close to that can we get?". Here's the thing: that's the way it does work with me. I won't ever steal an idea from you, and if we are interested in the same questions I'd much rather share the work and the credit between us than turn science into some bullshit macho game. If you want to be famous, go ahead and be the guy on TV if our work is important enough to get coverage -- I don't give a rat's. I just want to do science without running out of funds every year or two, and I don't see why I should have to claw my way past my colleagues into one of the increasingly scarce tenure track positions to do it.



Saturday, 22 April
some scienceblogging tools

1. A comment on Pedro's post about Bora's post about scienceblogging led me to Stew, and reminded me about Postgenomic, which is Stew's creation. PG is a feed aggregator, but it's a feed aggregator with big ideas:

Postgenomic aggregates the feeds from life science blogs in order to do useful and interesting things with them. It's kind of like Technorati crossed with a really big hot papers meeting.

Its main uses - hopefully - are to:

  • List the current top life science news stories and the hottest recent papers (or the papers most often cited by bloggers, anyway)
  • Store and index reviews of papers
  • Store and collate reports from conferences
  • Help bloggers to share their expertise and, flipside of the same coin, to find useful papers on a given topic

[...]
Hopefully, as the site develops and the database grows the fourth point can be accomplished by organizing the papers by topic (perhaps using MeSH terms, or keywords, or the Technorati tags from the posts containing links to them). If you're looking for papers on, say, Bayesian networks in molecular biology but don't know where to start then you could fire up your browser, click on the appropriate tag in the Postgenomic index and be presented with a list of relevant papers and the blog posts that talk about them.
This is a great idea, and dovetails nicely with the current scienceblogconversation about what scienceblogging is, and what it might be good for. (You can add your blog to the postgenomic index by emailing Stew, and here are some ways to make sure the indexing goes smoothly.)


2. In the comment that sparked this post, Stew pointed to WebCite:

WebCite is an archiving system for webreferences (cited webpages and websites), which can be used by authors, editors, and publishers of scholarly papers and books, to ensure that cited webmaterial will remain available to readers in the future. If cited webreferences in journal articles, books etc. are not archived, future readers may encounter a "404 File Not Found" error when clicking on a cited URL.

A WebCite reference is an archived webcitation, and rather than linking to the live website (which can and probably will disappear in the future), authors of scholarly works will link to the archived WebCite copy on webcitation.org.

This not only provides a solution to the dead links problem,it also provides external timestamp authentication (which, as discussed elsewhere, is an issue when using blog posts to stake out academic/intellectual territory and avoid being scooped).


3. Stew found WebCite via Alf of HubLog. Alf discusses various solutions to the dead links/timestamp problem, including using Spurl (which is how I backup my Simpy archive) and his own cite bookmarklet. The bookmarklet allows you to grab a timestamped blockquote from another page, like so:

<blockquote cite="http://hublog.hubmed.org/archives/001243.html" title="HubLog: Creating a citable archive of a web page on Sat Apr 22 2006 15:59:48 GMT-0700 (Pacific Standard Time)>Academic papers or weblog posts often need to refer to external web pages; generally, you want people to see the external pages as they were when you wrote about them.

The simplest way to do this is a standard hyperlink, combined with a quote of the appropriate section of the text. If you're referencing long pages though, lots of lengthy quotes could get out of hand.</blockquote><cite><a href="http://hublog.hubmed.org/archives/001243.html">HubLog: Creating a citable archive of a web page</a></cite>.

Note: the original text included a link, which the bookmarklet doesn't preserve, but it's no big deal to add those back in (you could use "view selection source" if there were lots of links).



Saturday, 22 April
New to the blogroll: more meta-science

In comments below, Pedro Beltrao of Public Ramblings says:

What I disagree with is that we should go ahead and try to change things starting with the assumption of good faith. There is a percentage of people with bad intentions, this is clear, so we should plan for this. Open systems like wikipedia and digg are having problems and are taking steps to solve them. I suggest we keep an eye on these pioneering online social systems and see what solutions they come up with.
He's right, and it's an important point. When I said we should assume good faith, I wasn't clear. I didn't mean we should naively pretend there are no assholes in science. What I meant to convey was that, in addition to the sorts of measures we can learn from systems like wikipedia, we should do two things: 1, change the emphasis of the culture of science from suspicion to trust; and 2, have more faith in our ability to identify and deal with cases of bad faith as they arise. In other words, relax.

I think that we have good reason to approach fellow researchers as potential collaborators rather than potential scoopers (see below), and that when bad actors try to take advantage of that approach we also have, as a community and as individuals, the means to deal with them. When I say "the means to deal with them", I mean to include the sorts of checks and balances that Pedro is talking about.

Plentiful though they are, stories of scooping and other assholery are vastly outnumbered by the stories you don't hear, precisely because they are the stuff of every day:

  • the PI who lent you her unsubmitted grant so you could copy the format for your own
  • the postdoc who spent half a day digging through the -80 freezer to find the plasmid you wanted
  • the NIH staff scientist who sent you transgenic fibroblasts in response to an out-of-the-blue email
  • the paper you're an author on even though all you did was teach someone a technique they didn't end up needing1 ("we said you'd be an author, so you're an author")
and so on and on. Those are all true examples from my own experience, and I'd like to invite readers to add their own in comments. It would be nice to hear about the up side of the scientific community for a change.


1 I should clarify: an acknowledgement "for technical assistance" would have been more appropriate, and these days I would insist on that. At the time, I gave in and took the free ride. Mea culpa. I included the example just to point out that researchers are often generous even with that most precious commodity, publication credit.



Friday, 21 April
Quick followup on science blogging.

There's a lot of great discussion going on at the moment about science blogging, the community of science, publishing and so on. I don't have time for a comprehensive roundup (though Bora's updates here cover most of it), but I want to quickly follow up on a comment that Abel Pharmboy made:

Bill Hooker was most vocal in Bora's comments and in a separate post at his own Open Reading Frame on how "scoopers" should be shunned by the scientific community.
(This was sort of tangential to the main point of his post, which is why I'm doing this here instead of in his comments.)

The point I want to make is this: for all my talk of shunning, and for all that I'm absolutely serious about increasing the risk associated with "anti-collegial behaviour" like scooping, I'm aware that we don't want to start a program of witch hunts. There will be grey areas, hard-to-prove cases, and we'll just have to err on the side of trust -- be scrupulous about "innocent until proven guilty". Better ten scoopers get away with it than one innocent be labelled a scooper. We don't have to catch 'em all, just associate a greater cost with the activity.

Further, it's not so much about punishing wrongdoers as altering community attitudes. Scientists now tend to shrug and say, "that's how the game is played" or some such -- as though that's how it HAD to be. Worse, people are not inclined to speak up and say, "Hey, I thought of that some time ago", because the response will be along the lines of "too bad, I published it so it's MINE ALL MINE bwahahaha!". If someone says to me, "Hey look, here's a blog post of mine outlining the central theme of your paper six months before you submitted it", I'm not going to say "tough luck". At the very least, I'm going to invite that person to work with me on questions we're both interested in, so we can publish together in future -- and more, I'd be happy to have my published work updated to give credit for their independent discovery. For one thing, how does it hurt me to admit that someone else also came up with "my" ideas? It amounts to a "note added in proof" if there are independent data involved, and a pretty ordinary courtesy if it's just about the concepts. Further, I don't WANT credit for something I didn't do, only for things I did do (and I don't even care so much about that, so long as interesting questions keep getting answered1). If someone else came up with an idea or a result before I did, I want that known -- I'd feel like a fraud otherwise, if the community thought I was first but I knew otherwise.

In closing, let me just deal with one common objection to this idea of a more open system: that the world is full of assholes. Whenever I discuss openness, be it publishing data on blogs or being willing to share credit or listing one's bioreagents on BioRoot, I meet with a reaction that boils down to "what if someone takes advantage of me?". What if someone scoops me, what if someone fakes a blog post to get me to acknowledge them in a paper, what if someone keeps asking me for reagents and never gives any out? Well, to begin with it's a lot healthier (and, I'd argue, more productive in the long term) to start with an assumption of good faith than with the idea that everyone is out to cheat you. It's perfectly true that there will be assholes trying to take advantage, but here's the thing: they're doing that now, and the system we have is not hindering them much. In a more open system predicated on good faith interactions, assholery becomes harder to hide and get away with. As far as dealing with assholes as they appear, I return to a point from my last post: we're scientists, we present and evaluate evidence for a living. So if I'm going to accuse someone of scooping, for instance, I know -- it's my job to know -- what kind of evidence I need and how to get and present it. If I'm answering charges of assholery, I know what kind of evidence to demand, or to present in my defense. Give it a chance, I say: there aren't as many assholes as you think, and we already know how to cope with them.




1 To the extent that I do care, it's a job security issue: my ability to win funding and get or keep jobs in science is largely dependent on getting credit for my discoveries. That (job security) is a common lament among researchers, and it's a function of the career structure/hierarchy, which is another problem for the community to deal with; for instance, there's an interesting discussion here. For now, let me just point out that a system in which everyone gets the credit they've earned, because everyone is willing to give it (as in my personal thought-experiment above), seems to me to offer more security than a dog-eat-dog system.



Tuesday, 18 April
Science blogging: what's it all about? Part 1 of an ongoing series.

I've been posting pretty much nothing but verse, photos and linkdumps for a while now, partly because I've been exceedingly busy and, if I'm honest, mostly because serious original posts are a lot of work. The main reason, however, for the blog name change and the switch to my real name was that I want to start using this blog for talking about, thinking about, and even doing science, and recent posts by several other bloggers have prodded me into action.

I want to come back to issues and ideas raised by YoungFemaleScientist, Chad and Dr Free-Ride, but for today I'll mostly just point to Science and Politics.

Bora recently posted an elegant, scholarly, professional level discussion of Chossat's Effect in humans, complete with preliminary data, an hypothesis and an explicit request that the post be cited as a scientific communication; I noted this in a linklog and said he was helping to "usher in a new era of scientific publishing", and I wasn't kidding. I got online in about 1993, before there were blogs as we know them now, and my immediate reaction to this new medium was two-fold: "my people!" and "eee, publishing revolution!" I was right on the first count (even met the spousal unit online), and it's been slower than I'd have liked but I still think I was right on the second count as well. I'm not the first to observe that blogs are conversations, and conversations between scientists are where a lot of the creative action is; collaboration is a fun and powerful way to extend one's intellectual and practical reach. What better way to keep up with what's happening on relevant benches around the world than a well-connected network of lab weblogs (lablogs)?

Today, Bora has gone further with this idea. By way of answering the question "what are science blogs doing now?", he sets out a pretty comprehensive taxonomy of the current community. The category that interests me right now is "hypotheses and data", and I agree with Bora that there are two kinds of blog post in this category:

A) "This is my hypothesis and I am staking the territory here. I intend to test this hypothesis in the near future and you BETTER NOT try to scoop me!"
B) "This is my hypothesis, but I have no intention to follow it up with actual research. However, I'd love to see it tested. Please someone test it! And if you do, you will have to cite me in the list of references as your source for this hypothesis"
I would rewrite (A) to read: "This is my hypothesis and I plan to test it; if you can contribute, with ideas I haven't had or reagents I don't have or whatever it might be, great: let's collaborate. There's no need to steal when you can share."

Here we run into a personal bete noir of mine: "scooping". This means what it sounds like: taking advantage of someone else's work, to which the Scooper had advance (pre-publication) access by way of a conference presentation, visiting lecture, conversation, manuscript review, blog post or whatever, in order to slam a rapid publication into press ahead of the Scoopee, the person who actually had the idea. In Bora's comments, PZ Myers provides a personal example:

I got burned several years ago. I had a complete description of the protocols we were using in a teratology study, with some preliminary pictures of some of the results, all on the web. A few months later, my students found a paper published describing similar results in a fairly big name journal, and the protocols, which they had worked out by trial and error, were identical right down to the fraction of a percent of various reagents. It was damned obvious that they'd found our description and literally copied every step of our experiment...and there wasn't so much as an acknowledgment. The authors hadn't even bothered to contact us.

It was particularly galling to go to meetings afterwards and have people ask me, "Oh, so you're doing experiments like so-and-so?"

I've said elsewhere, I said in Bora's comments, and I'll say again: those assholes should be shunned. To do that to another researcher should basically mean the end of your career, by way of community opprobrium if not active sanction. I asked PZM what he did about his scoopage, and I'll be interested to hear his response. What typically happens is nothing: the scoopee shrugs and says something like "I couldn't prove they didn't think of it themselves, and it's too much trouble, and I don't want to rock the boat".

NNNNNNNGGGGGGGGHHHHHH!!! That galls me nearly as much as the initial assholery!

Of course, you don't want to smear "SCOOPER" all over an innocent researcher's reputation, and of course there will be grey areas and cases that are difficult to prove. But we are scientists, ferfucksake: we evaluate evidence for a living. It's what we do. Case in point: PZ lays out good-looking evidence of guilt in his comment, and as I said in reply:

As Bora points out, a blog post is a timestamped piece of evidence, a well-pissed-on territorial tree. It shouldn't take more than an hour or two with the lab books from the suspect lab to tell whether or not they stole your protocols -- unless they made up very careful fakes, which frankly would be more work than doing the damn experiments and not nearly as interesting.
You don't have to go screaming over to the offender's lab, punch him in the face and carve "SCUMBAG" into his forehead with a rusty scalpel. Simply contact the apparent scooper and lay out your evidence in a calm, straightforward manner. Frame it as an enquiry: my work shows considerable similarity to yours, how about we work together on some of these questions? If he blows you off, take it to the senior editor of the journal he published in; the journal has a vested interest in evaluating your claims, because they need a reputation for impartiality. While you're at it, cc: the apparent scooper's boss/es (dept head, dean of school, whatever). If you're wrong, that should become clear pretty damn fast -- and you haven't carved anything into anyone's face, so a sincere apology is all that's required. (Speaking for myself, if I were the innocent apparent scooper, at this point I'd be happy to talk about future collaborations, and possibly adding an acknowledgement about independent prior art to the paper in question.) If you're right, you may or may not get active satisfaction in terms of having the paper rescinded, or your name added to it, but you will have taken a stand against an unacceptable but all-too-common practice and, in doing so, nailed a big stanky turd to the scooper's reputation. Science, like all human endeavours, runs to a certain extent on reputation, so the mechanism is already in place to deal with this problem. The risk associated with scooping is currently very low; if you're willing to do it, you can probably get away with it. And there are always assholes in every field, so there will always be someone willing to do it. The good news is that collaborations are already CV fodder, in many cases regarded even more highly than individual efforts when it comes to promotions, grants and so on. We therefore do not need to raise the risk associated with scooping very high -- we can be absolutely scrupulous about proof, and about avoiding witch hunts -- before sharing becomes a more attractive option than stealing.



Wednesday, 23 November
An idea whose time has come.

Orac has a post up about MacGyver science -- you know, supercolliders made out of toilet rolls and chewing gum, or in this case an electrophoresis rig made out of kitchen stuff. Orac concludes, sadly, that it's not a practical way to cut lab costs. He's right, but there are good ways to cut lab costs.

(There are bad ways, too. I've done the grow-your-own Taq thing that RPM mentions in comments; it's not worth it. Too much fiddling and no one else in the lab will trust their experiments to your crappy enzyme anyway.)

For instance, commenter Dave raises a good point about resource pooling. A colleague of mine, lab manager in the last lab I worked in, estimated that he saved the lab about 30% of its running costs just by instituting a central ordering system. Once all orders went through him, he could shop around for best prices and pool orders with other labs to save on shipping. The institute that lab was in also saved itself a ton of money by putting together a central Store, so they could buy in bulk.

(A brief digression. It occurs to me that most of my tens of readers won't be familiar with what it costs to do biomed research. Quite apart from salaries, on-costs and infrastructure, I'd guess that most labs spend at least $500/month/staff member just on reagents and consumables (such as disposable plasticware). For a medium sized lab of five people, that's $30K per year. On top of that, costs vary widely from experiment to experiment; for instance, the lab I'm in now probably spends at least a further $20K/year on facilities for transgenic mice. If, like Orac, you do a lot of qRT-PCR, that's spendy too -- I think it goes close to $0.5/reaction and "a lot" is thousands of reactions per month, if not per week. To take a less fine-grained view, the average cost of an NIH grant (1992-96) was $274,710/year. Them's your dollars, taxpayers, so you should be keeping an eye on us -- in fact, there's a whole nother post -- hell, a whole nother career -- right there.)

Anyway, the whole point of this post is: there's another kind of resource pooling that is due for an internet-era upgrade: simple "hey have you got an antibody against X?" sharing. A while back, my current PI came up with the idea of a central database for sharing biological reagents; it's an idea best illustrated by example. (For non-scientists that is; labrats reading this will already be punching themselves and going "oh man why didn't I think of that, does it already exist, where is it gimme gimme gimme". Patience, I'll get to it.)

I happen to be interested at the moment in a protein called Smad3. We had an antibody to the molecule, but I also wanted to be able to distinguish between the phosphorylated (active) and non-phosphorylated (inactive) forms. You can buy an anti-phospho-Smad3 antibody, but it'll cost a bundle and you may be buying a lot more than you need. For instance, the one I linked comes in 40 µl lots for $110 (though most antibodies typically aren't sold in such small lots; the 100 µl/$250 size is much more usual). The company says that's enough for 4 blots, but I could probably stretch it to 40 -- if I wanted to run 40 blots, that is. Until I ran the first experiment, I didn't know whether I was going to pursue that line of inquiry, so I didn't want to toss 110 hard-earned taxpayer dollars (plus shipping and handling, and you really get screwed on that believe me) at something that might not pan out. (Plus, I wasn't too keen on the cross-reactivity with pSmad1, a related molecule, that the linked antibody displays.)

In such cases, and there are MANY, MANY such cases, what you typically do is wander forlornly around the building, asking if anyone has the antibody (or plasmid, or yeast strain, or oligo, or whatever it is) that you want. I did that -- even sent a couple of emails to groups elsewhere on campus -- but no luck. So I did the next thing you do, which is I ran a few searches and read a few papers, and discovered that there were a couple of antibodies in the literature that fit my requirements nicely. One of these was made in the lab of Prof Ed Leof at the Mayo Clinic; promisingly, it was cited in several papers by other groups ("the anti-pSmad3 antibody was a gift from Ed Leof"). So I sent Prof Leof email, and about 24 hours later someone in his lab sent me enough of his antibody for at least 100 blots (Prof Leof, Dr Edens -- if you're reading this, thanks again, and FYI the Ab can be re-used at least ten times, just put azide in the dilution buffer). All it cost our lab was FedEx shipping for a small container on dry ice.

Now, that's the way it's supposed to work -- and in fact, in my experience, the majority of such requests are met with similar collegiality and generosity. For myself, I am always pleased when I can help a colleague out. But here's the thing -- there's probably a lab right here on campus that has an antibody I could have used. I tried the obvious suspects (labs working on systems in which Smad3 might play a role), but even though they didn't have any I bet there's someone on campus who does. It's even likely that they bit the bullet and coughed up for the antibody on spec, and it's been sitting in their -80°C freezer since that first experiment didn't go the way they hoped. That shit happens all the time, ask any researcher. I want to emphasize that: this whole example, from me wanting something for just one look-see experiment to the likelihood that it was available on campus but I just couldn't find it, happens all the time.

Enter the idea whose time has come: an online database into which labs everywhere input the biological reagents they're willing to share: antibodies, plasmids, viruses, bacteria, yeast, mutant model animals, peptides, oligos, primer sets, cytokines, spendy chemicals -- the list of potential shareables is enormous and ever-expanding. Some of this functionality exists -- for instance if the mouse you want already exists, Jackson Labs probably has it or knows about it, and you can always do the literature thing like I did -- but it's scattered and inefficient. Think how much easier my quest for an anti-pSmad3 antibody would have been made by such a tool: one search and up comes a list of labs and antibodies, pick an antibody, sort the resulting labs by location, email (or walk over to) the nearest one. Here's another example: I have a new search going right now -- I want some Smad2/3-null mouse embryo fibroblasts and a set of Smad2/3 expression plasmids. I've sent out seven or eight emails to colleagues I found in the literature; I've had one negative and one positive response, but the positive response depends on permission from someone else from whom I'm still waiting to hear. I'm still not sure I'm bothering the right people, it's been almost a week (and Thanksgiving's coming up), and dammit there's probably someone in Portland, or maybe at the Hutch in Seattle, who has what I want and would share it with me.

See what I mean? Happens all the time. I want that database and I want it now! Peter suggested I make it happen, at least initially on a limited, local scale -- start with the six labs in our institute, then expand to include the whole OHSU campus. Great idea, so as a first step I googled around to see whether anyone had already done it -- turns out they have:

Welcome to BioRoot Bioinformatics
BioRoot is a non-profit organization dedicated to fostering communication, collaboration, and increased productivity in the biological sciences through information exchange.
We provide centralized databases to collect, store, and disseminate information about commonly used molecular biology reagents: antibodies, plasmids, strains, and oligonucleotides.
Use of these BioReagent databases will cut costs, save time, and accelerate research benefiting the bench scientist, the PI, and the public.

W00t!

The guy behind it is David Nix, who clearly has the programming chops to go from "an Excel spreadsheet uploaded to my web space somewhere", which is probably where I was going to start, to a fully-functional database complete with privacy/security measures. Major, major kudos, dude. (What is slightly odd is that I found out about BioRoot by googling, and found David the same way. Why haven't I heard of this everywhere? It's the best thing since PubMed, it should be huge. Apart from one subscriber-only article in The Scientist, I couldn't find anything.)

I've sent BioRoot an email, so we'll see how things work out. If it's what it looks like (and why wouldn't it be?), I'm going to become a hardcore BioRoot evangelist.



Friday, 28 May
scooped again

Two Spanish researchers have shown (original here) that two leading journals routinely publish statistical errors:

The analysis revealed that at least one error appeared in 38 per cent of the Nature papers and 25 per cent of the British Medical Journal papers looked at. Furthermore, the study estimates that four per cent of results reported to be statistically "significant" may not be significant after all.
Yet again, the Spanish study is an example of someone actually doing something I thought of some time ago. (Fortunately for me, I'm usually only pleased when this happens, because I know perfectly well that I'll never do anything with the idea.) I am woefully ignorant of statistics, and probably have published overly simplistic analyses myself (though I am careful about claims of significance, and am confident that I've made no errors there). This sorry state is much more prevalent among biomed researchers than it ought to be, so I'm not suprised by the study's findings. Garcia-Berthou and Alcaraz also make another point upon which I've been known to wax shrewish:
As well as warning researchers and editors to be more careful with data, they also urge the publication of raw data online. "If we had that, we could check the results," Garc�a-Berthou says. "Some journals already publish supplements online, but it's rare, and I think it should become commonplace."
I think it should by now be viewed as low-rent not to make your raw data available online. There's no reason not to do it, unless you're hiding something; if the journal doesn't provide the option, the server space and bandwidth costs are well within reach of any research institution. I'm convinced that it will become a standard part of scientific publishing. (Obdisclosure: I haven't made any of my raw data available online, even though it was about the first thing I thought of when I came across the net, way back in 1993. I could never convince the higher-ups that it was a good idea. I'll start doing it as soon as I'm high enough on the food chain to insist on it, which I hope will be from the next paper onwards, paying for the hosting myself if need be.)


RSS Feed

CC0
To the extent possible under law, I have waived all copyright and related or neighboring rights to this weblog. This work is published from the United States. Further information.



Links:
(formerly Malice Aforethought)
me
spousal unit
Bloglines account
Simpy account
Connotea account
OpenWetWare userpage
monthly irregular column on 3QuarksDaily


Please sign the petition in support of the European Commission's proposed Open Access Self-Archiving Mandate

googlebombs for good
Roe; Wade; Roe v Wade
abortion
Jew
Seldovia Herald


blogroll:

Archives:
March 2010
February 2010
January 2010
October 2009
July 2009
June 2009
May 2009
April 2009
March 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
January 2007
December 2006
November 2006
October 2006
September 2006
August 2006
July 2006
June 2006
May 2006
April 2006
March 2006
February 2006
January 2006
December 2005
November 2005
October 2005
September 2005
August 2005
July 2005
June 2005
May 2005
April 2005
March 2005
February 2005
January 2005
December 2004
November 2004
October 2004
September 2004
August 2004
July 2004
June 2004
May 2004
April 2004
March 2004
February 2004
January 2004
December 2003









Design thrown together haphazardly by frykitty.
Powered by the inimitable MovableType.