open access/open science Category Archive



Tuesday, 30 June
Perfect match?

Surely this:


doe.jpg

You may find a technical report that you want to share with others or you think worthy of making broadly available on the Web to support the advancement of science. When you search for important science information in your area of interest, you can choose to sponsor the digitization of any adoptable technical report. The cost is $85 (approximately the same cost as ordering a hard copy). Discounts for multiples of 5 or more adoptions may be available. If you are interested in a larger scale project, please contact (865) 576-5699.



is a job for this guy:


malamud.jpg
... Most recently, Malamud has set up the nonprofit public.resource.org, headquartered in Sebastopol, California, to work for the publication of public domain information from local, state, and federal government agencies. Among his victories have been digitizing 588 government films for the Internet Archive and YouTube, publishing a 5 million page crawl of the Government Printing Office, and persuading the state of Oregon to not assert copyright over its legislative statutes.

?


(CC-BY image of Carl Malamud from Joe Hall via Wikimedia)



Sunday, 21 June
OA vs TA costs: I think I have finally got this straight.

I made some errors in the last few posts, making the information somewhat scrambled -- my apologies. Here is what I hope is a clear picture of what we know about the relative costs of OA and TA publishing.

1. The NIH estimates that it pays $100 million/year in author-side charges, and supports the production of some 80,000 scholarly articles; that's an average of $1250/article.

Update: Peter Suber points out that some fraction of that 80,000 articles did not use NIH funds, either because they were published in no-fee journals or because the authors found other ways to pay. I can't think of any way to estimate the actual number of articles the $100 million paid for in order to adjust the estimated fee/article, but it's worth remembering that it's an underestimate.

2. Björk et al. found that less than 5% of all articles worldwide are available through no-embargo Gold OA. We don't know what proportion of the NIH's $100 million went to Gold OA fees, nor what the average such fee might be. In order to be conservative, let's assume that the average Gold OA fee is triple the average TA fee (it almost certainly isn't that high). Then (if that 5% is evenly distributed) the NIH paid for (0.95x80000=) 76,000 articles at $average and 4,000 articles at 3x$average, bringing the average author-side charge for a TA article to $1136.

3. Philip Davis' 2004 library costs spreadsheet estimates the average subscription charge per scholarly article at between $970 and $1750, depending on what proportion of the library serials budget is allocated to scholarly publications.

subscriptionperarticle.jpg

Davis' original study estimated this proportion at 50% (on what basis I don't know), but I think the real value is closer to 90%. My reasoning is based on my observation (see Table 2) that the average unit cost of a curated list of scholarly journals from UCOSC is about ten times the average unit cost of "all serials" from ACRL, ARL and NCES datasets. If that result is broadly representative it means that scholarly journals must contribute either a small fraction or the vast majority of the cost (see here for a brief explanation).

So that gives an estimated fee of between $2106 and $2886 per toll-access article. That money isn't all coming from the same place -- the NIH is paying author-side fees and libraries are paying subscriptions -- but it's all going to the same place, publisher coffers.

I've added a current (under)estimate of NIH costs for author-side fees, adjusted for a 2006 estimate of %OA by article, to a 2004 estimate of subscription fee/article, but I'm confident that the real cost (if I could get up-to-the-minute figures for all inputs) would be in the same ballpark.

Sure puts one-time, up-front Gold OA fees in a different perspective, doesn't it? Here's a reminder (stupid Impact Factors in brackets just because I know a lot of people still think they mean something even though they don't):


average revenue
1 per toll-access article .............. $2100 - $2900

BioMed Central
Genome Biology (6.6) ..................................... $2250
BMC Biology (5.1) ........................................ $1950
Molecular Cancer (3.7) ................................... $1710
Retrovirology (4.0) ...................................... $1390
J. of Cardiovascular Magnetic Resonance (1.9) ............ $1195

Hindawi
Comparative and Functional Genomics (1.6) ................ $850
J. of Biomedicine and Biotechnology (1.9) ................ $975
Mediators of Inflammation (1.2) .......................... $975
Bioinorganic Chemistry and Applications (1.0) ............ $700

Public Library of Science
PLos Biology (13.5), PLoS Medicine (12.6) ................ $2850
PLoS Pathogens (9.3), Neglected Tropical Diseases (n/a),
Genetics (8.7) and Comp Biol (6.2) ....................... $2200
PLoS ONE (n/a) ........................................... $1300

Other
J. Medical Internet Research (3.6, best in field) ........ $1590
Biological Procedures Online (1.2) ....................... $1250
J. of Clinical Investigation (16.9) ..................... ~$2500



1 Update: since D0r0th34 has already pointed out one dumb thing I did, neglecting other revenue streams available to TA but not OA publishers, I think that rather than continually update this post I'll just go ahead and embed the FriendFeed discussion right here:







Friday, 19 June
OA and strategy

Stuart Sheiber recently gave a talk at Caltech, which prompted the following blogospheric exchange with Stevan Harnad (which I recommend highly if you are interested in Green vs Gold OA and the intricacies of OA mandate politics):

Harnad --> Sheiber --> Harnad

followed by this related post on "proportion and strategy" from Prof Harnad, the main points of which he also left as a comment on a couple of my posts:

#1: The vast majority of current (peer-reviewed) journal articles are not OA (Open Access) (neither Green OA nor Gold OA ).
#2: The vast majority of journals are not Gold OA.
#3: The vast majority of journals are Green OA.
#4: The vast majority of citations are to the top minority of articles (the Pareto/Seglen 90/10 rule).
#5: The vast majority of journals (or journal articles) are not among the top minority of journals (or journal articles).
#6: The vast majority of the top journals are not Gold OA.
#7: The vast majority of the top journals are Green OA.
#8: The vast majority of article authors would comply willingly with a Green OA mandate from their institutions and/or funders.
#9: The vast majority of institutions and funders do not yet mandate Green OA.
#10: The vast majority of Gold OA journals are not paid-publication journals.
#11: The vast majority of the top Gold OA journals are paid-publication journals.
#12: The vast majority of institutions do not have the funds to subscribe to all the journals their users need.

CONCLUSION I: The fact that the vast majority of Gold OA journals are not paid-publication journals is not relevant if we are concerned about providing OA to the articles in the top journals.

CONCLUSION II: Green OA, mandated by institutions and funders, is the vastly underutilized means of providing OA.

CONCLUSION III: It is vastly more productive (of OA) for universities and funders to mandate Green OA than to fund Gold OA.

I think there is a considerable strategic error embedded in those premises and the conclusions which follow, the basis of which is the emphasis on "the top minority of journals (or journal articles)". The 90/10 rule is not relevant: the goal of OA is 100% OA, not 10% -- not even "the top" 10% in which is concentrated 90% of whatever your metrics are measuring.

Much of the potential of OA lies in the provision of a comprehensive corpus of information on which to build the semantic web. Comprehensivity matters, because just as re-use beyond the scope of the original author's imagination is a primary impetus for information sharing between humans, it is folly to imagine that we can determine ahead of time what will matter to machines -- that is, which articles will be crucial to finding new and unexpected connections in text- and data-mining initiatives. The more complete the corpus, the more likely we can refine from it insights that are currently unpredictable.

Also, in an odd bit of circularity, 100% OA is vital to the development of rich, fine-grained, multiply cross-validated metrics that will likely be more reliable than existing metrics in guiding management decisions and researcher information searches. If we focus on "the top" journals and articles, we hamstring our best strategy for improving the methods with which we identify quality in the first place.

It's also worth addressing claim #11 separately. For the direct argument against the assertion that most of the "top" Gold OA journals charge fees, see Peter Suber:

If this is a claim about quality, or about future submission patterns, as opposed to present submission patterns, then it's an assumption for which there is no evidence.  Nobody has done the studies. [...] In the absence of studies, this is all we know:
[T]here are strong and weak OA journals, just as there are strong and weak TA journals. Hence, any analysis focusing on weak OA journals and strong TA journals (as if to show the superiority of TA journals) would be as arbitrary as one focusing on weak TA journals and strong OA journals (as if to show the superiority of OA journals). Without some additional argument showing that the journals on which they focus are typical of their breeds, they would be guilty of cherry-picking and generalizing from an unrepresentative sample.

There is, however, a neglected and (in my opinion) important counter-argument: even if that assertion is true, it is surely equally or more the case that the vast majority of toll-access journals charge author-side fees in addition to subscription charges. A 2005 Kaufman-Wills study found that 75% of TA journals in their sample charged author-side fees. There is at least as much reason to suppose that the top-ranked TA journals are to be found among the fee-charging cohort as there is to suppose the same of OA journals.

The NIH estimates that it pays author-side fees to the tune of $100 million per year, and funds the publication of some 80,000 scholarly articles. Assuming, in order to be conservative, 5% Gold OA at fees that are triple the average TA fee, that averages out to $1136/article, but what's sauce for the TA goose is sauce for the OA gander: if the Kaufman-Wills figures are broadly representative then those TA journals that charge additional author-side fees are charging, on average, $1515 per article. That's more than PLoS ONE, more than most BMC journals and more than any Hindawi journal.

It follows that, since we are not -- that is, I argue that we should not be -- "concerned about providing OA to the articles in the top journals", the fact that most Gold OA journals do not charge fees is in fact relevant to all strategies for increasing OA to the research literature.

I think I disagree with the second conclusion also -- in the most comprehensive study so far, about 8% of articles published in 2006 were available via Gold OA, whereas a further 11% was available as a self-archived copy. I agree, of course, that both are vastly underutilized relative to the goal of 100% OA, but it doesn't seem to me that Green suffers more neglect than Gold.

Given the flaws in some premises and the first two conclusions, I don't believe that conclusion 3 stands up either. I find Stuart Sheiber's argument for the Harvard model compelling:

In summary, a university that commits to the open access compact1 will more easily be able to answer objections against green OA policies specifically because it has an approach to long-range support for gold OA publishing, not in spite of it. The two models are inextricably tied. I, like Professor Harnad, am interested in facilitating the adoption of green OA policies. I proposed the open access compact in large part because I expect that adoption of the compact will lead to more green OA policies. The open access compact is therefore contributory to the promotion of green OA, not a sidetrack to it. I of course encourage universities to adopt green OA policies before gold OA support, but given that dystopian fears of faculty are preventing adoption of such policies, an open access compact that might assuage these worries should not be delayed.

1 The compact simply states that ""The university commits to underwrite reasonable article processing fees for open-access journals for which funds are not otherwise available".

Given all of the above, the optimal strategy seems to me to be the one adopted by Harvard: a Green OA mandate and careful (fiscally responsible) support for Gold OA.



Friday, 19 June
Update and correction re: cost to libraries and author-side fees

In comments below, Peter Suber points out that the NIH has amended its estimates to $100 million/yr spent on author-side charges and 80,000 manuscripts funded -- which brings the estimated average author-side fee to $1250, well in line with the individual journal estimates I made and the published figures I found. This is an important number because it is derived from a very large sample of the scholarly literature and casts a very different light on OA author-side fees than the one that TA publishers are wont to shine on their competitors. Compare, for instance, PLoS ONE at $1300, or the standard BMC charge of $1470 -- for a couple hundred dollars more than the average cost of a TA publication, you can make your work free for all users to access, immediately and permanently. (It would be interesting to know what proportion of the $100 million is going to OA fees, though I doubt it would be large enough to make a significant dent in the average TA charge. Edit: according to Björk et al., less than 5% of all articles are available via no-embargo Gold OA; taking this into account, and assuming that the average Gold OA fee is triple the average TA fee, gives an average of $1136/article.)

But! However! There is a flaw in my reasoning!

The problem is not with the estimate of author-side charges, but in my use of that estimate to update Philip Davis' library costs study. The point of that study was to look at what libraries would pay in an all-OA model, which is why I used the fractional cost matrix1 and graph in the first place. See the problem? Libraries don't pay the toll-access author-side charges, the NIH does! This makes the model a little artificial, perhaps, since *someone* has to pay those charges regardless of which journal levies them; nonetheless, the idea was to estimate practical library costs, so the TA author-side fees should not be included.

Here's what the updated situation looks like with the subscription/article estimate NOT adjusted for TA author-side fees (see my earlier post for details of the calculations):

Davisupdatecorrected.png

The fractional cost has to drop to 0.4 before there are no libraries predicted to pay more in the OA model -- as I pointed out in the original post, there are numerous realistic combinations that will result in a fractional cost of 0.4 or lower:

matrix2.png

The new figures also show that the fractional cost has to drop below 0.2 before all 113 libraries are predicted to save money in an OA model. That still seems to me to fall within a realistic range, given that 70% of journals in the DOAJ don't charge author-side fees and 45% of researchers in a recent RCUK study had their OA fees covered by their research funders, for a fractional cost of 0.135.

Nonetheless, it's worth taking a quick look at the libraries which are predicted to pay about the same in the OA and TA models. At a fractional cost of 0.4, they are: UC Davis, LA and San Diego, Univ Colorado, Cornell, Harvard, Johns Hopkins, McGill, Univ Massachusetts, Univ Maryland, MIT, Univ Toronto, Univ Washington and Univ Wisconsin. At a fractional cost of 0.3, only UC Davis, UCLA, Harvard, McGill, Maryland and Washington remain in the "pay about the same" category.

It's easy enough to guess what these universities have in common, and a simple analysis confirms it:

rank.png

Shading the top six yellow and the next 8 blue for visibility and ranking the libraries according to FTE, serials expenditure and "estimated scholarly articles published" reveals that the 14 "pay-same" libraries have only a slight tendency to be among the larger schools, but cluster very strongly at the high end of the "scholarly articles published" ranking. In other words, research-intensive schools that publish a lot may put more pressure on their libraries in the OA world (to the extent that libraries are likely to be asked to repurpose serials costs for OA charges).

Among other things, it was in order to examine this particular concern in detail that Davis carried out his original study, and for the same reason I have here updated it with more recent estimates and assumptions. The newer numbers show that a realistic worst-case scenario is that the libraries in question (14 out of 113 total) don't save any money in the OA model.

-------------
1 I neglected to mention in earlier posts that I got the %fee x %funded matrix idea (of which the fractional cost graph is an obvious extension) from Peter Suber. My apologies to Peter; I'm usually more careful about crediting sources.



Thursday, 18 June
Cost to libraries: OA vs TA

Note: important update/correction.

In 2004, Philip Davis carried out a study of library costs in which he estimated the average subscription cost/article for a subset of ARL libraries and compared this with a range of estimated author-side fees for Gold OA, in order to determine whether libraries might pay more or less if all journals switched to OA. Here I've tried to update that study using information that wasn't available back then.

Davis set the spreadsheet up to make it easy to update his assumptions and recalculate (kudos!), and Peter Suber (among others) pointed out that at least the following assumptions should be updated:

  1. all OA journals charge author-side fees
  2. the full cost of OA fees will be borne by libraries
  3. TA journals charge no author-side fees

We now have five different studies (one recently confirmed, improved and updated) showing that in fact the majority of OA journals do not charge author-side fees. The highest proportion of no-fee journals is in the DOAJ psychology subset (90%) and the lowest is in the chemistry subset (49-58%); the most recent analysis of the entire DOAJ showed 70% no-fee.

We also know that research funders are increasingly willing to foot the bill for OA. For example, HHMI has institutional agreements/memberships with BMC, Springer and Elsevier, and BMC's page of funder policies shows that a majority of UK funders either make additional funds available or allow publication charges to be treated as an indirect cost. A recent RCUK report showed that 45% of authors publishing in fee-based OA journals had their costs covered by their research funders.

Rather than pick a single number for either of these updates, I've plotted the fraction of the OA cost borne by libraries against the number of institutions at which OA is predicted to cost more than, the same as, or less than the TA model. The fractional cost borne by libraries is the product of (100 - %covered by funders)(%OA journals charging fees). (See Figs 1 and 2 below.)
 
We don't know much about author-side fees at toll-access journals, but we do have some information. Firstly, the 2005 Kaufman-Wills report showed that more than 75% of the 247 toll-access journals in their sample charged author-side fees in addition to subscriptions. Secondly, I just had a rough-and-ready look at a small number of TA journals and found average author-side fees ranging from $400 to almost $3000. Finally, the NIH estimates (scroll to section L) that it spends over $30 million/year in author-side fees and funds the production of around 60,000 manuscripts. This means that the NIH is paying, on average, about $500/article in page charges. Since this is the largest sample we have, I've used this figure to update the spreadsheet. I added $500/article to the calculated serials expenditure/article and compared this adjusted TA cost/article to the OA costs.

Update: this was a mistake! The point of the exercise was to compare existing library subscription costs with predicted OA costs, and libraries are not currently paying the TA author side fees. See this post for the correctly updated version of the Davis study.

I've updated two further aspects of Davis' spreadsheet. First, we now have better information about the actual range of author-side fees charged by those OA journals that do charge them. Rather than Davis' $2500 - $5000 range, I've used $1300 (PLoS ONE) to $3000 (most of the high-profile hybrid programs). If the adjusted TA cost/article falls within this range, the prediction is that the OA and TA models cost about the same from a library point of view.

Second, Davis assumed that the scholarly literature made up 50% of library serials expenditures. I don't know where this figure came from (the spreadsheet refers to a report which does not give any further information), but I think the real value is closer to 90%. My reasoning is based on my observation (see Table 2) that the average unit cost of a curated list of scholarly journals from UCOSC is about ten times the average unit cost of "all serials" from ACRL, ARL and NCES datasets. If that result is broadly representative it means that scholarly journals must contribute either a small fraction or the vast majority of the cost. Here's a simple explanation: suppose 1000 items at an average cost of $10; then average cost of the scholarly items must be about $100 if the "10 x all serials" rule is accurate. So you can either have 90 scholarly items and 910 non-scholarly items at about $1, or you can have one scholarly item and 999 non-scholarly items at about $10. What you can't have, for the averages to work out according to the "10 x" rule, is any ratio close to 50% scholarly/50% non-scholarly.

Summary of updates:

  1. plot fractional cost borne by libraries to account for %OA journals that don't charge fees and % OA costs borne by research funders (or other bodies)
  2. add $500/article to TA model costs to account for author-side fees charged in addition to subscriptions
  3. predicted OA fee range = $1300 to $3000
  4. assume scholarly literature makes up 90% of serials expenditure

The updated spreadsheet is here, and the end result is this:

Davisupdate_errornote.png

At a fractional cost of 0.8, there are no libraries at which OA is predicted to cost more than the TA model, and at a fractional cost of 0.3 the OA model is predicted to cost less than the TA model at all 113 libraries.

To see how the %fee and %funder proportions affect the fractional cost borne by libraries, I constructed a simple matrix and highlighted the two cutoff points shown on the graph above:

Davisupdate_fraction.png

As you can see, there are a number of perfectly reasonable combinations which result in a fractional cost of 0.3 or less, at which all the libraries in the sample would save money under the OA model. (This, by the way, is exactly what Peter Suber predicted.)

Update/correction: see this post.



Thursday, 18 June
Author-side fee comparison: OA vs TA.

I've posted a couple of times about the misconception that all OA journals charge author-side fees, and each time I've mentioned the Kaufman-Wills study which found that 75% of the toll-access journals they examined charged author-side fees in addition to subscription charges. I thought it would be useful to compare author-side fees charged by OA and TA journals.

It's easy to work out what OA and hybrid journals charge; BMC maintains a detailed list of publisher article processing charges.  Here are some examples: 

PLoS journals charge in three tiers:
PLoS ONE, $1300
PLoS Pathogens, NTDs, Genetics and Comp Biol, $2200
PLoS Biology and Medicine, $2850

BMC charges between $1105 and $2095 for most journals, and their standard charge is $1470

Hindawi charges between $275 and $850 for most of their journals, with a few titles up to $1400

Springer Open Choice, Wiley Funded Access and Elsevier's Sponsored Articles all cost $3000. (*cough*)

What is much more difficult to determine is how much the average author is paying in author-side fees at toll-access journals, because the charge for a given article depends on number of pages and/or color figures, and in some cases also on whether supplementary information is included.

Below are a few examples; in each case for which I calculated a figure, I extracted the page and figure counts manually from a single issue. This is far too small a sample to be representative, but I'm just trying to get some kind of feel for the numbers. Further, the published figures I managed to find (indicated by footnotes) are consistent with my "calculated guesses". Also, the NIH estimates (scroll to section L) that it spends "over $30 million annually in direct costs for publication and other page charges" and produces "roughly 50,000 - 70,000 manuscripts", which means that the NIH is paying, on average, about $500/article in page charges. If around 8% of all new articles are Gold OA, that number goes up to about $543/article. If the Kaufman-Wills 75% figure is representative, then the average author-side fee being charged is $666/article, or $724/article if the %OA is taken into account. (Note that the %OA adjustment might be spurious and the estimated average slightly off, because we don't know how much of the estimated $30 million is going to Gold OA fees.) Edit: according to Björk et al., only about 5% of all articles are available through Gold OA without an embargo period. Taking this into account, and assuming that the average Gold OA fee is triple the average TA fee, gives an average of $454/article, or $606/article on the Kaufman-Wills estimate.

Update: In comments, Peter Suber points out that the NIH has amended its estimates to $100 million/yr spent on author-side charges and 80,000 manuscripts funded -- which brings the estimated average author-side fee to $1136; if only 75% of TA journals are charging such fees, then they are charging on average $1515.

This section became way too cluttered, so I've put a summary here and the details are below:

journal .................................... average author side fee
PNAS ............................................... $1446
Science ............................................ $1019
Nature ............................................. $1669
Cell ............................................... $2031
Cell Cycle ......................................... $756
EMBO J ............................................. $2974
Mol Biol Cell ...................................... $1829 1
American Physiological Society (14 journals) ....... $1000 2
Journal of Nutrition ............................... $456
J Neuroscience ..................................... $850 + color charges 2
Molecular Biology and Evolution .................... $922 3
Molecular Plant-Microbe Interactions ............... $1275 4
J Natural Res & Life Sci Education ................. $400

1 official figures, 2006
2official figures, current
3 official figures, 2008
4 official figures, 2000


The selection of journals is fairly random, just the first few that came to mind then whatever turned up when I was searching for things like "average page color charges". They range from prestige to niche, and even the cheapest charge fees that amount to a significant fraction of Gold OA author-side fees.

It would be very interesting to extend this half-baked pilot study, but I think it would also be unavoidably labor intensive. Except for rare cases where publishers provide the numbers, there's really no way to calculate average author-side fees based on page and figure counts except by doing those counts for a representative sample of issues in each journal. (Perhaps a passing statistician could help me figure out what would constitute a representative sample -- perhaps sqrt(issues/year)?) Then you have to select which journals to investigate -- perhaps high, middle and low ranked journals in a handful of broad categories? Finally, it's pretty slow going, so I don't think Mechanical Turk would be cost effective for this job -- even if you could solve the problem of giving Turkers access to the journals. In the end I think you'd have to inflict the counting task on some hapless grad student or intern, who would probably find it easiest to sit in a library with a stack of journals and a spreadsheet.







----------------------------------------details of "calculated guesses" and official figures----------------------------------------

PNAS: $70/page, $250 for supplementary information, $300 per color figure or table

March 17 2009 vol 106 issue 11: 88 papers, pp 4079 to 4570; mean = 5.6 pages 5.6 pages = $392 10 papers had no supplementary info so mean SI=78/88=0.886 = $221 approx every 5-6th paper examined, 18 in total:

5 color figures ($1500) ii
4 color figures ($1200) iiiii i
3 color figures ($900)  ii
2 color figures ($600)  iiii
1 color figure  ($300)  ii
0 color figures ii

mean color cost = $833; mean total cost/article = $1446

In 2004 Cozzarelli et al. suggested that around $2000/article would be needed to cover PNAS'  costs without subscription income.


Science: $650 for the first color figure, $450/color figure thereafter

March 20 2009 vol 323 issue 5921: 2 research articles, 11 reports:

4 color figures ($2000) iii
3 color figures ($1550) i
2 color figures ($1100) iiii
1 color figure ($650) ii
0 color figures iii

mean color cost = mean cost/article = $1019

 
Nature: £735 ($1072) for the first colour figure and £262.50 ($383) for each additional figure (note: "Inability to pay this charge will not prevent publication of colour figures judged essential by the editors")

March 19 2009 vol 458 number 7236: 2 articles, 12 letters:

5 color figures ($2604) ii
4 color figures ($2221) iiii
3 color figures ($1838) iii
2 color figures ($1455) iii
1 color figure ($1072) i
0 color figures ii

mean color cost = mean cost/article = $1669


Cell: $1000 for the first color figure and $275 for each additional color figure. 

March 20 2009 vol 135 number 6: 12 articles:

7 color figures ($2650) iii
6 color figures ($2375) iii
5 color figures ($2100) ii
4 color figures ($1825)
3 color figures ($1550) ii
2 color figures ($1275)
1 color figure  ($1000) ii
0 color figures

mean color cost = mean cost/article = $2031


J Neurosci: $850 for regular manuscripts, $450 for brief communications, color figures are free "when color is judged essential by the editors and when the first and last authors are members of the Society for Neuroscience", otherwise $1,000 each.

March 18 2009 vol 29 issue 11: 28 articles; looked at 4 random articles, no color figs = 6,8,5,1.  Regular SfN membership is $160.  I'm guessing most authors are members but it's still impossible to tell how much each paper is being charged for color.


Landes Bioscience (all journals): four pages free, then $80/page; $340 for the first color page and $150 for each additional color page (in print -- color is free online)

Cell Cycle March 15 2009 vol 8 issue 6: 10 research reports, pp 870 - 949

pages = 5,12,6,5,6,6,8,5,8,9
pages charged = 1,8,2,1,2,2,4,1,4,5; total = 30, mean = 3 = $240

7 color figures ($1240)
6 color figures ($1090)
5 color figures ($940)
4 color figures ($790) iiii
3 color figures ($640) i
2 color figures ($490)
1 color figure  ($340) iiii
0 color figures i

mean color cost = $516; mean total cost/article = $756


EMBO J: $250/page over 6 pages, plus color charges: $650/figure for the first three figures, $432/figure for the next two, $2928 for six figures and $326 per additional figure thereafter.

March 18 2009 vol 28 number 6: 15 articles

pages = 10,8,10,10,13,8,10,13,13,10,8,9,10,12,12
pages charged = 4,2,4,4,7,2,4,7,7,4,2,3,4,6,6; total = 66, mean = 4.4 = $1100

9 color figures ($3906) i
7 color figures ($3254) ii
6 color figures ($2928) ii
5 color figures ($2814) ii
4 color figures ($2382) i
3 color figures ($1950)
2 color figures ($1300) ii
1 color figure  ($650) ii
0 color figures iii

mean color cost = $1874; mean total cost/article = $2974


Molecular Biology of the Cell: according to the Am Soc Cell Biol, in their 2006 publication "MBC and the Economics of Scientific Publishing" (available as a pdf from the linked page):

The average article published in MBC in 2006 was 11.7 pages long and included 2.9 color figures. With the 20% discount on page and color charges now offered to ASCB members, publishing such an article would cost the author $1,829.
(Regular ASCB membership is $130.) Interestingly, the same publication gives the following details of budgeted (projected?) journal revenue for 2008:


MBC.png



I don't know how similar that breakdown would be for other journals, but it's interesting that subscription revenue is roughly equal to page OR color charges -- meaning that the average author would pay about 50% more if the journal switched to full cost recovery from author side fees.  This would put MCB's author side fees roughly on par with those charged by the top two PLoS tiers.


The American Physiological Society's Author Choice (hybrid OA) fee is $3000 for review articles and $2000 for research articles; according to their FAQ this is because:

For research articles, the Author Choice fee was determined by calculating the real average cost ($3,000) of publishing an article in an APS journal, and subtracting the actual average amount already paid by authors in author fees (page charges and color fees). The Author Choice fee for review articles is $3,000, because there are no other fees paid by authors of review articles. The Author Choice fee was designed to completely cover the cost of publishing an article.
which indicates that the average author-side fee for the 14 journals published by the APS is $1000.


Journal of Nutrition: in this editorial, AC Ross gave some figures regarding costs:

On average, each published page costs about $465, and pages with color, $1300! Each published manuscript costs, on average, $3233. Page charges (starting at $70) and color charges to authors ($400 per figure) are only a fraction of the actual costs of publication. Institutional subscriptions remain a key factor in the financial success of professional society journals like JN.
Page charges are currently $75/page for the first 7 pages and $120/page thereafter, and color charges are still $400/figure.


March 2009 volume 139 issue 3: 29 articles

pages = 5,4,7,4,8,5,6,7,5,6,6,4,6,7,5,4,6,6,7,5,6,7,5,5,5,3,4,7,5
mean page charge = $415

1 color figure  ($400) iii
0 color figures iiiii iiiii iiiii iiiii iiiii i

mean color charge = $41; mean total cost/article = $456


Molecular Biology and Evolution: in the 2008 Editor's Report (pdf available here) the Society for Mol Biol and Evolution provided the following figures for MBE in 2008:

average article length: 10.1 pages
average number of color figures per article: 0.927

Current charges are $50/page plus $450 per color figure, giving an average cost/article of $922.


Phytopathology and Plant Disease: $50 per printed page for the first six pages and $80 per printed page for each additional page for members of The American Phytopathological Society and $130 per printed page for nonmembers. In addition, there is a $20 fee charged for each black-and-white figure or line drawing. Color charges are $500 for the first illustration, $500 for the second illustration, and $250 for the third and each subsequent color illustration in one article.

Molecular Plant-Microbe Interactions: $150 for the first 6 pages, $150/page or fraction of thereafter; Color charges are $500 for the first illustration, $500 for the second, and $250 for the third and each subsequent color illustration in one article. In addition, there is a $20 fee charged for each black and white figure or line drawing.

The Society's Reports of Publications from 2000 gives the following figures:

Phytopathology: average article = 7.3, average color figs/article = ?
Plant Disease: average article = 5.4, average color figs/article = ?
MMPI: average article = 9.4, average color figs/article = 1.05; mean cost/article = $1275

(Regular membership in Am Phytopath Soc is $76.)


Journal of Natural Resources and Life Sciences Education: $350/article, $10 per table and $10 per figure plus $100/color page (print only; color is free online).

Vol 36, 2007: 17 articles, number of figs/tables = 1,3,6,7,12,4,5,4,8,8,5,5,9,1,2,2,4 only a couple had color figures; mean additional charge = $50, mean cost/article = $400



Sunday, 14 June
*bump*

On FriendFeed, items move back up the temporal sequence when they get "likes" and comments, giving them extra chances to be noticed. In addition, a "like" or comment from one of your friends will bring an item into view even if posted by someone whose stream you don't follow. The emerging mores of the system include leaving a one-word comment, bump, to indicate that one feels a particular item is worthy of wider attention -- "bumping" the item up the queue, as it were.

That's what I'm doing with this post. Richard Poynder is trying to put together a list of institutions and funding bodies which have established funds to pay for Gold Open Access:

I am trying to establish how many research institutions and funders have created Gold Open Access (Gold OA) authors funds, and would be grateful for input from others.

I am aware that the Wellcome Trust announced a scheme for paying OA publication fees for its grantees in 2006. But what other funders have introduced such schemes?

So far as research institutions are concerned, Peter Suber kindly provided me with the following list of those he knows have created Gold OA funds:

University of Amsterdam
University of Calgary
University of California, Berkeley
Delft University of Technology
ETH Zurich
Griffith University
University of Helsinki
Institute of Social Studies (Netherlands)
Lund University
University of North Carolina, Chapel Hill
University of Nottingham
University of Tennessee, Knoxville
Texas A&M University
Tilburg University
Wageningen University and Research Center
University of Wisconsin

However, I do not think this list is complete.

Richard also points out that it is probably useful to keep track of which Gold funds are complemented by a Green mandate, and makes the (imo excellent) suggestion of establishing a Gold Fund equivalent to ROARMAP, which tracks Green Mandates.

So -- *bump* -- please go read Richard's post, and help him out if you can.

Update: Peter Suber has created and pre-populated the Open Access Directory list of journal OA funds, so if you have information please add it there.



Friday, 05 June
That's the way you do it!

Via Peter Suber, I am delighted to find that Stuart Shieber has started a weblog, and even more delighted that in one of his first entries he has turned my long-ago author-side fees DOAJ hack into an actual, readily reproducible study:

Here are the results computed by my software, as of May 26, 2009:

Charges.......................951  (23.14%)
No charges....................2889 (70.29%)
Information missing...........270  (6.57%)
Hybrid........................1519 (26.99%)
Total.........................5629
The numbers are consistent with those of Hooker's study some 16 months earlier.
It's great to have the numbers confirmed, and even better to be able to make regular updates and construct time series. Thanks to Stuart for doing it right, and for making the code freely available.

(Note, had to reformat the quoted table into ugly text, because I still can't get MT to play nice. Grrr.)



Friday, 05 June
What use are research patents?

DrugMonkey has a conversation going about the ongoing kerfluffle over (micro)blogging of conference presentations (see also the FriendFeed discussion). I want to go off on a tangent from something that came up in his comment thread, so rather than derail it I thought I'd post here.

In his first comment in the thread, David Crotty made the following claim:

Lots of researchers support their families and labs through money generated by patents, and most universities are heavily dependent upon their patent portfolios for funding.

That doesn't accord with my (limited!) experience -- I know a few researchers who hold multiple patents, and none of them ever made any money that way -- and my general impression is that the return on investment for tech transfer offices and the like is fairly dismal.

This seems like the sort of beans that beancounters everywhere should be counting, so I asked on FriendFeed whether anyone knew of any data to address the question of whether universities really make much money from patents. Christina Pikas pointed me to the Association of University Technology Managers, whose 2007 Licensing Activity Survey is now available.

I extracted data for 154 universities and 27 hospitals and research institutions. Between them, in 2007, these institutions filed 11116 patent applications, were awarded 3512 patents, and gave rise to 538 start-up companies. I calculated licensing income as a percentage of research expenditure:


patents1.png

Apart from New York University (I wonder what they own that's so profitable?), it's clear that none of these universities are "heavily dependent upon their patent portfolios for funding". In fact, more than half of them (78/154) made less than 1% of their research expenditure back in licensing income, and the great majority (144/154) made less than 10%.

Licensing income for Massachusetts General Hospital and "City of Hope National Medical Ctr. & Beckman Research" (whoever they are) amounted to 65-70% of research expenditure, but none of the other hospitals or research institutions made more than 20%. More than half of this group (15/27) made less than 2%, and most of them (23/27) made less than 10%.

The distribution looks just about as you would expect:


patents2.png

I also wondered whether there was any evidence that greater numbers of patents awarded, or more money spent per patent, resulted in higher licensing income. As you can see, the answer is no (insets show the same plots with the circled outliers removed):

patents3.png

patents4.png


I don't know how representative this dataset is; there are several thousand universities and colleges in the US, and surely even more hospitals and research institutions, so the sample size is relatively small. It does include some big names, though - Harvard, Johns Hopkins, MIT, Stanford, U of California -- and I would expect a list of schools answering the AUTM survey to be weighted towards those schools with an emphasis on tech transfer.

In any case, I'm not buying David's assertion that "most universities", or most hospitals or research institutes for that matter, rely heavily on licensing income. And that being so, I am also somewhat skeptical about the number of researchers' families being supported by patents.

What's the Open Science connection? Well, if you're interested in patenting the results of your research, there are a lot of restrictions on how you can disseminate your results. You can't keep an Open Notebook, or upload unprotected work to a preprint server or publicly-searchable repository, or even in many cases talk about the IP-related parts of your work at conferences. It seems from the data above that most universities would not be losing much if they gave up chasing patents entirely; nor would they be risking much future income, since so few seem to get significant funds from licensing. My own feeling is that any real or potential losses would be much more than offset by the gains in opportunities for collaboration and full exploitation of research data that come with an Open approach.

Updates:

1. Christina left a comment pointing out that patents may be required for more than simply making money from licensing:

...an extremely important reason universities patent [is] to protect their work so that they may exploit it for future research... it turns out that universities have to patent in life sciences - even if they don't actively market and license these patents - to be able to attract new research money from industry.

There are two distinct points here: first, that if you don't patent you may not attract industry partners, and second, that if you don't patent you may end up licensing your own tech back from someone else (I note that most tech licenses I know of are cheap or free "for research purposes" so the latter factor might not weigh so heavily). According to the 2007 AUTM data, industry investment in academic research amounted to about 7% of research expenditure and was up 15% over 2006.

2. David responded on DM's thread with some counter evidence, on reading which I realise that the data above may (likely?) only show what the university received and not any money that went to the labs or researchers involved. Tech transfer may not be financially worth it for the university, except that it might still be doing good things for individual labs and PIs, and so would constitute a support service the university offers its research community. It also strikes me that my experience, such as it is, is mainly with Australian researchers, whereas David's is in the US, so cultural differences may also apply.

3. More from Christina at her own place, here.

_____________
If you want the data, the spreadsheet I used is here.



Wednesday, 03 June
What happened to serials prices in 1986-87? (Update: probably nothing.)

This could be nothing but an artifact (e.g. of the way the data were collected), but if you look at Fig 1 from this post, there's a clear break in the serials expenses (EXPSER) curve that's not evident in any of the others. Here's the same plot reworked to emphasize what I'm talking about:


indices4.png

If you squint just right you can imagine a similar but much weaker effect, beginning a year or two later, in the total expenditures (TOTEXP) curve; and the salaries (TOTSAL) curve seems to start a similar upward trend at about the same time but then levels off after 1991 or so. I wouldn't put any weight on either of those observations though -- I'd never have noticed either if I hadn't been comparing carefully with the EXPSER curve.

I've added linear regression lines for the 1976-1986 and 1987-2003 sections of the EXPSER data, just to emphasize the change in rate of increase. For those of you who will twitch until they know, just 'cos, the regression coefficients of the two lines are 0.99 and 0.98 respectively. If you extrapolate from just the 76-86 section, TOTEXP exceeds the forecast for EXPSER after about 2000.

I have no idea if this means anything, but it is tempting to speculate. For instance: when did the big mergers begin in Big Publishing, and when did the big publishing companies start the odious practice of "bundling", that is, selling their subscriptions in packages so that libraries are forced to subscribe to journals they don't want just to get the ones they do?


Update: it's probably nothing; the curve simply shows an increasing rate of increase, and you can break it up into at least five reasonably convincing-looking segments with breaks at 86-87 and 94-95. It's possible there were two "pricing events" around those times, but I think this is most likely just an illustration of what can happen when you look a little too hard for patterns in your data!


indices6.png




Tuesday, 02 June
Every little bit counts.

There are so many good causes, and so many of them are not just good but urgent -- even assuming you have some money to spare, where are you to donate it? Everyone has their own solution to this problem. Mine is to try to hedge my bets: donate roughly equally to long- and short-term, local and global, human and environmental. I'm out of work and thoroughly skint right now, but I try to remember that by world standards I'm still living like a king; my budget includes some "don't go insane" funds for occasional movies or dinners out or whatever, and I can always skip one of those in order to give just a little to some good cause.

One such is the Open Knowledge Foundation, which is turning five and asking for support:

This month the Open Knowledge Foundation is five years old.

Over those last five years we've done much to promote open access to information -- from sonnets to stats, genes to geodata -- not only in the form of specific projects like Open Shakespeare and Public Domain Works but also in the creation of tools such as KnowledgeForge and the Comprehensive Knowledge Archive Network, standards such as the Open Knowledge Definition, and events such as OKCon, designed to benefit the wider open knowledge community. (More about what we've been up just over the last year can be found in our latest annual report).

While we have achieved a lot, we believe we can do much, much more. We are therefore reaching out to our community and asking you to help us take our vision further.

Our aim: at least a 100 supporters committed to making regular, ongoing donations of £5 (EUR 6, $7.50) or more a month.

These funds will be essential in expanding and sustaining our work by allowing us to invest in infrastructure and employ modest central support. To pledge yourself as one of those supporters all you need to do is take 30 seconds to sign up to our "100 supporters" pledge at:

http://www.pledgebank.org/support-okfn/

And if you want to act on the pledge right now (or make any other kind of donation), please visit: http://www.okfn.org/support/

We are and will remain a not-for-profit organization, built on the work of passionate volunteers but these additional fund are essential in maintaining and extending our effort. Become a supporter and help us take our work forward!

I'm in no position to make a regular commitment, but I skipped a movie and sent 'em ten quid. It's not much but it's my hope that small donations can be a powerful force in the internet age. The other thing I can donate is publicity, which is what this post is for.

Why donate to OKF? My belief is that openness is not only our best weapon in the unending battle against bad actors and free riders, it is the key to a radically more efficient scientific process, which in turn is the key to all material progress in human quality of life.

The OKF not only builds tools and standards for open exchange of information, but they are also part of the front line effort to make openness and transparency into a constant, widely adopted habit of mind and of behaviour. To choose a topical example, we won't have appropriate access to information about the spending habits of our elected officials until we are so in the habit of openness that it is a surprise and an affront to the average citizen to realise that such information is being kept secret. To choose my own bête noire as another example, we won't be free of "data not shown" in the scientific literature until the majority of scientists respond to that phrase with an immediate and indignant "why the hell not?".

So, support for the OKF is one of my long-term choices: an investment in a better future for everybody. If you have a couple of dollars to spare, please consider investing with me.



Monday, 01 June
Pick an index, any index.

Over at The Scholarly Kitchen, Philip Davis takes the ARL to task for comparing their serials expenditures with the Consumer Price Index:

By adopting the CPI as a general frame of reference, almost any industry that requires huge professional worker input will look like it is spiraling out of control. Perhaps this is the reason the ARL uses the Consumer Price Index as a reference for journal prices when it could have used the Higher Education Price Index, the Producer Price Index, or an index which more closely resembles professional knowledge production.

The CPI is an excellent tool for collective salary bargaining, for estimating who should be eligible for food stamps or free school lunches. It is a very bad tool for measuring the purchasing power of libraries or justifying a reinvention of the journal publication system.

Since I've just played around with updating the famous graph to which Davis takes exception, I thought I'd better take a closer look at the alternative indices he suggests.

From the Commonfund 2008 HEPI Report (pdf; linked from here) I extracted historical HEPI and CPI data from 1976 to 2003, and from the ARL stats interface at U Virginia I extracted the median values for serials expenditures (EXPSER), total salaries expenditures (TOTSAL) and total expenditures (TOTEXP) for the same period (it was limitations in the ARL data range that dictated the time period). I also extracted Producer Price Index data for "all commodities" (PPI ALL) over the same period from the Bureau of Labor Statistics. There are lots of choices for PPI data, but most of them don't go back as far as 1976. (I did try a couple of industries that I thought required "huge professional worker input", such as hospitals and book publishers, but the data weren't available for all the years I wanted -- and by eyeball it didn't look as though they showed much greater increase than the all commodities index.)

Plotting percent cumulative change against time we see:


indices1.png

There isn't a lot of difference between the HEPI and the CPI, and the all commodities PPI index shows even less increase. Davis suggests that salaries, professional worker input, are at least part of the reason why the CPI is a poor choice for comparison with serials costs, but (to the extent that the HEPI is a better "professional worker weighted" measure) the data do not bear him out. Neither does his claim regarding librarian salaries fit the data I have to hand:
If we plotted academic librarian salaries against the CPI, we could claim that the profession was in crisis, that salary growth was unsustainable, and that the system was simply broken.

It's clear from the data, though, that library salary expenditures have outstripped the HEPI and CPI, but not by as much as total expenses and not by nearly as much as serials costs.

Remember, too, that this is still only part of the story: "serials" includes a great many publications whose costs have not increased at the same rate as the scholarly literature. The Abridged Index Medicus data I got from EBSCO only cover 1990 onwards, so I reworked the comparison to include the AIM data:


indices3.png

I used the AIM data because comparison with a much larger data set, broken down by individual discipline, showed that the AIM data gave what looks like a reasonable "middle value" -- and as you can see, scholarly journal price increases outstrip all others, including total serials, by a considerable margin.

Note also that there's little difference between "total salaries" and "professional salaries" -- the professional salary data series (SALPRF) only goes back to 1986, which is why I've included it in this second graph.

None of this is to say that the CPI is the ideal comparison index against which to measure increases in the cost of the scholarly literature. It seems from the comparisons above, though, that there's not much difference for this particular purpose between the CPI and the HEPI. While I don't have data for publishing industry salaries, library salaries hew fairly closely to the HEPI and to total library expenditures. It therefore doesn't seem that salaries have much to do with the much-bruited discrepancy between "general cost of living/doing business/whatever" increases and the rise and rise of the cost of scholarly literature.

If you want the data I used, the spreadsheet is here.



Tuesday, 26 May
Motes, beams &c.

A while back, Philip Davis over at The Scholarly Kitchen posted about a small but useful research project of his:

All I did was ask five librarians at institutions administrating Open Access publication charges two simple questions:

"Can you provide a list of Open Access articles that you have supported through your author support program," and "Have you rejected any requests to date?"

This is (to me) clearly information that such programs should be collating and reporting, and after two weeks Davis' results were not exactly stellar:

Two weeks after asking my simple questions, I received just two short responses. No list, no numbers, but at least a few details: There was some confusion on the part of faculty of what an OA article publication charge really was. Some faculty requests were actually for page charges in conventional subscription journals; one faculty submitted a request for reprint charges; others submitted invoices to the library when they should have been directed to the external granting agency (like the HHMI). To date, no bonafide requests have been denied.

That's useful information, as far as it goes, but it doesn't go very far. Davis plays the conspiracy theory card way too hard for my taste, with "dark secrets" in the post title and an opening paragraph that reeks of melodrama:

You would have thought I was requesting a field manual for interrogating prisoners of war or a list of members on Dick Cheney's Energy Taskforce. At least in those instances, I would have received a response that answering my questions violated national security or "executive privilege."

Whoa, cowboy, back up a minute. As commenter Amanda R pointed out, we don't know much about how Davis went about gathering the information:

As a point of clarification, were you directly refused data, or did libraries simply not respond? Did you contact them back and ask why there was no response, or if there was a reason they weren't providing the full data you wanted?

Obviously, you deserve a professional response from the libraries you contacted. But, as much as it pains me to say it, I could easily imagine a library in which a request for statistics was bumped around internally for a few weeks before actually being answered.

In a Friendfeed discussion, librarian Christina Pikas made a related point:

the worst part of this is figuring out who you would send a request like that to. It takes me 10 e-mails and 3 phone calls to find the right person at my mothership main library. Almost seems that he's taking confusion for malicious intent

as did commenter JQ Johnson:

when I in March queried the same institutions that Davis did, I got lots of cooperation. For example, UNC pointed me to a public letter (2/20/2009) to their vice chancellor that summarized in some detail the 12 requests they had funded to date. I'm puzzled why Davis got the response he did. Did he ask the wrong people?

Davis replied to both Amanda R and JQJ, but he gave non-answers containing no information about his methodology and insisted that what he had shown was a lack of transparency:

Whether the lack of response was caused by human error, technological barriers or internal policy, the result is a lack of transparency in how these author-support programs are performing.
[...]
These are all good questions but they skirt around the main issue of why I received only 2 responses, and why even these two responses were unable to provide me with any meaningful (even summarized or anonymized) data.

I found this very frustrating and left a comment1 aimed at clarifying why that was so:

JQJ's comments and questions do not seem to me to skirt the issue at all, but rather to speak directly to alternative explanations for the lack of response. Methodological concerns are not trivial here.
  • Whom did you contact?
  • Did you say explicitly that you were sensitive to confidentiality issues and happy with various forms of anonymized data?
  • Did you phone anyone, or simply email?
  • How do you know your emails didn't just end up in the spam bin?
  • Did you follow up (an unanswered question from Amanda, above)?
And so on. You have asked good questions, and have shown that routine reporting could be improved for such programs (already a useful outcome). But you need a good deal more evidence -- including a more transparent methodology -- before you go claiming there are "dark secrets" at work.

Now, it's been almost two weeks since I left that comment, and it hasn't appeared or been answered. What dark secrets is Philip Davis hiding? What dim, Crotty-esque ambitions of being the famous naysayer, the Nicholas Carr of Open Access, are forming even now in the troubled subconscious of this ---

Or, you know, I just got stuck in the spam queue. It happens. :-)

Davis finishes up by saying something relatively unexceptionable if taken out of the context of his insistence on ignoring both Occam's and Hanlon's razors:

Library Open Access policies cannot exist with secret budgets, ambiguous guidelines, and a practice of stonewalling requests for information.

Those who campaign for Open Access need to be held accountable just like everyone else, and budget transparency is the first step.

Exactly so -- everyone else, including bloggers who wish to hold librarian feet to the accountability fire.


1I added the list formatting for this post, hoping for improved readability.




Monday, 11 May
The Semantic Web: a long and somewhat convoluted definition.

This1 is an attempt to define and explain the semantic web for a lay audience, though it should be remembered that I am a member of that audience myself...

It is a commonplace that we are drowning in information, and nowhere is this "information overload" more apparent than in scientific research. The National Library of Medicine's literature database, PubMed, is searched more than 60 million times a month and contains almost 19 million records from more than 5300 journals -- still only a fraction of the approximately 15,000 active, refereed, scientific journals listed in Ulrich's Periodicals Directory2. GenBank, the world's foremost repository of nucleic acid sequence information, contains roughly 100 billion bases in 100 million sequence records, and is growing at an exponentially increasing rate that is currently in excess of 50,000 records per day. Unlike PubMed and GenBank, which are cross-disciplinary databases, the Nucleic Acids Research Molecular Biology Database Collection is a carefully curated list of high-value specialist resources; it currently lists 1170 distinct, largely non-overlapping databases. I could go on, but you get the point3.

As things stand, researchers talk to researchers and use computers to facilitate that conversation; what we need is for computers to be able to talk to computers. To cope with (literally) inhuman volumes of data, we need that data to start making sense to machines, so that they can do something no human brain can do: process all of it. We need to make it possible for machines to transfer richly interconnected data among themselves, mix and remix it, generate new connections, filter it, process it, transform it, and output the results to formats and interfaces that make sense to human brains -- substrates on which we can carry out the sorts of synthetic, creative thinking that computers cannot do.

We need a man-machine partnership in which both partners can do what they do best, and that means we need the semantic web.

The semantic web is the outcome of processes and frameworks with which computers can manipulate data in a way that makes it accessible by human brains. It is built on the standards and metadata -- information about data -- that are required for automated data exchange and processing, which in turn is required to create machine-generated, human-scale summaries, skeletons, outlines and other representations of, and interfaces with, the entire knowledge corpus.

Here's an example. Human brains have no trouble processing the following data:

Another reason for opening access to research. Wilbanks J. BMJ. 333:1306-8 (2006).

To you, that's a reference; but to a computer, it's just a string of text. What a computer needs is information (metatada) about each substring:

Title: Another reason for opening access to research.
Author: Wilbanks, J
Journal: British Medical Journal
Issue: 333
Pages:1306-8
Date: 2006

Now the computer "knows" which letters identify John, which constitute the title of the article, and so on. If you set the standards up properly, it even "knows" that Wilbanks is the surname and J the first initial, and so on into ever finer grained properties.

Now imagine you had, oh, say, about 19 million such records. A human brain cannot do anything useful with such a database, but a computer can -- which is exactly why we can ask PubMed human-scale questions like "how many papers did J Wilbanks publish between 2000 and 2009?", or "show me all the papers with "access to research" in the title".

Now multiply that -- the ability to ask human-scale questions of a mass of information far too large for any human brain to absorb or process -- by thousands of different types of information (text, gene sequences, chemical formulae, microarray results, etc etc), millions of individual records within each data type, recorded in thousands of journals and databases, produced by hundreds of thousands of laboratories, libraries and garage hackers. Imagine what we could learn if we could query all of that information on a human scale.

There: that's a glimpse of the potential power of the semantic web.

-------------
1 This entry started life as an early draft of a letter in support of John Wilbanks' application for a TED fellowship. We didn't get enough signatures in time, so it never was even sent. My apologies to those people who did sign on; if John re-applies I'll try again, with better planning!

2 tickboxes = active, refereed, scholarly/academic; search = LC Classification Number for [Q* OR R* OR S* OR T* OR U* OR V*]

3In fact, I'm always on the lookout for more good examples of the "data deluge" and the rapid progress of science and tech; post 'em (in comments) if you got 'em.



Saturday, 09 May
More on the "Australasian Journal of..." series.

On the basis of the evidence below, I believe the entire "Australasian journal of..." series from Excerpta Medica to be either nonexistent or fake, in the same sense of "fake" that Elsevier has already admitted applies to the following six titles from that series:

  • Australasian Journal of General Practice
  • Australasian Journal of Neurology
  • Australasian Journal of Cardiology
  • Australasian Journal of Clinical Pharmacy
  • Australasian Journal of Cardiovascular Medicine
  • Australasian Journal of Bone & Joint Medicine

WorldCat lists a further thirteen titles in the apparent series:

  • Australasian journal of asthma
  • Australasian journal of bone & joint medicine
  • Australasian journal of dentistry
  • Australasian journal of depression
  • Australasian journal of gastroenterology
  • Australasian journal of hospital pharmacy
  • Australasian journal of infectious diseases
  • Australasian journal of musculoskeletal medicine
  • Australasian journal of obstetrics & gynaecology
  • Australasian journal of paediatrics
  • Australasian journal of pain management
  • Australasian journal of psychiatry
  • Australasian journal of respiratory medicine
  • Australasian journal of sexual health

I believe these all to be either nonexistent or fake because:


1a. Although WorldCat lists ISSNs for all titles, all but two include a note saying "ISSN prepublication record". The two entries which do not carry that note are also the only two titles listed as being held in any library:


1b. Only the "Australasian journal of musculoskeletal medicine" and the admitted fake "Australasian Journal of Bone & Joint Medicine" are listed as being held by any library in WorldCat.  Both are listed at the State Library of New South Wales:

Australasian journal of bone & joint medicine.
Chatswood, N.S.W. : Excerpta Medica Communications, 2002- 
v. : ill. ; 30 cm.
State Ref Library
NQ617.7005/1
Vol. 1, issue 2 (2002)-v. 4, issue 1 (2005)

Australasian journal of musculoskeletal medicine.
Chatswood, N.S.W. : Excerpta Media Communications, 2002. 
v. : ill. ; 30 cm.
State Ref Library
NQ617.7005/1
Vol. 1, issue 1 (2002).
I've written to the library to ask for a copy or photograph of either journal.


2. None of the series titles have websites that I can find.


3. None of them are listed in PubMed, Ulrich's Periodicals Directory, Elsevier's own Science Direct or Scopus (I'd be obliged if someone with access could check Web of Science). Update: Peter Murray checked, and couldn't find any of the titles in the WoS "publication name" field. Thanks Peter!


4a. A phrase search in Google Scholar returns hits only for the Australasian journal of psychiatry; all of these are citations, three of which are apparent self-citations to the same article:

Mellsop GW, Menkes DB, El-Badri S. Releasing Psychiatry from the Constraints of Categorical Diagnosis. Australasian Journal of Psychiatry. 2007;15:3-5. doi: 10.1080/10398560601083134
That DOI resolves to an article of the same name and with the same page numbers in Australasian Psychiatry, which is published by Informa Healthcare for The Royal Australian and New Zealand College of Psychiatrists.  I've written to the communicating author, Dr Mellsop, to ask for a reprint.

Of the remaining three hits, two are citations to other articles in the Australasian Journal of Psychiatry and one I cannot decipher without paying a fee to see the references of an obscure paper.  Of the two I can decipher, one resolves to a paper in Australasian Psychiatry from 2003; the same article is available from Informaworld.  The other is to an "in press" citation from 2007 (which also appears in 4b below).


4b. The same search on Google returns a number of hits, including the following:

  • from this page:
    M.I. Loh., & Restubog, S.L. (2007). Lecturers' and Students' Perceptions of Current Teaching Methods about Schizophrenia. Australasian Journal of Psychiatry, 15, 347-349.

    This does not seem to be related to the Informaworld journal Australasian Psychiatry since vol 15 p 347 is this, and I could only find these two papers by Jennifer Loh on the informaworld site. 
  • from this page:
    Langdon, R. (2003). Theory of mind and psychopathology: autism versus schizophrenia [Abstract]. Australasian Journal of Psychiatry.

    from this page:
    Griffiths, K., Farrer, L., & Christensen, H. (2007). Clickety-Click: the e- trains on track. Australasian Journal of Psychiatry, 15(2), 100-108.
    also from here and here:
    Griffiths, K.; Farrer, L.; and Christensen, H. Clickety-click: The e-trains on track. Australasian Journal of Psychiatry, In press, accepted 10/06.

    This appears to be the same paper in the Informaworld journal, Australasian Psychiatry.
  • from here, here and here:
    Tarantola D (2007) The interface of mental health and human rights in Indigenous populations: triple jeopardy and triple opportunity Australasian Journal of Psychiatry, 15(Suppl):S10-S17

    Again, here's the same paper in Australasian Psychiatry.
  • from here and here:
    Cornes, A., & Napier, J. (in press). Challenges of mental health interpreting: Therapy has taught us that it's all our fault Australasian Journal of Psychiatry.

    And the same paper seems to appear in Australasian Psychiatry.
I've written to Drs Loh, Langdon, Griffiths, Tarantola and Napier to ask for copies.



Saturday, 09 May
Excerpta Medica in action

The Elsevier fake journal scandal is expanding in two directions. First, it's now "fake journals", plural. Elsevier has admitted to publishing six of these things:

  • Australasian Journal of General Practice
  • Australasian Journal of Neurology
  • Australasian Journal of Cardiology
  • Australasian Journal of Clinical Pharmacy
  • Australasian Journal of Cardiovascular Medicine
  • Australasian Journal of Bone & Joint Medicine

Only one, Bone & Joint Medicine, is on the list I posted yesterday of Excerpta Medica "Australasian journal of..." titles from WorldCat. That leaves thirteen titles in the same series, none of which are listed in PubMed, Science Direct, Ulrich's or (thanks to Peter Murray, see comments on that post) Scopus. Jonathan Rochkind has pointed out how to find the rest of their titles in WorldCat; there are around 50 all told.

That's the tip; I await the rest of the iceberg.

The second direction in which the scandal is expanding is towards ghostwriting: I think probably Laika was the first person to make this connection clear. This is a separate but related issue, and Excerpta Medica appears to be up to their armpits in this sleazy practice as well. There's quite a large literature on ghostwriting, so here are just a few quotes (mentioning Excerpta Medica) to whet your appetite (if indeed one could be said to have an 'appetite' for something so nauseating):

Anna Wilde Mathews, At medical journals, paid writers play big role

When articles are ghostwritten by someone paid by a company, the big question is whether the article gets slanted. That's what one former free-lance medical writer alleges she was told to do by a company hired by Johnson & Johnson.

Susanna Dodgson, who holds a doctorate in physiology, says she was hired in 2002 by Excerpta Medica, the Elsevier medical-communications firm, to write an article about J&J's anemia drug Eprex. A J&J unit had sponsored a study measuring whether Eprex patients could do well taking the drug only once a week. The company was facing competition from a rival drug sold by Amgen Inc. that could be given once a week or less.

Dr. Dodgson says she was given an instruction sheet directing her to emphasize the "main message of the study" -- that 79.3 percent of people with anemia had done well on a once-a-week Eprex dose. In fact, only 63.2 percent of patients responded well as defined by the original study protocol, according to a report she was provided. That report said the study's goal "could not be reached." Both the instruction sheet and the report were viewed by The Wall Street Journal. The higher figure Dr. Dodgson was asked to highlight used a broader definition of success and excluded patients who dropped out of the trial or didn't adhere to all its rules. The instructions noted that some patients on large doses didn't seem to do well with the once-weekly administration but warned that this point "has not been discussed with marketing and is not definitive!"

The Eprex study appeared last year in the journal Clinical Nephrology, highlighting the 79.3 percent figure without mentioning the lower one. The article didn't acknowledge Dr. Dodgson or Excerpta Medica. Dr. Dodgson, who now teaches medical writing at the University of the Sciences in Philadelphia, says she didn't like the Eprex assignment "but I had to earn a living."

The listed lead author, Paul Barre of McGill University in Montreal, says Excerpta Medica did "a lot of the scutwork" but he had "complete freedom" to change its drafts. Dr. Barre says he helped design the study and enroll patients in it. In statements, J&J and Excerpta Medica offered similar explanations of the process. J&J says it regularly uses outside firms "to expedite the development of independent, peer-reviewed publications."

Carl Elliott, Pharma goes to the laundry: public relations and the business of medical education

One of the most ingenious pieces of the Fen-Phen public relations strategy was its ghostwriting scheme. In 1996 Wyeth hired Excerpta Medica Inc, a New Jersey-based medical communications firm, to write ten articles for medical journals promoting obesity treatment. Wyeth paid Excerpta Medica $20,000 per article. In turn, Excerpta Medica paid prominent university researchers $1,000 to $1,500 to edit drafts of their articles and put their names on the published product. Wyeth kept each article under tight control, scrubbing drafts of any material that could damage sales. One draft article included sentences that read: "Individual case reports also suggest a link between dexfenfluramine and primary pulmonary hypertension." Wyeth had Excerpta delete it. (21)

What made Excerpta Medica such an inspired choice is that it is a branch of the academic publisher, Reed Elsevier Plc., which publishes many of the world's most prestigious science journals. Excerpta Medica manages two journals itself: Clinical Therapeutics and Current Therapeutic Research. According to court documents, Excerpta Medica planned to submit most of the articles it produced to Elsevier journals. In the actual event, Excerpta managed to publish only two articles before Fen-Phen was withdrawn from the market in 1997. One appeared in Clinical Therapeutics, the other in the American Journal of Medicine (another Elsevier journal). In neither case did the authors of the articles disclose that they were paid by Excerpta Medica. So clean was the laundering operation, in fact, that many of the authors did not even realize that Wyeth was involved. Richard Atkinson of the University of Wisconsin wrote a letter to Excerpta Medica congratulating them on the thoroughness and clarity of their article. "Perhaps I can get you to write all my papers for me!" he wrote. He did have one reservation about the piece he was signing: "My only general comment is that this piece may make dexfenfluramine sound better than it really is." (22)

Sergio Sismondo, Ghost Management: How Much of the Medical Literature Is Shaped Behind the Scenes by the Pharmaceutical Industry?

Several of the publication planning firms identified are owned by major publishing houses. For example, Excerpta Medica is "an Elsevier business" and writes that its "relationship with Elsevier allows... access to editors and editorial boards who provide professional advice and deep opinion leader networks" [40]. Wolters Kluwer Health draws attention to its publisher Lippincott Williams & Wilkins, with "nearly 275 periodicals and 1,500 books in more than 100 disciplines," and to Ovid and its other medical information providers, emphasizing the links it can make between its different arms [41]. Vertical integration is attractive in the industry as a whole: at least three of the world's largest advertising agencies own not only MECCs, but also CROs [contract research organizations] [13].




Wednesday, 06 May
No bottom to worse at Elsevier?

Like Dorothea, I haven't said anything about the slimy Merck/Elsevier fake publication deal, because I thought the blogosphere had plenty of coverage. Anyone who reads me would know all about the scandal.

The latest development, though, strikes me as something that should be shouted from every available rooftop: Elsevier simply must answer the questions raised.

Via Dorothea: Jonathan Rochkind has done a little "forensic librarianship" and raised astonishing questions about the entire imprint, Excerpta Medica, which published the fake journal that started all of this.

Go read Jonathan, but the bottom line is this: Excerpta Medica does not provide a straightforward list of its own publications or make clear which are, ahem, "industry-sponsored".

Jonathan says "WorldCat lists 50 publications by Excerpta Medica Communications"; I just tried a simple author search for that phrase and got only 21 results, including the recently-exposed-as-fake Australasian journal of bone & joint medicine; how many others are fake? How about the other fourteen thirteen "Australasian Journal of" titles in the same list:

  • Australasian journal of asthma
  • Australasian journal of bone & joint medicine
  • Australasian journal of dentistry
  • Australasian journal of depression
  • Australasian journal of gastroenterology
  • Australasian journal of hospital pharmacy
  • Australasian journal of infectious diseases
  • Australasian journal of musculoskeletal medicine
  • Australasian journal of obstetrics & gynaecology
  • Australasian journal of paediatrics
  • Australasian journal of pain management
  • Australasian journal of psychiatry
  • Australasian journal of respiratory medicine
  • Australasian journal of sexual health

Why, for one thing, are none of them indexed by Science Direct? The PubMed journal limit field contains only Australasian journals of dermatology, pharmacy and optometry; the latter two seem to be defunct and the first is published by Wiley.

Futher obvious questions arising:

  • What exactly were the 11 "publications" mentioned in this case study, and where were they published?
    Excerpta Medica published more than 11 scientific publications, all offering medical education credits, and targeting medical specialties from the clinical pharmacist to the physician specialist and emergency nurse. Over 700,000 of these publications have been sent to medical professionals to build awareness...
  • Someone should take a close look at the publications (and faculty) mentioned in this case study:
    Excerpta Medica summarized the issues and recommendations from these ["faculty-led regional advisory board"] meetings and communicated them in a funneled approach, beginning with broad reach and comprehensive content, to more regionally focused publications.

    Excerpta Medica first created a full issue and subsequent supplement of Clinical Cornerstone™, the company's proprietary, peer-reviewed, indexed, continuing medical education (CME) journal distributed to 75,000 physicians. As a result, the data gained significant credibility within the larger physician community.

    The final published product from these regional meetings was a series of regional newsletters. The newsletters referenced the indexed Clinical Cornerstone publications and also highlighted the leading regional attendees on the cover to establish credibility and regional buy-in with the recipients. Approximately 2000 copies of each newsletter were sent to physicians in each region.

  • What exactly is the "company-sponsored journal" created in this case study? We're told that
    The quarterly publication was created to build awareness of the disease [targeted by the client's product] and prepare the specialist and primary markets for future indications. It was also designed to establish this client as one of the industry's authorities on cardiovascular disease.
    and that
    The clinical content was complemented with high-quality photographic images, giving each issue a very professional and attractive appearance.
    [...]
    The publication was launched in December 2004 and continues to run today. Circulation has increased from 10,000 at launch to 17,000 currently and includes such specialties as cardiology, diabetology, nephrology, internal medicine, and general practice.
    but not the name of the journal. Wanna bet it starts with "Australasian journal of..."?




Monday, 04 May
Alternative Connotea bookmarklets for OATP

Peter Suber launched the Open Access Tracking Project on April 16, and you can read a full description of it in this month's SPARC OA Newsletter.

I encourage anyone interested in contributing to the OATP to read the full description so as to make your contributions maximally useful. Here are the basics:

  • the project runs on Connotea, using shared tags
  • the only official tag right now is oa.new
  • use the oa.new tag for developments from the past six months or so
  • user-defined tags are encouraged and should use the same format: oa.foo, where foo can be any relevant subtopic

If you are pressed for time, and we all are, then it may help to have a Connotea bookmarklet with the oa.new tag (or oa.unclassified, if the item is older than six months) already filled in. That way you can just hit the bookmarklet, hit "add to my library" and be done. It's better if you have time to put in further classifying tags and a description, but at least this way the page will be recorded.

I guess the easiest way to do this would be to have three bookmarklets, the regular one and the "two click" bookmarklets I describe here. If you're using FireFox, here are the two-click versions; you can install them the same way as the regular one (drag to the toolbar) and, if you like, rename them using the "Organize Bookmarks" dialog box:

Connotea/oa.new

Connotea/oa.unclassified

This would obviously be better as a one-click than a two-click bookmarklet, but I failed dismally in my attempt to make it so because I don't actually know anything about javascript. I've previously suggested to the lazyweb that someone make a bookmarklet for another project, and nothing came of it; I'm hoping both that this little hack will be useful, and that it will inspire an actual programmer to improve it.




Saturday, 02 May
Congratulations to Harvard.

Harvard has been fortunate enough to secure the services of Peter Suber, who has been appointed a Berkman Fellow.

I cannot say it better so I will simply quote Stevan Harnad's comments accompanying the announcement:

A brilliant choice, and eminently well-deserved. Peter -- whose historic contributions to the growth of OA have been spectacularly successful -- will continue his invaluable OA work, but this Fellowship will also make it possible for him to begin writing the books on OA and related matters that are welling up in him, and that the world scholarly and scientific research community (as well as the historians of knowledge) are eagerly waiting to read, digest and learn from for years to come.

It is so gratifying to see true merit being rewarded occasionally, as it ought to be (although my guess is that this is just the beginning of the honors to be accorded to this selfless and sapient transformer of Gutenberg scholarship into PostGutenberg scholarship).





Friday, 01 May
Open Access, copyright transfer and NC licensing: caveat emptor!

When I was rummaging around in J Vis a while back, I noticed something that I've been meaning to blog about: why is an Open Access journal still requiring complete surrender of author copyright1?

I happen to know one answer to that question, though I don't know whether this is the case at J Vis. The deal is this: Big Publishing sells paper reprints, and not just of their own articles -- they pay fees where necessary in order to provide a one-stop shop (e.g. through Excerpta Medica or Ovid), mainly to the pharmaceutical industry. In order to blanket existing and potential customers with research favorable to their causes, pharm companies spend a great deal of money on these reprints -- some of which trickles down to small publishers, some of whom depend on that revenue. Such publishers therefore cannot afford to give up such rights as force the reprint traders to pay for their wares.

J Vis has a copyright notice which says, in part:

Users may view, reproduce or store copies of articles comprising the journal provided that the articles are used only for their personal, non-commercial use. [...] Any uses and or copies of Journal of Vision articles, either in whole or in part, must include the customary bibliographic citation, including author attribution, date, article title, journal name, DOI and/or URL, and copyright notice.

A closely related strategy is to use open(ish) licensing that contains a noncommercial (NC) clause. For instance, Springer Open Choice leaves copyright with authors, but uses their own license that is compatible with CC-BY-NC. That, like J Vis' copyright notice, puts their publications out of reach of the reprint traders, except for the little clause that says:

No term or provision of this License shall be deemed waived and no breach consented to unless such waiver or consent shall be in writing and signed by the party to be charged with such waiver or consent.

which allows the small publishers to waive the NC part for certain uses, in return for what amounts to royalties2.

Why do I care about this? Because it's another instance of the old "Free is not Open" argument, and the problems discussed here and here. Since digital repositories -- as far as I know, all existing digital repositories -- carry no blanket license, but leave intact the licensing of each individual digital object they contain, the effect is that there are no OA repositories that remove both price and permission barriers (that is, provide "strong" or "libre" OA to their contents).

The end result is the same problem that copyleft causes3: Reuse, Rework and Redistribute may not be powerfully affected, but Remix is killed outright.

Consider, for instance, PubMed Central, all the papers in which are free to read. What else can you do with them? Textmining, datamining? As far as I can tell, the answer is no, you can't do any of that -- because whatever you want to do, some papers will be licensed to allow it and some won't. Barring some way to reach agreements with dozens or perhaps hundreds of publishers and pre-sort millions of papers on the basis of licensing, the entire PMC barrel is spoiled by the copyrighted, NC and similar apples -- though there is a much smaller uncontaminated barrel available4.

Which brings me, at long last, to my title. Why "caveat emptor"? Well, if you're buying Open Access -- that is, publishing with a journal that charges author-side fees (remember, most don't), make sure you're getting value for your money! If the journal demands your copyright, or slaps a NC license on your work before distributing it, you should know that many possible downstream uses for your work are being pre-emptively eliminated. Are you sure that's what you want?


-------------
1 From the copyright form, emphasis mine:


form.png


2 There's even a clause in the canonical definitions of OA that deals with this issue -- or at least I suspect that's what it's doing there. Budapest, which came first, says this:

The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.

But Bethesda and Berlin, both of which were written about two years later, include this in the definition of Open Access (emphasis mine):

The author(s) and right holder(s) of [OA works] grant(s) to all users a free, irrevocable, worldwide, right of access to, and a license to copy, use, distribute, transmit and display the work publicly and to make and distribute derivative works, in any digital medium for any responsible purpose, subject to proper attribution of authorship (community standards, will continue to provide the mechanism for enforcement of proper attribution and responsible use of the published work, as they do now), as well as the right to make small numbers of printed copies for their personal use.

I suspect, though of course I'm really just guessing, that the "small numbers" clause was inserted at least in part as a reaction to the gleeful scarfing-up of OA works for resale by reprint distributors, or to the threat of same.

It needs the force of law to be any use for that purpose, though, which is where licensing comes in -- using a noncommercial clause like the one in CC NC licenses is a bit like swatting a gnat with a bulldozer, but I know of no licenses which deal specifically with the volume reprint trade but allow other commercial uses.

Frankly, even if there were such a license, I wonder whether publishers who insist on NC now would switch to it. Springer's Open Choice, for example, charges $3000 per article. I would say they've already been paid and shouldn't much care if someone else, without restricting access to their content, makes further profit from it. The barrier (to such a view) seems to be a mindset that says "why shouldn't I get my cut?" -- and if any other downstream use should arise that starts to make serious money, they would want their cut of that too. To make sure they get it, just in case it ever comes into being, I expect that many publishers would be willing pre-emptively to kill off any smaller commercial innovations that might otherwise arise.

(Someone will no doubt argue that these fledgelings could always negotiate via the waiver clause, as above. The main problem there is that such negotiations themselves cost money, and since much of the promise of OA is in remix across a wide range of sources, that means negotiating with every publisher. Let me know how that works out for ya.)


3 In fact, although NC clauses don't require a particular license for derivative or collective works, they do exert a kind of de facto copyleft, because they are only downstream-compatible with other NC licenses -- see footnote 1 here, or play this game for a while.


4 Two things of note here: firstly, the NIH apparently agrees with me that OA by definition removes both price and permission barriers, since they refer to the uncontaminated barrel as Open Access and explicitly say that the rest of their content is free, not OA. Secondly, following on from Egon's and Antony's questions, I wonder: by permitting the spoilage, can databases violate the licensing terms of the CC-BY papers they also contain? The question hinges on this wording:

You may not offer or impose any terms on the Work that restrict the terms of this License or the ability of the recipient of the Work to exercise the rights granted to that recipient under the terms of the License. [...] When You Distribute or Publicly Perform the Work, You may not impose any effective technological measures on the Work that restrict the ability of a recipient of the Work from You to exercise the rights granted to that recipient under the terms of the License.

Egon and Antony are asking more directly technological questions, but I do think it could be argued that if they do not do as PMC has done and make available a libre OA subset, databases can be seen to be imposing terms that restrict, etc.



Tuesday, 28 April
Perpetuating an OA myth

Maxine at Nautilus posted a slightly shortened version of this letter to Nature from Raf Aerts; what caught my eye was the rearing of a familiar ugly head (emphasis mine):

...the [global recession] may also be affecting the publication output of research institutions in a more subtle way. It could be boosting the traditional reader-pays publication model for scientific journals at the expense of the author-pays, or open-access, model.

Open-access journals ask authors to pay for processing their manuscripts (which involves organizing a form of quality control, formatting and distribution) so that the final product becomes freely available, and free to use if properly attributed. [...]

This myth, that OA is synonymous with author-pays, is a toll-access publisher's delight. It simply is not true. See here for detail; briefly:

  • in 2005, the Kaufman-Wills group showed that "...more than half of DOAJ [Open Access] journals did not charge author-side fees of any type, whereas more than 75% of ALPSP, AAMC, and HW subset [Toll Access] journals did charge author-side fees." (Note that this study included only 248 journals from the DOAJ.)
  • in 2007, Peter Suber and Caroline Sutton showed that, of 450 OA journals published by 468 scholarly societies, only 75 -- fewer than 20% -- charged author-side fees
  • also in 2007, I showed that only 18% of the almost 3000 journals in the whole DOAJ charged author-side fees; 67% did not charge such fees, and the information was missing for 15%.
  • in March 2008, Heather Morrison showed that more than 90% of the psychology journals in the DOAJ charge no publication fee1
  • about a month ago, I showed that only 38 (42%) of the 90 full-OA chemistry journals in the DOAJ charged author-side fees (49% did not charge such fees, and information was missing for 9%).

Raf goes on to say:

...few peer-reviewed open-access journals have so far had a high impact factor in their field, except for a small number such as those published by the Public Library of Science and BioMed Central. They are therefore struggling to emerge and to attract the most prestigious research findings.

This situation could deteriorate further if open-access journals are forced to move to (partial) site licensing in order to cover their production costs -- a shift recently undertaken by the Journal of Visualized Experiments, for example -- as authors become increasingly reluctant or unable to pay in the current financial climate.

I don't see why we should assume that anything will "deteriorate" if OA journals switch to new funding models, or that OA journals will have a harder time 'emerging' if they move to a model that is actually closer to the old, familiar toll-access model. After all, there already exist a wide variety of ways in which OA publications pay the bills: advertising, endowments, philanthropy, institutional subsidies, memberships, priced editions and more. In particular, hybrid journals (which is what JoVE has become) are popular with toll-access publishers as a way to establish a foothold in OA territory. Inter alia, Elsevier, Springer and Wiley all publish hybrid journals, and between them, those three account for more than 40% of the worldwide science/tech/medicine publishing market -- so the hybrid model is pretty well established.

There's more to say about authors' willingness and/or ability to pay, too. Firstly, it's almost never the author who pays, but the funding body paying for that author's research. At the moment, this can translate into using up precious grant money when there's a need to pay author-side fees, but with 77 funder, institutional and departmental OA mandates in place and more on the way, it seems reasonable to suppose that more and more of the mandating bodies will underwrite more and more of the costs of publishing. For example, HHMI has institutional agreements/memberships with BMC, Springer and Elsevier, and BMC's page of funder policies shows that a majority of UK funders either make additional funds available or allow publication charges to be treated as an indirect cost. Many OA journals also waive or reduce their fees on application; for instance, here are the PLoS (scroll down) and BMC policies.

Finally, remember that the Kaufman-Wills study showed that 75% of the toll-access journals surveyed charged author-side fees (page charges, colour charges, reprint charges, etc) in addition to their subscription charges. So when there are author-side fees involved, I'd like to know how those charged by OA journals (in return for which the work is freely available to everyone, forever) compare with those charged by toll-access journals (in return for which, authors often cannot retrieve their own work, and anyone who wants to read it must pay another fee).


1 updated 04/29 after reading this post from Peter Suber



Saturday, 18 April
Scholarly (scientific) journals vs total serials: % price increase 1990-2009

Following on from this post, I manually extracted historical data for average scholarly journal prices in a dozen broad disciplines from the Library Journal Annual Periodicals Price Surveys by Lee Van Orsdel and Kathleen Born, and compared these with three datasets from the earlier post: ARL libraries' median total serials expenditures (ARL all serials), Abridged Index Medicus average journal price (AIM) and the consumer price index (CPI):


LJ.png

My concern with the AIM dataset was that it was too small and specialized to support broad conclusions, but it turns out that the AIM data sit somewhere in the middle of the disciplines analysed. Astronomy is closest to the ARL all serials median, with math and computer science not much worse; general science is the worst offender, with engineering and technology, chemistry and food science not far behind. From 1990 to 2008, total price increases ranged from 238% (astronomy) to 537% (general science); that's 3.7 and 8.3 times the increase in the CPI, respectively.

This dataset covers an average of around 3600 journals from 2005-2009, 3255 from 1997-2001 and 2655 from 1989-1990. I think this represents good evidence that historical price data for total serials, even though it shows a rate of increase far greater than that of the CPI, masks an even greater rate of increase among scholarly (scientific) journals. It's difficult to look at that graph and believe that scholarly publishers are playing fair, particularly when one remembers that online publishing, with its attendant cost reductions, came of age during the same period of time.

The Van Orsdel/Born surveys include a number of other scholarly disciplines (art, architecture, business, history, language, law, music, etc etc). If I have the time I'll work those up as well, to provide as broad a picture as possible. I should also include numbers of titles in each discipline, to give some idea of total influence. For instance: although general science (around 60 or 70 titles) shows the greatest increase, it likely contributes far less to the serials crisis than health sciences (more than 1500 titles).

(The data are available in this Excel spreadsheet.)



Friday, 17 April
Some wishes come true.

A while back, I posted about my discovery (new to me, though not new to many others) that the serials crisis should probably be called something like the "scholarly journals crisis". The term "serials" includes a wide range of publications, most of which are not peer-reviewed scholarly journals -- newspapers, goverment reports issued in series, yearbooks, magazines and more. Only about 1/10 of the serials in Ulrich's directory are peer-reviewed. The average scholarly journal costs around 10 times as much as the average serial, and while the cost of the scholarly literature continues to climb, median serial unit costs at ARL libraries have actually been falling for the last seven or eight years (Fig 1 below). It therefore appears that scholarly journals are the driving force behind the serials crisis.

At the time, I wished that I had some specific data to show the difference between scholarly and average serials -- hence the title of this post: via medinfo, I learned that EBSCO Information Services has released a brief report (pdf!) on the price history of well regarded clinical journals, using 117 titles from the NLM's Abridged Index Medicus (AIM). This is a curated list of biomed journals "of immediate interest to the practicing physician" and can be searched on PubMed as a subset limit named "core clinical journals".

As a reminder, here's that graph; it's from the ARL stats report from 2004-5 and the reason it's famous is the way that "Serials Expenditures" outstrips the Consumer Price Index (CPI) and other measures:


ARL.png



Here's a comparison of that data with the price history of the AIM journals; the line labeled "expser/ARL libraries all serials" shows the 1990-2005 subset of the "Serials Expenditures" data from Fig 1, and "EBSCO/core clinical journals" shows the AIM data:


EBSCO.png

Data labels (ARL data from here):

  • serpur: Current Serials Purchased, median value from all ARL libraries
  • expser: Expenditures for Serials, median etc
  • totsal: Total Salaries & Wages, median etc
  • serunit: Serial Unit Cost; median value of expsur/serpur calculated for all ARL libraries
  • EBSCO: average price per journal in the Abridged Index Medicus set
  • CPI-U: Consumer Price Index, all urban consumers, annual average, not seasonally adjusted


This is exactly what I wished for, hard evidence of the difference between scholarly and average serials; and what that evidence strongly indicates is that price increases in scholarly journals are driving the serials crisis. Scholarly journals far outstrip total serials in terms of annual price increase, even though the latter shows a much more rapid increase than the CPI. In contrast, library salary expenditure follows the CPI closely, and median serial unit cost (all serials) has been dropping slowly since 2000.

Frankly, I'm tempted to name this the Big Fat Ripoff Graph. Between 1990 and 2008, the CPI increased by about 65%, whereas over the same period the average price of an AIM journal increased by 415%, a 6.4-fold difference. I've seen publishers try to defend the "total serials expenditures" vs CPI discrepancy by pointing out that journals are proliferating -- indeed, the "serials purchased" curve is headed upwards at an increasing rate, particularly over the last five years or so. But that defense is no good against the BFR Graph, on which the most damning curve shows average journal prices. I've also seen comments to the effect that if mean or median serial unit costs are dropping, publishers must be offering increasing value for money even if they are charging more in total. That might be true of the set of "all serials publishers", but it's apparent from the BFR Graph that scholarly journal publishers can make no such claim.

It must be remembered, of course, that we are only looking at a little over a hundred clinical journals here, a small and discipline specific subset. Nonetheless, the result is so striking that I think it is a considerable inducement to the gathering of more data. Since it seems my wishes for more work are coming true, I'll make another: now I want price history data for other, larger journal subsets in other scholarly disciplines. I wonder what the BFR Graph looks like for those datasets?

(P.S. If you want the numbers I used, or to check my work, the spreadsheet is here.)


Update: ha! I just got around to reading this article, linked by Peter Suber a couple of days ago; turns out it's full of annual price data, and Van Orsdel and Born have been doing these surveys for at least ten years. There doesn't seem to be a central collection or data collation, so I'll have to piece it together. Stay tuned!



Wednesday, 15 April
What's wrong with copyleft?

This FriendFeed thread regarding the Wikipedia licensing vote has stirred up an old hornet's nest of issues surrounding copyleft and noncommercial clauses in Open licenses. As I said in the thread, I get most of my ideas on this topic from David Wiley, and have posted about those ideas before. Herewith another attempt to organize and clarify my thoughts, as much for my own benefit as anything:


1. The purpose of Open licensing is to enable the following (this is straight from David's Open Education License draft, about which more later):

  • Reuse - Use the work verbatim, just exactly as you found it
  • Rework - Alter or transform the work so that it better meets your needs
  • Remix - Combine the (verbatim or altered) work with other works to better meet your needs
  • Redistribute - Share the verbatim work, the reworked work, or the remixed work with others


2. The purpose of restrictive clauses in such licensing is to prevent specific types of reuse, rework, remix and/or redistribution:

2a. Copyleft prevents future copyright lockup by requiring that all downstream (reworked or remixed) works be similarly licensed.

2b. Noncommercial clauses prevent profitmaking, and are complicated, and I'm not getting any further into it than that right now. (Maybe later, if my brain doesn't melt.)


3. Although copyleft and NC clauses achieve their own immediate goals, widespread license incompatibility1 means that they often (perhaps usually) defeat part of the larger purpose of Open licensing. The use case where this is most prominent is remix2, since reuse and redistribution of individual copylefted or NC-licensed works or their derivatives is usually just a matter of retaining the original license. But multiple works can only be recombined into new works if their respective licenses are compatible -- otherwise, there's no licensing option for the remix that doesn't violate the licensing terms of at least one of the ingredients. Not only that, but if any of the works in the mix carries a copyleft license, that license takes over the entire remix and everything downstream of it, thus propagating the incompatibility problem.


4. One last thing: could copyleft be saved from itself? What if someone wanted copyleft protection, without the compatibility issues? Creative Commons is already beginning to build the only solution I can think of: widespread interoperability agreements between existing and any newly developed copyleft licenses. CC-BY-SA 3.0 contains the following clause:

You may distribute, publicly display, publicly perform, or publicly digitally perform a Derivative Work only under: (i) the terms of this License; (ii) a later version of this License with the same License Elements as this License; (iii) either the Creative Commons (Unported) license or a Creative Commons jurisdiction license (either this or a later license version) that contains the same License Elements as this License (e.g. Attribution-ShareAlike 3.0 (Unported)); (iv) a Creative Commons Compatible License.
where (iv) is defined as
a license that is listed at http://creativecommons.org/compatiblelicenses
Sadly, the cupboard remains bare so far:
Please note that to date, Creative Commons has not approved any licenses for compatibility; however, we are hopeful that we may be able to do so in the future. If you would like to discuss the possible compatibility of your license with a Creative Commons license, please email us at info@creativecommons.org.

I am personally persuaded that the Public Domain is the best way out of the copyleft trap, which is why I use CCZero for everything I make.






-------------
1 Among CC licenses, there is only about 33% compatibility, and that drops to 20% among NC and SA versions -- including self-compatibility*:

cccompatibility.png


Restrictive (NC, SA) versions currently account for around 80% of worldwide CC licence uptake. Once you start factoring in the dozens and dozens of other Open/Free licenses out there, it only gets worse. The FSF and OSI maintain lists of licenses and compatibilities (here and here, respectively), and wikipedia includes a couple of fairly extensive comparison tables. Speaking of Wikipedia, the world's favourite online encyclopaedia is currently released under the GNU Free Documentation License, which is not compatible with any CC license except Public Domain though it does allow transition to CC-BY-SA. If the current vote on that transition is "yes", that will be a step forward -- but it will still leave Wikipedia with the compatibility problems shown in the figure above. Exploration of compatibility issues with all the other Free/Open licenses is left as an exercise, etc.

* from here and here; green indicates compatibility, light green indicates possible compatibility -- some disagreement between sources.


2This is why I consider David's "Four R's" formulation so important, because it makes a clear distinction between rework and remix that is essential to understanding the aims and implementation of Open licenses.



Monday, 13 April
Anniversary of sorts

This question from Antony Williams on FriendFeed:

Is PubChem Data Open or not? There are many discussions saying that PubChem data are Open but I see PubChem as a host and the disclaimer does not say "open": http://tinyurl.com/e78as

reminded me that it's almost a year to the day since Egon Willighagen asked a similar question about PubMed Central content:
I was wondering about this section in the CC license of much of the PMC content, such as our paper on userscripts (section 4a of the CC-BY 2.0):
    You may not distribute, publicly display, publicly perform, or publicly digitally perform the Work with any technological measures that control access or use of the Work in a manner inconsistent with the terms of this License Agreement.
CC-BY 3.0 reads differently, but has similar aims. [...] Peter [Murray-Rust, see here] indicates that the NIH has put in place 'technological measures to control access' to the distribution of our work on userscripts (the PMC entry). That is in clear violation of the CC license. [...] What the PMC website should indicate, instead, is that text mining is allowed for the PMC OAI subset, but that they would highly prefer to use the PMC OAI or PMC FTP routes. This is the least they have to do.

No matter what, I still have the feeling that any technical obstacles are disallowed by the CC-license. Any legal expert here, that can explain me if the CC license allows controlling how people have access to my material?

These are both very good questions, and I still don't have an answer for Egon's even after a year. I'm reluctant to go pestering John Wilbanks with every CC-related question I come across, so I'm reposting in the hope that someone will be able to save John from me.



Monday, 13 April
Lazy reporter, no donut.

Dennis Carter in an eCampus News article about NPG's Scitable:

Scitable's January launch came as elite universities across the United States are embracing open-access formats--making research articles available for free online. This marks an abrupt departure from the traditional model of printing research articles in academic journals, which can cost campuses as much as $20,000 annually, open-access experts say.
So, is it the traditional model that can cost campuses up to $20K/yr, or academic journals, each of which can cost etc?

It's only obvious that what is meant is $20K/yr per journal subscription if you already know that libraries spend millions of dollars per year on serials.

I'd expect a publication that wants you to register to read its content1 to bother making that content accurate and unambiguous.


-------------
1 Sure, registration is free. Registration also provides the publisher with a great bolus of immensely valuable marketing information, to say nothing of the slimy opt-out spam opportunity. Which is why I recommend poisoning such databases with fake information providing minimal information unless you get content that you really value from the site. (Two wrongs etc, hence the edit.)



Monday, 13 April
Someone else is fooling around with numbers.

Via Peter Suber, I came across this editorial in the Journal of Vision:

Measuring the impact of scientific articles is of interest to authors and readers, as well as to tenure and promotion committees, grant proposal review committees, and officials involved in the funding of science. The number of citations by other articles is at present the gold standard for evaluation of the impact of an individual scientific article. Online journals offer another measure of impact: the number of unique downloads of an article (by unique downloads we mean the first download of the PDF of an article by a particular individual). Since May 2007, Journal of Vision has published download counts for each individual article.
The author goes on to compare download vs citation (counts and rates, and downloads or citations over time). It's a pretty good analysis of an important topic, but something vital is missing:
Where are the data? Can I have them? What can I do with them?1
In fact, the data are approximately available here. Why "approximately"? Well, I can get a range of predigested overviews: DemandFactor (roughly, downloads/day/first 1000 days) Top 20, total downloads Top 20 and article distributions by DemandFactor and total downloads. I can also get the download information for any given article -- one article at a time, and once again predigested in the form of a graph from which I have to guesstrapolate if I want raw, re-useable data.

This is disappointing, for both general and specific reasons. It's always disappointing to see data locked away in a graph or a pdf or some similar digital or paper oubliette, there to languish un(re)used. It's also disappointing to see a journal getting way out ahead of the curve on something as important and valuable as download metrics (is there another journal besides J Vis that provides this information, even predigested?), and then missing an opportunity to continue to innovate by providing real Open Data.

It's also disappointing in this specific instance, because I have a question: why is Figure 1 plotted on a log scale and, more importantly, was the correlation coefficient calculated from log-transformed data? I could understand showing the log scale for aesthetic reasons, but I can't think of a reason to take logs of that kind of data -- and doing so can alter the apparent correlation. For instance, remember Fig 1 from this post? Here it is again, together with a plot of log-transformed data, both shown on natural and log scales:


logarithmssarehard.PNG



I could answer my own question quickly and easily if I could get my hands on the underlying data -- which leads me right back to one of the primary general arguments for Open Data. If I, statistical ignoramus and newcomer to these sorts of analyses, have questions after a brief skim through the paper, what questions might a better equipped and more thorough reader have? It's simply not possible to know -- the only way to find out is to make the data openly available!

I realise it's not possible for journals to demand Open Data from their authors -- that's what funder-level mandates are for, though there's much discussion still to be had regarding whether Open Data mandates would be a good idea. Nonetheless, when journals publish analyses of their own data, it would be great to see them leading the way by providing unrestricted access to that data.

-------------
1 Astute readers, both of you, will remember that howl of anguish refrain from this post.



Saturday, 04 April
Why don't we share data? Not for the reasons Steven Wiley thinks we don't.

Via Peter Suber, I came across an editorial about data sharing in The Scientist. I disagree with the author, PNNL's Steven Wiley, on a number of points:

Despite the appeal of making all biological data accessible, there are enormous hurdles that currently make it impractical. For one, sharing all data requires that we agree on a set of standards. This is perhaps reasonable for large-scale automated technologies, such as microarrays, but the logistics of converting every western blot, ELISA, and protein assay into a structured and accessible data format would be a nightmare -- and probably not worth the effort.

Wiley is making two mistakes here: setting the perfect against the good, and vastly underestimating human ingenuity.

Standards are inarguably required for automated sharing and essential for the sharing of ALL data, but that doesn't mean that sharing SOME data, with evolving standards or even without any standards, has no utility. My pet example is the long standing practice of supporting scientific claims with the phrase "data not shown" in peer-reviewed papers, something I think should no longer be allowed. All scientific claims should be supported by data. "Data not shown" belongs to the print era, when space was limited and distribution relied on physical reproduction and transport. This is the era of the online supplement, to which no such restrictions apply.

Reasonable people might contend that I am stretching the concept of "data sharing" to cover my pet peeve there, but I chose the example deliberately as an edge case: there is, to me, clear utility in that kind of data sharing, even though it involves no standards, only some data, and only eyeball-by-eyeball access (whereas I myself frequently argue that the greater part of the value of Open distribution probably lies in the long term, in machine-to-machine access). I argue that more sharing, using -- despite their current flaws -- evolving standards, is likely to yield significant dividends well before reaching the eventual goal of sharing all data using universal standards.

This leads me to the second mistake. It seems odd to me to insist that because standards are difficult to develop and implement, the bulk of such work is futile. The key is the phrase "currently... impractical". The whole concept of the internet was probably considered "currently impractical" by a great many people, until someone went and built it. There are plenty of people still willing to pronounce Free/Open Source software "currently impractical", even as they (perhaps unwittingly) rely on it every time they go online or send email. Then-existing hurdles at various times surely made business on the internet "currently impractical", and banking on the internet "currently impractical", and -- need I go on?

Moreover, I am not the only one who disagrees about the value of creating standards for difficult-to-share data. If you think western blots would be a nightmare, how about biodiversity data -- like, say, museum specimens? How about anthropometric data, exchangeable biomaterials, neuroscience data, electron micrographs, magnetic resonance images or microscopy images? The MIBBI project has dozens of other examples, the Open Biomedical Ontologies Foundry is working on dozens more, and Bioformats.org might offer a lightweight solution to some of the same problems.

(In re: Wiley's specific examples: I was easily able to find efforts underway to enable sharing of gel electrophoresis data, protein affinity reagents and molecular interaction experiments; and I can't imagine ELISA data being much harder to share than microarray information -- surely MIAME, for instance, could readily be adapted if it wouldn't already serve? I'm not sure what kind of protein assay Wiley has in mind.)

I cannot begin to imagine how to build semantic and exchange standards for those kinds of data, but I'm not about to bet against the people currently trying to do so; nor do I believe that, once built, their systems will prove to have been "not worth the effort".

As I mentioned, reasonable people might disagree about various points above. But Wiley goes on to say:

Unfortunately, most experimental data is obtained ad hoc to answer specific questions and can rarely be used for other purposes.

which is just plain wrong. Much of the rationale for data sharing, the engine of much of its promise, is the simple observation that you cannot know what someone else will do with your data, particularly when they have access to lots of other people's data to go with it. Re-use beyond the scope of the original author's imagination is a primary impetus for data sharing, and innovative examples abound; for instance, just take a look at Tony Hirst's blog. (If there is a dearth of examples from biomedical research, I'd call that an argument in favor of more, not less, data sharing.)

"Can rarely be used" is an empirical claim, and those should be backed by data -- and I can think of only one way to get the relevant data in this case.

Wiley continues:

Good experimental design usually requires that we change only one variable at a time. There is some hope of controlling experimental conditions within our own labs so that the only significantly changing parameter will be our experimental perturbation. However, at another location, scientists might inadvertently do the same experiment under different conditions, making it difficult if not impossible to compare and integrate the results.

[...] In order to sufficiently control the experimental context to allow reliable data sharing, biologists would be forced to reduce the plethora of cell lines and experimental systems to a handful, and implement a common set of experimental conditions.

Experimental results are supposed to provide useful information about the world of sense-perception. If a result cannot be repeated by different hands in a different lab, then it is probably not telling us what we think it is telling us about the way the world works. If, on the other hand, a particular result does mean what we think it means about the underlying system, then we should be able to design different experiments to be carried out with different hands, conditions, equipment etc., and obtain data that supports the same conclusions. That's what we call a robust result, and standard practice is to aim for robust results.

Regarding integration and comparison of results from different conditions -- just what does meta-analysis mean, if not exactly that? As an example, if you were to knock Pin1 down in HeLa cells, you'd block their growth, but Pin1 knockout mice survive just fine. Comparison of those results is not only possible, but extremely interesting, and is the way we learned that mice have an active Pin1 isoform, Pin1L, which is present but potentially inactive in humans.

I think that variation in conditions between labs is a good reason to build finer-grained semantic structures, but no reason at all to throw up our hands and give up on linked data.

Wiley goes on to give as his sole concrete example the lack of uptake into published papers of data from the Alliance for Cell (sic) Signaling. It's actually the Alliance for Cellular Signaling1; their website lists 20 publications, NextBio finds 35 and Google Scholar (which covers a lot more than peer-reviewed papers) finds 440. Scholarly papers are a somewhat limited measure of research impact, but that's not at first glance an impressive showing. Consider, though, that the AfCS was established in the late 1990's, which puts it well ahead of its time, and then compare the first, second and ongoing third decades of the undisputed poster child of data sharing2:

genbankgrowth.PNG

There's more to Wiley's choice of example, though:

In my own case, I am interested in the EGF receptor and receptor tyrosine kinases. This aspect of cell signaling was not covered in their dataset, and thus it is of no interest to me.

I wish I had a dollar for every time I'd heard an argument against some new idea that boils down to: "I can't figure this out, or find a use for it myself; therefore it's no good and will never be any use to anyone". I'm sure there's a pithy Latin name for this particular logical fallacy.

Wiley continues in, as it turns out, a similar vein:

And soon, discussions about the importance of sharing may become moot, since the rapid pace of technology development is likely to eliminate much of the perceived need for sharing primary experimental data. High throughput analytical technologies, such as proteomics and deep sequencing, can yield data of extremely high quality and can produce more data in a single run than was previously obtained from years of work. It will thus become more practical for research groups to generate their own integrated sets of data than try to stitch together disparate information from multiple sources.

And just what does the PNNL's Biomolecular Systems Initiative (of which Wiley is director) do? By a strange coincidence, this:

advancing our high-resolution, high-throughput technologies by exploiting PNNL's strengths in instrument development and automation and applying these technologies to solve large-scale biological problems....

We are building a comprehensive computational infrastructure that includes software for bioinformatics, modeling, and information management. To be more competitive in obtaining programmatic funding, we will continue to invest in new capabilities and technologies such as cell fractionation, affinity reagents, high-speed imaging, affinity pull downs, and ultra-fast proteomics. This will help us build world-class expertise in the generation and analysis of large, heterogeneous sets of biological data. The ability to productively handle extremely large and complex datasets is a distinguishing feature of the biology program at PNNL.

The remainder of this post is left as an exercise for the reader; be sure to cover the question of how less well-heeled institutions are supposed to carry out work in proteomics and deep sequencing and so on, and don't forget to ask for evidence showing that it is not important to share data even between such high-fliers, since presumably they can extract every last conceivable piece of useful information from their own data...


-------------
1You'd be amazed how many things share that acronym -- activity-friendly communities, antibody-forming cells, ataxia functional composite scale, antral follicle count, alveolar fluid clearance, age at first calving, amniotic something something -- that's where I gave up. Why oh why can't we have a decent text search? Even just "match case" would have solved much of my problem here. /rant

2 graph from here



Wednesday, 01 April
Fooling around with numbers, part 5b.

I've already assigned part 6 to a particular analysis in an effort to get me to actually do that work, but I felt that I just had to include this (via John Wilbanks) in the series:



Lemongraph.jpg



I'm just sayin'. (I may have to get that graph as a tattoo).


P.S. Never mind the date, this is not a trick; I hate online April Fool jokes with the fiery power of a thousand burning suns.




Tuesday, 24 March
Entry for Ada Lovelace Day

Today is Ada Lovelace Day:

Ada Lovelace Day is an international day of blogging to draw attention to women excelling in technology.


Women's contributions often go unacknowledged, their innovations seldom mentioned, their faces rarely recognised. We want you to tell the world about these unsung heroines. Entrepreneurs, innovators, sysadmins, programmers, designers, games developers, hardware experts, tech journalists, tech consultants. The list of tech-related careers is endless.

Since most of my role models who happen to be female are not really in any kind of tech career, I'm spared the need to write the enormous essay that it would take to cover them all. Instead I'll point to just two for whom I can reasonably make a tech connection: Rosie Redfield and Maureen Hoatlin.

I've never met Rosie, who is a PI in the Zoology Department at University of British Columbia, but she is one of the first biomed researchers -- if not the very first -- to embrace Open Science and I've been following her online presence for a couple of years now. From her lab's homepage you can read not just the usual list of publications and personnel, but also submitted research proposals and work in progress. The latter is communicated by blog: Rosie has one, and so do several other lab members. They discuss upcoming and ongoing experiments, work up data and think out loud about their research in general.

I met Maureen after we were both quoted in Mitch Waldrop's SciAm article on Open Science, and she realized that we worked on the same campus. Maureen is a PI in the Biochem Dept at OHSU. She tells a great story about neglecting her family one weekend while she sat in bed reading scientific articles online -- "this changes everything" was all she would say to their pleas for breakfast, etc. Well, Maureen meant what she said, and she's walking the walk. You can find the Hoatlin lab on OpenWetWare, along with a wiki-based, bottom-up, ongoing experiment in improving grad student education that she pioneered, and you can find Maureen on a range of social networking sites including FriendFeed and LinkedIn. Her lab has its own Twitter account.

Since I think this sort of open, collaborative model is very much the way of the future, if science is to have a future at all, I'd like to see Rosie and Maureen get their props for having been such early adopters. It's also worth mentioning that, in addition to still being a Boys' Club in many ways, research is a very conservative environment in which new ideas are usually met with scorn and active resistance. So, having made it up the foodchain in the face of irrational opposition, they are now confronting the same tribe with another set of new and threatening ideas. Both are worthy additions to the Ada Lovelace Day pantheon.



Tuesday, 24 March
New blog in town.

I don't normally promote new blogs, other than to add them to my blogroll if I think they are worth my readers' time, but I'll make an exception for PLoS ONE's new community blog, EveryONE:

Why a blog and why now? As of March 2009,  PLoS ONE, the peer-reviewed open-access journal for all scientific and medical research, has published over 5,000 articles, representing the work of over 30,000 authors and co-authors, and receives over 160,000 unique visitors per month. That's a good sized online community and we thought it was about time that you had a blog to call your own. This blog is for authors who have published with us and for users who haven't and it contains something for everyone.


Why did you call the blog everyONE? For three main reasons that encapsulate the mission of the journal:

Firstly, because PLoS ONE is for every rigorous research article that passes our peer-review process.

Secondly, because PLoS ONE is a forum for research in every scientific discipline (with a current emphasis on life and health sciences because of PLoS's history).

Thirdly, because PLoS ONE is a source of information for every inquisitive reader with an interest in high-quality scientific research.

I hope, and on my better days believe, that PLoS ONE is one of the leading models for the future of scientific journals:
  • they offer gold OA -- that is, free online to everyone everywhere from the moment of publication, including submission to PubMed Central
  • they offer a sustainable business model for OA: in the black after less than three years and with an author-side fee of $1300
  • their peer review process is as rigorous as any, but it does not ask reviewers to make guesses about what is "hot", or what is likely to be important at some time in the future: if it's solid science, PLoS ONE will publish it
  • they don't have an Impact Factor: homey don't play dat, as the kids around here say
  • that's not to say that they are not actively seeking rich measures of utility/impact for scientific publications: for instance, here's Bora's roundup of analyses of an experimental dataset that they passed around a while back, and an update from Euan
  • in the same vein, I can't find a link right now but there are plans afoot to release real-time access to such data as downloads, comment frequency and so on -- post-publication measures which can improve and speed up citation based measures; for another example, scroll down on this page for some self-measurement that represents a level of disclosure I have not seen from any other journal
  • they are responsive to and engaged with the community: for instance, both Bora Zivkovic (community manager -- how many journals have one of those?) and Peter Binfield (managing editor) are active on FriendFeed
  • they encourage and enable community input in the form of notes, comments and ratings on every article; I particularly like the option given to reviewers to have their reviews included as comments with the paper

EveryONE is another way for PLoS ONE to engage with their community of readers and contributors, and well worth a look.


DISCLAIMER: I consider Bora and Peter friends of mine, and I've previously applied to work at PLoS.



Saturday, 21 March
Should we talk about the "journals crisis" instead of the "serials crisis"?

I stumbled upon something new-to-me, and possibly even useful-to-others, in my fooling around with numbers (table 2 and discussion thereof here), but it's somewhat buried under all the "how I made this figure" and "where I got these data" details. For that reason, and because I didn't trust my idea until I had some external reinforcement, I thought I'd give it a separate post all its own.

Here's the thing: what is widely known as the serials crisis in library costs is probably driven largely by the pricing of scholarly journals. In library parlance, "serials" includes, inter no doubt many alia, newspapers, goverment reports issued in series, yearbooks and magazines (periodicals), in addition to the scholarly literature. Of the 225, 000 or so periodicals in Ulrich's, only about 25,000 are peer reviewed. In the FriendFeed discussion started by my post, Walt Crawford said

...some of us have long argued that there isn't a serials crisis for library budgets, there's a scholarly journal crisis. Magazines (and there are about 1/4 million magazines as compared to about 25,000 scholarly journals) tend to have very low prices and very modest increases.
Although non-refereed serials dominate product counts (and, apparently, library collections), the situation is reversed for unit expenditures. The average unit cost for the UCOSC dataset, which is composed entirely of scholarly journals, is roughly ten times the average unit cost for any of the other datasets I used, all of which were general data that included all types of serial. Here's Walt again:
the 10:1 ratio for UC (that is, scholarly journals averaging 10x as expensive as all serials) sounds about right
When the numbers and Walt's experience began to line up, I became much more confident in my conclusion, that the serials crisis is really a scholarly journals crisis. It's not clear to me, in fact, why the phenomenon got the nickname it did; perhaps it's just that "serials crisis" is a punchier phrase.

I'm not at all sure that any of this is more than semantic nitpicking, but giving things their proper name can be important. Most researchers who only hear the name won't care about a "serials crisis" -- that's a library problem, nothing to do with us. But if they hear about a "scholarly literature crisis", it becomes clearer that the issue is the potential loss of access to resources necessary to do our jobs. I suspect most researchers who've heard of the serials crisis are aware that it is, at least in part, about journal pricing, but I wonder how many know that it's pretty much only about journal pricing? This little "discovery" of mine really did put things in a different perspective for me, and I'm probably more informed about library- and publishing-related issues than most benchmonkeys.

I doubt that an alternative name will catch on, and I'm not going to start campaigning for one -- but I think that from now on I'll at least occasionally refer to the "serials/scholarly literature" crisis, or something similar, if only to remind myself of my own little satori. (Question for the lazyweb: can anyone suggest a better phrase, one which would make it more apparent to researchers that they should care about this?)




Thursday, 19 March
Fooling around with numbers, part 5

As promised, here is the distribution of journal prices for the subsets of the Elsevier life sciences dataset which either have or don't have impact factors, and for the entire UCOSC dataset (in which all journals have IFs):

plusminusIF.PNG

Each interval is $499: $0 to $499, $500 to $999, etc, and datapoints are plotted at the midpoint of each interval.

The conclusion is the same as in part 1, just a bit clearer now. Elsevier journals without an impact factor are priced lower than those which have an IF, and the price distributions are somewhat different between journals with and without an IF. Note, though, that if I'd used a $1000 interval instead of $500, the initial rise in the +IF curves would not appear; if these are power-law distributions the main difference is probably the scaling exponent. I think. (Math is not my friend.)

It almost looks as though low-end journals are shunted out of the lowest price bracket as soon as they get an IF, any IF, and then tend to increase in price as the IF goes up. Update: no it doesn't. I don't know what I was thinking there.


The rest of the series: part 1, part 2, part 3, part 4.



Tuesday, 17 March
Author-side fees in hybrid and OA chemistry journals

Peter Suber, responding to a J Cheminfo paper, wondered what proportion of chemistry journals in the DOAJ charge author-side fees. Since I was in that mode, as it were:


DOAJchem.png



Hybrid journals are those that offer OA-for-a-fee, so of course all of those charge fees. "Open" above refers to Gold OA journals, roughly half of which charge author-side fees in this chemistry subset. This is broadly consistent with the overall DOAJ listing (as of December 2007) and also with several other studies that Peter mentions.


I still can't solve the tables bug; if you want the numbers, view source -- I've commented out a simple table that displays fine unless Moveable bloody Type gets hold of it. If you want to see how I generated the numbers, grab this spreadsheet. I first cut-and-pasted from the DOAJ subject listings into a text editor, then used the replace function to introduce tabs before "hybrid" or "open" and between "publication fee" and the entry for each journal. Then I used the replace function to delete all lines between "hybrid/open" and "publication fee", to simplify the Excel formula... you'll see what I mean if you look at the spreadsheet.



Tuesday, 17 March
Fooling around with numbers, part 4; or, those data -- you keep using them -- I don't think they mean what you think they mean...

At the end of part 3, having looked at some of the ways in which prices and price/use were distributed, I said I'd try to say something about what constituted a fair price. I hadn't thought that through at all, and it turns out that I really can't get much leverage against that question from the UCOSC dataset alone.

In addition to the graphs in parts 1-3, here's yet another way to look at the UCOSC data (again, this is a png from a screenshot because MT ate my balls perfectly good table1):


Table 1
MTsucksass.png


Perhaps Elsevier doesn't stand out quite so much as I might have expected -- they still dominate by virtue of market share, but in terms of cost/use or use/title, Springer looks the worst of the bunch. Mean ($0.76) and median ($1.89) cost per use doesn't mean much without context. I could argue that since libraries are having trouble keeping up with serials costs and usage is only likely to increase, those probably don't represent fair prices... but I don't know how much weight that argument would hold, and anyway you should go read Heather Morrison on why usage-based pricing is dangerous. (That's one of the benefits of thinking-out-loud like this; knowledgeable people come along and point out stuff you need to know. Yay lazyweb!)

So, I need context: let's start with, how many libraries are there? According to the American Library Association, there are more than 120,000 libraries in the USA -- but for my purposes, I'm really only interested in those which carry the scholarly literature. The US Dept of Education's National Center for Education Statistics runs a Library Statistics Program, which provides data specifically on academic libraries.

According to the ALA and the NCES, there are about 3700 academic libraries in the US. If all of them subscribed (at list price) to the 2904 journals in the UCOSC dataset, that would work out to $13,306,150,900 -- about $13 billion -- per year on scholarly journals alone. To put that into perspective, the entire NIH research budget for 2008 was less than $30 billion. I have been told that most libraries don't pay list price, because publishers offer all kinds of deals, but I wondered whether that $13 billion was at least in the right ballpark, so I went looking for more data.

Since the UCOSC dataset covers 2003-4, I looked at the NCES report for 2004 (the spreadsheet I used is here). The ALA has another division, the Association of College and Research Libraries, which keeps its own records; alas, these are not free, but I could get nearly everything I wanted from the summaries -- again, I just looked at 2004. There's also the Association of Research Libraries, which is "a nonprofit organization of 123 research libraries at comprehensive, research-extensive institutions in the US and Canada that share similar research missions, aspirations, and achievements", mostly made up of very large libraries (think Harvard, Yale, etc). The ARL also compiles and makes available statistics on its members; I pulled out the 2004 data from the download page (spreadsheet here).

Finally, I added the UCOSC dataset for comparison, and for extra context I pulled out the University of California subset from the ARL data (Berkely, Davis, Irvine, LA, Riverside, San Diego and Santa Barbara; I think these are the largest 7 of UC's 10 main campus libraries).  The resulting data look like this2:


Table 2
MTstillsucksass.png


Na, not applicable; cc, couldn't calculate. The ACRL data is derived mainly from two summaries, one showing expenditure (red) and one showing holdings (blue). The mean cost/serial is a fudge, since it was calculated using figures from both summaries, but I doubt it's significantly different from the value I would get if I had all the data, since the number of libraries included in each set is so similar. The other values in green are also approximations derived from summary reports3. Note that the "per library" figures for the UCOSC dataset are actually just for that subset of journals (hence the "<<1" entry for "no. libraries").

I've put some sanity checks -- do these data make sense? -- in a footnote4; to me, the data appear both externally and internally consistent.  I don't, in other words, appear to have done anything egregiously stupid. Not with the numbers, anyway:

Two things jump out at me from Table 2, which together are responsible for the subtitle of this entry. First, my $13 billion guess was way off -- the actual amount spent on serials by US academic libraries is probably closer to $1-2 billion.  Large (e.g. Ivy League) libraries might spend many tens of millions of dollars, small libraries maybe only a few hundred thousand.  That's still an enormous amount of money, but it's not half the NIH budget!  So why the discrepancy?

Quite apart from "list price" and "what libraries actually pay" being two very different things, I've been making a mistake in terminology.  When I think of "serials" in a library, I think of the peer-reviewed scholarly literature; I tend to use "journals" to mean the same thing.

This is very, very wrong.

(As, no doubt, any librarian could have told me, without the need to go ferreting through all those numbers.) From the NCES survey instrument used to collect their data (emphasis mine):

[expenditure]
Current serial subscriptions (ongoing commitments) (line 13) - Report expenditures for current subscriptions to serials in all formats. These are publications issued in successive parts, usually at regular intervals, and, as a rule, intended to be continued indefinitely. Serials include periodicals, newspapers, annuals (reports, yearbooks, etc.), memoirs, proceedings, and transactions of societies.
[...]
[holdings]
Current serial subscriptions (line 26) -- Report the total number of subscriptions in all formats. If the subscription comes in both paper and electronic form, count it twice. Count each individual title if it is received as part of a publisher's package (e.g., Project MUSE, JSTOR, Academic IDEAL). Report each full-text article database such as Lexis-Nexis, ABI/INFORM as one subscription in line 27. Include paper and microfilm government documents issued serially if they are accessible through the library's catalog.

From the ARL ditto:

Questions 4-5. Serials. Report the total number of subscriptions, not titles. Include duplicate subscriptions and, to the extent possible, all government document serials even if housed in a separate documents collection. Verify the inclusion or exclusion of document serials... Exclude unnumbered monographic and publishers' series. Electronic serials acquired as part of an aggregated package (e.g., Project MUSE, BioOne, ScienceDirect) should be counted by title. A serial is
a publication in any medium issued in successive parts bearing numerical or chronological designations and intended to be continued indefinitely. This definition includes periodicals, newspapers, and annuals (reports, yearbooks, etc.); the journals, memoirs, proceedings, transactions, etc. of societies; and numbered monographic series.

Oy vey. Newspapers, yearbooks, government documents and a whole bunch of other things that aren't scholarly journals are (or can be) serials too. "Periodicals" means National Geographic qualifies -- hell, so does Playboy magazine!

As of today (March 17), Ulrich's Periodicals Directory lists 224,151 "active" periodicals; of those, 65,461 are "academic/scholarly"; and of those, 25,425 are "refereed".

What do those things cost which aren't part of the peer-reviewed literature? How does their inclusion in library data impact the means and medians I've been looking at?

Which brings me to the second item of note from Table 2: the mean cost/serial is on the order of ten times higher for the UCOSC dataset than for the other sets.  Does that mean that the scholarly literature is actually the powerhouse of the serials crisis (pdf!), and if we could zero in on the peer-reviewed fraction of the serials data we would see an even more dramatic rise in price? Or does it have more to do with the fact that the UCOSC dataset is deliberately composed of relatively high-end journals, thus artificially inflating the apparent costs? If every library in the NCES set subscribed to those journals at even one-tenth of list price, it would still account for pretty much the entire serials expenditure -- so how many libraries subscribe to which journals? What of the roughly 22,000 peer-reviewed journals that aren't included in the UCOSC dataset?  If libraries are subscribing to anywhere from a few thousand serials to well over 100,000 (e.g. ARL 2007 numbers for Columbia, Harvard and Illinois/Urbana), what proportion of those subscriptions are to peer-reviewed journals -- or, conversely, to what proportion of the peer-reviewed literature does the average library subscribe?

In other words, I've made no headway at all on the question of a "fair price"; all I've managed to do here is to find more questions.  I guess that's progress, because at least they are better-defined, more specific questions. Answering them will require much more fine-grained data, though: which libraries subscribe to which peer-reviewed journals, and at what cost?  I think the answers might be very useful to the research community, but collecting the data would be a full-time job. (I'm up for it, by the way, if anyone reading this is in a postion to hire me to do it. Seriously, I'd love it. After all, look what I'm doing for fun.)

To return to where I started: there's another angle of attack on the "fair price" question, which is to look at things from the other side.  How much does it cost to publish a paper in the peer-reviewed literature, and how does that compare to actual income at publishing companies? This information is notoriously hard to come by, but I've been collecting links and notes for a while so in Part 5 6* I'll try to put them all together and see if I've got anything useful.

* I've just remembered something else I want to do first: Part 5 will take a look at journal price distributions with and without impact factor, using the Elsevier Life Sciences (see Part 1 Fig 3) and the UCOSC datasets.

Update: if you've read this far, go read the FriendFeed discussion, you'll like it.


-------------


1 If you want the data there's a comma-delimited text version of the table here and the spreadsheet from which the table is derived is here.

2 Comma-delimited text file here.

3 The following table shows the figures used to calculate the sum total library expenditure for the ACRL dataset.  Numbers in black are taken from the summaries provided, numbers in pink are calculated from them.

Table 3
MTsucksassforever.png

Mean total expenditure per library was calculated using an approximate average number of libraries of 1074.

4 Sanity checks:

Internal:

  • the ARL and ACRL subsets of the NCES libraries spend less in sum than the NCES set, but the mean and median expenditures/library are lower for the NCES set because it includes more, and smaller, libraries
  • the mean and median number of serials/library is similar between the ARL dataset and its UC subset, both figures being much larger than the mean serials/library for the NCES or ACRL sets (again, more and smaller libraries)
  • the mean and median cost/serial is similar throughout, except for the UCOSC dataset which is a curated subset of high-end scholarly journals (discussed above)

External:

Are those reasonable totals for the libraries to be spending?

  • The ARL 2004-5 report shows that member libraries spent $680,774,493, with a median per library of $5,904,464, on serials, and total library expenditure was $2,683,008,943 (median per library $20,210,171)
  • The NCES 2004 summary shows that 3653 libraries surveyed spent, in sum, $5,751,247,194 on total operating expenses, $1,363,671,792 on serials and $2,157,531,102 on information resources in general

Are those reasonable total numbers of journals per library?

  • OHSU (where I was until recently employed) has 20857 entries in its "journals" catalog
  • The NCES 2004 summary shows that, all together, 3653 academic libraries held 12,763,537 serials subscriptions
  • The ARL 2004-5 report shows that 113 member libraries held 4,658,493 subscriptions, with a median per library of 37,668

Are those reasonable mean and median costs per serial?

  • I could only find unit costs for serials in the ARL report, in the "analysis of selected variables", where the mean cost/serial is given as $247.55 per subscription (range $656.31 to $93.72, median $231.90, 88 libraries reporting).

So, at least in ballpark terms, the numbers in my tables appear to check out against summaries compiled by the various agencies from their own data (and the OHSU library catalog).  There are, e.g., no order-of-magnitude discrepancies -- except perhaps in cost/serial, as discussed above.






Monday, 16 March

Update the first: now I feel bad for not waiting (though I did put "read AFTER honeymoon!!!" in the subject line), but John Wilbanks wrote back right away to say that it will take him a while to get to it, but he will ferret out specific answers regarding the Science Commons work and interoperability.

Update the second: Peter Sefton has more here, including specific recommendations for working with Microsoft while avoiding "a new kind of format lock-in; a kind of monopolistic wolf in open-standards lambskin":

  • The product (eg a document) of the code must be interoperable with open software. In our case this means Word must produce stuff that can be used in and round tripped with OpenOffice.org and with earlier versions, and Mac versions of Microsoft's products. (This is not as simple as it could be when we have to deal with stuff like Sun refusing to implement import and preservation for data stored in Word fields as used by applications like EndNote.)

    The NLM add-in is an odd one here, as on one level it does qualify in that it spits out XML, but the intent is to create Word-only authoring so that rules it out -- not that we have been asked to work on that project other than to comment, I am merely using it as an example.

  • The code must be open source and as portable as possible. Of course if it is interface code it will only work with Microsoft's toll-access software but at least others can read the code and re-implement elsewhere. If it's not interface code then it must be written in a portable language and/or framework.





Friday, 13 March
Fooling around with numbers, part 3; or, why would anyone pay for these journals?

Following on from part 2, I thought I'd ask a couple more questions about price-per-use, based on the online usage stats in the UCOSC dataset. I started on this because I noticed that in Fig 2 of part 2, I'd missed a point: there is an even-further-out outlier above the Elsevier set I pointed out:

UCOSCpriceuse2.JPG

It's another Elsevier journal, Nuclear Physics B. In 2003, only 1001 online uses were reported to UC by the publisher, but the 2004 list price was $15,360. The companion journal Nuc Phys A is not much better, $10,121 for 1198 uses. Compare that with Nature, 286125 uses at just $1,280!

It gets worse, too, because I'm led to believe that anything that appears in a physics journal these days is available ahead of time from the arXiv. I tried to confirm that for Nuc Phys B, but either I'm missing something or the arXiv search function is totally for shit, so I couldn't do it systematically. I did go through the latest table of contents (Vol 813 issue 3) on the Science Direct page, and was easily able to find every paper in the arXiv -- mostly just by searching on author names, though in a couple of cases I had to put titles into Google Scholar. Still, they were all there, which leads me to wonder why any library would buy Nuc Phys B (or Nuc Phys A, assuming it's also covered by the arXiv). Prices haven't improved in the intervening 5 years, either:

[I had a table here but Movable Type keeps munging it. Piece of shit. Here's a jpg until I sort it.]

MTsucksass.jpg


That got me wondering how the rest of the journals are distributed by price/use and publisher:


UCOSCpriceusepublisher.JPG


The inset shows a zoomed view but even that wasn't particularly informative, so I zoomed in a bit further:


UCOSCpriceuseregression.JPG

The curve fits are for the whole of each dataset, even though it's a zoomed view; the Nature set excludes British Journal of Pharmacology, the only NPG title that recorded 0 uses, and Nature itself. Colour coding by publisher is the same for each figure in this post. As in part 2, the correlation between price and use is weak at best and doesn't change much from publisher to publisher. Also, each publisher subset shows a stronger correlation than the entire pooled set -- score another one for Bob O'Hara's suggestion that finer-grained analyses of this kind of data are likely to produce more robust results. Since cutoffs improved the apparent correlation for the pooled set, I tried that with the publisher subsets:


UCOSCpriceuseregression1.JPG


As in part 2, with uses restricted to 5000 or fewer there was improvement in price/use correlation in most cases, but nothing dramatic; I'm not sure why the Blackwell fit got worse. The Nature subset is close to being able to claim at least a modest fit to a straight line there, so not only does NPG boast some of the lowest prices and highest use rates, they are the closest of all the publishers to pricing their wares according to (at least one measure of) likely utility. Special note to Maxine Clarke, remember this post next time I tee off on Nature! :-)

Next, I broke the data out into intervals (for clarity the labels say 0-1, 1-2 etc, but the actual intervals used were 0-0.99, 1-1.99 etc):


UCOSCpriceuseintervals.JPG


Now it seems that we're looking at some kind of long-tailed distribution, which is hardly surprising. The majority of the titles fall into the first few price/use intervals, say less than about $6/use. Since most pay-per-view article charges are between $25 and $40, I more-or-less arbitrarily picked $30/use as a cutoff and asked how many titles from each publisher fall above that cutoff, and what proportion of the total expenditure (viz, list price sum) does that represent? The inset shows that 161 titles, most of them from Kluwer and Springer (whose figures I combined because Springer bought most of Kluwer's titles sometime after 2003), account for about 5% of the total in list price terms. That was a bit more useful, so I expanded it to ask the same question for each interval:


UCOSCpriceuselistpricesum.JPG


What becomes apparent now, I think, is that the UC librarians are doing a good job! Only 6% of the total number of journals (5% of the total list price cost) fall into the "more than $30/use" category, of which it could reasonably be said that the library might as well drop the subscription and just cover the pay-per-view costs of their patrons. Only a further 15% or so work out to more than $6/use, and around 80% of the collection (figured as titles or cost) comes in under $6/use, with around 30% less than $1/use.

So, are these reasonable prices -- $1 per use, $6 per use? I'm not sure I can, but I'll try to say something about that question, using the UCOSC dataset, in Part 4.



Thursday, 12 March
Peters Murray-Rust and Sefton on "science and selfishness"

Peter Murray-Rust (welcome back to blogging!) has replied to Glyn Moody's post about semantic plugins being developed by Science Commons in collaboration with the Evil Empire, which I discussed in my last post. Peter MR takes the view, with which I concur, that it's more important to get scientists using semantic markup than to take an ideological stand against Microsoft:

Microsoft is "evil". I can understand this view - especially during the Hallowee'n document era. There are many "evil" companies - they can be found in publishing (?PRISM), pharmaceuticals (where I used to work) Constant Gardener) , petrotechnical, scientific software, etc. Large companies often/always? adopt questionable practices. [I differentiate complete commercial sectors - such as tobacco, defence and betting where I would have moral issues] . The difficulty here is that there is no clear line between an evil company and an acceptable one .

The monopoly exists and nowhere more than in in/organic chemistry where nearly all chemists use Word. We have taken the view that we will work with what scientists actually use, not what we would like them to use. The only current alternative is to avoid working in this field - chemists will not use Open Office.

Another, to my mind even more important, point was raised by Peter Sefton in a comment on Peter MR's entry:

I will have to talk about this at greater length but I think the issue is not working with Microsoft it's working in an interoperable way. The plugins coming out of MS Research now might be made by well meaning people but unless they encode their results in something that can interop with other word processors (the main one is OOo Writer) then the effect is to prolong the monopoly. There is a not so subtle trick going on here - MS are opening up the word processing format with one hand while building addons like the Ontology stuff and the NLM work which depend on Word 2007 to work with the other hand. I have raised this with Jim Downing and I hope you can get a real interop on Chem4Word.

(Peter S, btw, blogs here and works on a little thing called The Integrated Content Enviroment (ICE), which looks to me like a good candidate for an ideal Electronic Lab Notebook...)

There's a difference between the plugins being Open Source and the plugins being useful to the F/OSS community. If collaborators hold Microsoft to real interoperability, the "Evil Empire" concerns largely go away, because the project can simply fork to support any applications other than Word.

(I've emailed John Wilbanks to get his reaction to all this, but be patient because he's insanely busy in general, and right now he's on honeymoon!)




Wednesday, 11 March
On science and selfishness.

Glyn Moody has a nice post up about fraternizing with the enemy in Open Science; you should read the whole thing, but here's the gist:

One of the things that disappoints me is the lack of understanding of what's at stake with open source among some of the other open communities. For example, some in the world of open science seem to think it's OK to work with Microsoft, provided it furthers their own specific agenda. Here's a case in point:
John Wilbanks, VP of Science for Creative Commons, gave O'Reilly Media an exclusive sneak preview of a joint announcement that they will be making with Microsoft later today at the O'Reilly Emerging Technology Conference. [...] Microsoft will be releasing, under an open source license, Word plugins that will allow scientists to mark up their papers with scientific entities directly.

That might sound fine - after all, the plugins are open source, right? But no. Here's the problem:

Wilbanks said that Word is, in his experience, the dominant publishing system used in the life sciences [and] probably the place that most people prepare drafts. "almost everything I see when I have to peer review is in a .doc format."

In other words, he doesn't see any problem with perpetuating Microsoft's stranglehold on word processing. But it has consistently abused that monopoly [...]

Working with Microsoft on open source plugins might seem innocent enough, but it's really just entrenching Microsoft's power yet further in the scientific community [...]

It would have been far better to work with OpenOffice.org to produce similar plugins, making the free office suite even more attractive, and thus giving scientists yet another reason to go truly open, with all the attendant benefits, rather than making do with a hobbled, faux-openness, as here.

Let me say upfront that I mostly agree with Glyn here. Scientists should be at the forefront of abandoning closed for Open wherever possible, because in the long term Open strategies offer efficiencies of operation and scale that closed, proprietary solutions simply cannot match.

Having said that -- and most expressly without wishing to put words into John Wilbanks' mouth -- my response to Glyn's criticism is that I think he (Glyn) is seriously underestimating the selfish nature of most scientists. Or if you want to be charitable, the intense pressure under which they have to function. Let me unpack that:

Glyn talks about making Open Office more attractive and providing incentives for scientists to use Open solutions, but what he may not realize is that incentives mostly don't work in that tribe. Scientists will do nothing that doesn't immediately and obviously contribute to publications, unless forced to do so. Witness the utter failure of Open Access recommendations, suggestions and pleas vs the success of OA mandates. These are people who ignore carrots; you need a stick, and a big one.

For instance: I use Open Office in preference to Word because I'm willing to put up with a short learning curve and a few inconveniences, having (as they say here in the US) drunk the Open Kool-Aid. But I'm something of an exception. Faced with a single difficulty, one single function that doesn't work exactly like it did in Word, the vast majority of researchers will throw a tantrum and give up on the new application. After all, the Department pays the Word license, so it's there to be used, so who cares about monopolies and stifling free culture and all that hippy kum-ba-yah crap when I've got a paper to write that will make me the most famous and important scientist in all the world?

The last part is a (slight) exaggeration, but the tantrum/quit part is not. Researchers have their set ways of doing things, and they are very, very resistant to change -- I think this might be partly due to the kind of personality that ends up in research, but it's also a response to the pressure to produce. In science, only one kind of productivity counts -- that is, keeps you in a job, brings in funding, wins your peers' respect -- and that's published papers. The resulting pressure makes whatever leads to published papers urgent and limits everything else to -- at best -- important; and urgent trumps important every time. Remember the old story about the guy struggling to cut down a tree with a blunt saw? To suggestions that his work would go faster if he sharpened the saw, he replies that he doesn't have time to sit around sharpening tools, he's got a tree to cut down!

I said above that scientists should move from closed to Open wherever possible because of long term advantages. I think that's true, but like the guy with the saw, scientists are caught up in short-term thinking. Put the case to most of them, and they'll agree about the advantages of Open over closed -- for instance, I've yet to meet anyone who disagreed on principle that Open Access could dramatically improve the efficiency of knowledge dissemination, that is, the efficiency of the entire scientific endeavour. I've also yet to meet more than a handful of people willing to commit to sending their own papers only to OA journals, or even to avoiding journals that won't let them self-archive! "I have a job to keep", they say, "I'm not going to sacrifice my livelihood to the greater good"; or "that's great, but first I need to get this grant funded"; or my personal favourite, "once I have tenure I'll start doing all that good stuff". (Sure you will. But I digress.)

So to return to the question at hand: it's a fine thing to suggest that scientists should use Open Office, but I flat-out guarantee you that they never will unless somehow their funding comes to depend on it. Word is familiar and convenient; none of the advantages of Free/Open Source software are sufficiently important to overcome the urgency with which this paper or that grant has to be written up and sent.

It's also a great idea to get researchers to start thinking about, and using, markup and metadata and all that chewy Semantic Web goodness, but again I guarantee 100% failure unless you fit it into their existing workflow and habits. If you build your plugins for Open Office, that won't be another reason to use the Free application, it will be another reason to reject semantic markup: "oh yeah, the semantic web is a great idea, yeah I'd support it but there's no Word plugin so I'd have to install Open Office and I just don't have time to deal with that...".

When it comes to scientists, you don't just have to hand them a sharper saw, you have to force them to stop sawing long enough to change to the new tool. All they know is that the damn tree has to come down on time and they will be in terrible trouble (/fail to be recognized for their genius) if it doesn't.



Tuesday, 10 March
Fooling around with numbers, part 2

Following on from this post, and in the spirit of eating my own dogfood1, herewith the first part of my analysis of the U Cali OSC dataset.

The dataset includes some 3137 titles with accompanying information about publisher, list price, ISI impact factor, UC online uses and average annual price increase; these measures are defined here. The spreadsheet and powerpoint files I used to make the figures below are available here: spreadsheet, ppt.

As a first pass, I've simply made pairwise comparisons between impact factor, price and online use. There's no apparent correlation between impact factor and price, for either the full set or a subset defined by IF and price cutoffs designed to remove "extremes", as shown in the inset figure:


UCOSCpriceIF.JPG


One other thing that stands out is the cluster of Elsevier journals in the high-price, low-impact quadrant, and the Nature group smaller cluster of NPG's highest IF titles at the opposite extreme. Note that n < 3137 because not all titles have impact factors, usage stats, etc. I've included the correlation coefficients mainly because their absence would probably be more distracting than having the (admittedly fairly meaningless) numbers available, at least for readers whose minds work like mine.

Next I asked whether there was any clearer connection between price and online uses aggregated over all UC campuses:


UCOSCpriceuse.JPG


Again, not so much. I played about with various cutoffs, and the best I could get was a weak correlation at the low end of both scales (see inset). And again, note Elsevier in the "low value" quadrant, and Nature in a class of its own. Being probably the one scientific journal every lay person can name, in terms of brand recognition it's the Albert Einstein of journals. Interestingly, not even the other NPG titles come close to Nature itself on this measure, though they do when plotted against IF. I wonder whether that actually reflects a lay readership?

Finally (for the moment) I played the Everest ("because it's there") card and plotted use against impact factor:


UCOSCuseIF.JPG


The relationship here is still weak, but noticeably stronger than for the other two comparisons -- particularly once we eliminate the Nature outlier (see inset). I've seen papers describing 0.4 as "strong correlation", but I think for most purposes that's wishful thinking on the part of the authors. I do wish I knew enough about statistics to be able to say definitively whether this correlation is significantly greater than those in the first two figures. (Yes yes, I could look it up. The word you want is "lazy", OK?) Even if the difference is significant, and even if we are lenient and describe the correlation between IF and online use as "moderate", I would argue that it's a rich-get-richer effect in action rather than any evidence of quality or value. Higher-IF journals have better name recognition, and researchers tend to pull papers out of their "to-read" pile more often if they know the journal, so when it comes time to write up results those are the papers that get cited. Just for fun, here's the same graph with some of the most-used journals identified by name:


UCOSCtitles.JPG


Peter Suber has pointed out a couple of other (formal!) studies that have come to similar conclusions to those presented here. There are probably many such, because the relevant literature is dauntingly large. There's even a journal of scientometrics! The FriendFeed discussion of my earlier post has generated some interesting further questions, for instance Bob O'Hara's observation that a finer-grained analysis would be more useful. I'm not sure I'm up for manually curating the data, though, and I can't see any other way to achieve what Bob suggests... I might do it for the smaller Elsevier Life Sciences set. For the moment I think I'll concentrate more on slightly different questions regarding IF and price distributions, as in Fig 3 in my last post -- tune in next time for more adventures in inept statistical analysis!


-------------
1 I'm always on about Open Data and "publish early, publish often" collaborative models like Open Notebook Science, and it occurs to me that the ethos applies to blogging as much as to formal publications. So I'm going to try to post analyses like this in parts, so as to get earlier feedback, and of course I try to make all my data and methods available. Let me know if you think I'm missing any opportunities to practice what I preach.



Tuesday, 10 March
Fooling around with numbers

A while back, there was some buzz about a paper showing that, for a particular subset of journals, there was essentially no correlation between Impact Factor and journal subscription price. I think, though my google-fu has failed me, that the paper was Is this journal worth $US 1118? (pdf!) by Nick Blomley, and the journals in question were geography titles. Blomley found "no direct or straightforward relationship" between price and either Impact Factor or citation counts. He also looked at Relative Price Index, a finer-grained measure of journal value developed by McAfee and Bergstrom. He didn't plot that one out, so I will:

blomley.jpg

There is some circularity here, since RPI is calculated using price, but once again I'd call that no direct or straightforward relationship.

All this got me wondering about the same analyses applied to other fields and larger sets of journals. My first stop was Elsevier's 2009 price list, handily downloadable as an Excel spreadsheet. It doesn't include Impact Factors, but the linked "about" page for each journal displays the IF, if it has one, quite prominently. So I went through the Life Sciences journals by hand, copying in the IFs. I ended up with 141 titles with, and 90 titles without, Impact Factors. As with Blomley's set, there was no apparent correlation between IF and price:

Elsevier1.jpg

Interesting, no? If the primary measure of a journal's value is its impact -- pretty layouts and a good Employment section and so on being presumably secondary -- and if the Impact Factor is a measure of impact, and if publishers are making a good faith effort to offer value for money -- then why is there no apparent relationship between IF and journal prices? After all, publishers tout the Impact Factors of their offerings whenever they're asked to justify their prices or the latest round of increases in same.

There's even some evidence from the same dataset that Impact Factors do influence journal pricing, at least in a "we can charge more if we have one" kinda way. Comparing the prices of journals with or without IFs indicates that, within this Elsevier/Life Sciences set, journals with IFs are higher priced and less variable in price:

Elsevier2.jpg

About the time I was finishing this up, I came across a much larger dataset from U California's Office of Scholarly Communication. I've converted their html tables into a delimited text file, available here: UCOSC.txt. For my next trick I'll see what information I can squeeze out of a real dataset (there are about 3,000 titles in there).

Oh, and if anyone wants it, the Elsevier Life Sciences data are in this Excel file: ElsevierLifeSciPriceList.xls.



Sunday, 18 January
Another wonderful conference.

I'm sitting in the computer room at the Radisson RTP after Science Online '09 has wound down, and most of the attendees have left -- though I'm looking forward to dinner with a few fellow stragglers this evening.

Many thanks are due Anton, Bora, David and their various helpers, sponsors and assorted minions for running another wonderful conference. I was happier'n a pig in a puddle with this year's program, as I was able to attend an Open Something (or related) session in almost every slot. There's nothing quite like indulging an obsession with a crowd of like minds, especially when there remains enough diversity of opinion to (mostly) avoid the echo chamber effect. There was only one thing I can point to that wasn't essentially perfect, which is that the web connection, wifi or wire, was flaky and slow quite a lot of the time. That observation must be taken in context, though: although everyone commented, no one complained. It just isn't that sort of gathering.

My session with Björn went well (OK, I can't really judge that -- but I had fun!) -- although it would have gone better if I'd shut up sooner. Having not been to an unconference before, I wasn't strict enough with my introductory blurb and took up time that would have been better spent on the ensuing discussion, which was just terrific. I'll know next time -- and Björn was careful to learn from my mistake, limiting himself to a quick intro for his session with Peter Binfield and obstinately driving the discussion away from echo chamber territory, challenging the participants to come up with new ideas and ways forward. (If you're interested in the Impact Factor question -- that is, metrics and measurement in science -- there's a collaborative bibliography underway in a Google Doc here. I'll make it publicly editable as soon as I figure out how; in the meantime email me if you want an invite to collaborate.)

I definitely prefer the unconference format to a traditional lecture-style conference. When there is a subject that needs more intensive coverage by the speaker(s), the flexible format easily accomodates that -- for instance, John Wilbanks' talk on the semantic web was of necessity about half informal lecture and half rowdy discussion, simply because it's a complex topic about which few of us knew very much. (Before John got through, I mean, since it was an informative and inspiring look at the technology which will probably underpin the next truly radical leap forward in scientific capability.)

As Eva Amsen and Henry Gee both observed, the line between people I know online and people I've met in meatspace is getting very blurry these days. I was nonetheless pleased to meet Eva and Henry f2f for the first time, and also Björn, Peter and John, Cameron Neylon (more like "nylon" than "nay-lon"!), Victor Henning, Martin Fenner and a dozen others to whom I apologize for being too tired to remember you right now! I was of course no less happy to catch up with old friends, repeat offenders like me who were also at the 2007 and 2008 events.

And now it's too late for me to get a nap before dinner, so I think I'll go see if a shower will wake me up instead. More later as I process the many new ideas and insights I collected in the course of two very enjoyable days.




Monday, 12 January
What do you want to know about Open Access?

Science Online '09 is less than a week away, and I'm going to be co-moderating an unconference session with Björn Brembs, the theme of which is "Open Access publishing: present and future".

Björn has already put some notes up on the wiki, and there's an interesting contribution from Antony Williams of Chemspider. As both Björn's and Antony's notes make clear, we think the future of Open Access (indeed, all scholarly) publishing will feature prominently the long-overdue death of the Impact Factor. In fact, audience willing, we plan to use some of this session as a sort of preface for Björn's Sunday session with Peter Binfield, which is titled "Reputation, authority and incentives. Or: How to get rid of the Impact Factor".

It's difficult to overstate the extent to which that single figure has come to dominate scholarly and administrative decision making: where to publish, who to fund or promote, which candidate to hire, and so on. It's also difficult to overstate how bad an idea it is to put so much weight on a single journal-level metric derived by undislosed calculations and decisions from a proprietary database.

But that's the future of publishing, about which much more from Björn and Peter. Regarding the past, I thought I would do a five-minute definition-plus-potted-history, cribbed almost entirely from my earlier talk and Peter Suber's timeline.

That leaves us with the present, and in the spirit of an unconference about science online, I thought I'd simply ask the audience: what do you want to know about Open Access?

There are two things I must clarify. Firstly, by audience I mean both online and on the day: if you're there, you can ask in person, but if you're not going to the meatspace conference you are welcome to ask your question here, on the conference wiki, or by email to me, at any time. Secondly, I'm not claiming I'll have the answer ready to hand -- but OA and related Open ideas are pretty much an obession with me my hobby these days, and if you have a question I can't answer I'll be sure to find out and get back to you. (In addition, the conference will be packed with OA experts and I have no hesitation in bothering them for answers!)

So: what do you want to know about Open Access?



Saturday, 20 December
The serials crisis has a name, and it's Reed Elsevier.

It's notoriously difficult to get good numbers on publisher income, expense and profit -- even nonprofits like PLoS only publish what they have to1 -- and so I'm always on the lookout for more data. If I had more spare time, I could dig out more information, but for now I rely on articles like this one (via OAN) from McGuigan and Russell at Penn State:

The Business of Academic Publishing: A Strategic Analysis of the Academic Journal Publishing Industry and its Impact on the Future of Scholarly Publishing

(Incidentally, in the unlikely circumstance that you've read this far and your eyes haven't glazed over, you will probably like my oa.numbers and serialscrisis tags on Simpy, which is where I keep my collection of such references.)

Interested persons should, as the kids say, RTWT, expecially the nice readable introduction to scholarly publishing and the serials crisis; I just want to publicize this table of profit margins, comparing Elsevier S&M with the broader STM industry:

year
Elsevier Science and Medical
all Elsevier journals
all periodical publishers
1998
35.9
25.7
4.9
1999
35.4
23.4
4.7
2000
36.4
21.0
4.3


I am not going to pay over $100 for the Risk Management Assoc. data that McGuigan and Russell used, but I did download the UK Competition Commission report, wherein I found numbers supportive of the Elsevier figures in the table above.  The 2007 LJ Periodicals Price Survey says that commercial STM publishers' profit margins were "around 25 percent on average" for that year, so the figures for "all periodical publishers" would seem to include a variety of non-STM publishers.  Even so, Elsevier's science and medical division has a clear and commanding lead in the price-gouging stakes.

They also have a clear lead in market share. In one of McGuigan and Russell's references (a 2002 Morgan Stanley report that you can get in pdf format if you have half a clue about search), I found a table showing the proportion of the STM market (measured in number of journals and number of articles) enjoyed by a range of publishers. With a little digging (in the filthy muck of commerce, at that; you owe me, loyal readers!) I discovered that Bertelsmann is part of Springer's original name and they now own Kluwer Academic Publishing (as far as I can tell, most of Wolters Kluwer's journals except for Lippincott Williams & Wilkins) under the rubric of Springer Science+Business Media, and that Wiley bought Blackwell a few years ago.

With that in mind, here's an abbreviated version of the Morgan Stanley table of data:

publisher
no. journals
% ISI journals
% articles
Elsevier Science
1347
18
25
Springer + Kluwer
878
11
11
Wiley + Blackwell
620
8
8
[15 other named companies]
874
11
14
Others (2,028 publishers)
3716
48
40


Although those figures are from 2002, the 2008 Library Journal Periodicals Price Survey estimated that

the top ten STM publishers pulled in 53 percent of the revenue in the $16.1 billion periodicals market in 2006
so the bottom line doesn't seem to have changed much.

Mind you, I don't mean to imply that we should launch another boycott; reigning in Elsevier's profit margins and/or market share would do little to offset the serials crisis. The only answer to that, in the long term, is Open Access, because it scales where Toll access doesn't. No, this entry is not really about OA at all, it's just a little kick in the shins for my favorite Greedy Bastard Publishers.




------------------
1 I'd link to the GuideStar reports but I can't get them: I registered, but they haven't bothered sending me the verification email, and until they do I can't use their search.  What is this, amateur hour?




Wednesday, 26 November
Huh. I didn't suck.

A while ago, I mentioned that I was giving a talk at the Berglund Center. Well, now you can watch the whole thing on video, here (scroll down to Sept 9th).

I watched it myself, and despite seeing mostly room for improvement, was pleasantly surprised at just how much I didn't suck.

Many thanks to all those who offered suggestions on FriendFeed and on this blog. My slides are available here, and like everything I make they are intended for the public domain.



Wednesday, 26 November
Pop quiz!

Two unrelated quizzes that I recently took, and that might amuse some readers:

Via Peter Suber, Lund University's ten-question quickie on Open Access. And yes, I got 10/10.

Via 3 Quarks Daily: from the Intercollegiate Studies Institute, something that purports to be a Civics Quiz but which looks to me rather more like libertarian/capitalist propaganda. Of the roughly 2500 citizens who took the test as part of a survey, nearly three-quarters failed, and the average score was 49%. (I got 27/33, for those keeping score.)



Saturday, 22 November
Bizarre omission from my blogroll

I just noticed that Richard Poynder's blog Open and Shut? was missing from my blogroll -- which is weird, because I know it was on there at one time. I think that I didn't notice earlier because everything Richard writes gets covered multiple times across my "news network", simply because it's so damn good.

Anyway, the blog is back -- and if you read me because you are interested in Open Access and Open Science, and you're not already reading Richard, then do yourself a favour and start.



Monday, 17 November
Recommend OA to President Obama

Via Peter Suber and Bora: Obamacto is a new site where you can make recommendations to Obama's Chief Technology Officer and vote on recommendations made by others. Peter's suggestion was this:

Require open access for publicly-funded research

Require open access to the results of non-classified research funded by taxpayers. Extend the exemplary policy now in place at the NIH to all federal agencies.

You can vote anonymously, but registration is a snap -- seriously, the fastest and easiest online signup I've ever seen. Go vote!




Tuesday, 14 October
Open Access Day 2008

It's OA Day, and all the usual suspects are posting entries in the synchroblogging contest. I'm staying off the web except for 30 minutes or so mornings and evenings (because I desire and intend to finish the Project That Would Not Die by the end of the year), and that really only leaves me time to keep up with my feeds and friends.

So, that's my excuse for not having a contest entry (well, that and I dislike contests and prizes... a rant for another time). But I can't let OA Day go unremarked, so check out the official blog and the FriendFeed room. Here is the blog feed (sorry it's Flash, but I don't have time to test other widgets -- and it is pretty):

(Next year, I'm going to treat OA Day as a national holiday and take the day off work in celebration. Maybe one day everyone will do the same...)



Monday, 06 October
What she said.

With one alteration (viz I have had no differences with Richard Poynder), what Dorothea said goes for me as well. (For more background see Matt at Journalology: 1, 2.)

This is just a for-the-record, public statement that I fully support Richard Poynder's laudable and transparently conducted investigation of SJI and other publishers whose conduct threatens to bring Open Access into disrepute, and that if any such publishers take their legal bullying further than the bluff and bluster we are currently seeing from SJI, I will do what I can to help Richard fight back.

Update 081006: Peter Suber and Stevan Harnad have issued a joint statement in support of the investigative work of Richard Poynder. I was hesitant to do so when it was just me following Dorothea's lead, but now I would like to encourage everyone who is familiar with Richard's work and the SJI story to pick sides and do so publicly. (I have no doubt that every reasonable person will pick Richard's side!)



Tuesday, 26 August
Help me make the most of an opportunity.

Check me out:

ad.jpg

That means I've got about a week to put together a 30-40 minute talk. I won't have any trouble filling up the time, of course -- the real problem is what NOT to present. I aim to use the web instead of powerpoint, by creating a series of bookmarks that I can open in browser tabs (or from a History sidebar; haven't decided) and move through those like slides. I plan to follow the basic format of my old essays: we're all familiar with Free/Open Source software, the NIH just mandated a kind of Open Access so here's what that means and what that can do, and what else can be Open? leading into Open Data, Open Standards/semantic web, Open Licensing -- in short, Open Science.

The Berglund Center is affiliated with Pacific University, a "a small, private university with a blend of liberal arts, education and health care". I attended the Center's Summer Institute this year at the kind invitation of the director, Jeffrey Barlow, after he read Mitch Waldrop's "Science 2.0" article and noticed that I was local. (Sadly, I could only attend one day, but it was both fun and productive. The whole thing was also filmed, so I'll make a note when the footage and transcripts are available.)

Pacific U's College of Arts and Sciences includes schools of biology, bioinformatics and chemistry, and all three strongly encourage undergraduate research. I hope to tailor the presentation somewhat in the hope of getting faculty in these schools enthused about Open Access and Open Science.

So, my question to you dear LazyWeb, is essentially: what should I present? What are the basic, must-know tools and ideas of Open Science? How can I best introduce the possibilities of Open-ness to faculty and students at a small liberal arts college? Who has given really good presentations from which I can swipe ideas? I have an opportunity here to expand the Open Science community; help me make the most of it.


Update 080909: the slides -- after a suggestion from John Dupuis, I ended up using Google Presentations -- are here, and I'll post when the video becomes available.



Saturday, 19 July
An Open Access partisan's view of "Electronic Publication and the Narrowing of Science and Scholarship"

There's been a good deal of online chatter about this recent Science article that discusses the effects of online access on scholarship -- see, e.g., discussions here and here and blog entries noted therein.  The report is not available without paying a toll or subscription, but the abstract is freely visible:

Online journals promise to serve more information to more dispersed audiences and are more efficiently searched and recalled. But because they are used differently than print -- scientists and scholars tend to search electronically and follow hyperlinks rather than browse or peruse -- electronically available journals may portend an ironic change for science. Using a database of 34 million articles, their citations (1945 to 2005), and online availability (1998 to 2005), I show that as more journal issues came online, the articles referenced tended to be more recent, fewer journals and articles were cited, and more of those citations were to fewer journals and articles. The forced browsing of print archives may have stretched scientists and scholars to anchor findings deeply into past and present scholarship. Searching online is more efficient and following hyperlinks quickly puts researchers in touch with prevailing opinion, but this may accelerate consensus and narrow the range of findings and ideas built upon.
This seems thoroughly counter-intuitive to me, since I find a good deal more information by direct search now that I can do it online, and browsing has never played a significant role in my literature searching.  (And remember, I'm old -- I started out using Index Medicus!)  Who has time to browse probably-irrelevant journals and tables of contents on the offchance that something might be useful?  I'm far more likely to stumble across things I'd never have otherwise found when I'm relying on a variety of relevance-based search algorithms (PubMed's Related Articles, Google Scholar, NextBio, etc.).

For anyone who thinks that "forced browsing of print archives" makes a lick of sense: we'll pick a topic, then you spend a day or two browsing in meatspace, and I'll spend an hour searching online.  Who do you think is likely to come up with the best (most useful, most comprehensive) set of references?

Moreover, the article's conclusions seem to be based on a couple of unspoken assumptions with which I don't agree.

The first is that citing more and older references is somehow better -- that bit about "anchor[ing] findings deeply intro past and present scholarship".  I don't buy it.  Anyone who wants to read deeply into the past of a field can follow the citation trail back from more recent references, and there's no point cluttering up every paper with every single reference back to Aristotle.  As you go further back there are more errors, mistaken models, lack of information, technical difficulties overcome in later work, and so on -- and that's how it's supposed to work.  I'm not saying that it's not worth reading way back in the archives, or that you don't sometimes find overlooked ideas or observations there, but I am saying that it's not something you want to spend most of your time doing.

Secondly, let's take the author at his word:

I show that as more journal issues came online, the articles referenced tended to be more recent, fewer journals and articles were cited, and more of those citations were to fewer journals and articles.
OK, suppose you do show that -- it's only a bad thing if you assume that the authors who are citing fewer and more recent articles are somehow ignorant of the earlier work.  They're not: as I said, later work builds on earlier.  Evans makes no attempt to demonstrate that there is a break in the citation trail -- that these authors who are citing fewer and more recent articles are in any way missing something relevant.  Rather, I'd say they're simply citing what they need to get their point across, and leaving readers who want to cast a wider net to do that for themselves (which, of course, they can do much more rapidly and thoroughly now that they can do it online).

If that means citing fewer articles now than researchers tended to cite 20 years ago, it probably has more to do with changes in the culture of science than in the electronic availability of research papers.  For instance, I think it far more likely -- to exaggerate, for the purposes of illustration, in the opposite direction to Evans -- that earlier authors, unable to rapidly and comprehensively scan the literature, cited everything they could get their hands on, padding their bibliographies well beyond anything useful in an attempt to lend weight to their arguments.

It's potentially worrisome if more citations are going to fewer journals, but once again I see no more reason to attribute that to increasing online availability than to attribute it to the sharply rising cost of scientific journals in any form.  It's well documented that as journal prices have continued to rise, researchers and institutions have had to cut back on the number of subscriptions they take.  It is not difficult to imagine that "long tail" and "preferential attachment" phenomena (see, for instance, Evans' own references 14 - 18, reproduced below) would drive the concentration of likely subscriptions towards a pool of "must have" journals.  Indeed, publishers actively promote the concept of such a pool and compete strongly to be seen to be part of it.

Finally, and to me most importantly, Evans seems to me to gloss over the question of what proportion of the online archives are freely available, and what effect that has on the phenomenon he is attempting to model.  Here's the crux of what he does say (fair use! fair use!):

Evansfig2C.JPG

I've rearranged the figure so that what were left, middle and right panels are now top, center and bottom panels; in all graphs the abscissae are "Years of journal issues online" and the ordinates are "Herfindahl citation concentration", which is explained as follows:

A concentration of 1 indicates that every citation to [a given] journal [or subfield] in a given year is to a single article; a concentration just less than 1 suggests a high proportion of citations pointing to just a few articles; and a concentration approaching zero implies that citations reach out evenly to a large number of articles.
Here's Evans' interpretation of that data:
Figure 2C illustrates the concurrent influence of commercial and free online provision on the concentration of citations to particular articles and journals. The left panel shows that the number of years of commercial availability appears to significantly increase concentration of citations to fewer articles within a journal. If an additional 10 years of journal issues were to go online via any commercial source, the model predicts that its citation concentration would rise from 0.088 to 0.105, an increase of nearly 20%. Free electronic availability had a slight negative effect on the concentration of articles cited within journals, but it had a marginally positive effect on the concentration of articles cited within subfields (middle panel) and appeared to substantially drive up the concentration of citations to central journals within subfields (right panel). Commercial provision had a consistent positive effect on citation concentration in both articles and journals. The collective similarity between commercial and free access for all models discussed suggests that online access -- whatever its source -- reshapes knowledge discovery and use in the same way.
Wait, what?  Let me unpack that with a rewrite from my point of view:
The number of years of commercial availability appears to significantly increase concentration of citations to fewer articles within a journal, whereas free electronic availability had a negative effect on the concentration of articles cited within journals. If an additional 10 years of journal issues were to go online via any commercial source, the model predicts that its citation concentration would rise from 0.088 to 0.105, an increase of nearly 20%. In contrast, if an additional 10 years of journal issues were to go online via any free source, the model predicts that its citation concentration would drop from 0.088 to just under 0.08 [I had to estimate this by eye, since the data are not available], a decrease of around 10%. Similarly, free electronic availability had only a marginally positive effect on the concentration of articles cited within subfields. Only when considering concentration to journals within a subfield did free availability cause a substantial increase, and even then this effect was considerably less than that driven by commercial availability, which had a consistent positive effect on citation concentration in both articles and journals.
In other words, I take issue with the final sentence of the paragraph I quoted: commercial and free access do not show "collective similarity".  On one of three measures they have the opposite effect, and on the other two measures commercial access has by far the stronger effect.

What this suggests to me is that the driving force in Evans' suggested "narrow[ing of] the range of findings and ideas built upon" is not online access per se but in fact commercial access, with its attendant question of who can afford to read what.  Evans' own data indicate that if the online access in question is free of charge, the apparent narrowing effect is significantly reduced or even reversed.  Moreover, the commercially available corpus is and has always been much larger than the freely available body of knowledge (for instance, DOAJ currently lists around 3500 journals, approximately 10-15% of the total number of scholarly journals).  This indicates that if all of the online access that went into Evans' model had been free all along, the anti-narrowing effect of Open Access would be considerably amplified.

In fact, the comparison between print and online access is barely even possible when considering Open Access information.  The same considerations of cost -- who can afford to read what -- apply to commercial print and online publications, but free online information has essentially no print ancestor or equivalent.  Few if any scholarly journals were ever free in print, so there's a huge difference between conversion from commercial print to commercial online on the one hand, and from commercial print to Open Access on the other.

Indeed, I would suggest that if the entire body of scholarly literature were Openly available, so that every researcher could read everything they could find and programmers were free to build search algorithms over a comprehensive database to help the researchers do that finding, then in fact the opposite effect would obtain.  Perhaps it's true that the more commercial online access you have, the less widely a researcher's literature search net is cast, but as I mentioned above I see no reason to attribute that more to the mode of access than to its cost.

In support of this assertion, consider the expanding body of literature on the Open Access "citation advantage" -- studies which show that the likelihood of a given paper being cited is increased up to several hundred percent if the paper is OA rather than commercially available.  There is some controversy over that literature, but it stands in direct contrast to the idea that online access of any kind tends to narrow citation reach.

There are more data in Evans' paper that speak to the free-vs-commercial issue, and some of those data show free access having a stronger "narrowing" effect than commercial access.  I'd go through it in detail, but I am probably already pushing the limits of fair use so I'll have to refer you to the published article -- in particular, Figure 2 panels A and B.  My response is much the same, that the apparent effect suffers from a loading in "favour" of commercial access, because of the wildly disparate sizes of the two different bodies of online literature. 



-----
refs 14-18 from Evans, JA Science 321:395, 2008:

A. L. Barabási, R. Albert, Science 286, 509 (1999).
R. K. Merton, Science 159, 56 (1968).
D. J. de Solla Price, Science 149, 510 (1965).
H. A. Simon, Biometrika 42, 425 (1955).
M. J. Salganik, P. S. Dodds, D. J. Watts, Science 311, 854 (2006).

Updates 080720:

1. I linked to the FriendFeed discussions but meant to emphasize -- in one of those conversations, Lars Juhl Jensen points out that the single biggest change is information volume:

I cannot help but wonder if this has anything to do with electronic publication, or if it is simply an effect of sheer volume. If researchers have to search through ten times as many articles (because of the exponential growth of the literature), is it really surprising that they don't make it as far back into the past as they used to do?
This is related to, though stronger than, my point about changes in the culture of research.

2. Bora reminded me of another conflicting study by Arthur Eger, this one showing that "a larger [online] content offering coincides with a dramatic increase in Full Text Article requests, and an increase in Full Text Article requests, after about 2 years, coincides with increased article publication". This is not necessarily inconsistent with Evans' claims, especially since the Eger study also showed that the effect of increasing backfile availability is "modest", but I would like to see those increased Full Text requests broken down by date of publication...

3. Tom Wilson doesn't necessarily agree with my (rather blithe?) assertion that researchers are indeed aware of preceding work:

would it were true that authors are not ignorant of earlier work. In my experience as an Editor and a PhD supervisor, I am continually amazed at the extent to which authors and students are unaware of pre-WWW work. It seems that if the work was done before 1995 it is assumed to have no relevance to the present day. In many cases, of course, that will be true and in some cases the research record is a record of building upon earlier work. In the case of many subfields in information science, however, it isn't the case. A great deal of work was done in the 1970s, which is now completely ignored. Researchers rediscover wheels again and again, when a search of the earlier literature would have revealed that what they think of as novel, was novel 50 years ago!
I think this points up my own biases, in that when I think of research I tend only to think of wet lab science, molecular biology in particular since that's what I do for a living. There are many other fields of research! It strikes me that if molecular biologists do in fact reinvent wheels less often than other disciplines, it is perhaps because our online records go back a long way: PubMed reaches back to 1966, and has some coverage all the way back to 1951. Since molecular biology can fairly be said to have come of age as a discipline in 1953, this suggests two things: that Evans may be more right than I think for disciplines outside my own, and that if those disciplines could digitize their archives efficiently it might go a long way towards solving the problem. In other words, the answer to the narrowing effect of online access on scholarship may be to broaden and deepen online access.



Thursday, 03 July
Lie down with pit bulls, wake up with a blogospheric flea in your ear.

This clumsy hatchet job from Nature reporter Declan Butler is beneath him, a poor excuse for journalism and an affront to the respect with which many of his colleagues are regarded by the research community.

Let's start with the title: "PLoS stays afloat with bulk publishing". Loaded rhetoric, anyone? The clear implications are that PLoS is floundering (Butler's own numbers show otherwise!), and that "bulk" is somehow inferior (to, one presumes, "boutique" or some such). PLoS is "following an haute couture model of science publishing" sniffs our correspondant, who goes on to clarify: "relying on bulk, cheap publishing of lower quality papers to subsidize its handful of high-quality flagship journals".

This emphasis on "quality" and the idea that the same somehow equates with scarcity continues throughout: "the company consciously decided to subsidize its top-tier titles by publishing second-tier community journals with high acceptance rates", "the flood of articles appearing in PLoS One (sic)", "difficult to judge the overall quality", "because of this volume, it's going to be considered a dumping ground", "introduces a sub-standard journal to their mix".

The intent is obvious, and the illogic is boggling. Where does Butler think the majority of science is published? Even if you buy into this nebulous idea of "quality" (one knows it when one sees it, does one not old chap? wot wot?) there can be no "great brand" journals without the denim-clad proletarian masses. All the painstaking, unspectacular groundwork for those big flashy headline-grabbing (and, dare I say it, all too often retracted) Nature front-pagers has got to go somewhere.

It gets much worse, though, when we get some measure of what Butler thinks "quality" means:

Papers submitted to PLoS One (sic) are sent to a member of its editorial board of around 500 researchers, who may opt to review it themselves or send it to their choice of referee. But referees only check for serious methodological flaws, and not the importance of the result.
That, along with an earlier remark about "a system of 'light' peer review", is a blatant and serious misrepresentation of PLoS ONE's review process. Here's the actual policy:
The peer review of each article concentrates on objective and technical concerns to determine whether the research has been sufficiently well conceived, well executed, and well described to justify inclusion in the scientific record. [...]

Unlike many journals which attempt to use the peer review process to determine whether or not an article reaches the level of 'importance' required by a given journal, PLoS ONE uses peer review to determine whether a paper is technically sound and worthy of inclusion in the published scientific record. [...]

To be considered for publication in PLoS ONE, any given manuscript must satisfy the following criteria:

  • Content must report on original research (in any scientific discipline).
  • Results reported have not been published elsewhere.
  • Experiments, statistics, and other analyses are performed to a high technical standard.
  • Conclusions are presented in an appropriate fashionand supported by the text.
  • Techniques used have been documented in sufficient detail to allow replication.
  • Reports are presented in an intelligible fashion and written in standard English.
  • Research meets all applicable standards, including the Helsinki Declaration, with regard to the ethics of human and animal experimentation, consent, and research integrity.
  • Report adheres to the relevant community standards for research, reporting, and deposition of data. (Standards PLoS promotes across its journals).
Which is to say that PLoS ONE* holds authors to exactly the same scientific standards that every journal should follow. Which is to say that any methodological flaws, not "only... serious" ones, will see a paper revised, or rejected if the flaws can't be overcome. Which is to say that PLoS ONE uses peer review to do what it was designed to do, not to create an artificial scarcity from which to milk profit with scant regard for the integrity of the scientific record. That's not "light" peer review, it's real peer review.

With this scurrilous parroting of anti-OA FUD, Nature makes pretty clear where its interests and its allies are.  Well, you know what happens when you lie down with pit bulls...

There's a lot more, but that was the issue that pushed my buttons the hardest. See Bora for a roundup of responses; here's a quick outline of some of the key issues:

Jan Velterop, responding to Butler's last "investigation" of PLoS finances two years ago, pointed out that it's ridiculous to expect a new journal with a new business model to break even in a few years, when new journals from established publishers take up to a decade to achieve the same goal; DrugMonkey also mentions the "so what" nature of this complaint. Jonathan Eisen remarks that somehow Butler gets from "PLoS ONE is doing well and making money" to "PLoS is a failure"; go read Jonathan to see how twisted your logic has to be to make that particular trip. (Jonathan also provides an important reminder, that we should not confuse Nature Publishing Group as a whole with their many talented and well intentioned employees!) Grrlscientist observes that, while Butler's piece makes it sound as though PLoS' reliance on donations were a bad thing, all journals rely on the donation of time and expertise by unpaid reviewers. Drugmonkey, Jonathan and Grrlscientist all make the point that Nature has its own stable of "second tier" journals with "lower barriers to entry" -- the same mechanism for which Butler criticizes PLoS. Stevan Harnad is famous for making the point (here, for example) that if the funds currently draining into subscriptions were used to pay OA costs, there would be an immense improvement in the utility of the scientific record even if there were no financial saving.

Finally, pretty much every commenter has pointed out the glaring lack of any "conflict of interest" statement on the Nature piece -- having said which, I'd better make one of my own. It's well known and obvious at a glance at this blog that my favorite drink is the Open Access Kool-Aid. I have personal friends who work for PLoS, and I've previously applied to work there myself.


* originally in lowercase -- so much for my snotty (sic)s!



Sunday, 11 May
OA and licensing: why not Public Domain?

This is an unpublished post that's so old (Aug '07) that I don't know why I didn't just post the damn thing; I've forgotten what I was intending to do with it. I'm posting it now because it contains pointers to useful thinking by David Wiley and others that is germane to the ongoing discussion of data licensing (see post below). I was reminded of this old draft of mine by Deepak's comment that copyleft may be harmful in the case of scientific data, a point David also makes in respect of his particular Open area, education. Much of what David says maps readily from his field to research, so without further ado:

David Wiley of Iterating Toward Openness has been blogging up a storm about open content licensing:

That's a lot to read, but it's all good stuff. David makes one very strong argument that I want to emphasize here, because it points up the difficult distinction between data and (creative) work.

In the post introducing his draft Open Education Licence, he provides a very useful outline of the aims of open content:

  • Reuse - Use the work verbatim, just exactly as you found it
  • Rework - Alter or transform the work so that it better meets your needs
  • Remix - Combine the (verbatim or altered) work with other works to better meet your needs
  • Redistribute - Share the verbatim work, the reworked work, or the remixed work with others

I really, really like that. David's "four R's" resemble the four fundamental freedoms of the Free Software Foundation but do a better job of discriminating between Rework and Remix. The Four R's make immediate sense to me and I will certainly be Reusing and Redistributing that idea.

David goes on to quote some believable numbers and points out that:

Since half of all CC licensed materials are licensed using a copyleft clause and all GFDL licensed materials are licensed using a copyleft clause, this means that over half of the world's open content is copylefted. And while the CC and GFDL copyleft clauses guarantee that all derivative works will be "open," they also guarantee that they can never be used in remixes with the majority of other copylefted works. You can't remix a GFDL work with a By-NC-SA work when the licenses require that the child be licensed exactly as the parent. Each parent had one and only one license - which license would the derivative use? It's just not possible to legally remix these materials; copyleft prevents this remixing. [see David's earlier explanation for details of the incompatibilities among various copyleft licenses]

While promoting rework at the expense of remix - in other words, taking the copyleft approach - is fine for software, it is problematic for content and extremely problematic for education. As educators, we are always remixing materials for use in our classrooms both in the "real" world and online. Your mileage may vary, but over my last 15 years of teaching I would estimate that my remixing activities outnumber my reworking activities 10:1 or more. If other teachers are like me in this regard, then, copyleft is a huge problem for open education.

It's potentially a huge problem for scientists, too, because much of the potential of Open Science and Open Data (see here for an attempt at defining those terms) is in Remix. There are answers in existing datasets to questions their creators never thought to ask; as Alma Swan put it,
...exciting new developments in text-mining and data-mining are beginning to show what can be done to create new, meaningful scientific information from existing, dispersed information using computer technologies. Research articles and accompanying data files can be searched, indexed and mined using semantic technologies to put together pieces of hitherto unrelated information that will further science and scholarship in ways that we have yet to begin imagining.
This is why I join Peter Murray-Rust in being against copyleft for data:
I am not in favour of copyleft for data. I have no fundamental objection to creating a copyrighted work from data as long as there is significant added value. And copyleft is viral - deliberately. If any item in a system/collection/program etc. is copyleft, then the whole is (at least by the algorithm). [...]
I would argue that if I get factual information from WP [wikipedia] then it cannot carry a copyleft. I need the fundamental physical constants and get them from WP. I don't think that my data and programs are thereby copyleft. All algorithms are now slightly fuzzy.
So what do we mean by "data"? What I mean is "facts about the world of sense-perception", as distinct from the presentation and interpretation of those facts. So I might not be free to reproduce, say, a scan of a Western blot from a published paper -- but having looked at that image, I had better be completely free to do whatever I like with the information it gives me about the way the world works, or else science will grind to a halt. Similarly, if a review article (which contains no new facts, and is all reuse and remix) brings together the results of a number of studies to create new information, or a new hypothesis, about the way the world works, I am not free to copy the wording but I must be free to go into my lab and test the hypothesis.


See also (this was a note to myself in the draft, so caveat lector!):

CC-NC considered harmful (Kuroshin)
When is OA not OA? (Catriona MacCallum in PLoS Biology)
CC, OA and moral rights (Thinh Nguyen, Science Commons blog)
Open Data and Moral Rights (Peter Murray-Rust)


-----
In the interests of full disclosure, I have a personal statement for this blog which I hope places the content squarely in the public domain, and for my columns on 3QuarksDaily I use CC-BY so that, if those pieces generate any interest, 3QD might at least get some traffic out of having generously offered me a spot on their roster.



Saturday, 10 May
Data are difficult.

Scientific data are not only hard to come by, they're almost as hard to share, mainly because the scientific infrastructure is armpit-deep and sinking fast in the quicksand of patents, copyrights and ever-multiplying licenses. See Peter Murray-Rust, Antony Williams and Egon Willighagen for the latest dust-up over data licensing; I just want to point out this clear-eyed commentary by John Wilbanks:

The public domain is not an "unlicensed commons". The public domain does not equal the BSD. It is not a licensing option.

It is the natural legal state of data.

It is a damn shame that we no longer think of the public domain as an option that is attractive. It's a sign of the victory of the content holders that the free licensing movements work against that something without a license -- something that is truly free, not just just free "as in" -- is somehow thought to be worse. We've bought into their games if we allow the public domain to be defined as the BSD. The idea of the public domain has been subjected to continuous erosion thanks to both the big content companies and our own movements, to the point where we think freedom only comes in a contract.

The public domain is not contractually constructed. It just is. It cannot be made more free, only less free. And if we start a culture of licensing and enclosing the public domain (stuff that is actually already free, like the human genome) in the name of "freedom" we're playing a dangerous game.

There's a lot more to get at here.

Yes, there is, and you should read the rest of that entry (and keep up with John's blog) if you're at all interested. I'll add just one brief comment: back when John's current job was first advertised, I considered applying for it -- not that I thought I was qualified, but perhaps the SC would want to hire the new director an offsider of some sort. Having had a couple of years to start learning a bit about Open Access and Open Science, I would venture to say that we are all better off with me in the cheerleading section instead of on the field.




Sunday, 13 April
Term dilution; or, that phrase, you keep using it...

As the terminology wars between "Free Software" and "Open Source Software" afficionados demonstrate, as soon as you stick a label on what you are doing, someone will come along and co-opt it. Sometimes, as with F/OSS, there are real disagreements to be had by reasonable people; at other times, well, not so much. This:

"Open science" is liberated from methodological naturalism (MN), even though it begins with an MN position. That is, all scientists start their work in pursuit of natural explanations for events or natural solutions for problems. If evidence and logic point to an end of the road for natural explanations, on rare occasions a scientist using open science would be willing to consider an explanation which does not force him to a naturalistic conclusion. For instance, the genetic code stored in the DNA molecule has no precedent in naturalism, since all codes are the product of a mind. Open science would allow possible supernatural causation as a topic for further research. The scientist would not be restricted to naturalism as the only explanatory option. But alas! Professional scientists do not practice open science. They practice "closed science."
has most emphatically nothing whatsoever to do with Open Science in the sense in which I -- and my friends, colleagues and allies in the nascent movement, see e.g. blogroll to right -- use the term.



Sunday, 13 April
reminder

Over at Free Genes, Jason Kelly has a nice reminder for those of us who tend to be disheartened by slow rates of progress in our chosen field, be it Open Science or, in Jason's case, synthetic biology. I liked it so much I'm stealing it. This:


firsttransistorgif.jpg

is a transistor, circa 1948. Now you can buy the equivalent of many millions of these for pocket change, in a device that will fit on your keychain.



Saturday, 12 April
Good question.

Egon has an interesting angle on Peter Murray-Rust's observation that you can't mine PubMed Central:

I was wondering about this section in the CC license of much of the PMC content, such as our paper on userscripts (section 4a of the CC-BY 2.0):

    You may not distribute, publicly display, publicly perform, or publicly digitally perform the Work with any technological measures that control access or use of the Work in a manner inconsistent with the terms of this License Agreement.
CC-BY 3.0 reads differently, but has similar aims. [...] Peter indicates that the NIH has put in place 'technological measures to control access' to the distribution of our work on userscripts (the PMC entry). That is in clear violation of the CC license. [...] What the PMC website should indicate, instead, is that text mining is allowed for the PMC OAI subset, but that they would highly prefer to use the PMC OAI or PMC FTP routes. This is the least they have to do.

No matter what, I still have the feeling that any technical obstacles are disallowed by the CC-license. Any legal expert here, that can explain me if the CC license allows controlling how people have access to my material?
In other words, can they do that? Like Egon, I await legal advice... how 'bout it, Creative Commons?



Monday, 07 April
Removal of permission barriers is already part of the definition of OA

Heather Morrison points to this excellent post by Glen Newton, wherein Glen proposes that Open Access should explicitly include machine readability:

Open Access must include access by machines:

* At minimum one must allow crawls of the site/content or (to reduce the impact of badly configured crawlers) create a compressed XML file containing all metadata and either content, or direct links to content and make it available for download (and if bandwidth is still an issue put it on a P2P network like BitTorrent).
* Preferable is to offer some kind of API (OTMI) or protocol (OAI-PMH) to get at content and metadata and citations.
* Better is to offer access to the XML of the articles in addition to the PDF and/or HTML; if the XML actually has some semantic content, then we are approaching the optimum.

The end goal is to support and encourage text mining and analysis of the full-text (preferably semantically rich XML), metadata and citations to allow literature-based exploration and discovery in support of the scientific research process.

Most importantly: hear, hear!

I do, however, have a nitpick to make. Heather makes no comment on Glenn's idea that this is an addition to the definition of OA, but in fact I think it's already built in to the accepted BBB definition. Peter Suber refers to the removal of price and permission barriers, to distinguish Open from "merely" free access, which removes only price barriers; I've quoted him on this before, so here he is again:

The best-known part of the BBB definition is that OA content must be free of charge for all users with an internet connection. However, the BBB definition doesn't stop at free online access. It adds an extra dimension that isn't as easy to describe, and consequently is often dropped or obscured. This extra dimension gives users permission for all legitimate scholarly uses. It removes what I've called permission barriers, as opposed to price barriers. The Budapest statement puts the extra dimension this way:
By "open access" to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.
The Bethesda and Berlin statements put it this way: For a work to be OA, the copyright holder must consent in advance to let users "copy, use, distribute, transmit and display the work publicly and to make and distribute derivative works, in any digital medium for any responsible purpose, subject to proper attribution of authorship".

All three tributaries of the mainstream BBB definition agree that OA removes both price and permission barriers. Free online access isn't enough. "Fair use" ("fair dealing" in the UK) isn't enough.
Having said all that, though, I'll add that an explicit description of machine readability requirements would be an addition to the accepted definition of OA -- and one that I would welcome. Peter Murray-Rust recently noted that, according to the "price and permission barriers" view of Open Access, PubMed isn't OA -- even PubMed Central isn't OA.

I'll go even further: can anyone point me to a single Open Access repository? I don't know of even one such site that removes both price and permission barriers. Surely there must be some, but the Big Names (PubMed Central, arXiv, Cogprints, CiteSeer, RePEc, etc -- see ROAR) don't seem to qualify, because digital objects in these repositories carry their own copyrights, rather than being covered by a blanket license provided by the repository.

Can this be true? Five years after the BBB definition came together, more than ten years since Stevan Harnad's subversive proposal and on the first day of the NIH mandate -- widely referred to as an OA mandate! -- can it be that we really don't have a single truly OA repository in all the world? And if it is true, would it help to make the official definition more explicitly machine-friendly?




Wednesday, 06 February
Open Science Conference proposal

I'm probably too late with this to do any good, but Shirley Wu is putting together a proposal for an Open Science session at the Pacific Symposium on Biocomputing. You can read a draft of the proposal which already reads pretty well to me, and Shirley could do with letters of support:

One thing that would really help outside of the proposal itself is to have actual letters of support. That way the organizers will know there is serious interest and commitment for a session on Open Science - it's a gamble for them, frankly, but much less of one if there is a good crowd on board.

So if you would like to support this proposal and are willing to commit to participating should it get accepted, please send me an email to that effect (with as many details of your anticipated participation as you can provide at this time), and I will include all the emails as "supplementary material" next Friday.
Er, yes, that's this coming Friday... I did mention I was late with this, no?

So anyway, if you can come up with an idea for a presentation or can simply commit to attending, please drop Shirley a line. She's another graduate student who's caught the Open Science bug, and the more of them we have -- and the more we can do to help and encourage them -- the better.




Saturday, 12 January
Mitch Waldrop on Science 2.0

I'm way behind on this, but anyway: a while back, writer Mitch Waldrop interviewed me and a whole bunch of other people interested in (what I usually call) Open Science, for an upcoming article in Scientific American.  A draft of the article is now available for reading, but even better -- in a wholly subject matter appropriate twist, it's also available for input from readers.  Quoth Mitch:

Welcome to a Scientific American experiment in "networked journalism," in which readers -- you --get to collaborate with the author to give a story its final form.

The article, below, is a particularly apt candidate for such an experiment: it's my feature story on "Science 2.0," which describes how researchers are beginning to harness wikis, blogs and other Web 2.0 technologies as a potentially transformative way of doing science. The draft article appears here, several months in advance of its print publication, and we are inviting you to comment on it. Your inputs will influence the article's content, reporting, perhaps even its point of view.

So consider yourself invited. Please share your thoughts about the promise and peril of Science 2.0. -- just post your inputs in the Comment section below.

It's good to see Science 2.0 getting not just mainstream attention, but well-crafted and balanced mainstream attention.  It's also good to see a "Journalism 2.0" approach being tested, so if you have ideas or opinions, go participate.

On a personal note, I'm pleased but a little embarrassed to have been quoted by name in an article for which I know Mitch interviewed a lot of people who are actually *doing* Science 2.0, not just cheering from the sidelines like me.  It's hard to be critical of choices made in the face of space constraints (the article is destined for print), but there's no such limit online.  I wonder whether Mitch and his SciAm editors would consider putting a longer version online? 

In a similar vein, in comments here Bora asks whether we (John's "usual suspects") couldn't put together a longer article for publication somewhere.  I think I might have a better idea (though it's hardly original with me).  From my point of view, the best thing about my 3Quarks Open Science articles from about a year ago is that they are already wildly out of date.  The -- to me -- obvious way to update them and keep them up-to-date is to turn them into a wiki (probably starting from the Nodalpoint wiki's Open Science page).  I think the articles cover most of the main bases, and each section could relatively easily be turned into a wiki page; with a little attention to style, it should then be fairly easy to re-write the articles from the updated information.  I am, as usual, swamped with work, so I won't be able to wiki-ize anything any time soon -- I do intend to get to it eventually, but in the meantime the articles themselves are all CC-BY and my Simpy bookmarks, which should help with updating, are pub dom and I'd be happy to help if anyone else wanted to take a stab at it. 

Finally, if you enjoyed the SciAm article, you might also enjoy more of Mitch's writing: he has a blog, a new gig at Nature and has written three books to date: The Dream Machine (2001), Complexity (1992) and Man-Made Minds (1987). (I swiped his affiliate links, I hope they still work.)



Sunday, 06 January
Another clarification -- actually a correction.

Being careful with the language of the letter below made me see that, in earlier entries, I've fallen into one of the easy traps in which OA opponents would like to catch everyone:

...of these, 16 are listed as "grey" (won't allow archiving), 23 are "green" (allow refereed postprint archiving -- NIH mandate compliant) and 7 "pale green" (allow preprint archiving; many "pale green" publishers actually allow postprint archiving and are NIH compliant...

...at least 50% of PSP members are already complying with the NIH mandate, and a further 15% at least allow preprint archiving and may even be NIH-compliant.

The majority of journals for which information is readily available are already compliant with the new NIH mandate...

This phrasing is deeply misleading: it's not the journals or the publishers who must comply with the new NIH (or any other) Open Access mandate!

Publishers can choose to allow their authors to self-archive, or not. They are under no compulsion whatsoever. It's the authors -- who have taken public funding, and so are working for the public -- who must comply with the mandate to give the public full value for its money.

There is no such thing as an NIH-compliant, or non-compliant, journal or publisher. That's a phrase that comes readily to hand, a convenient shorthand perhaps, but we should not use it. The mandate simply does not concern itself with the actions of publishers. Beware the rhetorical frame in which the new law is cast as "the government telling publishers how to run their business"!

The obvious replacement phrase, when talking about journals or publishers and their policies, is "mandate-compatible", so I'll be careful to use that from now on.



Saturday, 05 January
They get letters. Maybe.

Peter Suber points out that no members of the AAP/PSP's ill-conceived PRISM "coalition" were ever identified, and that at least nine publishers publicly disavowed or distanced themselves from it; he then asks:

Has AAP/PSP ever consulted its members about its position on the NIH policy? Are AAP/PSP members willing to see their dues spent on a lawsuit to delay it?

I think it's worth finding that out.

Listed at the bottom of this entry are the "green" and "pale green" EPrints/RoMEO publishers listed as members by the PSP (links and names taken directly from the PSP website). On closer inspection, it seems that RoMEO proper lists all of the "pale green" publishers as yellow, and (with one or two caveats concerning journals with long embargo periods) gives them all a "compliant" rating in respect of NIH policy.

Here is a draft of the letter I have in mind to send to each of these publishers:

Dear [Publisher],

the Association of American Publishers' Professional and Scholarly Publishing division (AAP/PSP), which lists [your company] as a member [1], recently issued a press release [2] in response to the new NIH mandate [3] for Open Access to publicly funded research. The press release was highly critical and contained a number of mistaken and misleading assertions; for details, you can read a public, point-by-point rebuttal [4] by Prof Peter Suber, open access project director at Public Knowledge [5] and a senior researcher with the Scholarly Publishing and Academic Resources Coalition [6]. I'm sure you remember PRISM, the AAP/PSP's ill-considered campaign against Open Access [since your company publicy distanced itself from same]; this latest press release is similar in tone and apparent intent.

In stark contrast to the AAP/PSP's public stance, [your company] is listed by Project RoMEO [7] as a [yellow/green] publisher. This means that [your company] policy regarding self-archiving of journal articles was fully in line with the new law even before it became law, and there is absolutely no conflict between your business model and the NIH mandate. In fact, of the 46 PSP member companies indexed by Project RoMEO, 30 have no policy that conflicts with the new law; and of the approximately 6000 journals published by those 46 companies, around 5700 already allow their authors to comply with the NIH mandate.

I write, therefore, to ask: does the AAP/PSP accurately represent its members in its opposition to the NIH mandate? Was [your company], as a member of the Association, consulted before the AAP/PSP respnse was made public? Finally, if [your company] is not in agreement with the AAP/PSP on this matter, would you consider making a public statement to that effect [in the same way you did regarding PRISM]?

sincerely,

Me.

[1] http://www.pspcentral.org/index.cfm?left=member_companies&page=/home/member_companies.cfm
[2] http://www.pspcentral.org/publications/AAP_press_release_NIH_mandatory_policy.pdf
[3] http://thomas.loc.gov/cgi-bin/query/z?c110:H.R.2764:
[4] http://www.earlham.edu/~peters/fos/2008/01/aappsp-response-to-oa-mandate-at-nih.html
[5] http://www.publicknowledge.org/
[6] http://www.arl.org/sparc/
[7] http://www.sherpa.ac.uk/romeo.php

The most obvious thing missing from the draft is "who the hell am I, to be asking you this?" Now, I can send the letter as myself -- concerned citizen, professional research scientist, potential client of publishers -- but I am only an egg, and it would have a good deal more impact as an open letter from a variety of interested and concerned parties, and still more if it came from somewhere official (ARL, SPARC, I don't really know who would be appropriate here).

So -- anyone up for a multi-author open letter? Any other ideas?

Update 080310: decided not to send letters after all; see here, scroll to bottom of post.




The publishers in question:

Pale green:

Green:



Saturday, 05 January
Quick clarification

The publisher list I've been using in the last few posts actually comes from EPrints.org, using information from SHERPA/RoMEO. I'll refer to the EPrints interface as EPrints/RoMEO from now on.

This wouldn't cause any confusion and I wouldn't bother to point it out, except that RoMEO actually uses a four-colour scheme (green, blue, yellow, white) which EPrints has squished into three (green, pale green, grey).

Update: see Stevan Harnad's comment on the next entry.



Friday, 04 January
Does the AAP/PSP really represent its members?

Via Peter Suber, Dorothea Salo and Heather Morrison, I see that the AAP/PSP has responded to the new NIH mandate in typical, PRISM-esque fashion. For anything I might have said in response, and much more, read the linked entries -- especially Peter Suber's. I have something else in mind.

The PSP lists its members here ; it didn't take long to compare that list with the list of publishers indexed by SHERPA/RoMEO. Of the 355 publishers in the RoMEO database, 46 are members of PSP; of these, 16 are listed as "grey" (won't allow archiving), 23 are "green" (allow refereed postprint archiving -- NIH mandate compliant) and 7 "pale green" (allow preprint archiving; many "pale green" publishers actually allow postprint archiving and are NIH compliant, but are not listed as green because of various restrictions).

It's not possible to do what I wanted here -- which was to answer the title question. The problem is that the PSP lists 102 about 100 members that aren't indexed by RoMEO. I found that somewhat surprising, particularly since the list includes names I'd have expected to find in RoMEO: FASEB, Stanford U Press, Yale U Press, Cold Spring Harbor Lab Press, NEJM, Highwire Press and others.

Nonetheless, we can say that if the RoMEO-indexed sample (46 of 148, 31%) is representative, then at least 50% of PSP members are already complying with the NIH mandate, and a further 15% at least allow preprint archiving and may even be NIH-compliant.

It's even more unbalanced if we compare the numbers of journals published by each company. Those 46 publishers account for 5901 journals; the grey publishers put out 222 (4%), the green publishers 4243 (72%) and the pale green publishers 1436 (24%).

If the PSP were honest and interested in fairly representing its members, I'd think they would find out (and make public) whether the remaining, non-RoMEO indexed members follow the same pattern. I won't hold my breath.

____
Full disclosure: the numbers above are not 100% accurate, since the comparison between the two lists was not always straightforward. For instance, RoMEO indexes "Yale Law School" and the PSP lists "Yale University Press" as a member. I tried to err on the side of the PSP -- for instance, Yale Law is grey, so I included them. There were a few such problematic instances; I very much doubt that they made any difference to the data expressed as percentages, I'd welcome correction and a better dataset, and if anybody wants the Excel files I used I'll be happy to provide them.
Update: see strikethroughs above; some of the overlap issues can be resolved by searching more carefully -- for instance, NEJM is published by Massachusetts Medical Society, which is in RoMEO, and I have no idea how I missed FASEB the first time around. But again, little or no change to the percentages.



Wednesday, 02 January
Public Domain Day

Via Dorothea Salo and Peter Suber, John Mark Ockerbloom reminds me that New Year's Day is also Public Domain Day -- the day on which, each year, a new batch of works enters the public domain:

In countries that use the "life plus 50 years" minimum standard of the Berne Convention, works by authors who died in 1957 enter the public domain today. That includes writers, artists, and composers like Nikos Kazantzakis, Diego Rivera, Dorothy L. Sayers, Jean Sibelius, and Laura Ingalls Wilder.

In countries that use the "life plus 70 years" term, works by authors who died in 1937 enter the public domain, including works by J. M. Barrie, Jean de Brunhoff, H. P. Lovecraft, Maurice Ravel, and Edith Wharton. [...]

In countries like the US and Australia, which are under 20-year freezes of all or most of the public domain, it's not quite as momentous a day. Here in the US, like Bill Murray in Groundhog Day, we're once again waking up to a public domain 1922, as we have since 1998. Our next mass expiration of copyrighted published material is scheduled for New Year's Day 2019, 11 years from now. [...]

Let's not just ask what the public domain can do for us; let's ask what we can do for the public domain. In particular, as of this year more than 14 years have passed since the Web started to explode into public consciousness, with NCSA's release of the Mosaic web browser in 1993. Many of us older Net users started creating web sites that year. And 14 years was the original term of copyright specified in the UK's Statute of Anne, and the US's first copyright law (with an optional renewal term).

As an advocate of more reasonable copyright terms, like those envisioned by our country's founders, I am therefore today dedicating the copyrights of all 1993 versions of my web sites into the public domain. These sites include The Online Books Page, which is still in operation, and Catholic Resources on the Net, which I stopped maintaining in 1999.

Many thanks to John Mark for the informative post, and also for his gift to the public domain. Like Dorothea, I have long since tried to make it clear that I consider my weblog to belong to the public domain. (Do read Dorothea's explanation.) As you can see from comments on my entry, though, an informal statement is suboptimal because people still have questions, and are not confident simply taking whatever they want from the site (as I intend that they should be). It turns out that it's not easy to put something into the public domain without waiting out the requisite copyright term -- it means giving something away for free, and the law is leery of that. So you need meatspace signatures and whatnot, and the Creative Commons Public Domain Dedication is not really much use, even within the USA. I've thought about ditching my homebrew dedication for a CC-BY license, but I don't actually want to place that restriction on the use of anything I post here. Fortunately, CC is on the ball and will soon offer CCZero, which I hope will turn out to be an effective way to dedicate something to the public domain, formally and officially and in a widely recognized and accepted manner. Once I have an option that puts the weight of Creative Commons behind the dedication I want, I'll switch to that. For now, just trust me -- take whatever you want from this site (so long as I made it, of course) and do with it as you please. I'd love to hear back about anything you do with something you found here, but you're under no obligation to inform me.



Thursday, 27 December
A new beginning; here's why.

Rich Apodaca asks whether the new NIH OA mandate marks a new beginning, or more of the same. His argument hinges on the (admittedly unfortunate) phrase "in a manner consistent with copyright law", and he concludes that

Neither HR 2764 nor any form of government intervention will bring widespread Open Access into being.
Here's why I think Rich is wrong.

Point the first: Rich claims that

Most of the journals in question will be hostile to the idea of having their copyrighted material deposited into PubMed Central and so understandably won't allow it to be done by the authors of papers or anyone else.
The available data do not support this. Of the 355 publishers indexed by SHERPA/RoMEO, 66% formally allow self-archiving; more importantly, 56% formally allow archiving after refereeing. (There's a big gap between "formally allow" and "formally forbid", too.) The numbers are even more OA-positive at the journal level. Those publishers between them account for 10199 journals, of which 91% are at least "pale green" -- that is, allow at least preprint archiving. Well over 6000 journals, 62% of the total, are "green" -- that is, allow self-archiving of refereed postprints. You can use the web interface to find out whether your favorite journal or publisher will allow you to self-archive; here's a quick look at the big names (> 50 journals) and a few usual suspects (sorry about the jpg, I can't make html tables to save myself):


romeo.JPG


Point the second: Rich goes on to give the following hypothetical:
Professor Gross at California University gets his manuscript approved for publication in the Journal of Nanoscale Devices (JND). Professor Gross is fully aware both of HR 2764 and JND's refusal to deposit manuscripts into PubMed Central - the reasons why Professor Gross would choose JND anyway are interesting, but not relevant here. Along with the acceptance letter, JND requests prompt return of a signed copyright transfer agreement. Professor Gross sends in the signed form and from that point on, all rights to his article belong to JND. As is their policy, JND refuses Professor Gross permission to deposit a copy of his paper into PubMed Central within 12 months after publication.

Unless I'm missing something, neither Professor Gross nor JND have violated any laws.

Does Professor Gross have to publish in JND? Pace Rich, the good Professor's reasons are relevant. Let's take a look at those publication-related sins through an OA lens:

  • Greed -- the OA advantage should drive the greedy to reject journals like JND which deny them the opportunity fully to profit from their own work
  • Envy -- if you want your publication record to be all it can be, publish OA (either by choosing OA journals, or by self archiving)
  • Pride -- if you want your science to have maximal impact, ditto
  • Wrath -- STM publishing is big business with big fat profit margins; as consumers and producers, let's at least get value for money (i.e., OA) and put the hurt on greedy publishers who won't at least allow us to make our own work OA
  • Gluttony, Lust -- see Greed, Envy, Pride
  • Sloth -- for just a few keystrokes, you can increase your research impact and professional standing; why would you not?

Given all that, will the good Professor continue to kowtow before the little godlings who publish JND? Or will he simply find himself a journal that will play ball?

Point the third: Rich continues:

The assumption made by proponents of the new law seems to be that to implement the new policy, the Director of NIH will forbid publication by grant recipients in journals that don't allow deposition of articles into PubMed Central.

How many influential scientist do you know of who would tolerate the government telling them which journals they can and can't publish in? The minute such a misguided policy is put in place, the national scientific outcry would more than overwhelm anything Open Access proponents could muster.

How many? All of them. When a funder says "jump", even "influential" scientists say "Was that high enough? Shall I try again?". (Besides which, this is not "the government telling them" anything, this is a funding body making a reasonable demand.) Where scientists do have some weight to throw around is with publishers: the NIH can always get another benchmonkey, but publishers need a steady supply of authors. So if I want to publish in the Journal of Dodgy Results, which won't allow repository archiving, and the NIH says "not if you take our money -- not until they comply with the mandate", I can: look for other funding (believe me, there ain't a lot); fight authority (see Mellencamp, J.C., 1983); or I can try to get the editors of JDR to let me put a copy in PubMed Central after 12 months. Identifying the path of least resistance is left as an exercise for the reader.

Here again, the data (though scanty) are on my side. A 2005 survey of nearly 1300 authors found 81% of respondants reporting that they would willingly comply with a green OA mandate; a further 13% replied that they would comply unwillingly,and 5% claimed they would not comply. Not only is 94% a great deal better than the roughly 4% compliance observed while the NIH policy was voluntary, but I've got five bucks right here that says those 5% are full of it. If push comes to shove, they won't be handing back any grants or handing in any letters of resignation. Most of them, confronted with the evidence, will do what scientists are supposed to do in such cases: say "oh, I was wrong", and change their views and behaviour. The few who don't do that will still comply, they'll just yell at a couple of editors to make themselves feel all tough again.

(Stevan Harnad and Alma Swan have both reported that Arthur Sale's ongoing study of institutional repositories in Australia corroborates these figures, showing that authors comply in much the same way that they claimed they would in the survey. What I've seen of Sale's data is certainly consistent with that notion... but more on that later perhaps.)

So, to recap:

1. The majority of journals for which information is readily available are already compliant with the new NIH mandate; I see no reason to assume that any significant proportion of the remainder will be hostile to the policy.

2. I disagree that the NIH will not be able to enforce the policy; faced with the evidence that OA is a good idea and the fait accompli of an NIH mandate, researchers will comply and journals will have to follow suit. To believe otherwise is, I think, to give the publishing industry too much credit for being able to cow their authors.

3. Voluntary reposit policies simply don't work; we have evidence to suggest that mandates will, and already do. (An aside: the new NIH policy joins 20 funder mandates, 11 institutional mandates, 3 departmental mandates, 5 proposed funder mandates, 1 proposed institutional mandate and 2 proposed multi-institutional mandates. Most of those include growth data in their ROARMAP entries. Why don't we have more data on the effects of mandates?)

Happily, I can finish up on a note of agreement with Rich, who says:

The only things that will change the status quo are: (1) the availability of tools for making it happen; and (2) the realization by individual investigators that continuing to give away their hard-earned copyright makes them far less competitive than their peers who don't.

Open Access proponents should forget about getting the Federal Government to fix the mess that modern scientific publication has become. Instead, they should focus on making Open Access-like options more attractive to scientists.

I've outlined my disagreements above, now let me agree with the more important points here:

1. It is vitally important that tools for OA (and Open Science) be built -- tools that researchers will want to use; to see a graphic illustration of this, listen to the forlorn cry of the repository-rat

2. OA provides a host of benefits, not least the boost to individual impact and standing; the clearer this becomes, the closer we get to 100% OA

3. Modern scientific publishing is a mess, and needs fixing. Making OA more attractive to the benchmonkeys is going to be an indispensible part of that fix (see also #1).

P.S. still on hiatus... sorta. Still haven't put that ms together so posting will remain infrequent at best.

P.P.S. see also Peter Murray-Rust's response to Rich's entry.



Sunday, 02 December
If it won't sink in, maybe we can pound it in...

Another brief un-hiatus, this one sparked by a question asked by Dave Munger at BPR3:

If you know of a peer-reviewed, open-access journal that does not charge a publication fee, let us know about it in the comments.
Practically every time I talk about OA, online or in meatspace, I hear "I'd like to support OA but I can't afford it, don't all those journals charge, like, $2500 per article?"

No. They don't.

Everyone seems to be thinking of PLoS, never mind that they waive their fees at the drop of a hat; the assumption that most OA journals charge (high) author-side fees is both widespread and completely wrong.

In fact, more than 2/3 of the journals listed in the Directory of Open Access Journals (DOAJ) and more than 80% of OA journals published by scholarly societies charge no author-side fees at all; in contrast, more than 75% of the 247 non-DOAJ journals in a 2005 survey do charge author-side fees (page charges, colour charges, reprint charges, etc) in addition to subscription charges.

Let's unpack those numbers a little (especially since I generated the first one myself, and so you should take a look at how I did that).

In October 2005, the Kaufman-Wills group published a commissioned survey of journal publishing practices, The Facts about Open Access. The study was initially designed to include only full OA journals (listed in the DOAJ, OA immediately upon publication) and delayed-OA ("embargo") journals from the HighWire Press stable, but was expanded to include the full range of financial models by inclusion of journals published by the Association of Learned Professional and Scholarly Publishers (ALPSP) and the Association of American Medical Colleges (AAMC). The final report included responses from 248 DOAJ, 85 HighWire, 34 AAMC and 128 ALPSP journals and showed that:

52.8% of DOAJ journals charge no author-side fees at all. The percentage for subscription journals was much lower: ALPSP journals overall (23.4), ALPSP for-profit journals (44.9), ALPSP non-profit journals (10.1), AAMC journals (14.7), Highwire subset (17.6)
These are the figures that Kaufman and Wills summarize as "...more than half of DOAJ journals did not charge author-side fees of any type, whereas more than 75% of ALPSP, AAMC, and HW subset journals did charge author-side fees."

So -- not only do the majority of OA journals charge nothing on the author side, an even larger majority of non-OA journals do charge author-side fees. If the sample is representative, you're less likely to have to pay to publish if you choose an OA journal than if you don't.

When I first heard these numbers I thought, as Peter Suber did, that they should "recast the debate" around OA. In January 2006 Peter's regular yearly predictions included this forecast:

It will start to sink in that fewer than half of OA journals charge author-side fees and that many more subscription-based journals do so than OA journals.... People will stop talking about "the OA business model" for journals as if there were just one. People will talk less about how OA journals might exclude indigent authors and compromise on peer review and talk more about how toll-access journals do so. We'll start to document the range of models actually in use for OA journals... We'll get more creative in finding models that suit the range of niches...
He has since called this "the worst prediction I've ever made". I confess myself at something of a loss as to why the Kaufman-Wills study has not come to dominate and reconfigure the OA debate; I can only guess that profit-hungry lowlifes have successfully sidestepped it. In this year's predictions, Peter expects more of the same:
Because both Hindawi and Medknow have both been profitable for more than year, you'd think that the fact of their success would start to sink in, with corresponding effects on attitudes toward the sustainability of OA journals and interest in their business models. But well-documented truths about OA tend to sink in very, very slowly, because they have to compete with myths, misinformation, and misunderstanding. With regret, I predict more of the same.

In 2005 the Kaufman-Wills Group discovered that the majority of OA journals charged no publication fees at all. In 2006 I predicted that that fact would start to sink in. I was dead wrong. The fact still hasn't sunk in, and I've learned my lesson.

Caroline Sutton and I discovered last month that the OA journals published by learned societies follow same pattern as OA journals overall: most of them charge no publication fees. But while 52.8% of OA journals overall use no-fee business models (from Kaufman-Wills, 2005), we found that 83% of society OA journals use no-fee business models, a significantly greater fraction. However, I'm not predicting that this fact will sink in any time soon. Likewise, we found 425 societies publishing 450 OA journals, a much larger number than the societies known to oppose OA policies. But neither am I predicting that this fact will sink in any time soon. We'll continue to hear the unargued claim that society publishers are intrinsically vulnerable to OA and predominantly opposed to it.

The Kaufman-Wills study is not the only one of its kind, either. As discussed in the quote above, just last month Peter Suber and Caroline Sutton of Co-Action Publishing released preliminary findings from their ongoing study of OA journals published by scholarly societies. They identified 468 societies which publish, between them, 450 full OA journals and 73 hybrid ("pay-for-OA") journals. Of the full OA journals, only 75 charge author-side fees -- meaning that more than 80% of society journals do not charge such fees.

Finally, there's me. All of the above got me to wondering what proportion of journals in the entire DOAJ database charge author-side fees (since Suber and Sutton showed that when the dataset was expanded, at least among society publishers, the no-fee percentage went up considerably).

Fortunately, the DOAJ now includes a metadata field indicating whether or not a particular journal charges author-side publication fees. Unfortunately, that field is not included in the downloadable comma-delimited metadata file they make available. Fortunately, it's not a whole lot of work to make a replacement file by copy-and-pasting from the "browse by title" page. Unfortunately, you have to do this from the new "for authors" section, because the front-page browsing interface doesn't include the "fee/no fee" field. What's unfortunate about that, for my purposes (though it's a wonderful thing overall), is that the "for authors" browse does include hybrid journals, in which regular articles are subscription-only but authors can pay extra to have their work made OA. (In fact, even the logo at the top is different; on the front page you are seeing the Directory of Open Access Journals, but in the "for authors" section it becomes the Directory of Open Access and Hybrid Journals.) The front page says 2971 journals are indexed, but if you browse by title from the "for authors" page, the totals add up to 4638, the database having apparently added 1667 hybrid journals.

There's probably a smarter way to do this using the OAI-PMH, but that syntax is as impenetrable to me as Ancient High Martian, so I simply pasted the browse-by-title pages into a text document and imported that (colon-delimited) into Excel. Then I filtered on "publication fees", sorted by yes/no/missing and read off the totals from the row numbers. Horrible hack, but it worked.

Including hybrid journals, we get:

charge publication fees: 2185 (47%)
don't charge pub fees: 1998 (43%)
fee information missing: 455 (10%)
total no. of journals: 4638

Given the DOAJ definition of hybrid journal, those should obviously be excluded and the data reworked. This is where a smart person would have stopped and waited for the DOAJ to autogenerate the numbers, but I went ahead and deleted the hybrid entries by hand. (Shut up. I wanted to know, OK?) That yields:
charge publication fees: 534 (18%)
don't charge pub fees: 1980 (67%)
fee information missing: 453 (15%)
total no. of journals: 2967

The second total should presumably be 2971 and it would make sense for the "missing" totals to be the same in both sets, so either there are some errors in the database or I made a couple myself. In either case the errors appear small and make no difference to the percentages, and anyway did I mention this kept me up to 4 am? Actually I suspect some inconsistencies in the database, because the front-page total does not update as quickly as the actual entries, and because there are in fact hybrid journals which don't charge fees (e.g. Emerald Engineering's model).

So now we have three studies (OK, two studies and one ungainly hack) showing that a (strong) majority of OA journals do not charge author-side fees, and one of those studies further showing that a strong majority of non-full-OA journals do in fact charge author-side in addition to subscription fees.

Now, can we please put to rest the myth/FUD/whatever that there is only one OA model, the author-side fees/PLoS model? While we're at it, let's have a few more closely related ideas go the way of the dodo: that OA journals discriminate against indigent authors (because they charge publication fees -- except that most of them don't); that OA journals will compromise on quality (in order to collect payment for manuscripts -- except that most of them don't); that if most journals went OA, universities would have to pay more in author-side fees (which, remember, most OA journals don't, but most non-OA journals do, charge) than they do now in subscription fees.

I swiped that list of candidates for memetic extinction from Peter Suber, and you should go read his full discussion, which offers a lot more detail, especially on that last point. Me, I'm going to take a nap and go back to my blog hiatus. But the next time you hear someone talk about the "cost" of publishing in OA journals, please point 'em here.




Thursday, 22 November
brief hiatus in my hiatus

I'm not ending my blogging break, but I simply couldn't let this from Cameron Neylon pass by without comment:

The UK Engineering and Physical Sciences Research Council currently has a call out for proposals to fund 'Network Activities' in e-science. This seems like an opportunity to both publicise and support the 'Open Science' agenda so I am proposing to write a proposal to ask for ~£150-200k to fund workshops, meetings, and visits between different people and groups. The money could fund people to come to meetings (including from outside the UK and Europe) but could not be used to directly support research activities. The rationale for the proposal would be as follows.

  • 'Open Science' has the potential to radically increase the efficiency and effectiveness of research world wide.
  • The community is disparate and dispersed with many groups working on different approaches that do not currently interoperate - agreeing some interchange or tagging standards may enable significant progress
  • Many of those driving the agenda are early career scientists including graduate students and postdocs who do not have independent travel funds and whose PI may not have resources to support attending meetings where this agenda is being developed
  • There is significant interest from academics, some publishers, software and tool developers, and research funders in making more data freely available but limited concensus on how to take this forward and thus far an insufficient committment of resources to make this possible in practice
This is a terrific opportunity to move Open Science forward; as Cameron points out, existing efforts are scattered and perhaps the most important thing right now is to make connections among the community. The whole idea is that a community approach will be vastly more efficient than the existing hypercompetitive model! This funding could move Open Science into the big time by driving the creation and adoption of working standards, possibly even a BBB-style declaration, and by creating a seed network of cooperative scientists out of which mainstream Open Science could emerge.

Cameron writes, in a followup:

I've made a start with an outline on a GoogleDoc which can be viewed here. I have tried to set out some general headings and areas to be fleshed out and added a little text. This is early days but if anyone wishes to add anything then please feel free. I have given editing rights to all those people who have comments on the original post (as of around 9:30 pm GMT on Thursday 22 November) so they should now have editing rights. I have set the document so that those people with invitations can cascade them to others (I hope). I will continue to issue invitations to anyone who comments on the original post. No need to feel obliged to add anything  - I'm not asking you to write the grant for me - but if you feel so inclined then the assistance will be very welcome.

What I will request is from those who are interested is a short letter stating your current post/position/ambitions, your interest in 'Open Science' and why you would like to be involved in this network. Either email to me at C [dot] Neylon [at] rl.ac.uk or simply drop it in as a comment.

Please, if you have anything to offer, step up. And I cannot emphasize this too strongly: if you're at all interested, you do have something very valuable to offer: a letter of support, as described. It is vital that the powers-that-be (that is, the powers-that-fund) see real commitment to these ideas, from real people. The deadline loometh (next Tuesday), so don't put this off. Your letter doesn't have to be a literary masterpiece -- just stand up and be counted.



Sunday, 21 October
Call yer congresscritters -- right now.

The bill to make the NIH OA policy mandatory instead of voluntary is in trouble: from the ATA via Peter Suber (with some editing by yours truly):

The Senate is currently considering the FY08 Labor-HHS Bill, which includes a provision (already approved by the House of Representatives and the full Senate Appropriations Committee), that directs the NIH to change its Public Access Policy so that participation is required (rather than requested) for researchers, and ensures free, timely public access to articles resulting from NIH-funded research. On Friday, Senator Inhofe (R-OK), filed two amendments (#3416 and #3417), which call for the language to either be stricken from the bill, or modified in a way that would gravely limit the policy's effectiveness.

Amendment #3416 would eliminate the provision altogether. Amendment #3417 is likely to be presented to your Senator as a compromise that "balances" the needs of the public and of publishers. In reality, the current language in the NIH public access provision accomplishes that goal. Passage of either amendment would seriously undermine access to this important public resource, and damage the community's ability to advance scientific research and discovery.

Please contact your Senators TODAY and urge them to vote NO on amendments #3416 and #3417. (Contact must be made before close of business on Monday, October 22).

Contact information and a tool to email your Senator are online [here]. No time to write? Call the U.S. Capitol switchboard at (202) 224-3121 to be patched through to your Senate office.

If you have written in support before, or when you do so today, please inform the Alliance for Taxpayer Access. Contact Jennifer McLennan through jennifer@arl.org or by fax at (202) 872-0884.

The ATA has provided a sample email, but I think they miss one important point: Inhofe's amendments are likely to be presented as compromises aimed at avoiding a presidential veto, and that is purely bullshit. (Note to self: find out how much money Inhofe gets from publishers.) Here's Peter Suber's extract from the White House Statement of Administration Policy:
The Administration strongly opposes S. 1710 because, in combination with the other FY 2008 appropriations bills, it includes an irresponsible and excessive level of spending and includes other objectionable provisions....

S. 1710 exceeds the President's request for programs funded in this bill by nearly $9 billion, part of the $22 billion increase above the President's request for FY 2008 appropriations. The Administration has asked that Congress demonstrate a path to live within the President's topline and cover the excess spending in this bill through reductions elsewhere, while ensuring the Department of Defense has the resources necessary to accomplish its mission. Because Congress has failed to demonstrate such a path, if S. 1710 were presented to the President, he would veto the bill.

The Administration strongly opposes provisions in this bill that overturn the President's policy regarding human embryonic stem cell research....

Public Access to Research Information. Provisions in the bill would require that manuscripts based on NIH-funded research be made available to the public within 12 months of publication. The Administration notes that NIH's current policy requesting the voluntary submission of manuscripts has only been in effect for 2 years, and the Administration believes there is opportunity to work with Congress to study the current policy and consider ways to encourage better participation. The Administration believes that any policy should balance the benefit of public access to taxpayer supported research against the possible impact that grant conditions could have on scientific research publishing, scientific peer review and on the United States' longstanding leadership in upholding strong standards of protection for intellectual property....

The Administration strongly opposes...the elimination of the longstanding definition of abstinence education that keeps these programs focused solely on abstinence....

Note that the real reason for the President's objection is the money he'd rather spend on his own priorities. The paragraph that deals directly with the NIH provision shows unsettling echoes of the PRISM propaganda but is really just waffle -- padding to make the list of objections look longer. In fact, as I noted earlier, the NIH estimates that it will cost about $3 million to implement the mandate -- not much of a dent in that $9 billion the President is complaining about. So, here's an alternative sample email, the one I just sent:
Dear Congresscritter,

I am a research scientist and about to become a US citizen. I have worked in the US for four years, having held an NIH T32 postdoctoral fellowship for two of those years. As a scientist and as a concerned member of the US public, I recently wrote to you in support of that portion of the Senate Appropriations Committee's FY 2008 Labor-HHS-Education appropriations bill (S.1710) which directs the NIH to change its policies from a request to a mandatory requirement for free, timely public access to NIH funded research. I have just learned of two last-minute amendments to this bill (#3416 and #3417) proposed by Sen Inhofe (R-OK). The first of these amendments would eliminate the relevant portion of the bill altogether, and the second would cripple it.

I write now to urge you to oppose both of these amendments, which are likely to be presented to you as compromises aimed at avoiding a Presidential veto. They will do nothing of the sort: the President's primary objection to the bill, as a recent Statement of Administration Policy (1) makes clear, is the $9 billion in spending over and above the Administration's topline. The NIH recently estimated (2) the cost of implementing the mandatory public access requirement of S.1710 at less than $3 million per year -- hardly a significant reduction in a $9 billion overshoot!

As I wrote in my earlier letter, traditional scientific publishing sees the taxpayer pay for the research, pay to have it published, and then pay again to access it (or for the same researchers to access it!) through subscriptions to privately owned journals (3). Legislators have a practical, legal and moral obligation to end this inefficiency and waste, and the way to do that is through Open Acess to publicly funded research. Open Access maximizes research efficiency (and thus the return on research investment) by removing obstacles to the acquisition of new results by researchers (4), and is essential for realizing the vast and virtually untapped potential of automated data- and text-mining (5,6).

Since the current voluntary policy has achieved only a 5% compliance rate in the two years since its instigation, a mandate is clearly required to fulfil Congress' obligation to maximize the return on public investment in research. The current language of S.1710 contains just such a mandate, and Sen. Inhofe's amendments #3416 and #3417 would eliminate it. Please oppose these amendments and approve without change that portion of the appropriations bill which changes the language of the NIH deposit policy from voluntary to mandatory.

Sincerely,

me.


-----references-----
(1) http://www.whitehouse.gov/omb/legislative/sap/110-1/s1710sap-s.pdf
(2) http://grants.nih.gov/grants/guide/notice-files/NOT-OD-05-022.html
(3) http://www.earlham.edu/%7Epeters/fos/newsletter/09-04-03.htm#taxpayer
(4) http://eprints.ecs.soton.ac.uk/10713/01/timcorr.htm
(5) http://eprints.ecs.soton.ac.uk/13028/01/AS-OA-final.pdf
(6) http://www.jneurosci.org/cgi/content/full/26/38/9606



Sunday, 14 October
A big step in the right direction.

This is excellent news:

We are delighted to announce that a reviewer discount now exits for all those who review manuscripts for Chemistry Central Journal, and this is linked to the rest of the the BMC series journals. The review must have been received on time, and during the last 12 months.

This means that if the submitting author has reviewed a manuscript for Chemistry Central Journal or any of the BMC series, they are entitled to a 20% discount off the article processing charge (APC) when submitting articles to any of these journals. We ask that qualifying authors request this discount at the time of submission.

The number of articles submitted to these journals continues to grow significantly, and we are grateful those who agree to review for our journals.

This is a terrific idea., and I hope BMC will extend a similar program across all of its own BMC series journals -- that is, if you review for any of them, you qualify for some level of discount when you submit a paper to any other. (I'm an idiot. At least the link's right.)

Recognition of the value of peer review is a Good Thing™ and long overdue; it gets plenty of lip service but this is the first time I've seen anyone put their money where their mouth is. Let's just hope that funding and tenure review committees find a way to do something similar.

(Hat-tip: Peter Suber.)



Monday, 10 September
Reply to Timo Hannay.

Timo Hannay on Nascent, branching off from a discussion of intemperate responses to PRISM:

A case in point is the criticism that my NPG colleague, Maxine Clarke, faced when talking about "open access" projects at NPG. Not everyone shared her definition of open access and she was accused by some bloggers of using the term as a marketing slogan. (Peter Murray-Rust, who made the original point, later recanted when he understood that Maxine was being genuine, so I don't take issue with him.)
Mr Hannay does, presumably, take issue with me. I will apply Hanlon's Razor and assume Mr Hannay did not bother to read beyond the post he linked, since the very next is this one:
In the entry below, I was not sufficiently careful to avoid Nature-bashing, or the implication that Maxine Clarke was morphing, werewolf-like, into some kind of publisher pitbull. Thanks to Pedro, bdf and RPM for responses which made this clear.

[...]

Let me finish, though, by pointing out that I do not wish to paint NPG as one of the unscrupulous publishers whose intentions worry me, nor Maxine Clarke as their sneaky shill. If NPG uses the term "open access" differently from me, I take that as a good-faith disagreement, and if Maxine uses the term in her employers' sense that is hardly "marketing". Specifically, I apologize for the phrase "if [Maxine] is going to start abusing [the term "OA"] as marketing for Nature", which contains an uncalled-for implication that I hope this entry will dispel.

The elision there includes the list of NPG's OA-related activities that Mr Hannay goes on to point out. The next post on my blog is this one in which I quote Peters Suber and Murray-Rust some more regarding OA definitions and conclude, in what I am happy to have readers interpret as a further step back:
I take Peter S to be saying that it's inevitable that "Open Access" will come to mean, in general use, more things to more people than strict BOAI, and we will not achieve anything by making arseholes of ourselves over it. (Even if that's not quite the way Peter S would put it, that's the way I've come to look at the situation.) There's no point in picking quarrels we don't have to have. It's enough to be more careful in our own usage, for which purposes suffixes a la Peter MR should prove very useful when we need extra precision. I don't think we need invent terms ("fuzzy") just yet -- "OA (specific licence, with hyperlink if writing online)" and "OA (free to read)" should cover most cases.

If we can get to the point where the average consumer -- basically, any researcher -- responds to an OA claim or label by asking "which licence?", we will have done an end-run around the problem of term dilution.

It seems to me entirely unfair and misleading to link to the first of my posts without also linking the next two.

I think Mr Hannay is also in error in describing this post from Jean-Claude as a "followup" to the posts above; I think that Jean-Claude was referring to much more recent and clear-cut abuses outlined by Peter Murray-Rust.

Mr Hannay also goes on to say that

Some people are just too quick to assume base motives, and employ words like "boycott" as if they were punctuation marks.
I do not know who that is aimed at, but as for my own reference to a boycott, I do not think it unreasonable or precipitous to consider such action against publishers who do not distance themselves from PRISM and similar efforts. Why should it be up to me to determine who is and is not part of PRISM? The AAPThe PRISM organizers would certainly like me to assume that all their AAP members are PRISM supporters. As Mr Hannay himself makes clear, publishers need scientists more than the other way around. If you want my manuscripts, you had better demonstrate to me that you are not part of the pack of corporate bloodsuckers and soulless spin doctors that is pushing the palpably dishonest, profit-driven PRISM agenda. (Not that I would, given a free choice, publish in Nature anyway, even after Mr Hannay made it clear NPG does not support PRISM and even if they'd have me -- because they're not OA.)



Update: Peter Murray-Rust did a better job than me of responding to the Nascent post: he rightly led with the important part, which is that Nature is not endorsing PRISM. That's no surprise, but I think it important to be explicit and public about who is and who is not backing PRISM.

Also, now I feel bad about the snotty "Mr Hannay" stuff. I use people's first names here as a rule, even when I've never met them, because a blog is an informal conversation and because I think it fosters a sense of civil fucking discourse. I know perfectly well that Timo is on the side of the angels (viz, on the side of science!) when it comes to scientific communication, and it follows that his comments -- and criticisms -- on this issue are made in good faith. So, er, *shuffles feet*, sorry Timo.



Wednesday, 05 September
Nature mission statement update

Since I spend a fair bit of time excoriating publishers, it's only fair that I take note of those who act in good faith. In response to the blogospheric reaction to the Nature mission statement, Maxine Clarke asked the appropriate persons to update the NPG web page (as you remember, Bob, the journal site already made clear the necessary distinction between the original and updated statements). Accordingly, the NPG page now reads:

Nature's original mission statement was published for the first time on 11 November 1869. The journal's original mission statement was revised in 2000. The original mission statement is reproduced below:
and there follows the same version of the original that was on the page last time I looked.

It's nitpicking to note that I prefer the way the journal does it, with the updated statement immediately visible and a link to the pdf of the original. The new page removes any confusion as to which mission statement now obtains.

Maxine also asked for the print edition of the journal to follow the online version and make both versions of the mission statement obvious. This will necessarily take more time than updating a web page, and I don't have the latest Nature to hand so I don't know if the print change has gone through yet. I will update again as soon as I find out.

So, many thanks to Maxine for responding to somewhat barbed criticism in such a constructive manner.



Wednesday, 05 September
PRISM and PMR

I'm swamped, but two quick points:

1. I'm not going to try to keep up with reactions to PRISM here, unless I think I have something potentially useful to add. If you want a news stream, read OAN or watch my PRISM tag on Simpy -- I'll grab everything I notice.

2. Peter Murray-Rust is blogging up a storm on publisher policies, copyright and Open Access:

There is a great deal of confusion regarding publisher policies and the rights of readers, scholars, institutions &c. I hope that publishers will agree with me that Peter MR is doing a sterling job of getting these issues out into the open, where they can be clarified -- to everyone's benefit.




Saturday, 01 September
More on PRISM: let's not take this lying down.

Jonathan Eisen has got the right idea, listing the entire members' directory of the AAP and calling on academics to consider a boycott if those entities will not at least request dissociation from the PRISM program (as Rockefeller University Press has done) or its discontinuation. You can also read the members' list on the AAP site, and Peter Suber points out that we should pay particular attention to their Professional and Scholarly Publishing division:

I suspect that AAP/PSP did not consult its members before launching PRISM. But in any case the members should know that the launch of PRISM tarnishes them, alienates authors, readers, and referees, and, if successful, will only harm science by entrenching rather than removing access barriers to the results of publicly-funded research.
Peter is commenting there in response to someone else who has got the right idea, Peter Murray-Rust, who (as a Cambridge faculty member) has written to Cambridge University Press; his letter is an excellent example of what everyone should do who has any connection, professional or personal, with any of the AAP/PSP member companies, so I quote it here in full:
Open Letter to Stephen Bourne, Chief Executive Cambridge University Press

Dear Stephen Bourne,

I am writing as an individual member of staff in the University (heavily
engaged in developing new approaches to scientific scholarly publishing) to
ask about CUP's involvement with the recently launched PRISM initiative
from the AAP (http://www.prismcoalition.org/). This initiative is an
undisguised coalition to discredit Open Access publishing and its launch a
few days ago has generated universal dismay and anger in many quarters
including several outside mainstream publishing. The press release was
reported in full by Peter Suber on his Open Access News blog
(http://feeds.feedburner.com/~r/earlham/dGCQ/~3/147374721/2007_08_19_fosblogarchive.html)
where he has objectively answered and dismissed the basis of PRISM and its
methods. As an example of the language of PRISM it implies that publishing
in Open Access journals (as I do on occasions) is "junk science". There is
much more from PRISM which is both deliberately factually incorrect and
misleading and I cannot see how a reputable scholarly organisation such as
CUP could be associated with it. Indeed at least one similar publisher
(Rockefeller University Press
http://feeds.feedburner.com/~r/earlham/dGCQ/~3/150207794/2007_08_26_fosblogarchive.html)
writes:

"I am writing to request that a disclaimer be placed on the PRISM website
indicating that the views presented on the site do not necessarily reflect
those of all members of the AAP. We at the Rockefeller University Press
strongly disagree with the spin that has been placed on the issue of open
access by PRISM." [rest of letter omitted here]

The purpose of my letter is simply to request factual information from CUP
about its involvement with PRISM. Since PRISM itself has not reacted to any
of the recent comment I can simply speculate that not all members of the
AAP (perhaps including yourselves) were consulted before PRISM made its
press release and new site. In particular it is unclear whether PRISM is de
facto composed of all the members of the AAP or whether it uses their
unsought goodwill to reinforce the apparent strength of the PRISM
organization.

This mail is an Open Letter (posted on my blog,
http://wwmm.ch.cam.ac.uk/blogs/murrayrust) and I would intend to publish
your reply in toto and unedited since your position (and those of similar
publishers) is of great public interest). If there is anything you would
not wish to be published, please indicate. Alternatively you may leave a
comment on the blog itself. (My blog itself, though strongly advocating
Open Access and particularly Open Data, attempts to be fair and accurate).

Thanks in advance

Peter Murray-Rust

This letter hits every necessary nail squarely on the head:
  • be polite
  • make clear the nature of your connection with the publisher to whom you are writing
  • keep the background brief and be sure to point to Peter Suber's rebuttal
  • explicitly request a specific response: did Publisher X know about PRISM, and does Publisher X support PRISM?
  • suggest that Publisher X should publicly distance themselves as RUP has done
  • if at all possible, do all of this in public: an open letter, on a blog
I don't know whether I have any direct connection with any AAP/PSP member companies, although I could certainly write to publishers of journals in which I have published papers. In a later entry I will dig through the list and try to find likely recipients of such letters -- for which Peter Murray-Rust has provided such a splendid template.

Update: in comments on Jonathan's post, CSHL Press has repudiated PRISM. Good for them, and I hope they will make a formal public statement to the same effect -- for instance, on their website.



Sunday, 26 August
a bit more on PRISM

If you haven't already, go read Peter Suber's initial response -- it is, as always, clear, calm, comprehensive and compelling. (I hope to meet Peter one day; I imagine him as a kind of unflappable, scholarly James Bond...) This is your one-stop anti-PRISM shop for the time being: if you read nothing else, read this; and whenever PRISM rears its ugly head, make sure Peter's response gets an airing too.

Peter has also responded to a Publisher's Weekly article that simply repeats the PRISM propaganda. The by-line is Rachel Deahl, a senior news editor at PW. I wrote to her, as follows:

Dear Ms Deahl,

I write in response to your recent brief article in Publishers Weekly ("AAP Tries to Keep Government Out of Science Publishing", August 23, 2007), in which you quote or repeat several egregious errors of fact which are being propagated by the newly formed anti-Open Access disinformation factory, PRISM.

Briefly, there is no aspect of the Open Access publishing model which would force anyone to "turn over" anything to the government, nor will OA publishing damage peer review in any way. For a detailed and authoritative response to the PRISM campaign, I refer you to Peter Suber, Professor of Philosophy at Earlham College, on his weblog Open Access News:

http://www.earlham.edu/~peters/fos/2007_08_19_fosblogarchive.html#365179758119288416

In your article, you quoted PRISM and AAP members, but gave no space to the opposing point of view, which is simply that taxpayers should get what they have paid for: the results of the research they fund, and maximally efficient use of those results by the researchers whose salaries they also pay. I hope you will follow your initial report with a more balanced article that includes interviews with Open Access experts and advocates. In case it is of use in your research, I include here, in no particular order, a brief list of potential interviewees:

Peter Suber, as above (contact)
Paul Ginsparg, founder of arXiv (contact)
Barbara Cohen, Executive Editor, Public Library of Science (contact)
Mark Patterson, Director of Publishing, Public Library of Science (contact)
Matthew Cockerill, Publisher, BioMed Central (contact)

Finally, I should point out that I have also published this letter on my own weblog, and you are of course welcome to respond there (www.sennoma.net) at any time.

Best wishes,

me.

I'm not sure whether this will do any good -- William Walsh has pointed out that Publisher's Weekly is owned, once removed, by Reed Elsevier, noted price-gougers and employers of the notorious Publisher's Pitbull, so Ms Deahl's options may be limited by her bosses. This is also a good place to point out that if you write to her, being a jerk about it will not only be pointless and stupid but will in fact damage the OA cause. (That should go without saying but these things do tend to get out of hand when emotions run high and email allows one to send in haste and repent at leisure...)



Thursday, 23 August
PRISM = Publishers Relying on Insidious Subversion Methods

From Peter Suber:

The AAP/PSP has launched PRISM (Partnership for Research Integrity in Science & Medicine).  I'm quoting today's press release in its entirety so that I can respond to it at length:


A new initiative was announced today to bring together like minded scholarly societies, publishers, researchers and other professionals in an effort to safeguard the scientific and medical peer-review process and educate the public about the risks of proposed government interference with the scholarly communication process.

[much egregious lying]

Anyone who wishes to sign on to the PRISM Principles may do so on the site.

Fortunately for us all, Peter has already responded; I won't excerpt his point-by-point rebuttal here, you should go read it all.

This is disgusting. This runs counter to everything that science, academia, scholarship (and scholarly publishing!) stand for.

There are no names on the PRISM site yet -- but I'm going to find as many as I can and publish them here. Sunlight is the best disinfectant, and I want to know just who is taking part in this revolting effort to steal from the commons and turn public goods into private profit.

(We can start with the AAP: their members page is essentially one long list of companies and organizations with whom I will assiduously avoid doing business until and unless they dissociate themselves from PRISM, and preferably from the AAP altogether.)

More later. Oh yes indeedy.



Tuesday, 21 August
Another note on terminology.

In a comment on one of my 3QuarksDaily columns about Open Access/Open Science, Matthias Röder points out that there are more kinds of research than scientific:

One thing that might be worth thinking about is the fact that Open Science is a term that excludes many projects in the humanities and social sciences. I think Open Research might be a good alternative.
By way of illustration he points to a wikipedia entry on Open Research, which in turn points to a number of Open projects, including SCRIBE, with which Matthias is involved:
  1. SCRIBE is an open and peer-reviewed database with information on music copyists and samples of their handwriting.
  2. SCRIBE is a software tool for searching music manuscripts by handwriting characteristics.
He's got a point. I don't mean to be exclusionary, and am happy to accept Open Research as an umbrella term, a higher level taxon of which Open Science and Open Anything Else are subgroups.

That said, there's also no reason not to use the phylum name when you don't mean to speak for the entire kingdom. I don't know much about research outside of science; I've posted a little about it, but haven't looked into it with nearly the obsessive care with which I follow developments in Open Science. I'm a scientist; my focus is on science.

I'm happy to learn about efforts towards openness in other fields, of course, but I hope no one is surprised or offended to hear that I'll be thinking "how can we use this for science?" the whole time. So for now, I will continue to talk about "Open Science", and I hope that researchers from other fields will not feel excluded but will instead simply look to see whether anything I'm saying is of use in Open Whatever-It-Is-That-They-Do.



Monday, 20 August
What do we mean by open science?

(Addressed in absentia to "Tools for Open Science", Second Life, Aug 20 2007.  I am sorry I could not be there.)

I think we all know what we want, and I think we all want much the same thing, which boils down to just this: cooperation.  A way forward for science, a way out of the spiralling inefficiency of patent thickets, secret experiments and dog-eat-dog competition.  But we use a variety of terms, and probably mean slightly different things even when we use the same terms.  It might -- I am not sure -- be useful at this point to come together on an agreed definition for an agreed term or set of terms  -- something equivalent to the Berlin/Bethesda/Budapest Open Access Declarations.

If this does not seem like a "tool for open science", consider what the BBB definition has done for Open Access.  It provides cohesion, a point of reference and a standard introduction for newcomers, and acts as a nucleation center for an effective movement with clear and agreed goals.  Since this SL session takes off from SciFoo, and SciFoo is by all accounts very good at converting brainstorming sessions into practical outcomes, I thought perhaps the idea of a definition or declaration of Open Science might be a suitable topic.  In what I hope is the spirit of SciFoo, here are some ideas that might be useful in such a discussion.


Terms

Whatever this thing is, what should we call it?  There are a number of terms in use:

  • Open Science -- has the weight of Creative Commons/Science Commons behind it, via iCommons
  • Open Source Science -- Jamais Cascio, Chemists Without Borders
  • Open Source Biology -- Molecular Biosciences Institute
  • I think "biology" too narrow -- there seems little point in Open Chemistry, Open Microbiology, Open Foo all having different names.  I think Open Source Foo too likely to lead to confusion with software initiatives, and too likely to lead to pointless arguments about what the "source code" is.
  • That leaves Open Science, which would be my choice for an umbrella term.  A case can be made, though, for Open Research, on the same basis on which I argue against Open Biology etc -- see this comment from Matthias Röder
  • Another "inclusive" possibility is to focus on information -- Open Data, as per PMR's wikipedia entry, or the broader Open Content.  In the same vein, the Open Knowledge Foundation provides a fairly comprehensive definition of Open Knowledge.
  • I have seen "Science 2.0" around quite a bit lately, though it's a bit too marketing-speak for my taste
  • Open Notebook Science is a very specific subset of Open Science: if your notebook is open to the world, there's not much confusion about access barriers!  It even comes with its own motto: "no insider information".  This is as Open as Open gets.


Sources and Models

We don't have to re-invent the wheel:



Flexibility

We don't want to start a cult, and we don't want to bog anyone down in semantics.  There's no purity test or loyalty oath.  My own view is that Open Science (or whatever we end up calling it) is not an ideology but an hypothesis: that openly shared, collaborative research models will prove more productive than the highly competitive "standard model" under which we now operate. 

Openness in scientific research covers a range of practices, from tentative explorations with a single small side-project all the way to Open Notebook Science á la Jean-Claude, and we should welcome every step away from the current hypercompetitive model.  Open Notebook Science provides a useful marker for the Open end of the spectrum; perhaps all a Declaration need do is identify the minimum requirements that mark the other end of the spectrum?


Conditions


What standards must a research project or programme meet in order to be considered Open?

  • obvious: Open Access publication
  • equally crucial: Open Data, that is, raw data as freely available (including machine access) as OA text
  • probably indispensable: Open Licensing so as to avoid confusion as to what is truly available and for what purposes; as per Peters Suber and Murray-Rust, this must be
    • explicit
    • conspicuous
    • machine-readable
  • Open Semantics: perhaps none of this will be much good without metadata and standards to allow interoperability and free flow of information
  • desirable: Free/Open Source Software
  • David Wiley: "four Rs" of Open Content (cf. Stallman's four fundamental freedoms for software):
    • Reuse - Use the work verbatim, just exactly as you found it
    • Rework - Alter or transform the work so that it better meets your needs
    • Remix - Combine the (verbatim or altered) work with other works to better meet your needs
    • Redistribute - Share the verbatim work, the reworked work, or the remixed work with others
  • OKF definition of Open Knowledge




Wednesday, 08 August
Yale vs. BMC

Yale science Libraries have stopped paying the article processing charges for Yale faculty who publish in BioMed Central journals. Yale says:

Starting with 2005, BioMed Central article charges cost the libraries $4,658, comparable to single biomedicine journal subscription. The cost of article charges for 2006 then jumped to $31,625. The article charges have continued to soar in 2007 with the libraries charged $29,635 through June2007, with $34,965 in potential additional article charges in submission.
BMC responds:

The main concern expressed in the library's announcement is that the amount payable to cover the cost of publications by Yale researchers in BioMed Central's journals has increased significantly, year on year. Looking at the rapid growth of BioMed Central's journals, it is not difficult to see why that is the case. BioMed Central's success means that more and more researchers (from Yale and elsewhere) are submitting to our journals each year. [...]
An increase in the number of open access articles being submitted and going onto be published does lead to an increase in the total cost of the open access publishing service provided by BioMed Central, but the cost per article published in BioMed Central's journals represents excellent value compared to other publishers.

The increased cost arises because Yale researchers are submitting more and more work to BMC journals.  More manuscripts = higher costs, but if the cost per article has not gone up, then BMC's model scales effectively.  Here are some other ways to look at the numbers:

  • For around $65K, Yale gets about 40 articles published OA, that is,available free to everyone everywhere forever, plus a "subscription" (that is, Open Access, like everyone else) to 179 journals.  Theaverage biomed journal subscription is around $1000-1500/yr; choosing the lower figure to be conservative, those subscription-equivalents are worth $179K/yr.  Even if Yale only wanted to subscribe to around a third of the BMC journals, that would still cost about the same as the OA charges --and this comparison ignores the page, color and miscellaneous charges that many journals levy.  (An example: PNAS charges $70 per printed page, plus $325 for each color figure or table; $150 for each replacement or deletion of a color figure or table.)

  • Yale could publish those 40 articles elsewhere without paying anything (again, ignoring page etc. charges).  Assuming they don't subscribe to any of the journals they publish in, though, every time any Yale employee wants to read one of those articles they're on the hook for somewhere around $30; so it only takes 2166 person-articles, or an average of about 50 employees wanting to read each article, to get back to $65K -- without the benefit of OA.

  • Yale spent about $7.7 million on subscriptions in 2005-6; converted to OA author-side charges at $1600/article, that's about 4800 articles.  A PubMed search on "Yale" gives 2272 hits; "Yale in title/abstract" gives 131, leaving 2141 papers where "Yale" is probably in an author's address.  I can't find a quick way to break out Yale's subscription expenditure by field, so what proportion of the $7.7mil goes to biomed journals I couldn't say (though STM titles are the most expensive subscriptions for any academic library). If PubMed-indexed journals make up 44% (2141/4800) of Yale's subscription costs, which does not seem unlikely, then they're already paying $1600/article -- without the benefit of OA.

A quick fiddle with biology + medicine data from theJournal Cost-Effectiveness database gives an average price per article of around $12 for toll-access journals, but that's (one subscription)/(total no. articles).  The question is, how many subscriptions do they sell -- that is, what is their income/article?  We know what BMC makes per article: about $1600 on average.  If an average toll-access journal sells just 135 subscriptions per year, they're bringing in more per article than BMC.

There's more, but that'll do for now.  Two questions arising:

1. what's the average page/colour/misc charge levied by toll-access journals?
2. how many subscriptions does an average journal sell each year?



An appendix of sorts: the BMC cost structure

Article Processing Charges

standard charge = $1600 (129 journals)
alternative charges: $2410 (2 journals)
$2310 (1 journal)
$2170 (1 journal)
$2010 (2 journals)
$1710 (1 journal)
$1970 (2 journals)
$1910 (4 journals)
$1810 (2 journals)
$1710 (5 journals)
$1505 (11 journals)
$1455 (1 journal)
$1305 (5 journals)
$1205 (2 journals)
$1005 (2 journals)
$805 (1 journal)
$725 (2 journals)
$500 (1 journal)
no charge (5 journals)

Supporter's Membership
Supporter Members pay a flat rate annual Membership fee based on the number of biology, chemistry, physics and medical researchers and graduate students at the institution. Members of the institution are then given a 15% discount on the APC when publishing in our journals.
Very small institution (21-500 faculty and postgraduate students in biology, chemistry and medicine) $1994
$13293
8.3
Small institution (501-1500 faculty etc.) $3987
$26580
16.6
Medium size institution (1501-2500 faculty etc.) $5980
$39867
24.9
Large institution (2501-5000 faculty etc.) $7974
$53160
33.2
Very large institution (5001-10000 faculty etc. $9967
$66447
41.5

So if this fee is to be less than 15% of total APC, total APC must be at least the figure in column 3.  Since the average is likely to be close to $1600/article, dividing through gives the number of articles in column 4.

Postpay Membership
...group members are invoiced in arrears for articles authored by their members that have published in our journals since the last invoice date. Invoice schedules are set on a monthly or quarterly cycle.
Prepay Membership
...enables an organization to cover the whole cost of publishing for their investigators when publishing in our open access journals. No additional fees will be paid by individual authors. This is an advance payment system whereby customers pay upfront for accepted articles authored by their investigators to be processed and published. Upon publication, the full Article-Processing-Charge (APC) for the journal in question, minus a loyalty discount, will be deducted from the account.

The higher the amount paid in advance, the greater the loyalty discount given on each APC.
No numbers seem to be available for the "loyalty discount".



Saturday, 21 July
OK, but I still don't want to see "Open Access" become the new "Low Fat".

Peter Suber commented on the last entry to clarify his position on the varying uses of the term "Open Access":

For me, OA in the strict sense removes both price barriers and permission barriers; all the major public definitions say so; and I'm only too glad to repeat this whenever it comes up. However, as a matter of word usage, the term now covers more territory than this and I've stopped fighting that fact. That is, the term is often used for content that is merely free-to-read.
Peter goes into more detail in a recent entry on his blog:
...many projects which remove price barriers alone, and not permission barriers, now call themselves OA. I often call them OA myself. This is only to say that the common use of the term has moved beyond than the strict definitions. But this is not always regrettable. For most users, removing price barriers alone solves the largest part of the problem with non-OA content, and projects that do so are significant successes worth celebrating. By going beyond [I would say "outside" -- BH] the BBB definition, the common use of the term has marked out a spectrum of free online content, ranging from that which removes no permission barriers (beyond those already removed by fair use) to that which removes all the permission barriers that might interfere with scholarship. This is useful, for we often want to refer to that whole category, not just to the upper end. When the context requires precision we can, and should, distinguish OA content from content which is merely free of charge. But we don't always need this extra precision.

In other words: Yes, most of us are now using the term "OA" in at least two ways, one strict and one loose, and yes, this can be confusing. But first, this is the case with most technical terms (compare "evolution" and "momentum"). Second, when it's confusing, there are ways to speak more precisely. Third, it would be at least as confusing to speak with this extra level of precision --distinguishing different ways of removing permission barriers from content that was already free of charge-- in every context. [...]

and in the Sept 2004 edition of the SPARC OA Newsletter:
One danger is the dilution of our term. That's why [this newsletter discusses] the BBB definition and its place in our history. But another danger is the false sharpening of our term. If we thought that the BBB definition settled matters that it doesn't settle, then we could prematurely close avenues of useful exploration, needlessly shrink the big tent of OA, and divisively instigate quarreling about who is providing "true OA" and who isn't.

The BBB definition functions as a usefully firm definition of "open access" even if it leaves room for variation. We should agree that OA removes some permission barriers (e.g. on copying, redistribution, and printing) even if it leaves different OA providers free to adopt different policies on others (e.g. on derivative works and commercial re-use). My personal preference, for example, is to permit derivative works and commercial re-use. But (as I wrote in FOSN for 1/30/02) I want to make this preference genial, or compatible with the opposite preference, so that we can recruit and retain authors on both sides of this question.

I've omitted a lot of good information to save space here; anyone interested in this issue should read all of the linked discussions. In particular, the SPARC newsletter goes into useful specifics about the OA-related activities of a number of publishers.

Peters Suber and Murray-Rust have both pointed out that one way to be specific about "levels" of openness is to be explicit about licensing -- PMR:

If the community wishes to continue to use "open access" to describe documents which do not comply with BOAI then I suggest the use of suffixes/qualifiers to clarify. For example:
  • "open access (CC-BY)" - explicitly carries CC-BY license
  • "open access (BOAI)" - author/site wishes to assert BOAI-nature of document(s) without specific license
  • "open access (FUZZY)" - fuzzy licence (or more commonly absence of licence) for document or site without any guarantee of anything other than human visibility at current time. Note that "Green" open access falls into this category. It might even be that we replace the word FUZZY by GREEN, though the first is more descriptive.
I take Peter S to be saying that it's inevitable that "Open Access" will come to mean, in general use, more things to more people than strict BOAI, and we will not achieve anything by making arseholes of ourselves over it. (Even if that's not quite the way Peter S would put it, that's the way I've come to look at the situation.) There's no point in picking quarrels we don't have to have. It's enough to be more careful in our own usage, for which purposes suffixes a la Peter MR should prove very useful when we need extra precision. I don't think we need invent terms ("fuzzy") just yet -- "OA (specific licence, with hyperlink if writing online)" and "OA (free to read)" should cover most cases.

If we can get to the point where the average consumer -- basically, any researcher -- responds to an OA claim or label by asking "which licence?", we will have done an end-run around the problem of term dilution.



Thursday, 19 July
In which our hero takes his customary couple steps backwards...

In the entry below, I was not sufficiently careful to avoid Nature-bashing, or the implication that Maxine Clarke was morphing, werewolf-like, into some kind of publisher pitbull. Thanks to Pedro, bdf and RPM for responses which made this clear.

Peter Suber provides a handy roundup of Nature's OA and free-to-read offerings:

[the Current Science partnership] won't be Nature's first OA journal.  Nature and EMBO publish Molecular Systems Biology, a full OA journal, along with a couple of hybrid OA journalsNature publishes another hybrid with the British Pharmacological Society.  It publishes a regular series of OA supplements to its flagship TA journal, and in January of this year began offering OA to the backfiles of its academic and society journals. 

In addition, Nature has a raft of non-journal OA projects, including a self-archiving policy, a data sharing policy, a neuroscience gateway, a signaling gateway, a networking site, mixed journalism and research sites on climate change and stem cells, blogs, podcasts, job listings, a news aggregator, and a preprint exchange

[Updated after talking to Timo Hannay to include] The Cell Migration Gateway, Dissect Medicine, The Functional Glycomics Gateway, GI Motility Online and The Pathway Interaction Database
It's worth noting that Peter uses the term OA for services and projects which I would describe as free-to-read (or free-to-use), but not OA. I would welcome clarification from Peter here, as I do not feel I am in a position to argue OA definitions with someone who helped draft its founding declarations! [update: see comments]

Even on my more restrictive reading, Nature does have a couple of full-OA journals and a handful of hybrids -- not "one barely-OA journal". Further, whether or not one considers them OA the free-to-read/use projects and services include some important and useful innovations. (The list above doesn't even include Connotea, a science-centric social bookmark manager which I use myself.) Nature is head and shoulders above any of its toll-access competitors in terms of web savvy and willingness to experiment, and I think it's important to recognize this whenever one (quite rightly!) criticizes them for not (yet) being Open Access.

What bothers me about calling Nature's free-to-read/use publications and doohickeys "OA" is the Low Fat/Greenwashing Problem, which Peter Murray-Rust describes thus:

Publishers blaze around "free" "choice", etc. which confuse rather than inform. For a publisher "open" and "free" are to be used like "low fat" "energy food" "healthy" as a way of legitimising current practice.
Everyone is familiar with companies which label their products "environmentally sound" or "healthy choice" when in fact they are paying only underhanded lip service to those concepts. It seems to me entirely possible that unscrupulous publishers may try the same tricks with "open access", and that the best defense is to insist on the BBB definitions. A number of commenters have wondered (can't find a link right now) whether we need another term for Open Access sensu stricto -- something like "BBB-OA", perhaps. (If you say that "be-three-oh-ay" it's not so bad.)

Let me finish, though, by pointing out that I do not wish to paint NPG as one of the unscrupulous publishers whose intentions worry me, nor Maxine Clarke as their sneaky shill. If NPG uses the term "open access" differently from me, I take that as a good-faith disagreement, and if Maxine uses the term in her employers' sense that is hardly "marketing". Specifically, I apologize for the phrase "if [Maxine] is going to start abusing [the term "OA"] as marketing for Nature", which contains an uncalled-for implication that I hope this entry will dispel.


You can get to like the taste of crow... you just have to eat enough of it...




Tuesday, 17 July
"Open Access" is not a marketing phrase and you are not free to use it as you see fit.

Peter Murray-Rust recently pointed to Paul Wicks' (Nature Networks) blog article, "Is Publisher-Lead "open access" a swindle?", which refers to PMR's recent blog series on publisher licensing and permissions barriers in hybrid OA models. In comments on Paul's entry, Jennifer Rohn pointed out

The two dedicated open-access publishers (BioMed Central and Public Library of Science) don't have these problems. People who want to ensure their articles are truly going to be open access, published by companies who have put real thought into the publishing as well as business model, might want to look there.
PMR quoted that comment, to which Maxine Clarke replied (in a comment on PMR's entry) with what looks for all the world like classic publisher anti-OA FUD:
Hello, I declare conflict of interest as I am an editor at Nature, not in itself open access but our publisher has many open access projects and products.
In response to Jennifer's point: I agree that BMC has got an OA publishing/business model and indeed business, but the PLOS model is dependent on a large grant from a charitable foundation, so the jury is still out (in my opinion). As an editor I am concerned about the archiving and the preservation of the scientific record, for example.
I note the commendable upfront COI declaration and state for the record that I do not think Maxine was consciously engaging in FUD. It is nonetheless standard operating procedure for OA opponents to link PLoS to "charity" and cast vague aspersions on the ability of OA publishers to maintain the scientific record. PLoS was intended as a flagship-cum-icebreaker for OA; breaking even financially was always a secondary objective. Nay-sayers about the viability of OA in business are invited to explain the success of (at least) BioMed Central, Hindawi and Medknow. Persons who wish to claim that OA puts the record at risk are invited to explain how a proprietary archive in the hands of a for-profit publisher is safer than PubMed Central or the wide network of repositories linked by OAI-PMH. (Again, I don't think Maxine was making such anti-OA claims, but it bears pointing out that what she did say contains clear echoes of standard FUD.)

Peter MR's response to Maxine's comment was this entry, in which Peter sets out to find the "many open access projects and products" and gets no further than did Jonathan Eisen, who praised the establishment of Molecular and Systems Biology (NPG's only OA journal) only to find that in fact the MSB license is the same as CC-BY-NC-ND, which is far too restrictive to call itself OA. As Chris Surridge (of PLoS) puts it in comments on Jonathan's entry,

'Free Advertising' isn't 'Open Access' in my book.
Maxine had this to say:
Nature Precedings, several database publications, Nature Reports publications (3), Nature Network, Scintilla, online daily news service, gateways, blogs, many individual articles and collections of articles are freely available ("projects and products" as I mentioned in my comment to your earlier post. MSB is to my knowledge NPG's only formal open access journal.)
Peter responded with another post, giving the necessary background and pointing out that, excepting MSB,
...the rest of [Maxine's] list completely muddies the "open access" debate. If Nature believe that "open access" applies to any freely visible information on their site, most not peer-reviewed, many without licences and many with the publisher's copyright, then they are making my life much harder.
This is clear and unexceptionable in the context of Peter's ongoing quest for clarity in publisher OA-related policies. That context, or at least its existence and importance to the entry in question, was made clear by the entry itself, and I take ordinary netiquette to involve being familiar with an ongoing conversation before taking part. Nonetheless, Maxine again:
frankly I was not responding to anything you have written in the past few weeks, I was responding to your request to give examples of NPG's "open access" or "free" material.
This is weak at best. Peter asked for "pointers to [Nature's] open access products and the licences which they carry"; see also netiquette, ongoing conversations and. Claims of a limited response made in ignorance of context are either disingenuous or, if made in good faith, still no excuse.

Maxine continues:

It is your perogative to define terms however you like, but not your perogative to enforce other people to use the same definitions - I know what I mean by "open" or "free" content and I don't need to be told off by you for having a different definition to whatever your definition is
I don't know and I don't care what Maxine means by "open" or "free". I care what the BBB Declarations mean. Peter is not defining terms however he likes; he is working with published, widely accepted definitions. He is well within his rights to expect that other people will indeed use the same definitions: that is, after all, the point of having developed and published them. Nature does NOT have "many open access projects and products", it has one (barely) OA journal and the excellent Precedings, together with a number of commendable free-to-read initiatives (blogs, Nature Network, the various free-to-read web special collections, etc). "Open Access" is not a fuzzy buzzword that Maxine is free to define as she sees fit, and if she is going to start abusing it as marketing for Nature then she most certainly does need telling off.

Peter has apologized for being "over-brusque", which is a handsome gesture but in my opinion no such apology was called for.



Friday, 13 July
Giving Open Notebook Science a Try

Openness is spreading, one researcher at a time: Jeremiah Faith, a Boston U graduate student in bioinformatics, has put his lab notes online:

Open Notebook Science [...] is a term coined by Jean-Claude Bradley. The idea is simply that the heart of every person's research - their lab notebook - should be open to the world.

Since most of our scientific work is funded by tax payers who expect their money to be well-spent, it's interesting that openness isn't required. Science typically builds on the body of available knowledge - the more knowledge available the faster science goes. It's striking when you visit other labs in person; you see all of their unpublished work, and you know that most of their results and data won't be available to the bulk of the scientific community until a year after each particular scientific project is finished. By the time papers are in print, it's old news to the insiders. More striking is when you visit labs whose work you've thought about replicating and expanding on. It's not too uncommon to find that only one person in the entire lab is able to get the technique to work, and even for him the technique only works on Wednesdays. This type of information would be useful to know before you embark on a useless three months trying to adapt their method. But scientific publications are covered in a thick coat of high-gloss finish, making these unacknowledged difficulties hard to detect.

Lab notebooks on the other hand are flat black. As long as people keep them regularly updated, they contain the good, the bad, and the completely nonsensical results.

Today I test the waters of Open Notebook Science.

The latest version of my lab notebook is now automatically posted on J's Lab Notebook Page each night. I've been using an electronic lab notebook for two years now, so there's quite a bit of data in there - good and bad (300+ pages).

This is simply fantastic. One of the things that Open Science advocates most sorely lack is concrete examples. Doing research in public, instead of in secret, is a new and somewhat unnerving idea for most scientists; early adopters like Jeremiah are essential to take the edge off that unfamiliarity.

(It's also, to be honest, just plain fun to snoop around in someone else's lab notes! I was amused to note that Jeremiah talks to and about himself in his notebook, the same way I do -- "if I weren't so stupid I'd...", "next time load the control first, doofus", etc. I wonder if everyone does that?)



Tuesday, 03 July
FINO

Once more unto the breach, dear friends, once more: the dreaded Free Is Not Open argument rears its ugly head again. I've made my position (indeed free != Open, and the distinction matters) clear elsewhere, and was gratified recently to find PMR agreeing; now it seems that the Open Medicine editorial team takes the same position:

The Canadian Medical Association Journal (CMAJ) has just published:

Here is our response:

Although the endorsement by CMAJ's editors of open access medical publishing is welcome, we would like to take this opportunity to clarify several points raised in their commentary.1 First, there is an important distinction between open versus free-access publication. Open Medicine has not only adopted the principle of free access, that is, making content fully available online, but endorses the definition of open access publication drafted by the Bethesda Meeting on Open Access Publishing. This definition stipulates that the copyright holder grants to all users a free, irrevocable, worldwide, perpetual right of access to, and a license to copy, use, distribute, transmit and display the work publicly and to make and distribute works derived from the original work, in any digital medium for any responsible purpose, subject to proper attribution of authorship. Given that CMAJ holds copyright and charges reprint and permission fees, it is not in fact an open access journal.

In comparison, Open Medicine does not assume the copyright of our authors' work. We believe that it is only fair and just that authors retain the ownership of their work; as such, Open Medicine does not charge reprint or permission fees, and our work is available for reproduction for educational and teaching purposes without copyright limitations or charges.  We use a Creative Commons Copyright License that also ensures derivative works are available through an open access forum. It is through this creative and unlimited use of published material, with due attribution, that we believe scientific discourse can flourish. This truly open access forum also has a contribution to make to a journal's integrity, independence, and freedom.   [...]

Chris Surridge of PLoS also agrees, and supplies an excellent analogy:
Free Access to scientific research is great, and all publishers who make their content free to read should be praised for doing so. But this is not Open Access. It is like giving a child a Lego car and telling them that they can look at it, perhaps touch it, but certainly not take it apart and make an aeroplane from it. The full potential of the work cannot be realised.
Where the OM team refer to Bethesda, Chris links to Berlin and goes on to enumerate
...the four unmistakable marks by which you may know, wheresoever you go, the warranted genuine Open Access publication:

1. Content is made freely and immediately accessible to all.
This basically means that you can get it on the internet without paying anything in addition to what it costs you to access the internet.

2. Authors retain the rights of attribution.
So the work is the authors [' property]. The author doesn't sign over the copyright to the publisher or anyone else. Rather the author allows the publisher to publish the work under licence. A licence which also ensures that:

3. Content can be distributed and reused without restriction.
So I or anyone else can take Open Access content and use it, in whole or in part, for any purpose including purposes that have not yet been dreamt of as long as I don't infringe the Authors rights of attribution.

4. Papers are deposited in a public online archive such as PubMed Central.
This ensures, as best as anyone can, that the above three conditions continue to apply to the Open Access content in perpetuity.

It's been my contention that in the absence of explicit, conspicous and machine-readable Open licensing, condition 3 is violated because in this litigious age, the conscientious and the risk-averse will not download and derive without explicit permission. I got "explicit and conspicious" from Peter Suber:
The newer definitions [of OA] recognize one further element: an explicit and conspicuous label that an open-access work is open access. Readers should be told when a work is free of price and permission barriers. They might be reading a copy forwarded from a friend and not know whether the publisher would like to charge for access. They might want to forward a copy to a friend and not know whether this kind of redistribution is permitted. When an article has no label, then conscientious users will seek permission for any copying that exceeds fair use. But this kind of delay and detour, with non-use as the consequence of a non-answer, are just the kinds of obstacles that open access seeks to eliminate. A good label will save users time and grief, prevent conscientious users from erring on the side of non-use, and eliminate a frustration that might nudge conscientious users into becoming less conscientious.
and "machine-readable" from Peter Murray-Rust:
For me, if my robots cannot read the articles then as a human I have no interest at all in reading the "fulltext".
Peter MR is not saying that free access for humans is useless, but that to realize the full potential of text- and data-mining, OA materials need to be machine-readable, which includes letting the machines know what they are allowed to have.

I must confess that finding my thoughts echoed by such leading OA proponents makes me feel better about being, on this issue, at odds with Stevan Harnad. I simply cannot agree that Open "comes with" Free, and the distinction bothers me. It should be relatively easy to convert Free to Open -- simply add a Creative Commons or similar license -- but I think it would be better to do that proactively. If we gloss over the difference between Free and Open at this relatively early stage of OA, we risk creating a (potentially enormous) body of Free text that must be updated to include complete, useful permissions when at last we realize that Free Is Not Open. (The game's afoot: / Follow your robots, and upon this license / Cry "Free is not Open"!)



Tuesday, 05 June
Mission-critical OA!

While you're over at Attila's blog (see the entry below), be sure to read this entry about surgeons in desperate need of information during an operation. Library staff were able to provide the required paper (at 3am!), but the connection with OA is inescapable. Attila:

Even if the surgeon found the title or abstract of the paper within seconds [...] would he/she be able to download the whole (copyrighted) content somehow within minutes too without an institutional subscription referring to informational and life emergency?

Could this exceptional information and life emergency be interpreted as a basic right with complementary duties? [...] What if a perfectly targeted Google app (call it Google Emergency) would be at hand, one that would be able to transiently abandon copyright issues for the sake of human help and solidarity?

That's a fine idea, but I hope that Open Access will render it moot, and that in the not-too-distant future no special application, only PubMed or Google Scholar, will be needed.



Tuesday, 05 June
Two small steps...

Two small but (I think) profound steps forward today, the common thread being movement towards openness:

(1) Attila Csordas will be editing his doctoral thesis "live" on his blog. He won't, at least for now, be including data or unpublished discussions, but he did check with several relevant persons about the "prior publication" status of whatever he does blog (and concluded that the blogging will not present a barrier to publication). Says Attila:

...no idea on how challenging, meaningful this project, a sub-series in Pimm, will be. What I know is that continuous experimentation with genres and frames is the essence of free blogging!
It's at the heart of Open Science, as well; bravo, Attila!


(2) In reference to my earlier post about the proposal to make referee's comments public, Heather points out that PLoS One already offers reviewers the option of having their reviews published, anonymously or signed, as a discussion linked directly from the article. Kudos to Heather for opting to have her review of this paper made openly available.



Sunday, 03 June
Petition for OA to Brazilian science.

Via Stevan Harnad, a petition to establish a self-archiving Open Access mandate for Brazilian research:

Hélio Kuramoto of IBICT has helped to formulate a Proposed Law (introduced by Rodrigo Rollemberg, Member of Brazil's House of Representatives) that would require all Brazil's public institutions of higher education and research units to create OA institutional repositories and self-archive all their technical-scientific output therein.
Once established, OA does not care about national boundaries: open is open. So every institute, funding body, nation or other group that adopts an OA mandate is helping to bring worldwide 100% OA closer.

I join Stevan in congratulating Kuramoto and Rollemberg on their initiative and in urging all OA supporters to sign the petition. (I am signature #31.) Thanks again to Stevan, here is an English translation of the petition text:

To: The Brazilian Scientific Community

On May 23 of 2007, Rodrigo Rollemberg, Member of Brazil's House of Representatives, introduced Proposed Law nÂș 1120/2007 concerning the dissemination of Brazil's technical-scientific output.

This is a pioneering initiative for this country and indeed for all of Latin America. Brazil can become the first Latin American country to establish a legal mandate for the deposit and distribution of Brazil's technical-scientific output. This Proposed Law represents a decisive and courageous step toward providing open access to Brazilian scientific research. If approved, the Law will contribute to eliminating access barriers to scientific information worldwide. In addition to being beneficial to the national economy, the Law will allow greater transparency in Brazil's investment in its scientific research, generating quantitative metrics to guide the planning and support of science and technology.

The first article proposes that all Brazil's public institutions of higher education, as well as all research units, should be required to establish institutional repositories in which all the technical-scientific output of their academic and researcher staff must be deposited. The intention is to ensure that this content wil