April 2009 Archive



Wednesday, 29 April
Did Facebook screw up, or did I?

motherfucker.pngEarlier today I got a notification from someone on my Facebook friends list, leading to a "President Obama Approval Poll". I voted, clicked away, and then a couple of hours later got an email to say I'd sent notifications of the same poll to all my friends list.

I am usually pretty careful about permissions for Facebook apps, and I did not notice any opt-out for the spamming of my friends. I honestly think this thing has its privacy settings wrong (deliberately?) and does not give the usual options, just goes ahead and spams your friend list.

Sorry to anyone who got a "notification" from me -- either I got sloppy or, as I suspect, this thing is viral.



Tuesday, 28 April
Perpetuating an OA myth

Maxine at Nautilus posted a slightly shortened version of this letter to Nature from Raf Aerts; what caught my eye was the rearing of a familiar ugly head (emphasis mine):

...the [global recession] may also be affecting the publication output of research institutions in a more subtle way. It could be boosting the traditional reader-pays publication model for scientific journals at the expense of the author-pays, or open-access, model.

Open-access journals ask authors to pay for processing their manuscripts (which involves organizing a form of quality control, formatting and distribution) so that the final product becomes freely available, and free to use if properly attributed. [...]

This myth, that OA is synonymous with author-pays, is a toll-access publisher's delight. It simply is not true. See here for detail; briefly:

  • in 2005, the Kaufman-Wills group showed that "...more than half of DOAJ [Open Access] journals did not charge author-side fees of any type, whereas more than 75% of ALPSP, AAMC, and HW subset [Toll Access] journals did charge author-side fees." (Note that this study included only 248 journals from the DOAJ.)
  • in 2007, Peter Suber and Caroline Sutton showed that, of 450 OA journals published by 468 scholarly societies, only 75 -- fewer than 20% -- charged author-side fees
  • also in 2007, I showed that only 18% of the almost 3000 journals in the whole DOAJ charged author-side fees; 67% did not charge such fees, and the information was missing for 15%.
  • in March 2008, Heather Morrison showed that more than 90% of the psychology journals in the DOAJ charge no publication fee1
  • about a month ago, I showed that only 38 (42%) of the 90 full-OA chemistry journals in the DOAJ charged author-side fees (49% did not charge such fees, and information was missing for 9%).

Raf goes on to say:

...few peer-reviewed open-access journals have so far had a high impact factor in their field, except for a small number such as those published by the Public Library of Science and BioMed Central. They are therefore struggling to emerge and to attract the most prestigious research findings.

This situation could deteriorate further if open-access journals are forced to move to (partial) site licensing in order to cover their production costs -- a shift recently undertaken by the Journal of Visualized Experiments, for example -- as authors become increasingly reluctant or unable to pay in the current financial climate.

I don't see why we should assume that anything will "deteriorate" if OA journals switch to new funding models, or that OA journals will have a harder time 'emerging' if they move to a model that is actually closer to the old, familiar toll-access model. After all, there already exist a wide variety of ways in which OA publications pay the bills: advertising, endowments, philanthropy, institutional subsidies, memberships, priced editions and more. In particular, hybrid journals (which is what JoVE has become) are popular with toll-access publishers as a way to establish a foothold in OA territory. Inter alia, Elsevier, Springer and Wiley all publish hybrid journals, and between them, those three account for more than 40% of the worldwide science/tech/medicine publishing market -- so the hybrid model is pretty well established.

There's more to say about authors' willingness and/or ability to pay, too. Firstly, it's almost never the author who pays, but the funding body paying for that author's research. At the moment, this can translate into using up precious grant money when there's a need to pay author-side fees, but with 77 funder, institutional and departmental OA mandates in place and more on the way, it seems reasonable to suppose that more and more of the mandating bodies will underwrite more and more of the costs of publishing. For example, HHMI has institutional agreements/memberships with BMC, Springer and Elsevier, and BMC's page of funder policies shows that a majority of UK funders either make additional funds available or allow publication charges to be treated as an indirect cost. Many OA journals also waive or reduce their fees on application; for instance, here are the PLoS (scroll down) and BMC policies.

Finally, remember that the Kaufman-Wills study showed that 75% of the toll-access journals surveyed charged author-side fees (page charges, colour charges, reprint charges, etc) in addition to their subscription charges. So when there are author-side fees involved, I'd like to know how those charged by OA journals (in return for which the work is freely available to everyone, forever) compare with those charged by toll-access journals (in return for which, authors often cannot retrieve their own work, and anyone who wants to read it must pay another fee).


1 updated 04/29 after reading this post from Peter Suber



Sunday, 26 April
Caste in America (or: hell in a handbasket, yes indeedy)

I don't spend much time writing about politics any more -- my mental health just can't take it. But, data!

Via 3QuarksDaily: the Office of Management and Budget has a blog, to which director Peter Orszag posted an entry on "The Case for Reform in Education and Health Care". He describes a talk he gave to the Association of American Universities, and makes his slides available as a pdf. From those slides:

Whether you even start college depends as much on your family's income as on your ability (insofar as math scores are a decent proxy for such ability). For instance, if you're an average student (middle third math scores) you are about twice as likely to go to college if your family earned in the highest bracket, relative to your chances if your family earned in the lowest bracket. Similarly, if you're in one of the two lowest income brackets, you can roughly double your chances of going to college by getting your math score up from middle to highest third.


enrollment.png



If you do start college, whether or not you graduate also has a lot to do with family income: almost half of the students from the lowest income background do not finish college, whereas the noncompletion rate drops to less than 25% in the highest family income bracket.


completion.png



There is a vicious circle in operation: relative to a high school education, a college education returns a premium of over 400%, making you that much more likely to contribute to your children's success, as shown above. (The ordinate shows the log of the ratio between the return to a college education versus the return to a high school education: 10^0.6 is about 4.)


wage premium.png



The vicious circle encompasses more than just school. If you have money, you're more likely to be insured and to have more formal education; both factors make you much more likely to take part in routine health screens, which in turn makes you more likely to stay healthy, which in turn keeps your earning potential up, and so on.


sick.png



gettingahead2.pngIn a similar vein, Ryan Avent adds this figure from the Pew Charitable Trusts' Economic Mobility Project, which shows that you're more likely to wind up in the top earning quintile if your parents were in that demographic but you didn't go to college, than if you did go to college but your parents were in the bottom quintile (click the image at right for a popup or go here):

The rich get richer, the poor get the picture, but Garrett was wrong about one thing: when you're down so low, that's right where the bombs are most likely to land. Here's a little Vonnegut to take us to the news at the top of the hour:

America is the wealthiest nation on Earth, but its people are mainly poor, and poor Americans are urged to hate themselves. To quote the American humorist Kin Hubbard, "It ain't no disgrace to be poor, but it might as well be." It is in fact a crime for an American to be poor, even though America is a nation of poor. Every other nation has folk traditions of men who were poor but extremely wise and virtuous, and therefore more estimable than anyone with power and gold. No such tales are told by the American poor. They mock themselves and glorify their betters. The meanest eating or drinking establishment, owned by a man who is himself poor, is very likely to have a sign on its wall asking this cruel question: "If you're so smart, why ain't you rich?" [...]

Americans, like human beings everywhere, believe many things that are obviously untrue... Their most destructive untruth is that it is very easy for any American to make money. They will not acknowledge how in fact hard money is to come by, and, therefore, those who have no money blame and blame and blame themselves. This inward blame has been a treasure for the rich and powerful, who have had to do less for their poor, publicly and privately, than any other ruling class since, say, Napoleonic times.



word cloud CVs for dummies

Pierre and Pawel both did amazing things with word clouds for their CVs, using all kinds of black magic programming skills that I don't have. Just for fun, I thought I'd see what the version looked like that any doofus could create. I made a list of all the jobs I've had, then listed all the methods I used in each job -- making sure to call the same method by the same name each time it came up, so as to provide a basic weighting for the elements in the word cloud.

Here's what Wordle made of the resulting list:

wordleme.png

It's not horrible, though I can already see things I forgot to put in, and I do wish Wordle would keep phrases together1. I guess you could also try doing this with the texts of your published papers, or just the abstracts, or just the Materials and Methods.


1Update: thanks to Piotr, who left me a comment pointing out that Wordle can indeed keep phrases together, here's an alternative version; now that I see it with phrases intact I'm not sure which is better:

wordleme2.png



(Wordle settings for both versions: language: remove numbers, leave as spelled, remove common English words; font: Telephoto, layout: straighter edges, horizontal; color: Wordly, a little variance.)



Tuesday, 21 April
out of season

I've been meaning to post more verse -- my own, and other people's. Blame Jason for reminding me of this one, even though it's the wrong end of the season (and I'm nowhere near a campus these days):


Autumn Song for Alec

or, Who's that dirty old man leering out of the Chem Dept window?


Summer slowly fades away,
campus flowers even brighter:
hectic striplings seize the day,
shorts get shorter, tops crop tighter;

one last blaze of youth and skin,
refracted in the cooling prism
of early autumn; let's hope win-
ter eases Alec's priapism!








(How you doin' Alec? I hope you're OK.)



Saturday, 18 April
That bloody video.

still.pngThis video annoyed me the first time I saw it, but I just figured, you know, not everything is made for me. Now it seems to be making another round of the social media stream; it ended up on my radar via FriendFeed, and this time I just had to say something.

First of all, that's five minutes you'll never get back. Five minutes isn't much, but when you only have 30 or 60 minutes a day to spend online -- as, e.g., I did in my last job -- you resent every stolen second. This is why I hate, with a fierce and curmudgeonly hate, multimedia without transcripts or text versions.

Secondly, here's the content -- in a form you can use at your own pace without needing pause and fast forward buttons:

  • if you're 1 in a million in China, there are 1300 people just like you
  • China will soon become the number 1 English speaking country in the world
  • the 25% of India's population with the highest IQ's is greater than the total population of the United States
  • translation: India has more honors kids than America has kids
  • the top 10 in-demand jobs in 2010 did not exist in 2004
  • we are currently preparing students for jobs that don't yet exist, using technologies that haven't been invented, in order to solve problems we don't even know are problems yet
  • US Dept of Labor estimates that today's learner will have 10-14 jobs by the age of 38
  • 1 in 4 workers have been with their current employer less than a year; 1 in 2 have been there less than five years
  • 1 in 8 couples married in the US last year met online
  • if MySpace were a country, its 200 million registered users would make it the 5th largest in the world, between Indonesia and Brazil
  • the #1 ranked country in broadband internet penetration is Bermuda; #19 the US; #22 Japan
  • we are living in exponential times
  • Google searches: 2008, 31 billion/month; 2006, 2.7 billion/month
  • to whom were these questions addressed Before Google?
  • the first commercial text message was sent in Dec 1992; today, the number of text messages sent and received every day exceeds the total population of the planet
  • years it took to reach a market audience of 50 million: radio 38 years; television 13 years; internet 4 years; iPod 3 years; facebook 2 years.
  • in 1984 there were 1,000 internet devices, in 1992 there were 1,000,000, in 2008 there were 1,000,000,000
  • there are about 540,000 words in the English language, 5 X as many as in Shakespeare's time
  • it is estimated that a week's worth of the NY Times contains more information than a person was likely to come across in a lifetime in the 18th century
  • it is estimated that 4 exabytes (4x10^19 bytes) of unique information will be generated this year -- more than in the previous 5,000 years
  • the amount of new technical information is doubling every 2 years; for students in a 4-year degree this means that half of what they learn in their first year of study will be outdated by their third year
  • NTT Japan has successfully tested a fiber optic cable that pushes 14 trillion bits/second down a single strand of fiber -- that is 2,660 CDs or 210 million phone calls every second
  • it is currently tripling every 6 months and expected to do so for the next 20 years
  • by 2013, a supercomputer will be built that exceeds the computational capabilities of the human brain
  • predictions are that by 2049, a $1000 computer will exceed the computational capabilities of the entire human species
  • during the course of this presentation (4:55), 67 babies were born in the US, 274 were born in China, 395 were born in India and 694,000 songs were downloaded illegally
  • credit: Karl Fisch, Scott McLeod, and Jeff Brenman

When you see it like that, not zooming out at you with a soundtrack and a bunch of twee effects, it becomes obvious that there's nothing much there, and what there is, is rather disjointed and incoherent. Many of the factoids look shaky to me, and there are only a couple of references or sources provided (why not provide the others?). I'm not going to bother with a fisking, but here are some obvious eyebrow-raisers:
  • All that stuff about China and India smacks of xenophobic scaremongering to me -- I very much doubt that's the intent, but there's nothing to tie it to the technological stuff, so it starts to sound like "flee, the brown people are coming!"
  • "We are currently preparing..." -- feels good means nothing; it's just an overblown description of what good teachers have always done.
  • "We are living in exponential times" --  that word ("exponential"), I don't think it means what you think it means...
  • OK, the google searches, text messages and years-to-50-million stuff is neat, though I still want sources.
  • The prefix exa- denotes 10^18; even using the unofficial binary-base interpretation, 4 exabytes is about 4.61 x 10^18 bytes (See what I did there, with the links to my sources? In a slideshow, you can do that with footnotes and a final slide.)
  • In any case, 4 or 40 exabytes of what? How do you define/count "unique information"?
  • Even if we gloss over "unique information", how do any of the other quoted rates of change square with "more than in the previous 5,000 years"? What would that mean for the following 1/5000th of a year (~1.75 hours)? In other words, we must have maxed out -- right?
  • If the optical fiber example needs a human-scale yardstick, so does 4 exabytes --e.g. if you wrote that data to CD-ROM and covered a football field with the discs, the resulting stack would be about 16 m high, or roughly the height of a four story house.

Update, written after all of the above:

It's important to note that although the version discussed above is the only one I'd ever seen before today, it is actually the third version on YouTube and was "remixed" by Sony BMG in August 2008. The original was made by Karl Fisch in August 2006; Scott McLeod's version dates to January 2007 (this was the first one to make it to YouTube and was responsible for the first viral wave); Jeff Brenman created a SlideShare version a couple of months later, and the official version 2.0 was made in consultation with XPLANE in June 2007.

In fairness to Fisch (sounds like a PETA chant), many of the shortcomings of the version that so annoyed me must be laid at the feet of the anonymous Sony drone responsible for the "remix".

Not only did Fisch provide a text version and a list of his sources with version 1.0, but version 2.0 does a better job than the Sony version of acknowledging the sources in the course of the presentation and even comes with its own wiki, mentioned in the presentation. Version 2.0 is also considerably more coherent and much nicer to look at, and does a (somewhat) better job of avoiding the "eek, brown people!" tone. (Fisch says in a couple of places that he and McLeod, in response to criticism, consciously worked to reduce that "us vs them" feeling, and points out here that he views it as largely an unforseen side-effect of some of the changes between his original powerpoint version, made for his immediate colleagues, and the first YouTube version.) Finally, kudos for choosing a Creative Commons license (even though I don't like copyleft): although the Sony version leaves this out, all versions are CC-BY-NC-SA (source files are available on the wiki).

In my opinion it's a damn shame that the Sony version took off (at the time of writing, there are two copies on YouTube with 4,458,229 and 29,828 views, respectively). If you come across someone talking about that version, do everyone a favour and point them to version 2.0.

Scholarly (scientific) journals vs total serials: % price increase 1990-2009

Following on from this post, I manually extracted historical data for average scholarly journal prices in a dozen broad disciplines from the Library Journal Annual Periodicals Price Surveys by Lee Van Orsdel and Kathleen Born, and compared these with three datasets from the earlier post: ARL libraries' median total serials expenditures (ARL all serials), Abridged Index Medicus average journal price (AIM) and the consumer price index (CPI):


LJ.png

My concern with the AIM dataset was that it was too small and specialized to support broad conclusions, but it turns out that the AIM data sit somewhere in the middle of the disciplines analysed. Astronomy is closest to the ARL all serials median, with math and computer science not much worse; general science is the worst offender, with engineering and technology, chemistry and food science not far behind. From 1990 to 2008, total price increases ranged from 238% (astronomy) to 537% (general science); that's 3.7 and 8.3 times the increase in the CPI, respectively.

This dataset covers an average of around 3600 journals from 2005-2009, 3255 from 1997-2001 and 2655 from 1989-1990. I think this represents good evidence that historical price data for total serials, even though it shows a rate of increase far greater than that of the CPI, masks an even greater rate of increase among scholarly (scientific) journals. It's difficult to look at that graph and believe that scholarly publishers are playing fair, particularly when one remembers that online publishing, with its attendant cost reductions, came of age during the same period of time.

The Van Orsdel/Born surveys include a number of other scholarly disciplines (art, architecture, business, history, language, law, music, etc etc). If I have the time I'll work those up as well, to provide as broad a picture as possible. I should also include numbers of titles in each discipline, to give some idea of total influence. For instance: although general science (around 60 or 70 titles) shows the greatest increase, it likely contributes far less to the serials crisis than health sciences (more than 1500 titles).

(The data are available in this Excel spreadsheet.)



Friday, 17 April
Some wishes come true.

A while back, I posted about my discovery (new to me, though not new to many others) that the serials crisis should probably be called something like the "scholarly journals crisis". The term "serials" includes a wide range of publications, most of which are not peer-reviewed scholarly journals -- newspapers, goverment reports issued in series, yearbooks, magazines and more. Only about 1/10 of the serials in Ulrich's directory are peer-reviewed. The average scholarly journal costs around 10 times as much as the average serial, and while the cost of the scholarly literature continues to climb, median serial unit costs at ARL libraries have actually been falling for the last seven or eight years (Fig 1 below). It therefore appears that scholarly journals are the driving force behind the serials crisis.

At the time, I wished that I had some specific data to show the difference between scholarly and average serials -- hence the title of this post: via medinfo, I learned that EBSCO Information Services has released a brief report (pdf!) on the price history of well regarded clinical journals, using 117 titles from the NLM's Abridged Index Medicus (AIM). This is a curated list of biomed journals "of immediate interest to the practicing physician" and can be searched on PubMed as a subset limit named "core clinical journals".

As a reminder, here's that graph; it's from the ARL stats report from 2004-5 and the reason it's famous is the way that "Serials Expenditures" outstrips the Consumer Price Index (CPI) and other measures:


ARL.png



Here's a comparison of that data with the price history of the AIM journals; the line labeled "expser/ARL libraries all serials" shows the 1990-2005 subset of the "Serials Expenditures" data from Fig 1, and "EBSCO/core clinical journals" shows the AIM data:


EBSCO.png

Data labels (ARL data from here):

  • serpur: Current Serials Purchased, median value from all ARL libraries
  • expser: Expenditures for Serials, median etc
  • totsal: Total Salaries & Wages, median etc
  • serunit: Serial Unit Cost; median value of expsur/serpur calculated for all ARL libraries
  • EBSCO: average price per journal in the Abridged Index Medicus set
  • CPI-U: Consumer Price Index, all urban consumers, annual average, not seasonally adjusted


This is exactly what I wished for, hard evidence of the difference between scholarly and average serials; and what that evidence strongly indicates is that price increases in scholarly journals are driving the serials crisis. Scholarly journals far outstrip total serials in terms of annual price increase, even though the latter shows a much more rapid increase than the CPI. In contrast, library salary expenditure follows the CPI closely, and median serial unit cost (all serials) has been dropping slowly since 2000.

Frankly, I'm tempted to name this the Big Fat Ripoff Graph. Between 1990 and 2008, the CPI increased by about 65%, whereas over the same period the average price of an AIM journal increased by 415%, a 6.4-fold difference. I've seen publishers try to defend the "total serials expenditures" vs CPI discrepancy by pointing out that journals are proliferating -- indeed, the "serials purchased" curve is headed upwards at an increasing rate, particularly over the last five years or so. But that defense is no good against the BFR Graph, on which the most damning curve shows average journal prices. I've also seen comments to the effect that if mean or median serial unit costs are dropping, publishers must be offering increasing value for money even if they are charging more in total. That might be true of the set of "all serials publishers", but it's apparent from the BFR Graph that scholarly journal publishers can make no such claim.

It must be remembered, of course, that we are only looking at a little over a hundred clinical journals here, a small and discipline specific subset. Nonetheless, the result is so striking that I think it is a considerable inducement to the gathering of more data. Since it seems my wishes for more work are coming true, I'll make another: now I want price history data for other, larger journal subsets in other scholarly disciplines. I wonder what the BFR Graph looks like for those datasets?

(P.S. If you want the numbers I used, or to check my work, the spreadsheet is here.)


Update: ha! I just got around to reading this article, linked by Peter Suber a couple of days ago; turns out it's full of annual price data, and Van Orsdel and Born have been doing these surveys for at least ten years. There doesn't seem to be a central collection or data collation, so I'll have to piece it together. Stay tuned!



Wednesday, 15 April
What's wrong with copyleft?

This FriendFeed thread regarding the Wikipedia licensing vote has stirred up an old hornet's nest of issues surrounding copyleft and noncommercial clauses in Open licenses. As I said in the thread, I get most of my ideas on this topic from David Wiley, and have posted about those ideas before. Herewith another attempt to organize and clarify my thoughts, as much for my own benefit as anything:


1. The purpose of Open licensing is to enable the following (this is straight from David's Open Education License draft, about which more later):

  • Reuse - Use the work verbatim, just exactly as you found it
  • Rework - Alter or transform the work so that it better meets your needs
  • Remix - Combine the (verbatim or altered) work with other works to better meet your needs
  • Redistribute - Share the verbatim work, the reworked work, or the remixed work with others


2. The purpose of restrictive clauses in such licensing is to prevent specific types of reuse, rework, remix and/or redistribution:

2a. Copyleft prevents future copyright lockup by requiring that all downstream (reworked or remixed) works be similarly licensed.

2b. Noncommercial clauses prevent profitmaking, and are complicated, and I'm not getting any further into it than that right now. (Maybe later, if my brain doesn't melt.)


3. Although copyleft and NC clauses achieve their own immediate goals, widespread license incompatibility1 means that they often (perhaps usually) defeat part of the larger purpose of Open licensing. The use case where this is most prominent is remix2, since reuse and redistribution of individual copylefted or NC-licensed works or their derivatives is usually just a matter of retaining the original license. But multiple works can only be recombined into new works if their respective licenses are compatible -- otherwise, there's no licensing option for the remix that doesn't violate the licensing terms of at least one of the ingredients. Not only that, but if any of the works in the mix carries a copyleft license, that license takes over the entire remix and everything downstream of it, thus propagating the incompatibility problem.


4. One last thing: could copyleft be saved from itself? What if someone wanted copyleft protection, without the compatibility issues? Creative Commons is already beginning to build the only solution I can think of: widespread interoperability agreements between existing and any newly developed copyleft licenses. CC-BY-SA 3.0 contains the following clause:

You may distribute, publicly display, publicly perform, or publicly digitally perform a Derivative Work only under: (i) the terms of this License; (ii) a later version of this License with the same License Elements as this License; (iii) either the Creative Commons (Unported) license or a Creative Commons jurisdiction license (either this or a later license version) that contains the same License Elements as this License (e.g. Attribution-ShareAlike 3.0 (Unported)); (iv) a Creative Commons Compatible License.
where (iv) is defined as
a license that is listed at http://creativecommons.org/compatiblelicenses
Sadly, the cupboard remains bare so far:
Please note that to date, Creative Commons has not approved any licenses for compatibility; however, we are hopeful that we may be able to do so in the future. If you would like to discuss the possible compatibility of your license with a Creative Commons license, please email us at info@creativecommons.org.

I am personally persuaded that the Public Domain is the best way out of the copyleft trap, which is why I use CCZero for everything I make.






-------------
1 Among CC licenses, there is only about 33% compatibility, and that drops to 20% among NC and SA versions -- including self-compatibility*:

cccompatibility.png


Restrictive (NC, SA) versions currently account for around 80% of worldwide CC licence uptake. Once you start factoring in the dozens and dozens of other Open/Free licenses out there, it only gets worse. The FSF and OSI maintain lists of licenses and compatibilities (here and here, respectively), and wikipedia includes a couple of fairly extensive comparison tables. Speaking of Wikipedia, the world's favourite online encyclopaedia is currently released under the GNU Free Documentation License, which is not compatible with any CC license except Public Domain though it does allow transition to CC-BY-SA. If the current vote on that transition is "yes", that will be a step forward -- but it will still leave Wikipedia with the compatibility problems shown in the figure above. Exploration of compatibility issues with all the other Free/Open licenses is left as an exercise, etc.

* from here and here; green indicates compatibility, light green indicates possible compatibility -- some disagreement between sources.


2This is why I consider David's "Four R's" formulation so important, because it makes a clear distinction between rework and remix that is essential to understanding the aims and implementation of Open licenses.



Monday, 13 April
Anniversary of sorts

This question from Antony Williams on FriendFeed:

Is PubChem Data Open or not? There are many discussions saying that PubChem data are Open but I see PubChem as a host and the disclaimer does not say "open": http://tinyurl.com/e78as

reminded me that it's almost a year to the day since Egon Willighagen asked a similar question about PubMed Central content:
I was wondering about this section in the CC license of much of the PMC content, such as our paper on userscripts (section 4a of the CC-BY 2.0):
    You may not distribute, publicly display, publicly perform, or publicly digitally perform the Work with any technological measures that control access or use of the Work in a manner inconsistent with the terms of this License Agreement.
CC-BY 3.0 reads differently, but has similar aims. [...] Peter [Murray-Rust, see here] indicates that the NIH has put in place 'technological measures to control access' to the distribution of our work on userscripts (the PMC entry). That is in clear violation of the CC license. [...] What the PMC website should indicate, instead, is that text mining is allowed for the PMC OAI subset, but that they would highly prefer to use the PMC OAI or PMC FTP routes. This is the least they have to do.

No matter what, I still have the feeling that any technical obstacles are disallowed by the CC-license. Any legal expert here, that can explain me if the CC license allows controlling how people have access to my material?

These are both very good questions, and I still don't have an answer for Egon's even after a year. I'm reluctant to go pestering John Wilbanks with every CC-related question I come across, so I'm reposting in the hope that someone will be able to save John from me.

Lazy reporter, no donut.

Dennis Carter in an eCampus News article about NPG's Scitable:

Scitable's January launch came as elite universities across the United States are embracing open-access formats--making research articles available for free online. This marks an abrupt departure from the traditional model of printing research articles in academic journals, which can cost campuses as much as $20,000 annually, open-access experts say.
So, is it the traditional model that can cost campuses up to $20K/yr, or academic journals, each of which can cost etc?

It's only obvious that what is meant is $20K/yr per journal subscription if you already know that libraries spend millions of dollars per year on serials.

I'd expect a publication that wants you to register to read its content1 to bother making that content accurate and unambiguous.


-------------
1 Sure, registration is free. Registration also provides the publisher with a great bolus of immensely valuable marketing information, to say nothing of the slimy opt-out spam opportunity. Which is why I recommend poisoning such databases with fake information providing minimal information unless you get content that you really value from the site. (Two wrongs etc, hence the edit.)

Someone else is fooling around with numbers.

Via Peter Suber, I came across this editorial in the Journal of Vision:

Measuring the impact of scientific articles is of interest to authors and readers, as well as to tenure and promotion committees, grant proposal review committees, and officials involved in the funding of science. The number of citations by other articles is at present the gold standard for evaluation of the impact of an individual scientific article. Online journals offer another measure of impact: the number of unique downloads of an article (by unique downloads we mean the first download of the PDF of an article by a particular individual). Since May 2007, Journal of Vision has published download counts for each individual article.
The author goes on to compare download vs citation (counts and rates, and downloads or citations over time). It's a pretty good analysis of an important topic, but something vital is missing:
Where are the data? Can I have them? What can I do with them?1
In fact, the data are approximately available here. Why "approximately"? Well, I can get a range of predigested overviews: DemandFactor (roughly, downloads/day/first 1000 days) Top 20, total downloads Top 20 and article distributions by DemandFactor and total downloads. I can also get the download information for any given article -- one article at a time, and once again predigested in the form of a graph from which I have to guesstrapolate if I want raw, re-useable data.

This is disappointing, for both general and specific reasons. It's always disappointing to see data locked away in a graph or a pdf or some similar digital or paper oubliette, there to languish un(re)used. It's also disappointing to see a journal getting way out ahead of the curve on something as important and valuable as download metrics (is there another journal besides J Vis that provides this information, even predigested?), and then missing an opportunity to continue to innovate by providing real Open Data.

It's also disappointing in this specific instance, because I have a question: why is Figure 1 plotted on a log scale and, more importantly, was the correlation coefficient calculated from log-transformed data? I could understand showing the log scale for aesthetic reasons, but I can't think of a reason to take logs of that kind of data -- and doing so can alter the apparent correlation. For instance, remember Fig 1 from this post? Here it is again, together with a plot of log-transformed data, both shown on natural and log scales:


logarithmssarehard.PNG



I could answer my own question quickly and easily if I could get my hands on the underlying data -- which leads me right back to one of the primary general arguments for Open Data. If I, statistical ignoramus and newcomer to these sorts of analyses, have questions after a brief skim through the paper, what questions might a better equipped and more thorough reader have? It's simply not possible to know -- the only way to find out is to make the data openly available!

I realise it's not possible for journals to demand Open Data from their authors -- that's what funder-level mandates are for, though there's much discussion still to be had regarding whether Open Data mandates would be a good idea. Nonetheless, when journals publish analyses of their own data, it would be great to see them leading the way by providing unrestricted access to that data.

-------------
1 Astute readers, both of you, will remember that howl of anguish refrain from this post.



Saturday, 04 April
Why don't we share data? Not for the reasons Steven Wiley thinks we don't.

Via Peter Suber, I came across an editorial about data sharing in The Scientist. I disagree with the author, PNNL's Steven Wiley, on a number of points:

Despite the appeal of making all biological data accessible, there are enormous hurdles that currently make it impractical. For one, sharing all data requires that we agree on a set of standards. This is perhaps reasonable for large-scale automated technologies, such as microarrays, but the logistics of converting every western blot, ELISA, and protein assay into a structured and accessible data format would be a nightmare -- and probably not worth the effort.

Wiley is making two mistakes here: setting the perfect against the good, and vastly underestimating human ingenuity.

Standards are inarguably required for automated sharing and essential for the sharing of ALL data, but that doesn't mean that sharing SOME data, with evolving standards or even without any standards, has no utility. My pet example is the long standing practice of supporting scientific claims with the phrase "data not shown" in peer-reviewed papers, something I think should no longer be allowed. All scientific claims should be supported by data. "Data not shown" belongs to the print era, when space was limited and distribution relied on physical reproduction and transport. This is the era of the online supplement, to which no such restrictions apply.

Reasonable people might contend that I am stretching the concept of "data sharing" to cover my pet peeve there, but I chose the example deliberately as an edge case: there is, to me, clear utility in that kind of data sharing, even though it involves no standards, only some data, and only eyeball-by-eyeball access (whereas I myself frequently argue that the greater part of the value of Open distribution probably lies in the long term, in machine-to-machine access). I argue that more sharing, using -- despite their current flaws -- evolving standards, is likely to yield significant dividends well before reaching the eventual goal of sharing all data using universal standards.

This leads me to the second mistake. It seems odd to me to insist that because standards are difficult to develop and implement, the bulk of such work is futile. The key is the phrase "currently... impractical". The whole concept of the internet was probably considered "currently impractical" by a great many people, until someone went and built it. There are plenty of people still willing to pronounce Free/Open Source software "currently impractical", even as they (perhaps unwittingly) rely on it every time they go online or send email. Then-existing hurdles at various times surely made business on the internet "currently impractical", and banking on the internet "currently impractical", and -- need I go on?

Moreover, I am not the only one who disagrees about the value of creating standards for difficult-to-share data. If you think western blots would be a nightmare, how about biodiversity data -- like, say, museum specimens? How about anthropometric data, exchangeable biomaterials, neuroscience data, electron micrographs, magnetic resonance images or microscopy images? The MIBBI project has dozens of other examples, the Open Biomedical Ontologies Foundry is working on dozens more, and Bioformats.org might offer a lightweight solution to some of the same problems.

(In re: Wiley's specific examples: I was easily able to find efforts underway to enable sharing of gel electrophoresis data, protein affinity reagents and molecular interaction experiments; and I can't imagine ELISA data being much harder to share than microarray information -- surely MIAME, for instance, could readily be adapted if it wouldn't already serve? I'm not sure what kind of protein assay Wiley has in mind.)

I cannot begin to imagine how to build semantic and exchange standards for those kinds of data, but I'm not about to bet against the people currently trying to do so; nor do I believe that, once built, their systems will prove to have been "not worth the effort".

As I mentioned, reasonable people might disagree about various points above. But Wiley goes on to say:

Unfortunately, most experimental data is obtained ad hoc to answer specific questions and can rarely be used for other purposes.

which is just plain wrong. Much of the rationale for data sharing, the engine of much of its promise, is the simple observation that you cannot know what someone else will do with your data, particularly when they have access to lots of other people's data to go with it. Re-use beyond the scope of the original author's imagination is a primary impetus for data sharing, and innovative examples abound; for instance, just take a look at Tony Hirst's blog. (If there is a dearth of examples from biomedical research, I'd call that an argument in favor of more, not less, data sharing.)

"Can rarely be used" is an empirical claim, and those should be backed by data -- and I can think of only one way to get the relevant data in this case.

Wiley continues:

Good experimental design usually requires that we change only one variable at a time. There is some hope of controlling experimental conditions within our own labs so that the only significantly changing parameter will be our experimental perturbation. However, at another location, scientists might inadvertently do the same experiment under different conditions, making it difficult if not impossible to compare and integrate the results.

[...] In order to sufficiently control the experimental context to allow reliable data sharing, biologists would be forced to reduce the plethora of cell lines and experimental systems to a handful, and implement a common set of experimental conditions.

Experimental results are supposed to provide useful information about the world of sense-perception. If a result cannot be repeated by different hands in a different lab, then it is probably not telling us what we think it is telling us about the way the world works. If, on the other hand, a particular result does mean what we think it means about the underlying system, then we should be able to design different experiments to be carried out with different hands, conditions, equipment etc., and obtain data that supports the same conclusions. That's what we call a robust result, and standard practice is to aim for robust results.

Regarding integration and comparison of results from different conditions -- just what does meta-analysis mean, if not exactly that? As an example, if you were to knock Pin1 down in HeLa cells, you'd block their growth, but Pin1 knockout mice survive just fine. Comparison of those results is not only possible, but extremely interesting, and is the way we learned that mice have an active Pin1 isoform, Pin1L, which is present but potentially inactive in humans.

I think that variation in conditions between labs is a good reason to build finer-grained semantic structures, but no reason at all to throw up our hands and give up on linked data.

Wiley goes on to give as his sole concrete example the lack of uptake into published papers of data from the Alliance for Cell (sic) Signaling. It's actually the Alliance for Cellular Signaling1; their website lists 20 publications, NextBio finds 35 and Google Scholar (which covers a lot more than peer-reviewed papers) finds 440. Scholarly papers are a somewhat limited measure of research impact, but that's not at first glance an impressive showing. Consider, though, that the AfCS was established in the late 1990's, which puts it well ahead of its time, and then compare the first, second and ongoing third decades of the undisputed poster child of data sharing2:

genbankgrowth.PNG

There's more to Wiley's choice of example, though:

In my own case, I am interested in the EGF receptor and receptor tyrosine kinases. This aspect of cell signaling was not covered in their dataset, and thus it is of no interest to me.

I wish I had a dollar for every time I'd heard an argument against some new idea that boils down to: "I can't figure this out, or find a use for it myself; therefore it's no good and will never be any use to anyone". I'm sure there's a pithy Latin name for this particular logical fallacy.

Wiley continues in, as it turns out, a similar vein:

And soon, discussions about the importance of sharing may become moot, since the rapid pace of technology development is likely to eliminate much of the perceived need for sharing primary experimental data. High throughput analytical technologies, such as proteomics and deep sequencing, can yield data of extremely high quality and can produce more data in a single run than was previously obtained from years of work. It will thus become more practical for research groups to generate their own integrated sets of data than try to stitch together disparate information from multiple sources.

And just what does the PNNL's Biomolecular Systems Initiative (of which Wiley is director) do? By a strange coincidence, this:

advancing our high-resolution, high-throughput technologies by exploiting PNNL's strengths in instrument development and automation and applying these technologies to solve large-scale biological problems....

We are building a comprehensive computational infrastructure that includes software for bioinformatics, modeling, and information management. To be more competitive in obtaining programmatic funding, we will continue to invest in new capabilities and technologies such as cell fractionation, affinity reagents, high-speed imaging, affinity pull downs, and ultra-fast proteomics. This will help us build world-class expertise in the generation and analysis of large, heterogeneous sets of biological data. The ability to productively handle extremely large and complex datasets is a distinguishing feature of the biology program at PNNL.

The remainder of this post is left as an exercise for the reader; be sure to cover the question of how less well-heeled institutions are supposed to carry out work in proteomics and deep sequencing and so on, and don't forget to ask for evidence showing that it is not important to share data even between such high-fliers, since presumably they can extract every last conceivable piece of useful information from their own data...


-------------
1You'd be amazed how many things share that acronym -- activity-friendly communities, antibody-forming cells, ataxia functional composite scale, antral follicle count, alveolar fluid clearance, age at first calving, amniotic something something -- that's where I gave up. Why oh why can't we have a decent text search? Even just "match case" would have solved much of my problem here. /rant

2 graph from here



Wednesday, 01 April
Fooling around with numbers, part 5b.

I've already assigned part 6 to a particular analysis in an effort to get me to actually do that work, but I felt that I just had to include this (via John Wilbanks) in the series:



Lemongraph.jpg



I'm just sayin'. (I may have to get that graph as a tattoo).


P.S. Never mind the date, this is not a trick; I hate online April Fool jokes with the fiery power of a thousand burning suns.



RSS Feed

CC0
To the extent possible under law, I have waived all copyright and related or neighboring rights to this weblog. This work is published from the United States. Further information.



Links:
(formerly Malice Aforethought)
me
spousal unit
Bloglines account
Simpy account
Connotea account
OpenWetWare userpage
monthly irregular column on 3QuarksDaily


Please sign the petition in support of the European Commission's proposed Open Access Self-Archiving Mandate

googlebombs for good
Roe; Wade; Roe v Wade
abortion
Jew
Seldovia Herald


blogroll:

Archives:
August 2010
June 2010
April 2010
March 2010
February 2010
January 2010
October 2009
July 2009
June 2009
May 2009
April 2009
March 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
January 2007
December 2006
November 2006
October 2006
September 2006
August 2006
July 2006
June 2006
May 2006
April 2006
March 2006
February 2006
January 2006
December 2005
November 2005
October 2005
September 2005
August 2005
July 2005
June 2005
May 2005
April 2005
March 2005
February 2005
January 2005
December 2004
November 2004
October 2004
September 2004
August 2004
July 2004
June 2004
May 2004
April 2004
March 2004
February 2004
January 2004
December 2003









Design thrown together haphazardly by frykitty.
Powered by the inimitable MovableType.