meta-science Category ArchiveFriday, 05 June
What use are research patents?
DrugMonkey has a conversation going about the ongoing kerfluffle over (micro)blogging of conference presentations (see also the FriendFeed discussion). I want to go off on a tangent from something that came up in his comment thread, so rather than derail it I thought I'd post here. In his first comment in the thread, David Crotty made the following claim: Lots of researchers support their families and labs through money generated by patents, and most universities are heavily dependent upon their patent portfolios for funding. That doesn't accord with my (limited!) experience -- I know a few researchers who hold multiple patents, and none of them ever made any money that way -- and my general impression is that the return on investment for tech transfer offices and the like is fairly dismal. This seems like the sort of beans that beancounters everywhere should be counting, so I asked on FriendFeed whether anyone knew of any data to address the question of whether universities really make much money from patents. Christina Pikas pointed me to the Association of University Technology Managers, whose 2007 Licensing Activity Survey is now available. I extracted data for 154 universities and 27 hospitals and research institutions. Between them, in 2007, these institutions filed 11116 patent applications, were awarded 3512 patents, and gave rise to 538 start-up companies. I calculated licensing income as a percentage of research expenditure: ![]() Apart from New York University (I wonder what they own that's so profitable?), it's clear that none of these universities are "heavily dependent upon their patent portfolios for funding". In fact, more than half of them (78/154) made less than 1% of their research expenditure back in licensing income, and the great majority (144/154) made less than 10%. Licensing income for Massachusetts General Hospital and "City of Hope National Medical Ctr. & Beckman Research" (whoever they are) amounted to 65-70% of research expenditure, but none of the other hospitals or research institutions made more than 20%. More than half of this group (15/27) made less than 2%, and most of them (23/27) made less than 10%. The distribution looks just about as you would expect: ![]() I also wondered whether there was any evidence that greater numbers of patents awarded, or more money spent per patent, resulted in higher licensing income. As you can see, the answer is no (insets show the same plots with the circled outliers removed):
I don't know how representative this dataset is; there are several thousand universities and colleges in the US, and surely even more hospitals and research institutions, so the sample size is relatively small. It does include some big names, though - Harvard, Johns Hopkins, MIT, Stanford, U of California -- and I would expect a list of schools answering the AUTM survey to be weighted towards those schools with an emphasis on tech transfer. In any case, I'm not buying David's assertion that "most universities", or most hospitals or research institutes for that matter, rely heavily on licensing income. And that being so, I am also somewhat skeptical about the number of researchers' families being supported by patents. What's the Open Science connection? Well, if you're interested in patenting the results of your research, there are a lot of restrictions on how you can disseminate your results. You can't keep an Open Notebook, or upload unprotected work to a preprint server or publicly-searchable repository, or even in many cases talk about the IP-related parts of your work at conferences. It seems from the data above that most universities would not be losing much if they gave up chasing patents entirely; nor would they be risking much future income, since so few seem to get significant funds from licensing. My own feeling is that any real or potential losses would be much more than offset by the gains in opportunities for collaboration and full exploitation of research data that come with an Open approach. Updates: 1. Christina left a comment pointing out that patents may be required for more than simply making money from licensing: ...an extremely important reason universities patent [is] to protect their work so that they may exploit it for future research... it turns out that universities have to patent in life sciences - even if they don't actively market and license these patents - to be able to attract new research money from industry. There are two distinct points here: first, that if you don't patent you may not attract industry partners, and second, that if you don't patent you may end up licensing your own tech back from someone else (I note that most tech licenses I know of are cheap or free "for research purposes" so the latter factor might not weigh so heavily). According to the 2007 AUTM data, industry investment in academic research amounted to about 7% of research expenditure and was up 15% over 2006. 2. David responded on DM's thread with some counter evidence, on reading which I realise that the data above may (likely?) only show what the university received and not any money that went to the labs or researchers involved. Tech transfer may not be financially worth it for the university, except that it might still be doing good things for individual labs and PIs, and so would constitute a support service the university offers its research community. It also strikes me that my experience, such as it is, is mainly with Australian researchers, whereas David's is in the US, so cultural differences may also apply. 3. More from Christina at her own place, here. _____________ Wednesday, 03 June
What happened to serials prices in 1986-87? (Update: probably nothing.)
This could be nothing but an artifact (e.g. of the way the data were collected), but if you look at Fig 1 from this post, there's a clear break in the serials expenses (EXPSER) curve that's not evident in any of the others. Here's the same plot reworked to emphasize what I'm talking about: ![]() If you squint just right you can imagine a similar but much weaker effect, beginning a year or two later, in the total expenditures (TOTEXP) curve; and the salaries (TOTSAL) curve seems to start a similar upward trend at about the same time but then levels off after 1991 or so. I wouldn't put any weight on either of those observations though -- I'd never have noticed either if I hadn't been comparing carefully with the EXPSER curve. I've added linear regression lines for the 1976-1986 and 1987-2003 sections of the EXPSER data, just to emphasize the change in rate of increase. For those of you who will twitch until they know, just 'cos, the regression coefficients of the two lines are 0.99 and 0.98 respectively. If you extrapolate from just the 76-86 section, TOTEXP exceeds the forecast for EXPSER after about 2000. I have no idea if this means anything, but it is tempting to speculate. For instance: when did the big mergers begin in Big Publishing, and when did the big publishing companies start the odious practice of "bundling", that is, selling their subscriptions in packages so that libraries are forced to subscribe to journals they don't want just to get the ones they do? ![]() Saturday, 18 April
Scholarly (scientific) journals vs total serials: % price increase 1990-2009
Following on from this post, I manually extracted historical data for average scholarly journal prices in a dozen broad disciplines from the Library Journal Annual Periodicals Price Surveys by Lee Van Orsdel and Kathleen Born, and compared these with three datasets from the earlier post: ARL libraries' median total serials expenditures (ARL all serials), Abridged Index Medicus average journal price (AIM) and the consumer price index (CPI): ![]() My concern with the AIM dataset was that it was too small and specialized to support broad conclusions, but it turns out that the AIM data sit somewhere in the middle of the disciplines analysed. Astronomy is closest to the ARL all serials median, with math and computer science not much worse; general science is the worst offender, with engineering and technology, chemistry and food science not far behind. From 1990 to 2008, total price increases ranged from 238% (astronomy) to 537% (general science); that's 3.7 and 8.3 times the increase in the CPI, respectively. This dataset covers an average of around 3600 journals from 2005-2009, 3255 from 1997-2001 and 2655 from 1989-1990. I think this represents good evidence that historical price data for total serials, even though it shows a rate of increase far greater than that of the CPI, masks an even greater rate of increase among scholarly (scientific) journals. It's difficult to look at that graph and believe that scholarly publishers are playing fair, particularly when one remembers that online publishing, with its attendant cost reductions, came of age during the same period of time. The Van Orsdel/Born surveys include a number of other scholarly disciplines (art, architecture, business, history, language, law, music, etc etc). If I have the time I'll work those up as well, to provide as broad a picture as possible. I should also include numbers of titles in each discipline, to give some idea of total influence. For instance: although general science (around 60 or 70 titles) shows the greatest increase, it likely contributes far less to the serials crisis than health sciences (more than 1500 titles). (The data are available in this Excel spreadsheet.) Friday, 17 April
Some wishes come true.
A while back, I posted about my discovery (new to me, though not new to many others) that the serials crisis should probably be called something like the "scholarly journals crisis". The term "serials" includes a wide range of publications, most of which are not peer-reviewed scholarly journals -- newspapers, goverment reports issued in series, yearbooks, magazines and more. Only about 1/10 of the serials in Ulrich's directory are peer-reviewed. The average scholarly journal costs around 10 times as much as the average serial, and while the cost of the scholarly literature continues to climb, median serial unit costs at ARL libraries have actually been falling for the last seven or eight years (Fig 1 below). It therefore appears that scholarly journals are the driving force behind the serials crisis. At the time, I wished that I had some specific data to show the difference between scholarly and average serials -- hence the title of this post: via medinfo, I learned that EBSCO Information Services has released a brief report (pdf!) on the price history of well regarded clinical journals, using 117 titles from the NLM's Abridged Index Medicus (AIM). This is a curated list of biomed journals "of immediate interest to the practicing physician" and can be searched on PubMed as a subset limit named "core clinical journals". As a reminder, here's that graph; it's from the ARL stats report from 2004-5 and the reason it's famous is the way that "Serials Expenditures" outstrips the Consumer Price Index (CPI) and other measures: ![]() Here's a comparison of that data with the price history of the AIM journals; the line labeled "expser/ARL libraries all serials" shows the 1990-2005 subset of the "Serials Expenditures" data from Fig 1, and "EBSCO/core clinical journals" shows the AIM data: ![]() Data labels (ARL data from here):
This is exactly what I wished for, hard evidence of the difference between scholarly and average serials; and what that evidence strongly indicates is that price increases in scholarly journals are driving the serials crisis. Scholarly journals far outstrip total serials in terms of annual price increase, even though the latter shows a much more rapid increase than the CPI. In contrast, library salary expenditure follows the CPI closely, and median serial unit cost (all serials) has been dropping slowly since 2000. Frankly, I'm tempted to name this the Big Fat Ripoff Graph. Between 1990 and 2008, the CPI increased by about 65%, whereas over the same period the average price of an AIM journal increased by 415%, a 6.4-fold difference. I've seen publishers try to defend the "total serials expenditures" vs CPI discrepancy by pointing out that journals are proliferating -- indeed, the "serials purchased" curve is headed upwards at an increasing rate, particularly over the last five years or so. But that defense is no good against the BFR Graph, on which the most damning curve shows average journal prices. I've also seen comments to the effect that if mean or median serial unit costs are dropping, publishers must be offering increasing value for money even if they are charging more in total. That might be true of the set of "all serials publishers", but it's apparent from the BFR Graph that scholarly journal publishers can make no such claim. It must be remembered, of course, that we are only looking at a little over a hundred clinical journals here, a small and discipline specific subset. Nonetheless, the result is so striking that I think it is a considerable inducement to the gathering of more data. Since it seems my wishes for more work are coming true, I'll make another: now I want price history data for other, larger journal subsets in other scholarly disciplines. I wonder what the BFR Graph looks like for those datasets? (P.S. If you want the numbers I used, or to check my work, the spreadsheet is here.) Monday, 13 April
Someone else is fooling around with numbers.
Via Peter Suber, I came across this editorial in the Journal of Vision: Measuring the impact of scientific articles is of interest to authors and readers, as well as to tenure and promotion committees, grant proposal review committees, and officials involved in the funding of science. The number of citations by other articles is at present the gold standard for evaluation of the impact of an individual scientific article. Online journals offer another measure of impact: the number of unique downloads of an article (by unique downloads we mean the first download of the PDF of an article by a particular individual). Since May 2007, Journal of Vision has published download counts for each individual article.The author goes on to compare download vs citation (counts and rates, and downloads or citations over time). It's a pretty good analysis of an important topic, but something vital is missing: Where are the data? Can I have them? What can I do with them?1In fact, the data are approximately available here. Why "approximately"? Well, I can get a range of predigested overviews: DemandFactor (roughly, downloads/day/first 1000 days) Top 20, total downloads Top 20 and article distributions by DemandFactor and total downloads. I can also get the download information for any given article -- one article at a time, and once again predigested in the form of a graph from which I have to guesstrapolate if I want raw, re-useable data. This is disappointing, for both general and specific reasons. It's always disappointing to see data locked away in a graph or a pdf or some similar digital or paper oubliette, there to languish un(re)used. It's also disappointing to see a journal getting way out ahead of the curve on something as important and valuable as download metrics (is there another journal besides J Vis that provides this information, even predigested?), and then missing an opportunity to continue to innovate by providing real Open Data. It's also disappointing in this specific instance, because I have a question: why is Figure 1 plotted on a log scale and, more importantly, was the correlation coefficient calculated from log-transformed data? I could understand showing the log scale for aesthetic reasons, but I can't think of a reason to take logs of that kind of data -- and doing so can alter the apparent correlation. For instance, remember Fig 1 from this post? Here it is again, together with a plot of log-transformed data, both shown on natural and log scales: I could answer my own question quickly and easily if I could get my hands on the underlying data -- which leads me right back to one of the primary general arguments for Open Data. If I, statistical ignoramus and newcomer to these sorts of analyses, have questions after a brief skim through the paper, what questions might a better equipped and more thorough reader have? It's simply not possible to know -- the only way to find out is to make the data openly available! I realise it's not possible for journals to demand Open Data from their authors -- that's what funder-level mandates are for, though there's much discussion still to be had regarding whether Open Data mandates would be a good idea. Nonetheless, when journals publish analyses of their own data, it would be great to see them leading the way by providing unrestricted access to that data. ------------- Wednesday, 01 April
Fooling around with numbers, part 5b.
I've already assigned part 6 to a particular analysis in an effort to get me to actually do that work, but I felt that I just had to include this (via John Wilbanks) in the series: ![]() I'm just sayin'. (I may have to get that graph as a tattoo). P.S. Never mind the date, this is not a trick; I hate online April Fool jokes with the fiery power of a thousand burning suns. Saturday, 21 March
Should we talk about the "journals crisis" instead of the "serials crisis"?
I stumbled upon something new-to-me, and possibly even useful-to-others, in my fooling around with numbers (table 2 and discussion thereof here), but it's somewhat buried under all the "how I made this figure" and "where I got these data" details. For that reason, and because I didn't trust my idea until I had some external reinforcement, I thought I'd give it a separate post all its own. Here's the thing: what is widely known as the serials crisis in library costs is probably driven largely by the pricing of scholarly journals. In library parlance, "serials" includes, inter no doubt many alia, newspapers, goverment reports issued in series, yearbooks and magazines (periodicals), in addition to the scholarly literature. Of the 225, 000 or so periodicals in Ulrich's, only about 25,000 are peer reviewed. In the FriendFeed discussion started by my post, Walt Crawford said ...some of us have long argued that there isn't a serials crisis for library budgets, there's a scholarly journal crisis. Magazines (and there are about 1/4 million magazines as compared to about 25,000 scholarly journals) tend to have very low prices and very modest increases.Although non-refereed serials dominate product counts (and, apparently, library collections), the situation is reversed for unit expenditures. The average unit cost for the UCOSC dataset, which is composed entirely of scholarly journals, is roughly ten times the average unit cost for any of the other datasets I used, all of which were general data that included all types of serial. Here's Walt again: the 10:1 ratio for UC (that is, scholarly journals averaging 10x as expensive as all serials) sounds about rightWhen the numbers and Walt's experience began to line up, I became much more confident in my conclusion, that the serials crisis is really a scholarly journals crisis. It's not clear to me, in fact, why the phenomenon got the nickname it did; perhaps it's just that "serials crisis" is a punchier phrase. I'm not at all sure that any of this is more than semantic nitpicking, but giving things their proper name can be important. Most researchers who only hear the name won't care about a "serials crisis" -- that's a library problem, nothing to do with us. But if they hear about a "scholarly literature crisis", it becomes clearer that the issue is the potential loss of access to resources necessary to do our jobs. I suspect most researchers who've heard of the serials crisis are aware that it is, at least in part, about journal pricing, but I wonder how many know that it's pretty much only about journal pricing? This little "discovery" of mine really did put things in a different perspective for me, and I'm probably more informed about library- and publishing-related issues than most benchmonkeys. I doubt that an alternative name will catch on, and I'm not going to start campaigning for one -- but I think that from now on I'll at least occasionally refer to the "serials/scholarly literature" crisis, or something similar, if only to remind myself of my own little satori. (Question for the lazyweb: can anyone suggest a better phrase, one which would make it more apparent to researchers that they should care about this?) Thursday, 19 March
Fooling around with numbers, part 5
As promised, here is the distribution of journal prices for the subsets of the Elsevier life sciences dataset which either have or don't have impact factors, and for the entire UCOSC dataset (in which all journals have IFs): Each interval is $499: $0 to $499, $500 to $999, etc, and datapoints are plotted at the midpoint of each interval. The conclusion is the same as in part 1, just a bit clearer now. Elsevier journals without an impact factor are priced lower than those which have an IF, and the price distributions are somewhat different between journals with and without an IF. Note, though, that if I'd used a $1000 interval instead of $500, the initial rise in the +IF curves would not appear; if these are power-law distributions the main difference is probably the scaling exponent. I think. (Math is not my friend.)
Tuesday, 17 March
Fooling around with numbers, part 4; or, those data -- you keep using them -- I don't think they mean what you think they mean...
At the end of part 3, having looked at some of the ways in which prices and price/use were distributed, I said I'd try to say something about what constituted a fair price. I hadn't thought that through at all, and it turns out that I really can't get much leverage against that question from the UCOSC dataset alone. In addition to the graphs in parts 1-3, here's yet another way to look at the UCOSC data (again, this is a png from a screenshot because MT ate my
So, I need context: let's start with, how many libraries are there? According to the American Library Association, there are more than 120,000 libraries in the USA -- but for my purposes, I'm really only interested in those which carry the scholarly literature. The US Dept of Education's National Center for Education Statistics runs a Library Statistics Program, which provides data specifically on academic libraries. According to the ALA and the NCES, there are about 3700 academic libraries in the US. If all of them subscribed (at list price) to the 2904 journals in the UCOSC dataset, that would work out to $13,306,150,900 -- about $13 billion -- per year on scholarly journals alone. To put that into perspective, the entire NIH research budget for 2008 was less than $30 billion. I have been told that most libraries don't pay list price, because publishers offer all kinds of deals, but I wondered whether that $13 billion was at least in the right ballpark, so I went looking for more data. Since the UCOSC dataset covers 2003-4, I looked at the NCES report for 2004 (the spreadsheet I used is here). The ALA has another division, the Association of College and Research Libraries, which keeps its own records; alas, these are not free, but I could get nearly everything I wanted from the summaries -- again, I just looked at 2004. There's also the Association of Research Libraries, which is "a nonprofit organization of 123 research libraries at comprehensive, research-extensive institutions in the US and Canada that share similar research missions, aspirations, and achievements", mostly made up of very large libraries (think Harvard, Yale, etc). The ARL also compiles and makes available statistics on its members; I pulled out the 2004 data from the download page (spreadsheet here). Finally, I added the UCOSC dataset for comparison, and for extra context I pulled out the University of California subset from the ARL data (Berkely, Davis, Irvine, LA, Riverside, San Diego and Santa Barbara; I think these are the largest 7 of UC's 10 main campus libraries). The resulting data look like this2:
I've put some sanity checks -- do these data make sense? -- in a footnote4; to me, the data appear both externally and internally consistent. I don't, in other words, appear to have done anything egregiously stupid. Not with the numbers, anyway: Two things jump out at me from Table 2, which together are responsible for the subtitle of this entry. First, my $13 billion guess was way off -- the actual amount spent on serials by US academic libraries is probably closer to $1-2 billion. Large (e.g. Ivy League) libraries might spend many tens of millions of dollars, small libraries maybe only a few hundred thousand. That's still an enormous amount of money, but it's not half the NIH budget! So why the discrepancy? Quite apart from "list price" and "what libraries actually pay" being two very different things, I've been making a mistake in terminology. When I think of "serials" in a library, I think of the peer-reviewed scholarly literature; I tend to use "journals" to mean the same thing. This is very, very wrong. (As, no doubt, any librarian could have told me, without the need to go ferreting through all those numbers.) From the NCES survey instrument used to collect their data (emphasis mine): [expenditure] From the ARL ditto: Questions 4-5. Serials. Report the total number of subscriptions, not titles. Include duplicate subscriptions and, to the extent possible, all government document serials even if housed in a separate documents collection. Verify the inclusion or exclusion of document serials... Exclude unnumbered monographic and publishers' series. Electronic serials acquired as part of an aggregated package (e.g., Project MUSE, BioOne, ScienceDirect) should be counted by title. A serial is Oy vey. Newspapers, yearbooks, government documents and a whole bunch of other things that aren't scholarly journals are (or can be) serials too. "Periodicals" means National Geographic qualifies -- hell, so does Playboy magazine! As of today (March 17), Ulrich's Periodicals Directory lists 224,151 "active" periodicals; of those, 65,461 are "academic/scholarly"; and of those, 25,425 are "refereed". What do those things cost which aren't part of the peer-reviewed literature? How does their inclusion in library data impact the means and medians I've been looking at? Which brings me to the second item of note from Table 2: the mean cost/serial is on the order of ten times higher for the UCOSC dataset than for the other sets. Does that mean that the scholarly literature is actually the powerhouse of the serials crisis (pdf!), and if we could zero in on the peer-reviewed fraction of the serials data we would see an even more dramatic rise in price? Or does it have more to do with the fact that the UCOSC dataset is deliberately composed of relatively high-end journals, thus artificially inflating the apparent costs? If every library in the NCES set subscribed to those journals at even one-tenth of list price, it would still account for pretty much the entire serials expenditure -- so how many libraries subscribe to which journals? What of the roughly 22,000 peer-reviewed journals that aren't included in the UCOSC dataset? If libraries are subscribing to anywhere from a few thousand serials to well over 100,000 (e.g. ARL 2007 numbers for Columbia, Harvard and Illinois/Urbana), what proportion of those subscriptions are to peer-reviewed journals -- or, conversely, to what proportion of the peer-reviewed literature does the average library subscribe? In other words, I've made no headway at all on the question of a "fair price"; all I've managed to do here is to find more questions. I guess that's progress, because at least they are better-defined, more specific questions. Answering them will require much more fine-grained data, though: which libraries subscribe to which peer-reviewed journals, and at what cost? I think the answers might be very useful to the research community, but collecting the data would be a full-time job. (I'm up for it, by the way, if anyone reading this is in a postion to hire me to do it. Seriously, I'd love it. After all, look what I'm doing for fun.) To return to where I started: there's another angle of attack on the "fair price" question, which is to look at things from the other side. How much does it cost to publish a paper in the peer-reviewed literature, and how does that compare to actual income at publishing companies? This information is notoriously hard to come by, but I've been collecting links and notes for a while so in Part * I've just remembered something else I want to do first: Part 5 will take a look at journal price distributions with and without impact factor, using the Elsevier Life Sciences (see Part 1 Fig 3) and the UCOSC datasets. Update: if you've read this far, go read the FriendFeed discussion, you'll like it. 2 Comma-delimited text file here. 3 The following table shows the figures used to calculate the sum total library expenditure for the ACRL dataset. Numbers in black are taken from the summaries provided, numbers in pink are calculated from them. Table 3 Mean total expenditure per library was calculated using an approximate average number of libraries of 1074. 4 Sanity checks: Internal:
External: Are those reasonable totals for the libraries to be spending?
Are those reasonable total numbers of journals per library?
Are those reasonable mean and median costs per serial?
So, at least in ballpark terms, the numbers in my tables appear to check out against summaries compiled by the various agencies from their own data (and the OHSU library catalog). There are, e.g., no order-of-magnitude discrepancies -- except perhaps in cost/serial, as discussed above. Monday, 16 March
Updates on "science and selfishness"
Update the first: now I feel bad for not waiting (though I did put "read AFTER honeymoon!!!" in the subject line), but John Wilbanks wrote back right away to say that it will take him a while to get to it, but he will ferret out specific answers regarding the Science Commons work and interoperability. Update the second: Peter Sefton has more here, including specific recommendations for working with Microsoft while avoiding "a new kind of format lock-in; a kind of monopolistic wolf in open-standards lambskin":
Friday, 13 March
Fooling around with numbers, part 3; or, why would anyone pay for these journals?
Following on from part 2, I thought I'd ask a couple more questions about price-per-use, based on the online usage stats in the UCOSC dataset. I started on this because I noticed that in Fig 2 of part 2, I'd missed a point: there is an even-further-out outlier above the Elsevier set I pointed out: It's another Elsevier journal, Nuclear Physics B. In 2003, only 1001 online uses were reported to UC by the publisher, but the 2004 list price was $15,360. The companion journal Nuc Phys A is not much better, $10,121 for 1198 uses. Compare that with Nature, 286125 uses at just $1,280! It gets worse, too, because I'm led to believe that anything that appears in a physics journal these days is available ahead of time from the arXiv. I tried to confirm that for Nuc Phys B, but either I'm missing something or the arXiv search function is totally for shit, so I couldn't do it systematically. I did go through the latest table of contents (Vol 813 issue 3) on the Science Direct page, and was easily able to find every paper in the arXiv -- mostly just by searching on author names, though in a couple of cases I had to put titles into Google Scholar. Still, they were all there, which leads me to wonder why any library would buy Nuc Phys B (or Nuc Phys A, assuming it's also covered by the arXiv). Prices haven't improved in the intervening 5 years, either: [I had a table here but Movable Type keeps munging it. Piece of shit. Here's a jpg until I sort it.]
The curve fits are for the whole of each dataset, even though it's a zoomed view; the Nature set excludes British Journal of Pharmacology, the only NPG title that recorded 0 uses, and Nature itself. Colour coding by publisher is the same for each figure in this post. As in part 2, the correlation between price and use is weak at best and doesn't change much from publisher to publisher. Also, each publisher subset shows a stronger correlation than the entire pooled set -- score another one for Bob O'Hara's suggestion that finer-grained analyses of this kind of data are likely to produce more robust results. Since cutoffs improved the apparent correlation for the pooled set, I tried that with the publisher subsets:
Next, I broke the data out into intervals (for clarity the labels say 0-1, 1-2 etc, but the actual intervals used were 0-0.99, 1-1.99 etc):
So, are these reasonable prices -- $1 per use, $6 per use? I'm not sure I can, but I'll try to say something about that question, using the UCOSC dataset, in Part 4. Thursday, 12 March
Peters Murray-Rust and Sefton on "science and selfishness"
Peter Murray-Rust (welcome back to blogging!) has replied to Glyn Moody's post about semantic plugins being developed by Science Commons in collaboration with the Evil Empire, which I discussed in my last post. Peter MR takes the view, with which I concur, that it's more important to get scientists using semantic markup than to take an ideological stand against Microsoft: Microsoft is "evil". I can understand this view - especially during the Hallowee'n document era. There are many "evil" companies - they can be found in publishing (?PRISM), pharmaceuticals (where I used to work) Constant Gardener) , petrotechnical, scientific software, etc. Large companies often/always? adopt questionable practices. [I differentiate complete commercial sectors - such as tobacco, defence and betting where I would have moral issues] . The difficulty here is that there is no clear line between an evil company and an acceptable one . Another, to my mind even more important, point was raised by Peter Sefton in a comment on Peter MR's entry: I will have to talk about this at greater length but I think the issue is not working with Microsoft it's working in an interoperable way. The plugins coming out of MS Research now might be made by well meaning people but unless they encode their results in something that can interop with other word processors (the main one is OOo Writer) then the effect is to prolong the monopoly. There is a not so subtle trick going on here - MS are opening up the word processing format with one hand while building addons like the Ontology stuff and the NLM work which depend on Word 2007 to work with the other hand. I have raised this with Jim Downing and I hope you can get a real interop on Chem4Word. (Peter S, btw, blogs here and works on a little thing called The Integrated Content Enviroment (ICE), which looks to me like a good candidate for an ideal Electronic Lab Notebook...) There's a difference between the plugins being Open Source and the plugins being useful to the F/OSS community. If collaborators hold Microsoft to real interoperability, the "Evil Empire" concerns largely go away, because the project can simply fork to support any applications other than Word. (I've emailed John Wilbanks to get his reaction to all this, but be patient because he's insanely busy in general, and right now he's on honeymoon!) Wednesday, 11 March
On science and selfishness.
Glyn Moody has a nice post up about fraternizing with the enemy in Open Science; you should read the whole thing, but here's the gist: One of the things that disappoints me is the lack of understanding of what's at stake with open source among some of the other open communities. For example, some in the world of open science seem to think it's OK to work with Microsoft, provided it furthers their own specific agenda. Here's a case in point:John Wilbanks, VP of Science for Creative Commons, gave O'Reilly Media an exclusive sneak preview of a joint announcement that they will be making with Microsoft later today at the O'Reilly Emerging Technology Conference. [...] Microsoft will be releasing, under an open source license, Word plugins that will allow scientists to mark up their papers with scientific entities directly. Let me say upfront that I mostly agree with Glyn here. Scientists should be at the forefront of abandoning closed for Open wherever possible, because in the long term Open strategies offer efficiencies of operation and scale that closed, proprietary solutions simply cannot match. Having said that -- and most expressly without wishing to put words into John Wilbanks' mouth -- my response to Glyn's criticism is that I think he (Glyn) is seriously underestimating the selfish nature of most scientists. Or if you want to be charitable, the intense pressure under which they have to function. Let me unpack that: For instance: I use Open Office in preference to Word because I'm willing to put up with a short learning curve and a few inconveniences, having (as they say here in the US) drunk the Open Kool-Aid. But I'm something of an exception. Faced with a single difficulty, one single function that doesn't work exactly like it did in Word, the vast majority of researchers will throw a tantrum and give up on the new application. After all, the Department pays the Word license, so it's there to be used, so who cares about monopolies and stifling free culture and all that hippy kum-ba-yah crap when I've got a paper to write that will make me the most famous and important scientist in all the world? The last part is a (slight) exaggeration, but the tantrum/quit part is not. Researchers have their set ways of doing things, and they are very, very resistant to change -- I think this might be partly due to the kind of personality that ends up in research, but it's also a response to the pressure to produce. In science, only one kind of productivity counts -- that is, keeps you in a job, brings in funding, wins your peers' respect -- and that's published papers. The resulting pressure makes whatever leads to published papers urgent and limits everything else to -- at best -- important; and urgent trumps important every time. Remember the old story about the guy struggling to cut down a tree with a blunt saw? To suggestions that his work would go faster if he sharpened the saw, he replies that he doesn't have time to sit around sharpening tools, he's got a tree to cut down! I said above that scientists should move from closed to Open wherever possible because of long term advantages. I think that's true, but like the guy with the saw, scientists are caught up in short-term thinking. Put the case to most of them, and they'll agree about the advantages of Open over closed -- for instance, I've yet to meet anyone who disagreed on principle that Open Access could dramatically improve the efficiency of knowledge dissemination, that is, the efficiency of the entire scientific endeavour. I've also yet to meet more than a handful of people willing to commit to sending their own papers only to OA journals, or even to avoiding journals that won't let them self-archive! "I have a job to keep", they say, "I'm not going to sacrifice my livelihood to the greater good"; or "that's great, but first I need to get this grant funded"; or my personal favourite, "once I have tenure I'll start doing all that good stuff". (Sure you will. But I digress.) So to return to the question at hand: it's a fine thing to suggest that scientists should use Open Office, but I flat-out guarantee you that they never will unless somehow their funding comes to depend on it. Word is familiar and convenient; none of the advantages of Free/Open Source software are sufficiently important to overcome the urgency with which this paper or that grant has to be written up and sent. It's also a great idea to get researchers to start thinking about, and using, markup and metadata and all that chewy Semantic Web goodness, but again I guarantee 100% failure unless you fit it into their existing workflow and habits. If you build your plugins for Open Office, that won't be another reason to use the Free application, it will be another reason to reject semantic markup: "oh yeah, the semantic web is a great idea, yeah I'd support it but there's no Word plugin so I'd have to install Open Office and I just don't have time to deal with that...". When it comes to scientists, you don't just have to hand them a sharper saw, you have to force them to stop sawing long enough to change to the new tool. All they know is that the damn tree has to come down on time and they will be in terrible trouble (/fail to be recognized for their genius) if it doesn't. Tuesday, 10 March
Fooling around with numbers, part 2
Following on from this post, and in the spirit of eating my own dogfood1, herewith the first part of my analysis of the U Cali OSC dataset. The dataset includes some 3137 titles with accompanying information about publisher, list price, ISI impact factor, UC online uses and average annual price increase; these measures are defined here. The spreadsheet and powerpoint files I used to make the figures below are available here: spreadsheet, ppt. As a first pass, I've simply made pairwise comparisons between impact factor, price and online use. There's no apparent correlation between impact factor and price, for either the full set or a subset defined by IF and price cutoffs designed to remove "extremes", as shown in the inset figure:
Next I asked whether there was any clearer connection between price and online uses aggregated over all UC campuses:
Finally (for the moment) I played the Everest ("because it's there") card and plotted use against impact factor:
Tuesday, 10 March
Fooling around with numbers
A while back, there was some buzz about a paper showing that, for a particular subset of journals, there was essentially no correlation between Impact Factor and journal subscription price. I think, though my google-fu has failed me, that the paper was Is this journal worth $US 1118? (pdf!) by Nick Blomley, and the journals in question were geography titles. Blomley found "no direct or straightforward relationship" between price and either Impact Factor or citation counts. He also looked at Relative Price Index, a finer-grained measure of journal value developed by McAfee and Bergstrom. He didn't plot that one out, so I will:
There is some circularity here, since RPI is calculated using price, but once again I'd call that no direct or straightforward relationship. All this got me wondering about the same analyses applied to other fields and larger sets of journals. My first stop was Elsevier's 2009 price list, handily downloadable as an Excel spreadsheet. It doesn't include Impact Factors, but the linked "about" page for each journal displays the IF, if it has one, quite prominently. So I went through the Life Sciences journals by hand, copying in the IFs. I ended up with 141 titles with, and 90 titles without, Impact Factors. As with Blomley's set, there was no apparent correlation between IF and price:
Interesting, no? If the primary measure of a journal's value is its impact -- pretty layouts and a good Employment section and so on being presumably secondary -- and if the Impact Factor is a measure of impact, and if publishers are making a good faith effort to offer value for money -- then why is there no apparent relationship between IF and journal prices? After all, publishers tout the Impact Factors of their offerings whenever they're asked to justify their prices or the latest round of increases in same. There's even some evidence from the same dataset that Impact Factors do influence journal pricing, at least in a "we can charge more if we have one" kinda way. Comparing the prices of journals with or without IFs indicates that, within this Elsevier/Life Sciences set, journals with IFs are higher priced and less variable in price:
About the time I was finishing this up, I came across a much larger dataset from U California's Office of Scholarly Communication. I've converted their html tables into a delimited text file, available here: UCOSC.txt. For my next trick I'll see what information I can squeeze out of a real dataset (there are about 3,000 titles in there). Oh, and if anyone wants it, the Elsevier Life Sciences data are in this Excel file: ElsevierLifeSciPriceList.xls. Sunday, 12 October
No one goes into science to get rich.
A while back, Heather posted an entry about salaries in France, and just came right out and said what she makes: The beginning junior professor (maitre de conférences, or MdC) fresh out of the Ph.D. (which never happens anymore) gets approximately 1700 euros in their pocket after benefits withholding each month, and this measure will bring it up to about 1800 euros. [...] A MdC with 15 years' seniority on the Le Monde comment thread earns 2600 euros a month; I earn 2300. (Unlike the French, I have an American indifference to revealing my salary to all; what with the fluctuating exchange rate it's approximately equivalent to that of a tight-belted American high school teacher.)I don't know that it's particularly American, but I've never minded telling everyone my income either. I understand that there are lots of reasons why one might be reticent to reveal this information, but by and large I've always felt that such reticence was mostly encouraged by those setting the salary levels, so that they could keep them as low as possible: divide and Anyway, Heather's comments got me curious, and I've always been scornful of the numbers available from sites like salary.com as they seem ridiculously inflated to me. Further, most of the survey data I've seen have been like this set from the AAUP or this one (warning: Word doc) from CPST -- no mention of postdocs or grad students at all. When the CPST, for instance, reports a median salary of $80,000/year for "doctoral scientists", believe me when I tell you their numbers are skewed towards faculty! Similarly, The Scientist's annual life sciences survey for 2008 (free but requires registration) lists a median salary for academic scientists of $77,900. When you look at further breakdowns, though, you find that the median for scientists with no supervisory/managerial responsibilities is $49,400/year -- tell that to the next TA, grad student or (junior) postdoc you meet! So, I went ahead and posted a question -- "how much money do you make?" -- to the Life Scientists room on FriendFeed. There's quite a conversation underway in that thread as I write this; Donnie pointed me to the AAUP survey I linked, others have posted reference material of various kinds, and Daniel reminded me of Mike Barton's bioinformatician survey, the data from which can be downloaded from here. Some workup is available on OpenWetWare, but there's not much there about salary so far, so I went ahead and did a little Excel spreadsheet-ing (shut up, ok, I'm just a biologist) of my own. (Pause here to applaud Mike for all his hard work in collecting this data, and even more loudly for his decision to make it Open.) I removed the entries with no salary information and made three arbitrary decisions: anyone reporting between $1K and $10K per year was actually reporting monthly salary, anyone under $1000/year was probably reporting monthly salary but who knows so I deleted them too, and anyone reporting between $10K and $20K/year didn't entirely make sense as monthly OR yearly so I deleted them too. (I couldn't be arsed to make case-by-case decisions by, for instance, looking at how many years each person had worked in the field.) This left me with n = 490 and a healthy appreciation for careful survey design (read: never give your respondents a free-form field if you can help it!). If you're really keen, you can download the spreadsheet I used from here. The basic outcomes are these: ![]() The categories are as follows:
![]() I dicked about with the outliers a little, but nothing I did improved the curve fit much -- unsurprising, given the spread, and almost certainly meaningless (note, for instance, that it extrapolates to a negative starting salary!). Anyway, there it is; if I get another wild hair I might break out the categories by industry/academia/government, but right now I'm too lazy. If all of this has whet your appetite for more data, the NSF might have something for you (it's getting late, so I'm not going to dig around in there myself today). The most believable numbers I've seen (viz, the numbers which accord most closely with my experience!) come from the Sigma Xi postdoc survey. You can get hold of the Sigma Xi data; briefly, data were collected from ~7,600 postdocs at >40 institutions, median salary in 1995 = $28,000 ($34,700 in 2004 dollars) and median salary in 2004 = $38,000.
|
RSS Feed
Links: |