March 2009 ArchiveTuesday, 24 March
Entry for Ada Lovelace Day
Today is Ada Lovelace Day: Ada Lovelace Day is an international day of blogging to draw attention to women excelling in technology.Since most of my role models who happen to be female are not really in any kind of tech career, I'm spared the need to write the enormous essay that it would take to cover them all. Instead I'll point to just two for whom I can reasonably make a tech connection: Rosie Redfield and Maureen Hoatlin. I've never met Rosie, who is a PI in the Zoology Department at University of British Columbia, but she is one of the first biomed researchers -- if not the very first -- to embrace Open Science and I've been following her online presence for a couple of years now. From her lab's homepage you can read not just the usual list of publications and personnel, but also submitted research proposals and work in progress. The latter is communicated by blog: Rosie has one, and so do several other lab members. They discuss upcoming and ongoing experiments, work up data and think out loud about their research in general. I met Maureen after we were both quoted in Mitch Waldrop's SciAm article on Open Science, and she realized that we worked on the same campus. Maureen is a PI in the Biochem Dept at OHSU. She tells a great story about neglecting her family one weekend while she sat in bed reading scientific articles online -- "this changes everything" was all she would say to their pleas for breakfast, etc. Well, Maureen meant what she said, and she's walking the walk. You can find the Hoatlin lab on OpenWetWare, along with a wiki-based, bottom-up, ongoing experiment in improving grad student education that she pioneered, and you can find Maureen on a range of social networking sites including FriendFeed and LinkedIn. Her lab has its own Twitter account. Since I think this sort of open, collaborative model is very much the way of the future, if science is to have a future at all, I'd like to see Rosie and Maureen get their props for having been such early adopters. It's also worth mentioning that, in addition to still being a Boys' Club in many ways, research is a very conservative environment in which new ideas are usually met with scorn and active resistance. So, having made it up the foodchain in the face of irrational opposition, they are now confronting the same tribe with another set of new and threatening ideas. Both are worthy additions to the Ada Lovelace Day pantheon.
New blog in town.
I don't normally promote new blogs, other than to add them to my blogroll if I think they are worth my readers' time, but I'll make an exception for PLoS ONE's new community blog, EveryONE: Why a blog and why now? As of March 2009, PLoS ONE, the peer-reviewed open-access journal for all scientific and medical research, has published over 5,000 articles, representing the work of over 30,000 authors and co-authors, and receives over 160,000 unique visitors per month. That's a good sized online community and we thought it was about time that you had a blog to call your own. This blog is for authors who have published with us and for users who haven't and it contains something for everyone.I hope, and on my better days believe, that PLoS ONE is one of the leading models for the future of scientific journals:
EveryONE is another way for PLoS ONE to engage with their community of readers and contributors, and well worth a look.
Saturday, 21 March
Should we talk about the "journals crisis" instead of the "serials crisis"?
I stumbled upon something new-to-me, and possibly even useful-to-others, in my fooling around with numbers (table 2 and discussion thereof here), but it's somewhat buried under all the "how I made this figure" and "where I got these data" details. For that reason, and because I didn't trust my idea until I had some external reinforcement, I thought I'd give it a separate post all its own. Here's the thing: what is widely known as the serials crisis in library costs is probably driven largely by the pricing of scholarly journals. In library parlance, "serials" includes, inter no doubt many alia, newspapers, goverment reports issued in series, yearbooks and magazines (periodicals), in addition to the scholarly literature. Of the 225, 000 or so periodicals in Ulrich's, only about 25,000 are peer reviewed. In the FriendFeed discussion started by my post, Walt Crawford said ...some of us have long argued that there isn't a serials crisis for library budgets, there's a scholarly journal crisis. Magazines (and there are about 1/4 million magazines as compared to about 25,000 scholarly journals) tend to have very low prices and very modest increases.Although non-refereed serials dominate product counts (and, apparently, library collections), the situation is reversed for unit expenditures. The average unit cost for the UCOSC dataset, which is composed entirely of scholarly journals, is roughly ten times the average unit cost for any of the other datasets I used, all of which were general data that included all types of serial. Here's Walt again: the 10:1 ratio for UC (that is, scholarly journals averaging 10x as expensive as all serials) sounds about rightWhen the numbers and Walt's experience began to line up, I became much more confident in my conclusion, that the serials crisis is really a scholarly journals crisis. It's not clear to me, in fact, why the phenomenon got the nickname it did; perhaps it's just that "serials crisis" is a punchier phrase. I'm not at all sure that any of this is more than semantic nitpicking, but giving things their proper name can be important. Most researchers who only hear the name won't care about a "serials crisis" -- that's a library problem, nothing to do with us. But if they hear about a "scholarly literature crisis", it becomes clearer that the issue is the potential loss of access to resources necessary to do our jobs. I suspect most researchers who've heard of the serials crisis are aware that it is, at least in part, about journal pricing, but I wonder how many know that it's pretty much only about journal pricing? This little "discovery" of mine really did put things in a different perspective for me, and I'm probably more informed about library- and publishing-related issues than most benchmonkeys. I doubt that an alternative name will catch on, and I'm not going to start campaigning for one -- but I think that from now on I'll at least occasionally refer to the "serials/scholarly literature" crisis, or something similar, if only to remind myself of my own little satori. (Question for the lazyweb: can anyone suggest a better phrase, one which would make it more apparent to researchers that they should care about this?) Thursday, 19 March
Fooling around with numbers, part 5
As promised, here is the distribution of journal prices for the subsets of the Elsevier life sciences dataset which either have or don't have impact factors, and for the entire UCOSC dataset (in which all journals have IFs): Each interval is $499: $0 to $499, $500 to $999, etc, and datapoints are plotted at the midpoint of each interval. The conclusion is the same as in part 1, just a bit clearer now. Elsevier journals without an impact factor are priced lower than those which have an IF, and the price distributions are somewhat different between journals with and without an IF. Note, though, that if I'd used a $1000 interval instead of $500, the initial rise in the +IF curves would not appear; if these are power-law distributions the main difference is probably the scaling exponent. I think. (Math is not my friend.)
Where I live now: Google street view.
That's almost looking straight down the driveway, and at the end you can see the manager's office; we're two buildings back from that on the right. If the view had been shot from just a few feet to the left, you'd be able to see our parking space (the building is set back a bit too far). Try turning left (click and drag, or use the arrows at top left) and walking (click the arrows on the street, 13-14 times) up to my local 7-11, on the left at the intersection with Stark St, source of much late night soda pop and chocolate. If you keep walking along 148th Ave, you'll come to Burnside and the light rail, which is probably how I'll get to work once I get a job. If you turn left on Stark you'll pass by the site of the photos in the last few entries. The first photo in "new neighbourhood at night" is the building next to the porn shop, the next is the furniture place, the third is from a bit further down -- all on the left; "bird on a wire" was shot in front of the 7-11, looking towards the street. If you go far enough (about 25 blocks) in that direction, you'll find Fandango -- best Mexican food in Portland. If you turn right on Stark instead of left, you have only 14 blocks to Dutch Brothers -- best coffee in Portland. Neat. Almost sorta creepy, but neat. Wednesday, 18 March
You cannot steal this weblog.
You can't steal it because in every sense that means anything, you already own it. My shiny new Creative Commons Zero license replaces this and this, and places the entire site in the Public Domain. You can take anything you find here, provided I made it myself and have not included it under someone else's terms, and do anything you want with it. You can do things I don't like, you can make money and not give me any, you can attribute the work to me or not, and you can tell me what you're up to or not, as you choose. You don't have to ask first. This site is a free cultural work, dedicated to the commonweal of all persons. It is free as in speech and free as in beer. (Also free as in puppies: don't blame me if something you find here turns out to be more trouble than it's worth.) Following Richard Stallman and David Wiley, you are free to:
new neighbourhood at night
bird on a wire
Being unemployed is not all bad. I like having time to make pictures.
How 'bout it, codemonkey? One for all you web app wizards out there.
A great opportunity has opened up for a code-savvy free culture type to earn a little good karma. Here's the thing:
Now, a bookmarklet seems to me even better than a badge, because it's independent of the blog you're reading, right there on your browser toolbar. When you think to yourself "this is such a good post that I should submit it to The Open Lab", rather than finding the submission form and filling it in or looking to see whether the blog has a badge, you can just hit the bookmarklet. Even better, the bookmarklet can be set up to autofill at least your details, and perhaps to extract information from the page you're on as well. In any case, the various submission mechanisms are not mutually exclusive: there's no reason not to have badges and bookmarklets and anything else the community can think of. I could build one, in principle, since I've hacked around with js a little, but it would take me literally days of screaming frustration to do a half-assed job. Surely there's some web app wizard out there who could whip up something over their lunch break? So -- how about it? Help the Open Laboratory, help the science blogging community in general: build Bora a bookmarklet.
As Bora intimates in his introduction, blogs are conversations and so they lose a certain liveliness when embalmed in a blook (blog + book; don't blame me, I didn't coin it!) like this. Nonetheless, there is some excellent writing in this thing, it is as perfect an introduction to science blogging as you're likely to see offline, and it's a fun read all on its own. True to the open nature of the original medium, you can of course surf over to Bora's blog and find the anthology entries listed there. No one will mind if you do, but I hope you will also consider buying the blook -- which, after all, unlike the internets, you can carry with you on the bus and leave on the break-room table at work. It's priced at cost and any incidental proceeds will go towards next year's edition.Since then, there have been two subsequent editions, 2007 and 2008, and what I said of the 2006 incarnation remains true (except that incidental proceeds now go towards the Science Online conference). (Incidentally, if you follow those links you can read not only the posts that made it into each anthology but all the entries as well.) Tuesday, 17 March
Author-side fees in hybrid and OA chemistry journals
Peter Suber, responding to a J Cheminfo paper, wondered what proportion of chemistry journals in the DOAJ charge author-side fees. Since I was in that mode, as it were: ![]() Hybrid journals are those that offer OA-for-a-fee, so of course all of those charge fees. "Open" above refers to Gold OA journals, roughly half of which charge author-side fees in this chemistry subset. This is broadly consistent with the overall DOAJ listing (as of December 2007) and also with several other studies that Peter mentions.
Fooling around with numbers, part 4; or, those data -- you keep using them -- I don't think they mean what you think they mean...
At the end of part 3, having looked at some of the ways in which prices and price/use were distributed, I said I'd try to say something about what constituted a fair price. I hadn't thought that through at all, and it turns out that I really can't get much leverage against that question from the UCOSC dataset alone. In addition to the graphs in parts 1-3, here's yet another way to look at the UCOSC data (again, this is a png from a screenshot because MT ate my
So, I need context: let's start with, how many libraries are there? According to the American Library Association, there are more than 120,000 libraries in the USA -- but for my purposes, I'm really only interested in those which carry the scholarly literature. The US Dept of Education's National Center for Education Statistics runs a Library Statistics Program, which provides data specifically on academic libraries. According to the ALA and the NCES, there are about 3700 academic libraries in the US. If all of them subscribed (at list price) to the 2904 journals in the UCOSC dataset, that would work out to $13,306,150,900 -- about $13 billion -- per year on scholarly journals alone. To put that into perspective, the entire NIH research budget for 2008 was less than $30 billion. I have been told that most libraries don't pay list price, because publishers offer all kinds of deals, but I wondered whether that $13 billion was at least in the right ballpark, so I went looking for more data. Since the UCOSC dataset covers 2003-4, I looked at the NCES report for 2004 (the spreadsheet I used is here). The ALA has another division, the Association of College and Research Libraries, which keeps its own records; alas, these are not free, but I could get nearly everything I wanted from the summaries -- again, I just looked at 2004. There's also the Association of Research Libraries, which is "a nonprofit organization of 123 research libraries at comprehensive, research-extensive institutions in the US and Canada that share similar research missions, aspirations, and achievements", mostly made up of very large libraries (think Harvard, Yale, etc). The ARL also compiles and makes available statistics on its members; I pulled out the 2004 data from the download page (spreadsheet here). Finally, I added the UCOSC dataset for comparison, and for extra context I pulled out the University of California subset from the ARL data (Berkely, Davis, Irvine, LA, Riverside, San Diego and Santa Barbara; I think these are the largest 7 of UC's 10 main campus libraries). The resulting data look like this2:
I've put some sanity checks -- do these data make sense? -- in a footnote4; to me, the data appear both externally and internally consistent. I don't, in other words, appear to have done anything egregiously stupid. Not with the numbers, anyway: Two things jump out at me from Table 2, which together are responsible for the subtitle of this entry. First, my $13 billion guess was way off -- the actual amount spent on serials by US academic libraries is probably closer to $1-2 billion. Large (e.g. Ivy League) libraries might spend many tens of millions of dollars, small libraries maybe only a few hundred thousand. That's still an enormous amount of money, but it's not half the NIH budget! So why the discrepancy? Quite apart from "list price" and "what libraries actually pay" being two very different things, I've been making a mistake in terminology. When I think of "serials" in a library, I think of the peer-reviewed scholarly literature; I tend to use "journals" to mean the same thing. This is very, very wrong. (As, no doubt, any librarian could have told me, without the need to go ferreting through all those numbers.) From the NCES survey instrument used to collect their data (emphasis mine): [expenditure] From the ARL ditto: Questions 4-5. Serials. Report the total number of subscriptions, not titles. Include duplicate subscriptions and, to the extent possible, all government document serials even if housed in a separate documents collection. Verify the inclusion or exclusion of document serials... Exclude unnumbered monographic and publishers' series. Electronic serials acquired as part of an aggregated package (e.g., Project MUSE, BioOne, ScienceDirect) should be counted by title. A serial is Oy vey. Newspapers, yearbooks, government documents and a whole bunch of other things that aren't scholarly journals are (or can be) serials too. "Periodicals" means National Geographic qualifies -- hell, so does Playboy magazine! As of today (March 17), Ulrich's Periodicals Directory lists 224,151 "active" periodicals; of those, 65,461 are "academic/scholarly"; and of those, 25,425 are "refereed". What do those things cost which aren't part of the peer-reviewed literature? How does their inclusion in library data impact the means and medians I've been looking at? Which brings me to the second item of note from Table 2: the mean cost/serial is on the order of ten times higher for the UCOSC dataset than for the other sets. Does that mean that the scholarly literature is actually the powerhouse of the serials crisis (pdf!), and if we could zero in on the peer-reviewed fraction of the serials data we would see an even more dramatic rise in price? Or does it have more to do with the fact that the UCOSC dataset is deliberately composed of relatively high-end journals, thus artificially inflating the apparent costs? If every library in the NCES set subscribed to those journals at even one-tenth of list price, it would still account for pretty much the entire serials expenditure -- so how many libraries subscribe to which journals? What of the roughly 22,000 peer-reviewed journals that aren't included in the UCOSC dataset? If libraries are subscribing to anywhere from a few thousand serials to well over 100,000 (e.g. ARL 2007 numbers for Columbia, Harvard and Illinois/Urbana), what proportion of those subscriptions are to peer-reviewed journals -- or, conversely, to what proportion of the peer-reviewed literature does the average library subscribe? In other words, I've made no headway at all on the question of a "fair price"; all I've managed to do here is to find more questions. I guess that's progress, because at least they are better-defined, more specific questions. Answering them will require much more fine-grained data, though: which libraries subscribe to which peer-reviewed journals, and at what cost? I think the answers might be very useful to the research community, but collecting the data would be a full-time job. (I'm up for it, by the way, if anyone reading this is in a postion to hire me to do it. Seriously, I'd love it. After all, look what I'm doing for fun.) To return to where I started: there's another angle of attack on the "fair price" question, which is to look at things from the other side. How much does it cost to publish a paper in the peer-reviewed literature, and how does that compare to actual income at publishing companies? This information is notoriously hard to come by, but I've been collecting links and notes for a while so in Part * I've just remembered something else I want to do first: Part 5 will take a look at journal price distributions with and without impact factor, using the Elsevier Life Sciences (see Part 1 Fig 3) and the UCOSC datasets. Update: if you've read this far, go read the FriendFeed discussion, you'll like it. 2 Comma-delimited text file here. 3 The following table shows the figures used to calculate the sum total library expenditure for the ACRL dataset. Numbers in black are taken from the summaries provided, numbers in pink are calculated from them. Table 3 Mean total expenditure per library was calculated using an approximate average number of libraries of 1074. 4 Sanity checks: Internal:
External: Are those reasonable totals for the libraries to be spending?
Are those reasonable total numbers of journals per library?
Are those reasonable mean and median costs per serial?
So, at least in ballpark terms, the numbers in my tables appear to check out against summaries compiled by the various agencies from their own data (and the OHSU library catalog). There are, e.g., no order-of-magnitude discrepancies -- except perhaps in cost/serial, as discussed above. Monday, 16 March
Updates on "science and selfishness"
Update the first: now I feel bad for not waiting (though I did put "read AFTER honeymoon!!!" in the subject line), but John Wilbanks wrote back right away to say that it will take him a while to get to it, but he will ferret out specific answers regarding the Science Commons work and interoperability. Update the second: Peter Sefton has more here, including specific recommendations for working with Microsoft while avoiding "a new kind of format lock-in; a kind of monopolistic wolf in open-standards lambskin":
Saturday, 14 March
MT weirdness
1. Comments are working again. Thanks to everyone who told me about the problem -- I don't know what it was, but my technical consultant (Spousal Unit) turned off the spam firewall and things look fine. 2. Help me, lazyweb! I can enter html tables just fine, unless there's an image upstream -- then MT inserts a dozen or more <br> tags above the table! I've tried <br clear=all> and every kind of spacing between the table and the character right before it. No amount of text between the table and the image seems to have any effect. My stylesheet is here and you can view source to see the main index, but I can't see any obvious cause of the weirdness. Friday, 13 March
Fooling around with numbers, part 3; or, why would anyone pay for these journals?
Following on from part 2, I thought I'd ask a couple more questions about price-per-use, based on the online usage stats in the UCOSC dataset. I started on this because I noticed that in Fig 2 of part 2, I'd missed a point: there is an even-further-out outlier above the Elsevier set I pointed out: It's another Elsevier journal, Nuclear Physics B. In 2003, only 1001 online uses were reported to UC by the publisher, but the 2004 list price was $15,360. The companion journal Nuc Phys A is not much better, $10,121 for 1198 uses. Compare that with Nature, 286125 uses at just $1,280! It gets worse, too, because I'm led to believe that anything that appears in a physics journal these days is available ahead of time from the arXiv. I tried to confirm that for Nuc Phys B, but either I'm missing something or the arXiv search function is totally for shit, so I couldn't do it systematically. I did go through the latest table of contents (Vol 813 issue 3) on the Science Direct page, and was easily able to find every paper in the arXiv -- mostly just by searching on author names, though in a couple of cases I had to put titles into Google Scholar. Still, they were all there, which leads me to wonder why any library would buy Nuc Phys B (or Nuc Phys A, assuming it's also covered by the arXiv). Prices haven't improved in the intervening 5 years, either: [I had a table here but Movable Type keeps munging it. Piece of shit. Here's a jpg until I sort it.]
The curve fits are for the whole of each dataset, even though it's a zoomed view; the Nature set excludes British Journal of Pharmacology, the only NPG title that recorded 0 uses, and Nature itself. Colour coding by publisher is the same for each figure in this post. As in part 2, the correlation between price and use is weak at best and doesn't change much from publisher to publisher. Also, each publisher subset shows a stronger correlation than the entire pooled set -- score another one for Bob O'Hara's suggestion that finer-grained analyses of this kind of data are likely to produce more robust results. Since cutoffs improved the apparent correlation for the pooled set, I tried that with the publisher subsets:
Next, I broke the data out into intervals (for clarity the labels say 0-1, 1-2 etc, but the actual intervals used were 0-0.99, 1-1.99 etc):
So, are these reasonable prices -- $1 per use, $6 per use? I'm not sure I can, but I'll try to say something about that question, using the UCOSC dataset, in Part 4. Thursday, 12 March
Peters Murray-Rust and Sefton on "science and selfishness"
Peter Murray-Rust (welcome back to blogging!) has replied to Glyn Moody's post about semantic plugins being developed by Science Commons in collaboration with the Evil Empire, which I discussed in my last post. Peter MR takes the view, with which I concur, that it's more important to get scientists using semantic markup than to take an ideological stand against Microsoft: Microsoft is "evil". I can understand this view - especially during the Hallowee'n document era. There are many "evil" companies - they can be found in publishing (?PRISM), pharmaceuticals (where I used to work) Constant Gardener) , petrotechnical, scientific software, etc. Large companies often/always? adopt questionable practices. [I differentiate complete commercial sectors - such as tobacco, defence and betting where I would have moral issues] . The difficulty here is that there is no clear line between an evil company and an acceptable one . Another, to my mind even more important, point was raised by Peter Sefton in a comment on Peter MR's entry: I will have to talk about this at greater length but I think the issue is not working with Microsoft it's working in an interoperable way. The plugins coming out of MS Research now might be made by well meaning people but unless they encode their results in something that can interop with other word processors (the main one is OOo Writer) then the effect is to prolong the monopoly. There is a not so subtle trick going on here - MS are opening up the word processing format with one hand while building addons like the Ontology stuff and the NLM work which depend on Word 2007 to work with the other hand. I have raised this with Jim Downing and I hope you can get a real interop on Chem4Word. (Peter S, btw, blogs here and works on a little thing called The Integrated Content Enviroment (ICE), which looks to me like a good candidate for an ideal Electronic Lab Notebook...) There's a difference between the plugins being Open Source and the plugins being useful to the F/OSS community. If collaborators hold Microsoft to real interoperability, the "Evil Empire" concerns largely go away, because the project can simply fork to support any applications other than Word. (I've emailed John Wilbanks to get his reaction to all this, but be patient because he's insanely busy in general, and right now he's on honeymoon!) Wednesday, 11 March
On science and selfishness.
Glyn Moody has a nice post up about fraternizing with the enemy in Open Science; you should read the whole thing, but here's the gist: One of the things that disappoints me is the lack of understanding of what's at stake with open source among some of the other open communities. For example, some in the world of open science seem to think it's OK to work with Microsoft, provided it furthers their own specific agenda. Here's a case in point:John Wilbanks, VP of Science for Creative Commons, gave O'Reilly Media an exclusive sneak preview of a joint announcement that they will be making with Microsoft later today at the O'Reilly Emerging Technology Conference. [...] Microsoft will be releasing, under an open source license, Word plugins that will allow scientists to mark up their papers with scientific entities directly. Let me say upfront that I mostly agree with Glyn here. Scientists should be at the forefront of abandoning closed for Open wherever possible, because in the long term Open strategies offer efficiencies of operation and scale that closed, proprietary solutions simply cannot match. Having said that -- and most expressly without wishing to put words into John Wilbanks' mouth -- my response to Glyn's criticism is that I think he (Glyn) is seriously underestimating the selfish nature of most scientists. Or if you want to be charitable, the intense pressure under which they have to function. Let me unpack that: For instance: I use Open Office in preference to Word because I'm willing to put up with a short learning curve and a few inconveniences, having (as they say here in the US) drunk the Open Kool-Aid. But I'm something of an exception. Faced with a single difficulty, one single function that doesn't work exactly like it did in Word, the vast majority of researchers will throw a tantrum and give up on the new application. After all, the Department pays the Word license, so it's there to be used, so who cares about monopolies and stifling free culture and all that hippy kum-ba-yah crap when I've got a paper to write that will make me the most famous and important scientist in all the world? The last part is a (slight) exaggeration, but the tantrum/quit part is not. Researchers have their set ways of doing things, and they are very, very resistant to change -- I think this might be partly due to the kind of personality that ends up in research, but it's also a response to the pressure to produce. In science, only one kind of productivity counts -- that is, keeps you in a job, brings in funding, wins your peers' respect -- and that's published papers. The resulting pressure makes whatever leads to published papers urgent and limits everything else to -- at best -- important; and urgent trumps important every time. Remember the old story about the guy struggling to cut down a tree with a blunt saw? To suggestions that his work would go faster if he sharpened the saw, he replies that he doesn't have time to sit around sharpening tools, he's got a tree to cut down! I said above that scientists should move from closed to Open wherever possible because of long term advantages. I think that's true, but like the guy with the saw, scientists are caught up in short-term thinking. Put the case to most of them, and they'll agree about the advantages of Open over closed -- for instance, I've yet to meet anyone who disagreed on principle that Open Access could dramatically improve the efficiency of knowledge dissemination, that is, the efficiency of the entire scientific endeavour. I've also yet to meet more than a handful of people willing to commit to sending their own papers only to OA journals, or even to avoiding journals that won't let them self-archive! "I have a job to keep", they say, "I'm not going to sacrifice my livelihood to the greater good"; or "that's great, but first I need to get this grant funded"; or my personal favourite, "once I have tenure I'll start doing all that good stuff". (Sure you will. But I digress.) So to return to the question at hand: it's a fine thing to suggest that scientists should use Open Office, but I flat-out guarantee you that they never will unless somehow their funding comes to depend on it. Word is familiar and convenient; none of the advantages of Free/Open Source software are sufficiently important to overcome the urgency with which this paper or that grant has to be written up and sent. It's also a great idea to get researchers to start thinking about, and using, markup and metadata and all that chewy Semantic Web goodness, but again I guarantee 100% failure unless you fit it into their existing workflow and habits. If you build your plugins for Open Office, that won't be another reason to use the Free application, it will be another reason to reject semantic markup: "oh yeah, the semantic web is a great idea, yeah I'd support it but there's no Word plugin so I'd have to install Open Office and I just don't have time to deal with that...". When it comes to scientists, you don't just have to hand them a sharper saw, you have to force them to stop sawing long enough to change to the new tool. All they know is that the damn tree has to come down on time and they will be in terrible trouble (/fail to be recognized for their genius) if it doesn't. Tuesday, 10 March
Fooling around with numbers, part 2
Following on from this post, and in the spirit of eating my own dogfood1, herewith the first part of my analysis of the U Cali OSC dataset. The dataset includes some 3137 titles with accompanying information about publisher, list price, ISI impact factor, UC online uses and average annual price increase; these measures are defined here. The spreadsheet and powerpoint files I used to make the figures below are available here: spreadsheet, ppt. As a first pass, I've simply made pairwise comparisons between impact factor, price and online use. There's no apparent correlation between impact factor and price, for either the full set or a subset defined by IF and price cutoffs designed to remove "extremes", as shown in the inset figure:
Next I asked whether there was any clearer connection between price and online uses aggregated over all UC campuses:
Finally (for the moment) I played the Everest ("because it's there") card and plotted use against impact factor:
Fooling around with numbers
A while back, there was some buzz about a paper showing that, for a particular subset of journals, there was essentially no correlation between Impact Factor and journal subscription price. I think, though my google-fu has failed me, that the paper was Is this journal worth $US 1118? (pdf!) by Nick Blomley, and the journals in question were geography titles. Blomley found "no direct or straightforward relationship" between price and either Impact Factor or citation counts. He also looked at Relative Price Index, a finer-grained measure of journal value developed by McAfee and Bergstrom. He didn't plot that one out, so I will:
There is some circularity here, since RPI is calculated using price, but once again I'd call that no direct or straightforward relationship. All this got me wondering about the same analyses applied to other fields and larger sets of journals. My first stop was Elsevier's 2009 price list, handily downloadable as an Excel spreadsheet. It doesn't include Impact Factors, but the linked "about" page for each journal displays the IF, if it has one, quite prominently. So I went through the Life Sciences journals by hand, copying in the IFs. I ended up with 141 titles with, and 90 titles without, Impact Factors. As with Blomley's set, there was no apparent correlation between IF and price:
Interesting, no? If the primary measure of a journal's value is its impact -- pretty layouts and a good Employment section and so on being presumably secondary -- and if the Impact Factor is a measure of impact, and if publishers are making a good faith effort to offer value for money -- then why is there no apparent relationship between IF and journal prices? After all, publishers tout the Impact Factors of their offerings whenever they're asked to justify their prices or the latest round of increases in same. There's even some evidence from the same dataset that Impact Factors do influence journal pricing, at least in a "we can charge more if we have one" kinda way. Comparing the prices of journals with or without IFs indicates that, within this Elsevier/Life Sciences set, journals with IFs are higher priced and less variable in price:
About the time I was finishing this up, I came across a much larger dataset from U California's Office of Scholarly Communication. I've converted their html tables into a delimited text file, available here: UCOSC.txt. For my next trick I'll see what information I can squeeze out of a real dataset (there are about 3,000 titles in there). Oh, and if anyone wants it, the Elsevier Life Sciences data are in this Excel file: ElsevierLifeSciPriceList.xls. |
RSS Feed
Links: (formerly Malice Aforethought) me spousal unit Bloglines account Simpy account Connotea account OpenWetWare userpage googlebombs for good Roe; Wade; Roe v Wade abortion Jew Seldovia Herald blogroll: Archives: March 2010 February 2010 January 2010 October 2009 July 2009 June 2009 May 2009 April 2009 March 2009 January 2009 December 2008 November 2008 October 2008 September 2008 August 2008 July 2008 May 2008 April 2008 March 2008 February 2008 January 2008 December 2007 November 2007 October 2007 September 2007 August 2007 July 2007 June 2007 May 2007 April 2007 March 2007 January 2007 December 2006 November 2006 October 2006 September 2006 August 2006 July 2006 June 2006 May 2006 April 2006 March 2006 February 2006 January 2006 December 2005 November 2005 October 2005 September 2005 August 2005 July 2005 June 2005 May 2005 April 2005 March 2005 February 2005 January 2005 December 2004 November 2004 October 2004 September 2004 August 2004 July 2004 June 2004 May 2004 April 2004 March 2004 February 2004 January 2004 December 2003 |