Methodology

Wednesday, July 9. 2008

Batch Deposits in Institutional Repositories (the SWORD protocol)

In the context of Nature's just-announced offer to do proxy deposits for its authors, Peter Suber has asked Les Carr of EPrints to comment on whether the software has the capability of downloading and uploading deposits automatically, in batch mode, rather than just singly, with the keystrokes done by hand.

Les Carr's reply is affirmative:

"Both EPrints and DSpace allow batch uploads, but more to the point, both of them support the new SWORD [Simple Web-service Offering Repository Deposit] protocol for making automatic deposits in repositories. We (the SWORD developers) very much hope that we will be able to work with established discipline [i.e., central] repositories to allow automatic feed through of deposits from Institutional Repositories into Discipline Repositories and vice versa."

Posted by Stevan Harnad in Methodology at 03:36 | Comments (0) | Trackbacks (0)

Tuesday, July 8. 2008

Automatic search for OA versions of cited articles

Matt Cockerill (publisher of BioMed Central) makes the following comment on "The #1 Myth About Open Access":

You take issue with Mike Dunford's comment: "Just what is open access?... In an open access journal, there's no charge for reading articles..." and note that you feel that author deposit of manuscripts in open access repositories, in parallel to the existing subscription-based pay-to-access journals, is a faster and surer way to achieve open access.

But do you not agree that when a reader of an article spots an interesting item in a reference list, and clicks to follow a link to the article concerned, it does not "feel" like open access when they are faced with a publisher's pay-wall asking for a subscription or per-article fee to view the article. Of course, there are several ways they may be able to view a version of the article without paying. The would-be reader could search the net to see if they can track down a free copy of the article in a repository; they can send an email request to the author (who is hopefully not on holiday and has a legally sharable electronic version to hand); or they can try their luck down at their local library. Fair enough. But you must have some sympathy with a reader who would prefer simply to click a link and get straight to the article concerned, without being challenged to provide credit card details. It's not such a bad definition of open access.

That's exactly why Mike Jewell created Paracite. It would be a piece of cake to set up a bit of software that automatically transformed text that one highlights in a reference list into a Paracite or Google Scholar query. The only reason no one has yet bothered to create that piece of software is that most of that potential content is not yet OA. But Green OA self-archiving and mandates will take care of that...

Stevan Harnad
American Scientist Open Access Forum

Posted by Stevan Harnad in Methodology at 04:25 | Comments (0) | Trackbacks (0)

Thursday, October 18. 2007

Time to Update the BBB Definition of Open Access

[Update: See new definition of "Weak" and "Strong" OA, 29/4/2008]

SUMMARY: The definition of Open Access (OA) is still young, and not yet etched in stone; it stands only to benefit from a rational, corrective update. Parts of the (increasingly gilded) BBB formulation turn out to have been unnecessary, counterproductive, and even incoherent. The right to re-publish, re-sell and create derivative works may be essential for Free Online Scholarship (FOS), and for the Creative Commons, but they are not essential for OA, and it would be an unnecessary, self-imposed handicap to insist that they should be, merely raising barriers to OA where there are and need be none. It is a good idea for authors to retain extra rights for their published articles, wherever possible, but it is definitely not a necessary prerequisite for Green OA self-archiving, nor for Green OA self-archiving mandates.
   For the 62% of articles published in the Green journals that have explicitly endorsed the immediate OA self-archiving of the author's postprint, no further rights are needed to self-archive it, hence no further rights need to be negotiated as a precondition. And robotic harvesting and data-mining (Google, Scirus, OAIster) all come with the free online territory as surely as individual usage does.
   For the 38% of articles published in non-Green journals, authors can still deposit them in their Institutional Repositories (IRs), immediately upon acceptance for publication, setting access as Closed Access. With the help of the IR's "Email Eprint Request" Button, this provides for (1) accessing, (2) reading, (3) downloading, (4) storing (5) printing-off, (6) individual data-mining, and (7) re-using content (but not text) in further publications. That is still not OA; it is only almost-OA: Missing is full-text (8*) robotic harvesting and (9*) robotic data-mining.
   If all or most universities already mandated exception-free immediate-deposit as above (Open Access for the Green 62% and Closed Access for the non-Green 38%), there would be no problem at all about then going still further and trying to negotiate the retention of more rights -- even unnecessary ones! But instead declaring successful rights retention to be a precondition (by BBB "definition") would simply hamstring both self-archiving and the adoption of self-archiving mandates, and hence the advent of OA itself.

On Mon, 15 Oct 2007, Frederick Friend wrote:"I also agree with [Peter Suber, Peter Murray-Rust and Robert Kiley] that the UKPMC re-use agreement is vital for future academic developments. With hindsight we were too slow to pick up on the significance of the changes to copyright transfer agreements in the 1990s by which authors now assign all electronic rights to publishers. Blanket assigning of electronic rights has created and is still creating barriers in the electronic re-use of subscription content. We cannot afford to make the same mistake of neglect on the arrangements for academic re-use of OA content, whether green or gold."
I am afraid that this is more a matter of misunderstanding than of disagreement:

(1) The disagreement (with PS, PM-R and RK) was not about whether or not it is a good idea for the author to retain certain electronic rights. (It is a good idea for the author to do so, wherever possible. However, rights-retention is not a necessary prerequisite for Green OA self-archiving, nor for Green OA self-archiving mandates. Hence it would be a big mistake to imply otherwise: i.e., to imply that authors cannot self-archive, and/or their institutions/funders cannot mandate that they self-archive, until/unless the author successfully negotiates rights-retention. That would not only be incorrect, but it would be a gratuitous deterrent to self-archiving and to self-archiving mandates, hence to OA.)

(2) The disagreement was instead about:
(2a) whether or not certain electronic rights (not the same rights as in (1), above, by the way!), provided by certain Gold OA copyright agreements, were indeed necessary in order for research and researchers to derive the full benefits of OA (they are not)

and

(2b) whether or not there existed any further necessary rights or capabilities over and above those already inherent in Green OA self-archiving -- rights that therefore either had to be successfully negotiated with a non-OA publisher or had to be purchased from a Gold OA publisher in order to render an article OA (there are none)

I am sorry to sound like a pedant, but these details are devilishly important, and need to be understood quite explicitly:

Concerning (1) (i.e., rights retention as a prerequisite to Green OA self-archiving), what I said was that for the 62% of articles published in Green journals -- i.e., those that have explicitly endorsed the immediate OA self-archiving of the postprint (whether final draft or PDF) -- no further rights are needed to self-archive them, hence no further rights need to be negotiated as a precondition for self-archiving. The self-archived work is "protected" by standard copyright, and it is also OA, with all the attendant usage capabilities (of which I listed nine, covering all uses that research and researchers require, which are also all the self-same uses for which OA itself was proposed).

I also said that for the 38% of articles published in non-Green journals -- i.e., those journals that have not yet explicitly endorsed the immediate OA self-archiving, by the author, of the postprint (whether final draft or PDF) -- the strategy that I recommend is (a) mandated Immediate Deposit, Optional Closed Access and reliance on the semi-automatic "Email Eprint Request" Button to cover usage needs during the embargo.

I agreed, however, that it is possible to disagree on this strategic point, and to prefer instead (b) to try to negotiate rights retention with the non-Green publisher or else to (c) publish instead with a Gold OA publisher that provides the requisite rights. There is of course nothing at all wrong with strategy (b) and/or (c) as a matter of individual choice in each case. But strategy (a) is intended as the default strategy for facilitating exception-free self-archiving, and especially for facilitating the adoption of legal-objection-immune, exception-free self-archiving mandates.

So far, the only "right" at issue is the right to self-archive -- the right to provide immediate Green OA. It is that Green OA that I was arguing was sufficient to provide full OA (and the 62% of journals that are Green have already endorsed it.)

But now we come to (2): certain "re-use" rights and capabilities that purportedly go beyond those that already come with the territory, with Green OA self-archiving. Now we are no longer speaking of the right to self-archive, obviously, but of the right to create certain kinds of "derivative works" that one may re-publish (and perhaps even re-sell).

What I said there was that the right to re-publish, re-sell, and create derivative works for re-publication or re-sale is not part of OA. They are something extra (approaching certain kinds of Creative Commons Licenses). Most important, those extra rights are not necessary for research and researchers, they go far beyond OA, and they would handicap OA's already too-slow progress towards universality if added as a gratuitous extra precondition on counting as "full-blooded OA."

The very idea that these extra rights are needed comes not from the intuitions of the library community about how to include subscription content in course-packs -- those needs are trivially fulfilled by inserting the URLs of Green OA postprints in the course-packs, instead of inserting the documents themselves! -- but from intuitions about data-mining from (some sectors) of the biological and chemical community (inspired largely by the data-sharing of the human genome project as well as similar chemical-structure data-sharing in chemistry).

There are very valid concerns about research data sharing: note that such data are typically not contained in published articles, but are supplements to them that until the online era had no way of being published at all, because the data-sets were too big. So the concern is about licensing these data to make them openly accessible and to prevent their ever becoming subject to the same access-barriers as subscription content.

This is a very important and valid goal but, strictly speaking, it is not an OA matter, because these research data are not part of the published content of journal articles! So, yes, providing online access to these data does definitely require explicit rights licensing, but no one is stopping their authors (the holders of the data) from adopting those licenses! (The appropriate CC licenses exist.) And there's certainly no reason to pay a Gold OA publisher for those extra rights or rights agreements for data, which are hitherto unpublished content that can now be licensed and self-archived directly.

This brings us to the second case, the case that I suspect those who see an extra rights problem here have most in mind: It concerns the content of published journal articles, both inasmuch as the articles may indeed contain some primary data, as opposed to merely summaries, descriptions and analyses, and inasmuch as the article texts themselves can be seen as constituting potential data. This is where data-mining rights and derivative-works rights come in: "Naked" Green OA -- simply making these published full-texts accessible online, free for all -- is not enough (think these theorists) to guarantee that robots can data-mine their contents and that the results can be made accessible (published, or re-published) as "derivative works," unless those extra "rights" (to data-mine and create derivative works) are explicitly licensed.

My reply is very simple: robotic harvesting and data-mining come with the free online territory as surely as individual use does. Remember that we are talking about authors' self-archived postprints here, not the publishers' proprietary PDFs, whether Gray or Gold. If the journal is Green, it endorses the author's right to deposit the postprint in his OA IR. The rest (individual accessibility, Google, Scirus, OAIster, robotic harvestability, and data-mining) all come with that Green OA territory. So the contention is not about the Green OA self-archiving of the postprints published in the 62% of journals that are Green.

Is the contention then about the 38% of articles published in non-Green journals? I agree at once that if the author feels he cannot make those articles Green OA immediately, and instead deposits them as Closed Access, then, with the help of the IR's "Email Eprint Request" Button, only re-use capabilities (1)-(7) [(1) accessing, (2) reading, (3) downloading, (4) storing (5) printing, (6) individual data-mining, and (7) re-using content (but not text) in further publications] are possible.

This is definitely not OA; it is merely almost-OA. Missing is full-text (8*) robotic harvesting and (9*) robotic data-mining. If, to try to avoid this outcome, an author who fully intends to deposit his postprint immediately upon acceptance regardless of the outcome, first elects to try to negotiate the retention of more rights with his publisher -- or even elects to publish with a paid Gold publisher rather than deposit as Closed Access, with almost-OA -- that's just fine!That author is intent on self-archiving either way. The problem with holding out for and insisting upon more rights is (1) the author who would not deposit except if the publisher was Green (or Gold), and -- even more important -- (2) the institutions that would not mandate depositing except if all publishers were already Green (or Gold).

It is those authors and those institutions that are the main retardants on universal OA today. If most universities already mandated immediate-deposit either way (OA or CA), I would do nothing but applaud the efforts to negotiate the retention of more rights -- even unnecessary ones! -- But it would still remain true that no rights retention at all was necessary in order to deposit all postprints (and attain almost-OA), and that only a publisher endorsement of Green OA self-archiving was needed to attain full OA (1-9*). And it would remain true that re-publication, re-sale and "derivative-works" rights had nothing to do with either OA or the real needs of research and researchers.

[I am not, by the way, dear readers, "adulterating" OA; I am accelerating it, whereas those who are needlessly raising the barriers are (unintentionally) retarding it. Nor do, did or will I ever -- even should the string of B's get still longer! -- accept those parts of the increasingly gilded BBB "definition" of OA that are and ever have been unnecessary or incoherent, hence counterproductive for OA itself -- although I shame-facedly confess to having failed to pick up on that incoherence immediately in B1. That's what comes of being slow-witted. Blackballed from B2, I (with many others) was merely window-dressing at B3, which was really just, by now, ritually reiterating B2. If there is any "permission" barrier at all, it is a psychological one, and it pertains only to the "permission" to provide Green OA, no more -- something I always carefully call "endorse" (or sometimes "bless") rather than "permit" or "allow," because I think that's all just a matter of Wizard of Ozery too, and will be seen to have been such in hindsight, once this maddenly molluscan trek to the optimal and inevitable is at long last behind us...]

One last point -- made in full respect and admiration for Peter Suber. Peter understands every word I am saying and always has. His position, of all the people on this planet, is closest to my own. But Peter in fact has grander goals than I do. His "FOS" (Free Online Scholarship) movement predated OA, and had a much bigger target: It included no less than all of scholarship, online: not just journal articles, but books, multimedia, teaching materials, everything. And the freedom was a greater freedom than freedom to access and use the scholarship.

I greatly value, and fully support Peter's wider goals. But I don't think they are just OA. They are FOS. (I shall be remembered only as an impatient, testy, parochial OA archivangelist, whereas Peter will be rightly recognised as the patient, temperate, ecumenical archangel of FOS.) But OA does have the virtue of being the easier, nearer, surer subgoal.

I think that every time a little divergence arises between Peter and me, it is always a variant of this: He still has his heart and mind set on FOS, and it is good that he does. Someone eventually has to fight that fight too. But OA is narrower than that, and it is also nearer; indeed it is within reach. Hence it is ever so important that we should not over-reach, trying to attain something that is further, and more complicated than OA, when we don't yet even have OA! For we thereby risk needlessly complicating and further delaying the already absurdly overdue attainment of OA.

I think that is what is behind our strategic difference on (1) whether OA requires the elimination of all "permission" barriers or (2) whether, after all, the elimination of all "price" barriers -- via Green OA self-archiving (which is and always has been my model, and my ever-faithful "intuition pump") -- does give us all the capabilities worth having, and worth holding out for. Re-publication rights and the right to create derivative works may be essential for FOS, and for the Creative Commons in general. But they are not essential for OA in particular; and it would be an unnecessary, self-imposed handicap to insist that they should be. That would merely raise barriers for OA where there are and need be none.

Stevan Harnad
American Scientist Open Access Forum

Posted by Stevan Harnad in Methodology at 02:24 | Comments (0) | Trackbacks (0)

Tuesday, June 12. 2007

Open Access: What Comes With the Territory

SUMMARY: Downloading, printing, saving and data-crunching come with the territory if you make your paper freely accessible online (Open Access). You may not, however, create derivative works out of the words of that text. It is the author's own writing, not an audio for remix. And that is as it should be. Its contents (meaning) are yours to data-mine and reuse, with attribution. The words themselves, however, are the author's (apart from attributed fair-use quotes). The frequent misunderstanding that what comes with the OA territory is somehow not enough seems to be based on conflating (1) the text of research articles with (2a) the raw research data on which the text is based, or with (2b) software, or with (2c) multimedia -- all the wrong stuff and irrelevant to OA.

Peter Murray-Rust's worries about OA are groundless. Peter worries he can't be be sure that:

"I can save my own copy (the MIT [site] suggests you cannot print it and may not be allowed to save it)"

Pay no attention. Download, print, save and crunch (just as you could have done if you had keyed in the text from reading the pages of a paper book)! [Free Access vs. Open Access (Dec 2003)]

"that it will be available next week"

It will. The University OA IRs all see to that. That's why they're making it OA. [Proposed update of BOAI definition of OA: Immediate and Permanent (Mar 2005)]

"that it will be unaltered in the future or that versions will be tracked"

Versions are tracked by the IR software, and updated versions are tagged as such. Versions can even be DIFFed.

"that I can create derivative works"

You may not create derivative works. We are talking about someone's own writing, not an audio for remix, And that is as it should be. The contents (meaning) are yours to data-mine and reuse, with attribution. The words, however, are the author's (apart from attributed fair-use quotes). Link to them if you need to re-use them verbatim (or ask for permission).

"that I can use machines to text- or data-mine it"

Yes, you can. Download and crunch away.

This is all common sense, and all comes with the OA territory when the author makes his full-text freely accessible for all, online. The rest seems to be based on some conflation between (1) the text of research articles and (2a) the raw research data on which the text is based, and with (2b) software, and with (2c) multimedia -- all the wrong stuff and irrelevant to OA).

Stevan Harnad
American Scientist Open Access Forum

Posted by Stevan Harnad in Methodology at 01:00 | Comments (0) | Trackbacks (0)

Thursday, June 7. 2007

British Classification Society post-RAE Scientometrics

British Classification Society Meeting , "Analysis Methodologies for Post-RAE Scientometrics", Friday 6 July 2007, International Building room IN244 Royal Holloway, University of London, Egham

The selection of appropriate and/or best data analysis methodologies is a result of a number of issues: the overriding goals of course, but also the availability of well formatted, and ease of access to such, data. The meeting will focus on the early stages of the analysis pipeline. An aim of this meeting is to discuss data analysis methodologies in the context of what can be considered as open, objective and universal in a metrics context of scholarly and applied research.

Les Carr and Tim Brody (Intelligence, Agents, Media group, Electronics and Computer Science, University of Southampton): "Open Access Scientometrics and the UK Research Assessment Exercise"

   Harnad, S. (2007) Open Access Scientometrics and the UK Research Assessment Exercise. In Proceedings of 11th Annual Meeting of the International Society for Scientometrics and Informetrics (in press), Madrid, Spain.
   Brody, T., Harnad, S. and Carr, L. (2006) Earlier Web Usage Statistics as Predictors of Later Citation Impact. Journal of the American Association for Information Science and Technology (JASIST) 57(8) pp. 1060-1072.
   Carr, L., Hitchcock, S., Oppenheim, C., McDonald, J. W., Champion, T. and Harnad, S. (2006) Extending journal-based research impact assessment to book-based disciplines.

Posted by Stevan Harnad in Methodology at 02:46 | Comments (0) | Trackbacks (0)

Saturday, May 26. 2007

Craig et al.'s Review of Studies on the OA Citation Advantage

Update Jan 1, 2010: See Gargouri, Y; C Hajjem, V Larivière, Y Gingras, L Carr,T Brody & S Harnad (2010) “Open Access, Whether Self-Selected or Mandated, Increases Citation Impact, Especially for Higher Quality Research”
Update Feb 8, 2010: See also "Open Access: Self-Selected, Mandated & Random; Answers & Questions"

SUMMARY: The thrust of Craig et al.'s critical review (which was proposed by the Publishing Research Consortium and conducted by the staff of three publishers) is that despite the fact that virtually all studies comparing the citation counts for OA and non-OA articles keep finding the OA citation counts to be higher, it has not been proved beyond a reasonable doubt that the relationship is causal.
   I agree: It is merely highly probable, not proved beyond a reasonable doubt. And I also agree that not one of the studies done so far is without some methodological flaw that could be corrected. But it is also highly probable that the results of the methodologically flawless versions of all those studies will be much the same as the results of the current studies. That's what happens when you have a robust major effect, detected by virtually every study, and only ad hoc methodological cavils and special pleading to rebut each of them with. Here is a common sense overview:
(1) Research quality is a necessary, but not a sufficient condition for citation impact: The research must also be accessible to be cited.
(2) Research accessibility is a necessary but not a sufficient condition for citation impact: The research must also be of sufficient quality to be cited.
(3) The OA impact effect is the finding that an article's citation counts are positively correlated with the probability that that article has been made OA: The more an article's citations, the more likely that that article has been made OA.
(4) That correlation has at least three (compatible) causal interpretations:
   (4a) OA articles are more likely to be cited.
   (4b) More-cited articles are more likely to be made OA.
   (4c) A third factor makes it more likely that certain articles will be both more cited and made OA.
(5) Each of these causal interpretations is probably correct, and hence a contributor to the OA impact effect:
   (5a) The better the article, the more likely it is to be cited, hence the more citations it gains if it is made more accessible (4a). (OA Article Quality Advantage, QA)
   (5b) The better the article, the more likely it is to be made OA (4b). (OA Article Quality Bias, QB)
   (5c) 10% of articles (and authors) receive 90% of citations. The authors of the better articles know they are better, and hence are more likely both to be cited and to make their articles OA, so as to maximize their visibility, accessibility and citations (4c). (OA Author QB and QA)
(6) In addition to QB and QA, there is an OA Early Access effect (EA): providing access earlier increases citations.
(7) The OA citation studies have not yet isolated and estimated the relative sizes of each of these (and other) contributing components. (OA also gives a Download Advantage (DA), and downloads are correlated with later citations; OA articles also have a Competitive Advantage (CA), but CA will vanish -- along with QB -- when all articles are OA).
(8) But the handwriting is on the wall as to the benefits of making articles OA, for those with eyes to see, and no conflicting interests to blind them.
   Given all of this, here is a challenge for Craig et al: Instead of striving, like OJ Simpson's Dream Team, only to find flaws in the positive evidence for the OA impact differential, which is equally compatible with either interpretation (OA causes higher citations or higher citations cause OA) why don't Craig et al. do a simple study of their own? Since it is known that (in science) the top 10% of articles published receive 90% of the total citations made, why not test whether and to what extent the top 10% of articles published is over-represented among the c. 15% of articles that are being spontaneously made OA by their authors today? It is, after all, a logical possibility that all or most of the top 10% are already among the 15% that are being made OA: I think it's improbable; but it may repay Craig et al's effort to check whether it is so.
   For if it did turn out that all or most of the top-cited 10% of articles are already among the c.15% of articles that are already being made OA, then reaching 100% OA would be far less urgent and important than I have been arguing, and OA mandates would likewise be less important. I for one would no longer find it important enough to archivangelize if I knew it was just for the bottom 90% of articles, the top 10% of articles having already been self-archived, spontaneously and sensibly, by their top 10% authors without having to be mandated. But it is Craig et al. who think this is closer to the truth, not me. So let them go out and demonstrate it.

I've read Craig et al.'s critical review concerning the OA citation impact effect and will shortly write a short, mild review. But first here is Sally Morris's posting announcing Craig et al's review, on behalf of the Publishing Research Consortium (which "proposed" the review), followed by a commentary from Bruce Royan on diglib, a few remarks from me, then commentary by JWT Smith on jisc-repositories, followed by my response, and, last, a commentary by Bernd-Christoph Kaemper on SOAF, followed by my response.

Sally Morris (Publishing Research Consortium):
Craig, Ian; Andrew Plume, Marie McVeigh, James Pringle & Mayur Amin (2007) Do Open Access Articles Have Greater Citation Impact? A critical review of the literature. Journal of Informetrics.
A new, comprehensive review of recent bibliometric literature finds decreasing evidence for an effect of 'Open Access' on article citation rates. The review, now accepted for publication in the Journal of Informetrics, was proposed by the Publishing Research Consortium (PRC) and is available at its web site at www.publishingresearch.net. It traces the development of this issue from Steve Lawrence's original study in Nature in 2001 to the most recent work of Henk Moed and others.

Researchers have delved more deeply into such factors as 'selection bias' and 'early view' effects, and began to control more carefully for the effects of disciplinary differences and publication dates. As they have applied these more sophisticated techniques, the relationship between open access and citation, once thought to be almost self-evident, has almost disappeared.

Commenting on the paper, Lord May of Oxford, FRS, past president of the Royal Society, said 'In December 2005, the Royal Society called for an evidence-based approach to the scholarly communications debate. This excellent paper demonstrates that there is actually little evidence of a citation advantage for open access articles.'

The debate will certainly continue, and further studies will continue to refine current work. The PRC welcomes this discussion, and hopes that this latest paper may be a catalyst for a new round of informed scholarly exchange.

Sally Morris on behalf of the Publishing Research Consortium

Bruce Royan wrote on diglib:

Sally claims that according to this article "the relationship between open access and citation, once thought to be almost self-evident, has almost disappeared."

Now I'm no Informetrician, but my reading of the article is that the authors reluctantly acknowledge that Open Access articles do have greater citation impact, but claim that this is less because they are Open Access per se, and more because:
-they are available sooner than more conventionally published articles, or

-they tend to be better articles, by more prestigious authors
Sally's point of view is understandable, since she is employed by a consortium of conventional publishers. It's interesting to note that the employers of the authors of this article are Wiley-Blackwell, Thomson Scientific, and Elsevier.

Even more interesting is that, though this article has been accepted for publication in the conventional "Journal of Informetrics", a pdf of it (described as a summary, but there are 20 pages in JOI format, complete with diagrams, references etc) has already been mounted on the web for free download, in what might be mistaken for an example of green route open access.

Could this possibly be in order to improve the article's impact?

Professor Bruce Royan,
Concurrent Computing Limited.

It is notoriously tricky (at least since David Hume) to "prove" causality empirically. The thrust of the Craig et al. critique is that despite the fact that virtually all studies comparing the citation counts for OA and non-OA articles keep finding the OA citation counts to be higher, it has not been proven beyond a reasonable doubt that the relationship is causal.

I agree: It is merely highly probable, not proven beyond a reasonable doubt, that articles are more cited because they are OA, rather than OA merely because they are more cited (or both OA and more cited merely because of a third factor).

And I also agree that not one of the studies done so far is without some methodological flaw that could be corrected.

But it is also highly probable that the results of the methodologically flawless versions of all those studies will be much the same as the results of the current studies. That's what happens when you have a robust major effect, detected by virtually every study, and only ad hoc methodological cavils and special pleading to rebut each of them with.

But I am sure those methodological flaws will not be corrected by these authors, because -- OJ Simpson's "Dream Team" of Defense Attorneys comes to mind -- Craig et al's only interest is evidently in finding flaws and alternative explanations, not in finding out the truth -- if it goes against their client's interests...

Iain D.Craig: Wiley-Blackwell
Andrew M.Plume, Mayur Amin: Elsevier
Marie E.McVeigh, James Pringle: Thomson Scientific

Here is a preview of my rebuttal. It is mostly just common sense, if one has no conflict of interest, hence no reason for special pleading and strained interpretations:

(1) Research quality is a necessary, but not a sufficient condition for citation impact: The research must also be accessible to be cited.

(2) Research accessibility is a necessary but not a sufficient condition for citation impact: The research must also be of sufficient quality to be cited.

(3) The OA impact effect is the finding that an article's citation counts are positively correlated with the probability that that article has been made OA: The more an article's citations, the more likely that that article has been made OA.

(4) This correlation has at least three causal interpretations that are not mutually exclusive:

(4a) OA articles are more likely to be cited.

(4b) More-cited articles are more likely to be made OA.

(4c) A third factor makes it more likely that certain articles will be both more cited and made OA.

(5) Each of these causal interpretations is probably correct, and hence a contributor to the OA impact effect:

(5a) The better the article, the more likely it is to be cited, hence the more citations it gains if it is made more accessible (4a). (OA Article Quality Advantage, QA)

(5b) The better the article, the more likely it is to be made OA (4b). (OA Article Quality Bias, QB)

(5c) 10% of articles (and authors) receive 90% of citations. The authors of the better articles know they are better, and hence are more likely both to be cited and to make their articles OA, so as to maximize their visibility, accessibility and citations (4c). (OA Author QB and QA)

(6) In addition to QB and QA, there is an OA Early Access effect (EA): providing access earlier increases citations.

(7) The OA citation studies have not yet isolated and estimated the relative sizes of each of these (and other) contributing components. (OA also gives a Download Advantage (DA), and downloads are correlated with later citations; OA articles also have a Competitive Advantage (CA), but CA will vanish -- along with QB -- when all articles are OA).

(8) But the handwriting is on the wall as to the benefits of making articles OA, for those with eyes to see, and no conflicting interests to blind them.

I do agree completely, however, with erstwhile (Princetonian and) Royal Society President Bob May's slightly belated call for "an evidence-based approach to the scholarly communications debate."

John Smith (JS) wrote in jisc-repositories:

I wonder if we can come at this discussion concerning the impact of OA on citation counts from another angle? Assuming we have a traditional academic article of interest to only a few specialists there is a simple upper bound to the number of citations it will have no matter how accessible it is.

That is certainly true. It is also true that 10% of articles receive 90% of the citations. OA will not change that ratio, it will simply allow the usage and citations of those articles that were not used and cited because they could not be accessed to rise to what they would have been if they could have been used and cited.

JS: Also, the majority of specialist academics work in educational institutions where they have access to a wide range of paid for sources for their subject.

OA is not for those articles and those users that already have paid access; it is for those that do not. No institution can afford paid access to all or most of the 2.5 million articles published yearly in the world's 24,000 peer-reviewed journals, and most institutions can only afford access to a small fraction of them.

OA is hence for that large fraction (the complement of the small fraction) of those articles that most users and most institutions cannot access. The 10% of that fraction that merit 90% of the citations today will benefit from OA the most, and in proportion to their merit. That increase in citations also corresponds to an increase in scholarly and scientific productivity and progress for everyone.

JS: Therefore any additional citations must mainly come from academics in smaller institutions that do not provide access to all relevant titles for their subject and/or institutions in the poorer countries of the world.

It is correct that the additional citations will come from academics at the institutions that cannot afford paid access to the journals in which the cited articles appeared. It might be the case that the access denial is concentrated in the smaller institutions and the poorer countries, but no one knows to what extent that is true, and one can also ask whether it is relevant. For the OA problem is not just an access problem but an impact problem. And the research output of even the richest institutions is losing a large fraction of its potential research impact because it is inaccessible to the fraction to whom it is inaccessible, whether or not that missing fraction is mainly from the smaller, poorer institutions.

JS: Should it not be possible therefore to examine the citers to these OA articles where increased citation is claimed and show they include academics in smaller institutions or from poorer parts of the world?

Yes, it is possible, and it would be a good idea to test the demography of access denial and OA impact gain. But, again, one wonders: Why would one assign this question of demographic detail a high priority at this time, when the access and impact loss have already been shown to be highly probable, when the remedy (mandated OA self-archiving) is at hand and already overdue, and when most of the skepticism about the details of the OA impact advantage comes from those who have a vested interest in delaying or deterring OA self-archiving mandates from being adopted?

(It is also true that a portion of the OA impact advantage is a competitive advantage that will disappear once all articles are OA. Again, one is inclined to reply: So what?)

This is not just an academic exercise but a call to action to remedy a remediable practical problem afflicting research and researchers.

JS: However, even if this were done and positive results found there is still another possible explanation. Items published in both paid for and free form are indexed in additional indexing services including free services like OAIster and CiteSeer. So it may be that it is not the availability per se that increases citation but the findability? Those who would have had access anyway have an improved chance of finding the article. Do we have proof that the additional citers accessed the OA version (assuming there is both an OA and paid for version)?

Increased visibility and improved searching are always welcome, but that is not the OA problem. OAIster's usefulness is limited by the fact that it only contains the c. 15% of the literature that is being self-archived spontaneously (i.e., unmandated) today. Citeseer is a better niche search engine because computer scientists self-archive a much higher proportion of their research. But the obvious benchmark today is Google Scholar, which is increasingly covering all cited articles, whether OA or non-OA. It is in vain that Google Scholar enhances the visibility of non-OA articles for those would-be users to whom they are not accessible. Those users could already have accessed the metadata of those articles from online indices such as Web of Science or PubMed, only to reach a toll-access barrier when it came to accessing the inaccessible full-text corresponding to the visible metadata.

JS: It is possible that my queries above have already been answered. If so a reference to the work will suffice as a response.

I am a supporter of OA but also concerned that it is not falsely praised. If it is praised for some advantage and that advantage turns out not to be there it will weaken the position of OA proponents.

Accessibility is a necessary (but not a sufficient) condition for usage and impact. There is no risk that maximising accessibility will fail to maximise usage and impact. The only barrier between us and 100% OA is a few keystrokes.

It is appalling that we continue to dither about this; it is analogous to dithering about putting on (or requiring) seat-belts until we have made sure that the beneficiaries are not just the small and the poor, and that seat-belts do not simply make drivers more safety-conscious.

JS: Even if the apparent citation advantage of OA turns out to be false it does not weaken the real advantages of OA. We should not be drawn into a time and effort wasting defence of it while there is other work to be done to promote OA.

The real advantage of Open Access is Access. The advantage of Access is Usage and Impact (of which citations are one indicator). The Craig et al. study has not shown that the OA Impact Advantage is not real. It has simply pointed out that correlation does not entail causation. Duly noted. I agree that no time or effort should be spent now trying to demonstrate causation. The time and effort should be used to provide OA.

Bernd-Christoph Kaemper (B-CK) wrote on SOAF:

Elsevier said that citation rates of their journals had gone up considerably because of the increased access through wide- spread online availability of their journals...

Online availability clearly increased the IF [journal citation impact factor]. In the FUTON subcategory, there was an IF gradient favoring journals with freely available articles. ..."

I think it is quite obvious why sources available with open access will be used and cited more often than others...

So the usefulness of open access is a matter of daily experience, not so much of academic discussions whether there is any empirical proof for a citation advantage of open access that may be isolated by eliminating all possible confounders...

That open access leads to more visibility and thereby potentially more citations is trivial, but this relative open access advantage will vary from journal to journal...

Due to the multitude of possible confounding factors I would not believe any of the figures calculated by Stevan Harnad as the cumulated lost impact, or conversely, the possible gain.

I couldn't quite follow the logic of this posting. It seemed to be saying that, yes, there is evidence that OA increases impact, it is even trivially obvious, but, no, we cannot estimate how much, because there are possible confounding factors and the size of the increase varies.

All studies have found that the size of the OA impact differential varies from field to field, journal to journal, and year to year. The range of variation is from +25% to over +250% percent. But the differential is always positive, and mostly quite sizeable. That is why I chose a conservative overall estimate of +50% for the potential gain in impact if it were not just the current 15% of research that was being made OA, but also the remaining 85%. (If you think 50% is not conservative enough, use the lower-bound 25%: You'll still find a substantial potential impact gain/loss. If you think self-selection accounts for half the gain, split it in half again: there's still plenty of gain, once you multiply by 85% of total citations.)

An interesting question that has since arisen (and could be answered by similar studies) is this:

Since it is known that (in science) the top 10% of articles published receive 90% of the total citations made (Seglen 1992), to what extent is the top 10% of articles published over-represented among the c. 15% of articles that are being spontaneously made OA by their authors today?

It is a logical possibility that all or most of the top 10% are already among the 15% that are being made OA: I rather doubt it; but it would be worth checking whether it is so. [Attention lobbyists against OA mandates! Get out your scissors here and prepare to snip an out-of-context quote...]

[snip]
If it did turn out that all or most of the top-cited 10% of articles are already among the c.15% of articles that are already being made OA, then reaching 100% OA would be far less urgent and important than I had argued, and OA mandates would likewise be less important. I for one would no longer find it important enough to archivangelize if I knew it was just for the bottom 90% of articles, the top 10% of articles having already been self-archived, spontaneously and sensibly, by their top 10% authors without having to be mandated.
[/snip]

The empirical studies of the relation between OA and impact have been mostly motivated by the objective of accelerating the growth of OA -- and thereby the growth of research usage and impact. Those who are oersuaded that the OA impact differential is merely or largely a non-causal self-selection bias are encouraged to demonstrate that that is the case.

Note very carefully, though, that the observed correlation between OA and citations takes the form of a correlation between the number of OA articles, relative to non-OA articles, at each citation level. The more highly cited an article, the more likely it is OA. This is true within journals, and within and across years, in every field tested.

And this correlation can arise because more-cited articles are more likely to be made OA or because articles that are made OA are more likely to be cited (or both -- which is what I think is in reality the case). It is certainly not the case that self-selection is the default or null hypothesis, and that those who interpret the effect as OA causing the citation increase hence have the burden of proof: The situation is completely symmetric numerically; so your choice between the two hypotheses is not based on the numbers, but on other considerations, such as prima facie plausibility -- or financial interest.

Until and unless it is shown empirically that today's OA 15% already contains all or most of the top-cited 10% (and hence 90% of what researchers cite), I think it is a much more plausible interpretation of the existing findings that OA is a cause of the increased usage and citations, rather than just a side-effect of them, and hence that there is usage and impact to be gained by providing and mandating OA. (I can quite understand why those who have a financial interest in its being otherwise [Craig et al. 2007] might prefer the other interpretation, but clearly prima facie plausibility cannot be their justification.)

I also think that 50% of total citations is a plausible overall estimate of the potential gain from OA, as long as it is understood clearly that that the 50% gain does not apply to every article made OA. Many articles are not found useful enough to cite no matter how accessible you make them. The 50% citation gain will mostly accrue to the top 10% of articles, as citations always do (though OA will no doubt also help to remedy some inequities and will sometimes help some neglected gems to be discovered and used more widely). In other words, the OA advantage to an article will be roughly proportional to that article's intrinsic citation value (independent of OA).

Other interesting questions: The top-cited articles are not evenly distributed among journals. The top journals tend to get the top-cited articles. It is also unlikely that journal subscriptions are evenly distributed among journals: The top journals are likely to be subscribed to more, and are hence more accessible.

So if someone is truly interested in these questions (as I am not!), they might calculate a "toll-accessibility index" (TAI) for each article, based on the number of researchers/institutions that have toll access to the journal in which that article is published. An analysis of covariance can then be done to see whether and how much the OA citation advantage is reduced if one controls for the article's TAI. (I suspect the answer will be: somewhat, but not much.)

B-CK: Could we do a thought experiment? From a representative group of authors, choose a sample of authors randomly and induce them to make their next article open access. Do you believe they will see as much gain in citations compared to their previous average citation levels as predicted from the various current "OA advantage" studies where several confounding factors are operating? Probably not - but what would remain of that advantage? -- I find that difficult to predict or model.

From a random sample, I would expect an increase of around 50% or more in total citations, 90% of the increased citations going to the top 10%, as always.

B-CK: As I learned from your posting, you seem to predict that it will anyway depend on the previous citedness of the members of that group (if we take that as a proxy for the unknown actual intrinsic citation value of those articles), in the sense that more-cited authors will see a larger percentage increase effect.

I don't think it's just a Matthew Effect; I think the highest quality papers get the most citations (90%), and the highest quality papers are apparently about 10% (in science, according to Seglen).

B-CK: To turn your argument around, most authors happily going open access in expectation of increased citation might be disappointed because the 50% increase will only apply to a small minority of them.

That's true; but you could say the same for most authors going into research at all. There is no guarantee that they will produce the highest quality research, but I assume that researchers do what they do in the hope that they will, if not this time, then the next time, produce the highest quality research.

B-CK: That was the reason why I said that (as an individual author) I would rather not believe in any "promised" values for the possible gain.

Where there is life, and effort, there is hope. I think every researcher should do research, and publish, and self-archive, with the ambition of doing the best quality work, and having it rewarded with valuable findings, which will be used and cited.

My "promise", by the way, was never that each individual author would get 50% more citations. (That would actually have been absurd, since over 50% of papers get no citations at all -- apart from self-citation -- and 50% of 0 is still 0.)

My promise, in calculating the impact gain/loss that you doubted, was to countries, research funders and institutions. On the assumption that the research output of each roughly covers the quality spectrum, they can expect their total citations to increase by 50% or more with OA, but that increase will be mostly at their high-quality end. (And the total increase is actually about 85% of 50%, as the baseline spontaneous self-archiving rate is about 15%.)

B-CK: That doesn't mean though that there are not enough other reasons to go for open access (I mentioned many of them in my posting).

There are other reasons, but researchers' main motivation for conducting and publishing research is in order to make a contribution to knowledge that will be found useful by, and used by, and built upon by other researchers. There are pedagogic goals too, but I think they are secondary, and I certainly don't think they are strong enough to induce a researcher to make his publications OA, if the primary reason was not reason enough to induce them.

(Actually, I don't think any of the reasons are enough to induce enough researchers to provide OA, and that's why Green OA mandates are needed -- and being provided -- by researchers' institutions and funders.)

B-CK: With respect to the toll accessibility index, I completely agree. The occasional good article in an otherwise "obscure" journal probably has a lot to gain from open access, as many people would not bother to try to get hold of a copy should they find it among a lot of others in a bibliographic database search, if it doesn't look from the beginning like a "perfect match" of what they are looking for.

You agree with the toll-accessibility argument prematurely: There are as yet no data on it, whereas there are plenty of data on the correlation between OA and impact.

B-CK: An interesting question to look at would also be the effect of open access on non-formal citation modes like web linking, especially social bookmarking. Clearly NPG is interested in Connotea also as a means to enhance the visibility of articles in their own toll access articles. Has anyone already tried such investigations?

Although I cannot say how much it is due to other kinds of links or from citation links themselves, the University of Southampton, the first institution with a (departmental) Green OA self-archiving mandate, and also the one with the longest-standing mandate also has a surprisingly high webmetric, university-metric and G-factor rank:

Stevan Harnad
American Scientist Open Access Forum

Bollen, J., Van de Sompel, H., Smith, J. and Luce, R. (2005) Toward alternative metrics of journal impact: A comparison of download and citation data. Information Processing and Management, 41(6): 1419-1440.

Brody, T., Harnad, S. and Carr, L. (2006) Earlier Web Usage Statistics as Predictors of Later Citation Impact. Journal of the American Association for Information Science and Technology (JASIST) 57(8) pp. 1060-1072.

Craig, Ian; Andrew Plume, Marie McVeigh, James Pringle & Mayur Amin (2007) Do Open Access Articles Have Greater Citation Impact? A critical review of the literature. Journal of Informetrics.

Davis, P. M. and Fromerth, M. J. (2007) Does the arXiv lead to higher citations and reduced publisher downloads for mathematics articles? Scientometrics 71: 203-215.
See critiques: 1 and 2.

Diamond, Jr. , A. M. (1986) What is a Citation Worth? Journal of Human Resources 21:200-15, 1986,

Eysenbach, G. (2006) Citation Advantage of Open Access Articles. PLoS Biology 4: 157.

Hajjem, C., Harnad, S. and Gingras, Y. (2005) Ten-Year Cross-Disciplinary Comparison of the Growth of Open Access and How it Increases Research Citation Impact. IEEE Data Engineering Bulletin 28(4) pp. 39-47.

Hajjem, C. and Harnad, S. (2006) Manual Evaluation of Robot Performance in Identifying Open Access Articles. Technical Report, Institut des sciences cognitives, Universite du Quebec a Montreal.

Hajjem, C. and Harnad, S. (2006) The Self-Archiving Impact Advantage: Quality Advantage or Quality Bias? Technical Report, ECS, University of Southampton.

Hajjem, C. and Harnad, S. (2007) Citation Advantage For OA Self-Archiving Is Independent of Journal Impact Factor, Article Age, and Number of Co-Authors. Technical Report, Electronics and Computer Science, University of Southampton.

Hajjem, C. and Harnad, S. (2007) The Open Access Citation Advantage: Quality Advantage Or Quality Bias? Technical Report, Electronics and Computer Science, University of Southampton.

Harnad, S. & Brody, T. (2004) Comparing the Impact of Open Access (OA) vs. Non-OA Articles in the Same Journals, D-Lib Magazine 10 (6) June

Harnad, S. (2005) Making the case for web-based self-archiving. Research Money 19(16).

Harnad, S. (2005) Maximising the Return on UK's Public Investment in Research. (Unpublished ms.)

Harnad, S. (2005) OA Impact Advantage = EA + (AA) + (QB) + QA + (CA) + UA. (Unpublished ms.)

Harnad, S. (2005) On Maximizing Journal Article Access, Usage and Impact. Haworth Press (occasional column).

Harnad, S. (2006) Within-Journal Demonstrations of the Open-Access Impact Advantage: PLoS, Pipe-Dreams and Peccadillos (LETTER). PLOS Biology 4(5).

Henneken, E. A., Kurtz, M. J., Eichhorn, G., Accomazzi, A., Grant, C., Thompson, D., and Murray, S. S. (2006) Effect of E-printing on Citation Rates in Astronomy and Physics. Journal of Electronic Publishing, Vol. 9, No. 2, Summer 2006

Henneken, E. A., Kurtz, M. J., Warner, S., Ginsparg, P., Eichhorn, G., Accomazzi, A., Grant, C. S., Thompson, D., Bohlen, E. and Murray, S. S. (2006) E-prints and Journal Articles in Astronomy: a Productive Co-existence Learned Publishing.

Kurtz, M. J., Eichhorn, G., Accomazzi, A., Grant, C. S., Demleitner, M., Murray, S. S. (2005) The Effect of Use and Access on Citations. Information Processing and Management, 41 (6): 1395-1402.

Kurtz, Michael and Brody, Tim (2006) The impact loss to authors and research. In, Jacobs, Neil (ed.) Open Access: Key strategic, technical and economic aspects. Oxford, UK, Chandos Publishing.

Lawrence, S, (2001) Online or Invisible?, Nature 411 (2001) (6837): 521.

Metcalfe, Travis S (2006) The Citation Impact of Digital Preprint Archives for Solar Physics Papers. Solar Physics 239: 549-553

Moed, H. F. (2006) The effect of 'Open Access' upon citation impact: An analysis of ArXiv's Condensed Matter Section (preprint)

Perneger, T. V. (2004) Relation between online 'hit counts' and subsequent citations: prospective study of research papers in the British Medical Journal. British Medical Journal 329:546-547.

Seglen, P.O. (1992) The skewness of science. The American Society for Information Science 43: 628-638

Posted by Stevan Harnad in Methodology at 22:36 | Comments (0) | Trackbacks (0)

Tuesday, May 1. 2007

OA Citation Impact Study: No Conclusions Possible

Update Jan 1, 2010: See Gargouri, Y; C Hajjem, V Larivière, Y Gingras, L Carr,T Brody & S Harnad (2010) “Open Access, Whether Self-Selected or Mandated, Increases Citation Impact, Especially for Higher Quality Research”
Update Feb 8, 2010: See also "Open Access: Self-Selected, Mandated & Random; Answers & Questions"

Tonta, Yaşar and Ünal, Yurdagül and Al, Umut (2007) The Research Impact of Open Access Journal Articles. In Proceedings ELPUB 2007, the 11th International Conference on Electronic Publishing, Focusing on challenges for the digital spectrum, pp. 1-11, Vienna (Austria).

The above article compared average citation counts in several different fields for a sample of articles in a sample of OA journals in the Directory of Open Access Journals (DOAJ).

The average citation counts for articles in the OA journals were found to vary across fields. It was concluded that OA research impact varies across fields.

No comparison was made with non-OA journals in the same fields. Hence it is impossible to say whether any of these differences have anything to do with OA. Fields no doubt differ in their average number of citations. Journals no doubt differ too, in subject matter, quality, and citation impact, hence must be equated: It is not clear whether the OA journals in each field are the top, medium or bottom journals, relative to the non-OA journals.

No conclusions at all can be drawn from this study. The authors are encouraged to do the necessary controls.

Note also that Hajjem et al. 2005 (and others) report that the ratio of OA/non-OA articles is positively correlated with citation counts. This can mean that higher-quality articles are more likely to be made OA (Quality Bias), or that the OA impact advantage is greater for higher-quality articles (Quality Advantage) -- or, most likely, both (Hajjem & Harnad 2006).

Hajjem, C., Harnad, S. and Gingras, Y. (2005) Ten-Year Cross-Disciplinary Comparison of the Growth of Open Access and How it Increases Research Citation Impact. IEEE Data Engineering Bulletin 28(4) pp. 39-47.

Hajjem, C. and Harnad, S. (2006) The Open Access Impact Advantage: Quality Advantage or Quality Bias? Technical Report, ECS, University of Southampton.

Stevan Harnad
American Scientist Open Access Forum

Posted by Stevan Harnad in Methodology at 02:14 | Comments (0) | Trackbacks (0)

Sunday, January 21. 2007

The Open Access Citation Advantage: Quality Advantage Or Quality Bias?

Update Jan 1, 2010: See Gargouri, Y; C Hajjem, V Larivière, Y Gingras, L Carr,T Brody & S Harnad (2010) “Open Access, Whether Self-Selected or Mandated, Increases Citation Impact, Especially for Higher Quality Research”
Update Feb 8, 2010: See also "Open Access: Self-Selected, Mandated & Random; Answers & Questions"

SUMMARY: Many studies have now reported the positive correlation between Open Access (OA) self-archiving and citation counts ("OA Advantage," OAA). But does this OAA occur because articles that are self-archived are more likely to be cited ("Quality Advantage": QA) or because articles that are more likely to be cited are more likely to be self-archived ("Quality Bias," QB)? The probable answer is both. Three studies [by Kurtz and co-workers in astrophysics, Moed in condensed matter physics, and Davis & Fromerth in mathematics] had attributed the OAA to QB [and to EA, the Early Advantage of self-archiving the preprint before publication] rather than QA. These three fields, however, happen to be among the minority of fields that (1) make heavy use of prepublication preprints and (2) have less of a postprint access problem than most other fields. Chawki Hajjem has now analyzed preliminary evidence based on over 100,000 articles from multiple fields, comparing self-selected self-archiving with mandated self-archiving to estimate the contributions of QB and QA to the OAA. Both factors contribute, and the contribution of QA is greater.

This is a preview of some preliminary data (not yet refereed), collected by my doctoral student at UQaM, Chawki Hajjem. This study was done in part by way of response to Henk Moed's replies to my comments on Moed's (self-archived) preprint:

Moed, H. F. (2006) The effect of 'Open Access' upon citation impact: An analysis of ArXiv's Condensed Matter Section

Moed's study is about the "Open Access Advantage" (OAA) -- the higher citation counts of self-archived articles -- observable across disciplines as well as across years as in the following graphs from Hajjem et al. 2005 (red bars are the OAA):

FIGURE 1. Open Access Citation Advantage By Discipline and By Year.
Green bars are percentage of articles self-archived (%OA); red bars, percentage citation advantage (%OAA) for self-archived articles for 10 disciplines (upper chart) across 12 years (lower chart, 1992-2003). Gray curve indicates total articles by discipline and year.
Source: Hajjem, C., Harnad, S. and Gingras, Y. (2005) Ten-Year Cross-Disciplinary Comparison of the Growth of Open Access and How it Increases Research Citation Impact. IEEE Data Engineering Bulletin 28(4) pp. 39-47.

The focus of the present discussion is the factors underlying the OAA. There are at least five potential contributing factors, but only three of them are under consideration here: (1) Early Advantage (EA), (2) Quality Advantage (QA) and (3) Quality Bias (QB -- also called "Self-Selection Bias").

Preprints that are self-archived before publication have an Early Advantage (EA): they get read, used and cited earlier. This is uncontested.

Kurtz, Michael and Brody, Tim (2006) The impact loss to authors and research. In, Jacobs, Neil (ed.) Open Access: Key strategic, technical and economic aspects. Oxford, UK, Chandos Publishing.

In addition, the proportion of articles self-archived at or after publication is higher in the higher "citation brackets": the more highly cited articles are also more likely to be the self-archived articles.

FIGURE 2. Correlation between Citedness and Ratio of Open Access (OA) to Non-Open Access (NOA) Ratios.
The (OAc/TotalOAc)/(NOAc/TotalNOAc) ratio (across all disciplines and years) increases as citation count (c) increases (r = .98, N=6, p<.005). The more cited an article, the more likely that it is OA. (Hajjem et al. 2005)

The question, then, is about causality: Are self-archived articles more likely to be cited because they are self-archived (QA)? Or are articles more likely to be self-archived because they are more likely to be cited (QB)?

The most likely answer is that both factors, QA and QB, contribute to the OAA: the higher quality papers gain more from being made more accessible (QA: indeed the top 10% of articles tend to get 90% of the citations). But the higher quality papers are also more likely to be self-archived (QB).

As we will see, however, the evidence to date, because it has been based exclusively on self-selected (voluntary) self-archiving, is equally compatible with (i) an exclusive QA interpretation, (ii) an exclusive QB interpretation or (iii) the joint explanation that is probably the correct one.

The only way to estimate the independent contributions of QA and QB is to compare the OAA for self-selected (voluntary) self-archiving with the OAA for imposed (obligatory) self-archiving. We report some preliminary results for this comparison here, based on the (still small sample of) Institutional Repositories that already have self-archiving mandates (chiefly CERN, U. Southampton, QUT, U. Minho, and U. Tasmania).

FIGURE 3. Self-Selected Self-Archiving vs. Mandated Self-Archiving: Within-Journal Citation Ratios (for 2004, all fields).
S = citation counts for articles self-archived at institutions with (Sm) and without (Sn) a self-archiving mandate. N = citation counts for non-archived articles at institutions with (Nm) and without (Nn) mandate (i.e., Nm = articles not yet compliant with mandate). Grand average of (log) S/N ratios (106,203 articles; 279 journals) is the OA advantage (18%); this is about the same as for Sn/Nn (27972 articles, 48 journals, 18%) and Sn/N (17%); ratio is higher for Sm/N (34%), higher still for Sm/Nm (57%, 541 articles, 20 journals); and Sm/Sn = 27%, so self-selected self-archiving does not yield more citations than mandated; rather the reverse. (All six within-pair differences are significant: correlated sample t-tests.) (NB: preliminary, unrefereed results.)

Summary: These preliminary results suggest that both QA and QB contribute to OAA, and that the contribution of QA is greater than that of QB.

Discussion: On Fri, 8 Dec 2006, Henk Moed [HM] wrote:

HM: "Below follow some replies to your comments on my preprint 'The effect of 'Open Access' upon citation impact: An analysis of ArXiv's Condensed Matter Section'...

"1. Early view effect. [EA] In my case study on 6 journals in the field of condensed matter physics, I concluded that the observed differences between the citation age distributions of deposited and non-deposited ArXiv papers can to a large extent - though not fully - be explained by the publication delay of about six months of non-deposited articles compared to papers deposited in ArXiv. This outcome provides evidence for an early view [EA] effect upon citation impact rates, and consequently upon ArXiv citation impact differentials (CID, my term) or Arxiv Advantage (AA, your term)."
SH: "The basic question is this: Once the AA (Arxiv Advantage) has been adjusted for the "head-start" component of the EA (by comparing articles of equal age -- the age of Arxived articles being based on the date of deposit of the preprint rather than the date of publication of the postprint), how big is that adjusted AA, at each article age? For that is the AA without any head-start. Kurtz never thought the EA component was merely a head start, however, for the AA persists and keeps growing, and is present in cumulative citation counts for articles at every age since Arxiving began".
HM: "Figure 2 in the interesting paper by Kurtz et al. (IPM, v. 41, p. 1395-1402, 2005) does indeed show an increase in the very short term average citation impact (my terminology; citations were counted during the first 5 months after publication date) of papers as a function of their publication date as from 1996. My interpretation of this figure is that it clearly shows that the principal component of the early view effect is the head-start: it reveals that the share of astronomy papers deposited in ArXiv (and other preprint servers) increased over time. More and more papers became available at the date of their submission to a journal, rather than on their formal publication date. I therefore conclude that their findings for astronomy are fully consistent with my outcomes for journals in the field of condensed matter physics."

The findings are definitely consistent for Astronomy and for Condensed Matter Physics. In both cases, most of the observed OAA came from the self-archiving of preprints before publication (EA).

Moreover, in Astronomy there is already 100% "OA" to all articles after publication, and this has been the case for years now (for the reasons Michael Kurtz and Peter Boyce have pointed out: all research-active astronomers have licensed access as well as free ADS access to all of the closed circle of core Astronomy journals: otherwise they simply cannot be research-active). This means that there is only room for EA in Astronomy's OAA. And that means that in Astronomy all the questions about QA vs QB (self-selection bias) apply only to the self-archiving of prepublication preprints, not to postpublication postprints, which are all effectively "OA."

To a lesser extent, something similar is true in Condensed-Matter Physics (CondMP): In general, research-active physicists have better access to their required journals via online licensing than other fields do (though one does wonder about the "non-research-active" physicists, and what they could/would do if they too had OA!). And CondMP too is a preprint self-archiving field, with most of the OAA differential again concentrated on the prepublication preprints (EA). Moreover, Moed's test for whether or not a paper was self-archived was based entirely on its presence/absence in ArXiv (as opposed to elsewhere on the Web, e.g., on the author's website or in the author's Institutional Repository).

Hence Astronomy and CondMP are fields that are "biassed" toward EA effects. It is not surprising, therefore, that the lion's share of the OAA turns out to be EA in these fields. It also means that the remaining variance available for testing QA vs. QB in these fields is much narrower than in fields that do not self-archive preprints only, or mostly.

Hence there is no disagreement (or surprise) about the fact that most of the OAA in Astronomy and CondMP is due to EA. (Less so in the slower-moving field of maths; see: "Early Citation Advantage?.")

SH: "The fact that highly-cited articles (Kurtz) and articles by highly-cited authors (Moed) are more likely to be Arxived certainly does not settle the question of cause and effect: It is just as likely that better articles benefit more from Arxiving (QA) as that better authors/articles tend to Arxive/be-Arxived more (QB)."
HM: "2. Quality bias. I am fully aware that in this research context one cannot assess whether authors publish [sic] their better papers in the ArXiv merely on the basis of comparing citation rates of archived and non-archived papers, and I mention this in my paper. Citation rates may be influenced both by the 'quality' of the papers and by the access modality (deposited versus non-deposited). This is why I estimated author prominence on the basis of the citation impact of their non-archived articles only. But even then I found evidence that prominent, influential authors (in the above sense) are overrepresented in papers deposited in ArXiv."

I agree with all this: The probable quality of the article was estimated from the probable quality of the author, based on citations for non-OA articles. Now, although this correlation, too, goes both ways (are authors' non-OA articles more cited because their authors self-archive more or do they self-archive more because they are more cited?), I do agree that the correlation between self-archiving-counts and citation-counts for non-self-archived articles by the same author is more likely to be a QB effect. The question then, of course, is: What proportion of the OAA does this component account for?

HM: "But I did more that that. I calculated Arxiv Citation Impact Differentials (CID, my term, or ArXiv Advantage, AA, your term) at the level of individual authors. Next, I calculated the median CID over authors publishing in a journal. How then do you explain my empirical finding that for some authors the citation impact differential (CID) or ArXiv Advantage is positive, for others it is negative, while the median CID over authors does not significantly differ from zero (according to a Sign test) for all journals studied in detail except Physical Review B, for which it is only 5 per cent? If there is a genuine 'OA advantage' at stake, why then does it for instance not lead to a significantly positive median CID over authors? Therefore, my conclusion is that, controlling for quality bias and early view effect, in the sample of 6 journals analysed in detail in my study, there is no sign of a general 'open access advantage' of papers deposited in ArXiv's Condensed Matter Section."

My interpretation is that EA is the largest contributor to the OAA in this preprint-intensive field (i.e., most of the OAA comes from the prepublication component) and that there is considerable variability in the size of the (small) residual (non-EA) OAA. For a small sample, at the individual journal level, there is not enough variance left for a significant OAA, once one removes the QB component too. Perhaps this is all that Henk Moed wished to imply. But the bigger question for OA concerns all fields, not just those few that are preprint-intensive and that are relatively well-heeled for access to the published version. Indeed, the fundamental OA and OAA questions concern the postprint (not the preprint) and the many disciplines that do have access problems, not the happy few that do not!

The way to test the presence and size of both QB and QA in these non-EA fields is to impose the OA, preferably randomly, on half the sample, and then compare the size of the OAA for imposed ("mandated") self-archiving (Sm) with the size of the OAA for self-selected ("nonmandated") self-archiving (Sn), in particular by comparing their respective ratios to non-self-archived articles in the same journal and year: Sm/N vs. Sn/N).

If Sn/N > Sm/N then QB > QA, and vice versa. If Sn/N = 1, then QB is 0. And if Sm/N = 1 then QA is 0.

It is a first approximation to this comparison that has just been done (FIGURE 3) by my doctoral student, Chawki Hajjem, across fields, for self-archived articles in five Institutional Repositories (IRs) that have OA self-archiving mandates, for 106,203 articles published in 276 biomedical journal 2004, above.

The mandates are still very young and few, hence the sample is still small; and there are many potential artifacts, including selective noncompliance with the mandate as well as disciplinary bias. But the preliminary results so far suggest that (1) QA is indeed > 0, and (2) QA > QB.

[I am sure that we will now have a second round from die-hards who will want to argue for a selective-compliance effect, as a 2nd-order last gasp for the QB-only hypothesis, but of course that loses all credibility as IRs approach 100% compliance: We are analyzing our mandated IRs separately now, to see whether we can detect any trends correlated with an IR's %OA. But (except for the die-hards, who will never die), I think even this early sample already shows that the OA advantage is unlikely to be only or mostly a QB effect.]

HM: "3. Productive versus less productive authors. My analysis of differences in Citation Impact differentials between productive and less productive authors may seem "a little complicated". My point is that if one selects from a set of papers deposited in ArXiv a paper authored by a junior (or less productive) scientist, the probability that this paper is co-authored by a senior (or more productive) author is higher than it is for a paper authored by a junior scientist but not deposited in ArXiv. Next, I found that papers co-authored by both productive and less productive authors tend to have a higher citation impact than articles authored solely by less productive authors, regardless of whether these papers were deposited in ArXiv or not. These outcomes lead me to the conclusion that the observed higher CID for less productive authors compared to that of productive authors can be interpreted as a quality bias."

It still sounds a bit complicated, but I think what you mean is that (1) mixed multi-author papers (ML, with M = More productive authors, L = less productive authors) are more likely to be cited than unmixed multi-author (LL) papers with the same number of authors, and that (2) such ML papers are also more likely to be self-archived. (Presumably MM papers are the most cited and most self-archived of multi-author papers.)

That still sounds to me like a variant on the citation/self-archiving correlation, and hence intepretable as either QA or QB or both. (Chawki Hajjem has also found that citation counts are positively correlated with the number of authors an article has: this could either be a self-citation bias or evidence that multi-authored paper tend to be better ones.)

HM: "4. General comments. In the citation analysis by Kurtz et al. (2005), both the citation and target universe contain a set of 7 core journals in astronomy. They explain their finding of no apparent OA effect in his study of these journals by postulating that "essentially all astronomers have access to the core journals through existing channels". In my study the target set consists of a limited number of core journals in condensed matter physics, but the citation universe is as large as the total Web of Science database, including also a number of more peripherical journals in the field. Therefore, my result is stronger than that obtained by Kurtz at al.: even in this much wider citation universe, I do not find evidence for an OA advantage effect."

I agree that CondMP is less preprint-intensive, less accessible and less endogamous than Astrophysics, but it is still a good deal more preprint-intensive and accessible than most fields (and I don't yet know what role the exogamy/enodgamy factor plays in either citations or the OAA: it will be interesting to study, among many other candidate metrics, once the entire literature is OA).

HM: "I realize that my study is a case study, examining in detail 6 journals in one subfield. I fully agree with your warning that one should be cautious in generalizing conclusions from case studies, and that results for other fields may be different. But it is certainly not an unimportant case. It relates to a subfield in physics, a discipline that your pioneering and stimulating work (Harnad and Brody, D-Lib Mag., June 2004) has analysed as well at a more aggregate level. I hope that more case studies will be carried out in the near future, applying the methodologies I proposed in my paper."

Your case study is very timely and useful. However, robot-based studies based on much larger samples of journals and articles have now confirmed the OAA in many more fields, most of them not preprint-based at all, and with access problems more severe than those of physics.

Conclusions

I would like to conclude with a summary of the "QB vs. QA" evidence to date, as I understand it:

(1) Many studies have reported the OA Advantage, across many fields.

(2) Three studies have reported QB in preprint-intensive fields that have either no postprint access problem or markedly less than other fields (astrophysics, condensed matter, mathematics).

(3) The author of one of these three studies is pro-OA (Kurtz, who is also the one who drew my attention to the QA counterevidence); the author of the second is neutral (Moed); and the author of the third might (I think -- I'm not sure) be mildly anti-OA (Davis -- now collaborating with a publisher to do a 4-year [sic!] long-term study on QA vs QB).
Henneken, E. A., Kurtz, M. J., Eichhorn, G., Accomazzi, A., Grant, C., Thompson, D., and Murray, S. S. (2006) Effect of E-printing on Citation Rates in Astronomy and Physics. Journal of Electronic Publishing, Vol. 9, No. 2, Summer 2006

Moed, H. F. (2006, preprint) The effect of 'Open Access' upon citation impact: An analysis of ArXiv's Condensed Matter Section

Davis, P. M. and Fromerth, M. J. (2007) Does the arXiv lead to higher citations and reduced publisher downloads for mathematics articles? Scientometics, accepted for publication. See critiques: 1, 2
(4) So the overall research motivation for testing QB is not an anti-OA motivation.

(5) On the other hand, the motivation on the part of some publishers to put a strong self-serving spin on these three QB findings is of course very anti-OA and especially, now, anti-OA-self-archiving-mandate. (That's quite understandable, and no problem at all.)

(6) In contrast to the three studies that have reported what they interpret as evidence of QB (Kurtz in astro, Moed in cond-mat and Davis in maths), there are the many other studies that report large OA citation (and download) advantages, across a large number of fields. Those who have interests that conflict with OA and OA self-archiving mandates are ignoring or discounting this large body of studies, and instead just spinning the three QB reports as their justification for ignoring the larger body of findings.

This will all be resolved soon, and the outcome of our QA vs. QB comparison for mandated vs. self-selected self-archiving already heralds this resolution. I am pretty confident that the empirical facts will turn out to have been the following: Yes, there is a QB component in the OA advantage (especially in the preprinting fields, such as astro, cond-mat and maths). But that QB component is neither the sole factor nor the largest factor in the OA advantage, particularly in the non-preprint fields with access problems -- and those fields constitute the vast majority. That will be the outcome that is demonstrated, and eventually not only the friends of OA but the foes of OA will have no choice but to acknowledge the new reality of OA, its benefits to research and researchers, and its immediate reachability through the prompt universal adoption of OA self-archiving mandates.

Stevan Harnad & Chawki Hajjem
American Scientist Open Access Forum

Posted by Stevan Harnad in Methodology at 10:18 | Comments (0) | Trackbacks (0)

Wednesday, January 17. 2007

Citation Advantage For OA Self-Archiving Is Independent of Journal Impact Factor, Article Age, and Number of Co-Authors

Update Jan 1, 2010: See Gargouri, Y; C Hajjem, V Larivière, Y Gingras, L Carr,T Brody & S Harnad (2010) “Open Access, Whether Self-Selected or Mandated, Increases Citation Impact, Especially for Higher Quality Research”
Update Feb 8, 2010: See also "Open Access: Self-Selected, Mandated & Random; Answers & Questions"

SUMMARY: Eysenbach has suggested that the OA (Green) self-archiving advantage might just be an artifact of potential uncontrolled confounding factors such as article age (older articles may be both more cited and more likely to be self-archived), number of authors (articles with more authors might be more cited and more self-archived), subject matter (the subjects that are cited more, self-archive more), country (same thing), number of authors, citation counts of authors, etc.
Chawki Hajjem (doctoral candidate, UQaM) had already shown that the OA advantage was present in all cases when articles were analysed separately by age, subject matter or country. He has now done a multiple regression analysis jointly testing (1) article age, (2) journal impact factor, (3) number of authors, and (4) OA self-archiving as separate factors for 442,750 articles in 576 (biomedical) journals across 11 years, and has shown that each of the four factors contributes an independent, statistically significant increment to the citation counts. The OA-self-archiving advantage remains a robust, independent factor.
Having successfully responded to his challenge, we now challenge Eysenbach to demonstrate -- by testing a sufficiently broad and representative sample of journals at all levels of the journal quality, visibility and prestige hierarchy -- that his finding of a citation advantage for Gold OA articles (published OA on the high-profile website of the only journal he tested (PNAS) over Green OA articles in the same journal (self-archived on the author's website) was not just an artifact of having tested only one very high-profile journal.

In May 2006, Eysenbach published "Citation Advantage of Open Access Articles" in PLoS Biology, confirming -- by comparing OA vs. non-OA articles within one hybrid OA/non-OA journal -- the "OA Advantage" (higher citations for OA articles than for non-OA articles) that had previously been demonstrated by comparing OA (self-archived) vs. non-OA articles within non-OA journals.

This new PLoS study was based on a sample of 1492 articles (212 OA, 1280 non-OA) published June-December 2004 in one very high-impact (i.e., high average citation rate) journal: Proceedings of the National Academy of Sciences (PNAS). The findings were useful because not only did they confirm the OA citation advantage, already demonstrated across millions of articles, thousands of journals, and over a dozen subject areas, but they showed that that advantage is already detectable as early as 4 months after publication.

The PLoS study also controlled for a large number of variables that could have contributed to a false OA advantage (for example, if more of the authors that chose to provide OA had happened to be in subject areas that happened to have higher citation counts). Eysenbach's logistic and multiple regression analyses confirmed that this was not the case for any of the potentially confounding variables tested, including the (i) country, (ii) publication count and (iii) citation count of the author and the (iv) subject area and (v) number of co-authors of the article.

However, both the Eysenbach article and the accompanying PLoS editorial, considerably overstated the significance of all the controls that were done, suggesting that (1) the pre-existing evidence, based mainly on OA self-archiving ("green OA") rather than OA publishing ("gold OA"), had not been "solid" but "limited" because it had not controlled for these potential "confounding effects." They also suggested that (2) the PLoS study's finding that gold OA generated more citations than green OA in PNAS pertained to OA in general rather than just to high-profile journals like PNAS (and that perhaps green OA is not even OA!):

Eysenbach (2006): "[T[he [prior] evidence on the “OA advantage” is controversial. Previous research has based claims of an OA citation advantage mainly on studies looking at the impact of self-archived articles... (which some have argued to be different from open access in the narrower sense)... All these previous studies are cross-sectional and are subject to numerous limitations... Limited or no evidence is available on the citation impact of articles originally published as OA that are not confounded by the various biases and additional advantages [?] of self-archiving or “being online” that contribute to the previously observed OA effects."

PLoS Editorial (MacCallum & Parthasarathy 2006): "We have long argued that papers freely available in a journal will be more often read and cited than those behind a subscription barrier. However, solid evidence to support or refute such a claim has been surprisingly hard to find. Since most open-access journals are new, comparisons of the effects of open access with established subscription-based journals are easily confounded by age and reputation... As far as we are aware, no other study has compared OA and non-OA articles from the same journal and controlled for so many potentially confounding factors... The results... are clear: in the 4 to 16 months following publication, OA articles gained a significant citation advantage over non-OA articles during the same period... [Eysenbach's] analysis [also] revealed that self-archived articles are... cited less often than OA [sic] articles from the same journal."

When I pointed out in a reply that subject areas, countries and years had all been analyzed separately in prior within-journal comparisons based on far larger samples, always with the same outcome -- the OA citation advantage -- making it highly unlikely that any of the other potentially confounding factors singled out in the PLoS/PNAS study would change that consistent pattern, Eysenbach responded:

Eysenbach: "[T]o answer Harnad's question 'What confounding effects does Eysenbach expect from controlling for number of authors in a sample of over a million articles across a dozen disciplines and a dozen years all showing the very same, sizeable OA advantage? Does he seriously think that partialling out the variance in the number of authors would make a dent in that huge, consistent effect?' – the answer is “absolutely”.

My doctoral student, Chawki Hajjem, has accordingly accepted Eysenbach's challenge, and done the requisite multiple regression analyses, testing not only (3) number of authors, but (1) number of years since publication, and (2) journal impact factor. The outcome is that (4) the OA self-archiving advantage (green OA) continues to be present as a robust, independent, statistically significant factor, alongside factors (1)-(3):

Tested:
(1) number of years since publication (BLUE)
(2) journal impact factor (additional variable not tested by Eysenbach) (PURPLE)
(3) number of authors (RED)
(4) OA self-archiving (GREEN)

Already tested separately and confirmed:
(5) country (previously tested: OAA separately confirmed for all countries tested -- 1st author affiliation)
(6) subject area (previously tested: OAA separately confirmed in all subject areas tested)

Not tested:
(7) publication and citation counts for first and last authors (not tested, but see Moed 2006)

Irrelevant:
(8) article type (only relevant to PNAS sample)
(9) submission track (only relevant to PNAS sample)
(10) funding type (irrelevant)
Independent effects of (1) Year of Publication (purple), (2) Journal Impact Factor (blue), (3) Number of Authors (red) and (4) OA Self-Archiving (green) on citation counts: Beta weights derived from multiple regression analyses of (column 1) raw distribution, (column 2) log normalized distribution, (columns 3-6) separate Journal Impact Factor Quartiles, and (columns 7-10) separate Year of Publication Quartiles. In every case, OA Self-Archiving makes an independent, statistically significant contribution (highest for the most highly cited articles, column 6 "Groupe Dri": i.e., the QA/QB effect). (Biology, 1992-2003; 576 journals; 442,750 articles). For more details see Chawki Hajjem's website.

In order of size of contribution:

Article age (1) is of course the biggest factor: Articles' total citation counts grow as time goes by.

Journal impact factor (2) is next: Articles in high-citation journals have higher citation counts: This is not just a circular effect of the fact that journal citation counts are just average journal-article citation counts: It is a true QB selection effect (nothing to do with OA!), namely, the higher quality articles tend to be submitted to and selected by the higher quality journals!.

The next contributor to citation counts is the number of authors (3): This could be because there are more self-citations when there are more authors; or it could indicate that multi-authored articles tend to be of higher quality.

But last, we have the contribution of OA self-archiving (4). It is the smallest of the four factors, but that is unsurprising, as surely article age and quality are the two biggest determinants of citations, whether the articles are OA or non-OA. (Perhaps self-citations are the third biggest contributor). But the OA citation advantage is present for those self-archived articles (and stronger for the higher quality ones, QA), refuting Eysenbach's claim that the green OA advantage is merely the result of "potential confounds" and that only the gold OA advantage is real.

I might add that the PLoS Editorial is quite right to say: "Since most open-access journals are new, comparisons of the effects of open access with established subscription-based journals are easily confounded by age and reputation": Comparability and confounding are indeed major problems for between-journal comparisons, comparing OA and non-OA journals (gold OA). Until Eysenbach's within-journal PNAS study, "solid evidence" (for gold OA) was indeed hard to find. But comparability and confounding are far less of a problem for the within-journal analyses of self-archiving (green OA), and with them, solid evidence abounds.

I might further add that the solid pre-existing evidence for the green OA advantage -- free of the limitations of between-journal comparisons -- is and always has been, by the same token, evidence for the gold OA advantage too, for it would be rather foolish and arbitrary to argue that free accessibility is only advantageous to self-archived articles, and not to articles published in OA journals!

Yet that is precisely the kind of generalization Eysenbach seems to want to make (in the opposite direction) in the special case of PNAS -- a very selective, high-profile, high-impact journal. PNAS articles that are freely accessible on the PNAS website were found to have a greater OA advantage than PNAS articles freely accessible only on the author's website. With just a little reflection, however, it is obvious that the most likely reason for this effect is the high profile of PNAS and its website: That effect is hence highly unlikely to scale to all, most, or even many journals; nor is it likely to scale in time, for as green OA grows, the green OA harvesters like OAIster (or even just Google Scholar) will become the natural way and place to search, not the journal's website.

Having taken up Eysenbach's challenge to test the independence of the OA self-archiving advantage from "potential confounds," we now challenge Eysenbach to test the generality of the PNAS gold/green advantage across the full quality hierarchy of journals, to show it is not merely a high-end effect.

Let me close by mentioning one variable that Eysenbach did not (and could not) control for, namely, author self-selection bias (Quality Bias, QB): His 212 OA authors were asked to rate the relative urgency, importance, and quality of their articles and there was no difference between their OA and non-OA articles in these self-ratings. But (although I myself am quite ready to agree that there was little or no Quality Bias involved in determining which PNAS authors chose which PNAS articles to make OA gold), unfortunately these self-ratings are not likely to be enough to convince the sceptics who interpret the OA advantage as a Quality Bias (a self-selective tendency to provide OA to higher quality articles) rather than a Quality Advantage (QA) that increases the citations of higher quality articles. Not even the prior evidence of a correlation between earlier downloads and later citations is enough. The positive result of a more objective test of Quality Bias (QB) vs. Quality Advantage (QA) (comparing self-selected vs. mandated self-archiving, and likewise conducted by Chawki Hajjem) is reported ) here.

REFERENCES

Brody, T., Harnad, S. and Carr, L. (2005) Earlier Web Usage Statistics as Predictors of Later Citation Impact. Journal of the American Association for Information Science and Technology (JASIST) 57(8) pp. 1060-1072.

Eysenbach G (2006) Citation Advantage of Open Access Articles. PLoS Biology 4(5) e157 DOI: 10.1371/journal.pbio.0040157

Hajjem, C., & Harnad, S. (2007) The Open Access Citation Advantage: Quality Advantage Or Quality Bias?

Hajjem, C., Harnad, S. & Gingras, Y. (2005) Ten-Year Cross-Disciplinary Comparison of the Growth of Open Access and How it Increases Research Citation Impact. IEEE Data Engineering Bulletin 28(4) pp. 39-47.

Harnad, S. (2006) PLoS, Pipe-Dreams and Peccadillos. PLoS Biology Responses.

MacCallum CJ & Parthasarathy H (2006) Open Access Increases Citation Rate. PLoS Biol 4(5): e176 DOI: 10.1371/journal.pbio.0040176

Moed, H. F. (2006) The effect of 'Open Access' upon citation impact: An analysis of ArXiv's Condensed Matter Section

Stevan Harnad & Chawki Hajjem
American Scientist Open Access Forum

Posted by Stevan Harnad in Methodology at 18:18 | Comments (0) | Trackbacks (0)

Wednesday, December 13. 2006

President-Elect of the Association of American University Presses (AAUP) on Open Access: An Exchange

SUMMARY: Sandy Thatcher, President-Elect of AAUP is preparing a white paper on OA and asked about PRC's study "Self-Archiving and Journal Subscriptions: Co-existence or Competition? An International Survey of Librarians' Preferences." The PRC study tried to provide evidence, via simulation and modeling, on whether author self-archiving will cause librarians to cancel journals (because there is no evidence of this yet, and APS and IOPP have both reported that they can detect no correlation).
A methodological flaw in the PRC study made it impossible to make any relevent predictions because OA self-archiving (green) had been treated as if it were an OA journal (gold), suitable for cancellation. In reality, author self-archiving of individual articles is distributed and anarchic, with no sure way of knowing how much of a journal's contents have become OA, and when; moreover, self-archiving mandates affect all journals at once, roughly equally. So the journal versus journal acquisition/cancellation options presented in the PRC simulations have no bearing on the question of self-archiving and cancellation.
It is nevertheless likely that self-archiving will eventually induce cancellations, though no one can predict when, and how strong the pressure will be. What is certain is that journals can and will adapt; trying to deny research the demonstrated advantages of OA is no longer an option. Nor is there any need for authors' institutions or funders to pay for OA publication until and unless cancellation pressure makes subscriptions unsustainable; then, journals will cut costs, downsize, and convert to the OA publishing cost-recovery model. Till then, researchers need to provide immediate OA through self-archiving, and their institutions and funders need to mandate it. Journals too, will benefit from the enhanced impact. Ahead of us is a period of peaceful co-existence between mandated OA self-archiving (growing anarchically) and non-OA journal publishing, till we approach 100% OA; after that, the market itself will decide how long non-OA publishing and the subscription/licensing model remain sustainable, and whether and when there will need to be a transition to OA publishing. But meanwhile research will already be enjoying 100% OA, at long last.

On Tue, 12 Dec 2006, Sandy Thatcher wrote on liblicense:

ST: "Coming new to this list... as President-Elect of the AAUP (Association of American University Presses) charged with preparing a white paper on OA for the Association... [and] [n]ot knowing what may have been discussed previously, I begin by asking whether this list has focused any attention on the relatively new study from the Publishing Research Consortium titled "Self-Archiving and Journal Subscriptions: Co-existence or Competition? An International Survey of Librarians' Preferences. "

Dear Sandy,

Welcome to the list and to your new post!

Everything you wrote in your opening message has been enlightened and constructive, and I think we may be on the verge of a new era of fruitful cooperation and collaboration between the research and publishing community.

Let me reply to the questions you addressed to me. There has indeed been previous discussion of the PRC study on this list.

There was Chris Beckett's response to my critique of the PRC study and my reply to Chris's response:

The point of disagreement, in essence, was that one of the main objectives of the PRC study had been to gather evidence on whether or not librarians will cancel journals as a consequence of author self-archiving (because there exists as yet no evidence at all that self-archiving causes cancellations, and, as you note, two publishers in the fields with the longest and most extensive self-archiving, APS and IOPP, have both reported that they can detect no correlation).

The PRC study tried to predict, via simulation and modeling, whether librarians would cancel if authors self-archived.

(1) The lesser point of my critique was that even asking librarians directly -- "Please predict how much of a journal's content would have to be available free via self-archiving to induce you to cancel it?" -- would have generated speculative guesses rather than evidence, because:

(1a) There is no way to know how much of any particular journal's content is being self-archived, since author self-archiving is gradual, distributed and anarchic;

(1b) self-archiving mandates (by research institutions and funders) would not affect one journal's contents more than another's, so their effects would be global, not focussed on any individual journal, and

(1c) no librarian can really know today what their research faculty would advise, hence what they would do, under gradual, uncertain, anarchic growth of self-archiving, and when.

(2) My more critical point was a methodological one, concerning the indirect hypothetical choices and modeling used: To avoid bias (by mentioning either self-archiving or open access), the survey asked librarians for their preferences among various hypothetical competing journals with various hypothetical properties (among them: being free), and then used a model to extrapolate this to predict cancellations. This method actually made it impossible even to infer what librarians speculated they might do under the distributed anarchic conditions described above, because, as noted, no such journal-vs-journal information or options would ever be available to librarians: self-archiving does not grow on an individual journal-vs-journal basis, but on a global, distributed, anarchic, individual-article basis. The librarian's choice is hence never between cancelling a free journal in favour of another journal. (This sort of reasoning does fit "gold" OA journals, but it does not fit "green" OA self-archiving of individual articles by individual authors.)

Journals are acquired or cancelled on a comparative/competitive basis. Individual articles -- self-archived globally and anarchically by their individual authors across all journals -- are not the comparative/competitive journal acquisition/cancellation options that are familiar to acquisitions librarians, and that the PRC study was trying to simulate, and from which the model was trying to make predictions about the conditions that would cause cancellations. The model works for simulating actual comparative journal choices, but it fails for the special case of anarchic article self-archiving.

Hence the survey did not provide the evidence that still does not exist today: that self-archiving will cause cancellations.

Let me add, though, that I personally do believe that global self-archiving will eventually lead to cancellation pressure, but no one knows how much or when, as it will depend on how quickly global self-archiving and self-archiving mandates will grow. I must also add, though, that I do not believe that this likelihood of eventual cancellation pressure is any grounds for not self-archiving now, or for not mandating self-archiving now. Self-archiving brings substantial demonstrated benefits to research, researchers, their institutions, their funders, and the tax-paying public that funds the funders and institutions. OA is consequently optimal and inevitable for research (and already long overdue!). It is therefore publishing that will need to adapt to any eventual cancellation pressure that might arise from OA self-archiving; and publishing can, and will successfully adapt.

ST: "Another very interesting finding for me is that librarians care a lot that the material is peer-reviewed but care very little whether they have access to the final published version."

Yes. In fact that was the one substantive finding of the study. But the same considerations (about global anarchic growth) apply either way (whether the self-archived draft is the author's postprint or the publisher's PDF).

ST: "Librarians seem to place little or no value on the final processing of manuscripts after acceptance, which should be an eye-opener to publishers"

Yes! Hence this might be a region in which costs could already be cut, even before any cancellation pressure is felt.

ST: "Once we publishers think something is going to happen, we will act on those beliefs if they seem to be firmly supported, by such studies as the PRC's... behaviors will start to change based on beliefs, however erroneous they may be."

I am not sure what publishers are contemplating doing, but it seems to me that self-archiving and self-archiving mandates are in the hands of researchers, their institutions and their funders. So cooperating and adapting to this new PostGutenberg reality would, I think, be the optimal strategy.

Berners-Lee, T., De Roure, D., Harnad, S. and Shadbolt, N. (2005) Journal publishing and author self-archiving: Peaceful Co-Existence and Fruitful Collaboration.
ST: "(By the way, the PRC study directly confronts the "evidence" of the physics preprint archive not affecting cancellations of physics journals, by pointing out that the archive combines peer-reviewed and not peer-reviewed materials, thus making it less than fully reliable as a source of completely authenticated work in the field.)"

Indeed. And the same will be true of the global network of Institutional Repositories: They too will contain preprints as well as postprints.

ST: "I think the tipping phenomenon, which we know already to have shown itself operative in this arena when e-journals came to displace print journals as the main product in the marketplace (rather more quickly than many people anticipated), is extremely important to keep in mind here. This is what I see as a real possibility: enough of the major commercial journal publishers in an ever more consolidated market (after the purchase of Blackwell by Wiley) become convinced that their subscriptions will erode seriously (if, say, the FRPAA becomes law) and therefore decide to abandon the arena of STM journal publishing because they cannot sustain the expected profit margins under the new regime (as outlined by Dr. Harnad)."

As always, if a publisher decides to abandon a journal title, it can migrate to another publisher. There are now a growing number of new gold OA publishers, ready and willing to take over established titles (and to scale down to whatever there is still a market for, in the OA era).

But, to repeat, the growth of green OA via self-archiving is anarchic, not based on individual journals separately approaching 100% OA. Hence the "tipping point" is a global one, and still far away, and will approach gradually, so journals can adapt by phasing out goods and services for which there is no longer a market. There will always be a market for peer-review service provision. (And I wouldn't write off the market for the print edition, or even the publisher's enhanced PDF and copy-editing just yet!)

Sandy, I actually think you answer this question yourself, with:

ST: "I long ago predicted that university press journals would migrate to the electronic environment [and that it] was therefore much more possible, and more likely, that journals could spring up online without the support of publishers, if they went OA and did not have to bother about the complications of outsourcing printing and handling subscription fulfilment. (And a journal only has to be designed once, and the template followed thereafter, while marketing takes care of itself if the journal is aimed at a niche community anyway.)"
ST: "This could all happen very quickly, as "tipping" phenomena generally do. Where would that scenario leave the academy? With several thousand journals suddenly left to fend for themselves!"

Nothing sudden. And plenty of flexible ways to fend, in the portable online age!

ST: "the infrastructure of universities today is simply not prepared, in any shape or form, to deal with that "crisis" and find some way of sustaining those journals."

There is no evidence at all for such an impending crisis, just as there is as yet no evidence of self-archiving causing cancellations. (There is, however, plenty of evidence for the benefts of self-archiving.)

ST: "Self-publishing would then proliferate, and chaos would ensue for some time to come. Are librarians prepared to deal with the consequences?"

It's not up to librarians but to researchers. (And I'm afraid I have to say this sounds like hypothetical alarmism, rather than evidence-based reasoning and planning.) Titles will migrate, if need be. Peer review (done by and for researchers, for free, mediated and managed by the journal) will remaining intact. And the self-archiving of peer-reviewed articles is not self-publishing.

ST: "I do not depict this nightmare scenario in order to defend the existing system... But I do think university faculty, administrators, and librarians need to think through these issues and possible scenarios very carefully and "worst-case" planning would probably be appropriate here."

I agree that cooperative planning for a possible eventual downsizing to peer-review service-provision alone and a transition to the OA cost-recovery model under cancellation pressure (and corresponding institutional windfall savings) would be an excellent idea -- and much more constructive than trying to wish away the proposed self-archiving mandates such as the FRPAA.

Please see:

"The Urgent Need to Plan a Stable Transition" (began Sep 1998!)

Annex to UK Select Committee Evidence

Berners-Lee, T., De Roure, D., Harnad, S. and Shadbolt, N. (2005) Journal publishing and author self-archiving: Peaceful Co-Existence and Fruitful Collaboration

Best wishes,

Stevan Harnad
American Scientist Open Access Forum

Posted by Stevan Harnad in Methodology at 19:41 | Comments (0) | Trackbacks (0)

« previous page (Page 3 of 4, totaling 36 entries) » next page

Open Access Archivangelism

Wednesday, July 9. 2008

Batch Deposits in Institutional Repositories (the SWORD protocol)

Tuesday, July 8. 2008

Automatic search for OA versions of cited articles

Thursday, October 18. 2007

Time to Update the BBB Definition of Open Access

Tuesday, June 12. 2007

Open Access: What Comes With the Territory

Thursday, June 7. 2007

British Classification Society post-RAE Scientometrics

Saturday, May 26. 2007

Craig et al.'s Review of Studies on the OA Citation Advantage

Tuesday, May 1. 2007

OA Citation Impact Study: No Conclusions Possible

Sunday, January 21. 2007

The Open Access Citation Advantage: Quality Advantage Or Quality Bias?

Wednesday, January 17. 2007

Citation Advantage For OA Self-Archiving Is Independent of Journal Impact Factor, Article Age, and Number of Co-Authors

Wednesday, December 13. 2006

President-Elect of the Association of American University Presses (AAUP) on Open Access: An Exchange

EnablingOpenScholarship (EOS)

Federal Research Public Access Act (FRPAA)

Alliance for Taxpayer Access (ATA)

Creative Commons License:

Quicksearch

Syndicate This Blog

Materials You Are Invited To Use To Promote OA Self-Archiving:

Archives

Calendar

Categories

Blog Administration

Statistics

Top Referrers

Syndicate This Blog