Open Access Archivangelism

Thursday, October 12. 2006

CIHR Proposes 99.99% Optimal OA Self-Archiving Mandate

Canadian Institutes of Health Research (CIHR) has proposed a 99.99% optimal Open Access Self-Archiving Mandate:

CIHR grant and award holders must:
(1a) [deposit all] final peer-reviewed published articles or final peer-reviewed full-text manuscripts
(1b) in an appropriate [OAI]-compliant digital archive, such as PubMed Central, or an institutional repository
(1c) immediately upon publication.
(1d) (A publisher-imposed embargo on open accessibility of no more than 6 months is acceptable.)
[OR]
(2a) submit their manuscripts either to a journal that provides immediate open access to published articles (if a suitable journal exists)
[OR]
(2b) submit their manuscripts to a journal that allows authors to retain copyright and/or allows authors to archive journal publications in an open access archive within the six-month period following publication.

There is only one unnecessary and confusing clause in CIHR's policy: (2b). (2b) is redundant with [1]! (2b) says the author must publish in a journal that allows [1]. But that is already implicit in [1] -- it is not a sub-option of [2]. [1] is the requirement to self-archive immediately (and to set access as Open Access within 6 months). Alternative [2] is to publish in an Open Access journal. That covers all the alternatives! (2b) is completely redundant. So (2b) should simply be dropped.

That's all, really. There are still a few minor changes that would make the policy simpler, clearer, and more systematic and coherent. In order to encourage a uniform practice that will generalize and apply to all fields, whether or not funded by CIHR, it would be best if CIHR's uniform rule consisted of just these 5 components:

I. must deposit final peer-reviewed manuscript (or published version)
II. in the author's own IR (or other OAI-compliant repository)
III. immediately upon (acceptance for) publication
(IV. access to the deposit must be set as Open Access within 6 months at latest)
(V. where possible, publish in a suitable OA journal)

This way, everything gets deposited immediately, and access is OA within 6 months. The IR should be the preferred default locus, from which PubMed Central or other archives can harvest, but direct deposit elsewhere can be allowed as an option if the researcher has no institutional IR yet.

During any Closed Access embargo interval, IR's will have the EMAIL EPRINT REQUEST button to fulfill any individual requests for a single email copy -- Fair Use -- from would-be users who see the postprint's openly accessible metadata: available for DSpace IRs and for EPrints IRs.

(CIHR also requires making research data and materials available for reasonable requests: Might as well recommend -- but not require -- that they are self-archived too, wherever possible!)

Bravo CIHR!

Stevan Harnad
American Scientist Open Access Forum

PS: Note that, unlike the Wellcome Trust's Self-Archiving Mandate, CIHR's proposed mandate does not offer to fund option (2a) (publishing in an Open Access or hybrid "Open Choice" journal). Apparently CIHR did not feel it had the spare cash for this. This is quite understandable (although no doubt some publishers will complain vociferously about it): The fact is that all potential publication funds are currently tied up in covering the costs of institutional subscriptions, worldwide. If and when self-archiving should ever lead to institutional subscription cancellations that make the subscription model unsustainable, then those very institutional windfall savings themselves will be the natural source for the cash to cover OA publishing costs. No need to take it from research funds at this time, when it is unaffordable. OA is the immediate and urgent (and long-overdue) priority today, not pre-emptively cushioning a hypothetical transition to another publishing cost-recovery model (except where the spare cash is available).

Please note that a public consultation has been launched to seek comments on CIHR's proposed Policy on Access to Research Outputs.

Text of Proposal: English - French

Comments or questions on the draft policy should be sent by e-mail access@cihr-irsc.gc.ca or by mail to the address below.

Consultation on the Access to Research Outputs Policy
Canadian Institutes of Health Research
160 Elgin Street, 9th Floor
Address Locator 4809A
Ottawa, ON, K1A 0W9
Canada

Posted by Stevan Harnad in Self-Archiving Mandates at 03:29 | Comments (0) | Trackbacks (0)

Tuesday, October 10. 2006

Hypotheses Non Fingo

J.W.T.Smith (Templeton Library, University of Kent)
wrote in JISC-REPOSITORIES:
"My basic interpretation of the 'Harnad model' is that Stevan wants every researcher to locally (or remotely) make available an open copy... in parallel with the current journal model and [using] quality control services of existing journals. [This is] parasitic on the current model. What Stevan does not want to acknowledge is that this parasitism will ultimately destroy the current journal model... mandates (for self archiving) will not not only increase the number of research articles freely available (a good thing) but will also accelerate the end of the 'traditional' journal and force the evolution of a new form of academic publishing to replace it (in my opinion also a good thing...)."

Hypotheses non fingo. There is no "Harnad model":

      Research is published in c. 24K peer-reviewed journals (c. 2.5M articles annually).
            (Datum, not hypothesis.)
      Not all would-be users can access all those articles online.
             (Datum, not hypothesis.)
      Self-archiving supplements access, for those would-be users.
            (Datum, not hypothesis.)
      Self-archiving is correlated with higher and earlier download and citation impact.
            (Datum, not hypothesis.)
      Self-archiving is explicitly endorsed by 93% of journals.
            (Datum, not hypothesis.)
      Only c. 15% of annual articles are being spontaneously self-archived today.
            (Datum, not hypothesis)
      95% of researchers surveyed report they will self-archive if it is mandated.
            (Datum, not hypothesis.)
      When self-archiving is mandated, it rapidly rises toward 100%.
            (Datum, not hypothesis.)
       No evidence has been reported to date that self-archiving causes cancellations.
            (Datum, not hypothesis.)

[*Self-archiving might (or might not) eventually cause cancellations and a change in journal publishing model. (Hypothesis) Mea maxima culpa!]

Hypotheses non fingo. There is no "Harnad model."

Berners-Lee, T., De Roure, D., Harnad, S. and Shadbolt, N. (2005)
Journal publishing and author self-archiving:
Peaceful Co-Existence and Fruitful Collaboration.

Stevan Harnad
American Scientist Open Access Forum

Posted by Stevan Harnad in Publishing Reform at 14:27 | Comments (0) | Trackbacks (0)

Monday, October 9. 2006

Critique of EPS/RIN/RCUK/DTI "Evidence-Based Analysis of Data Concerning Scholarly Journal Publishing"

Update Jan 1, 2010: See Gargouri, Y; C Hajjem, V Larivière, Y Gingras, L Carr,T Brody & S Harnad (2010) “Open Access, Whether Self-Selected or Mandated, Increases Citation Impact, Especially for Higher Quality Research”
Update Feb 8, 2010: See also "Open Access: Self-Selected, Mandated & Random; Answers & Questions"

SUMMARY: This Report on UK Scholarly Journals was commissioned by RIN, RCUK and DTI, and conducted by ELS, but its questions, answers and interpretations are clearly far more concerned with the interests of the publishing lobby than with those of the research community.

The Report's two relevant overall findings are correct and stated very fairly in their summary form:
[1] "Overall, [self-archiving] of articles in open access repositories seems to be associated with both a larger number of citations, and earlier citations for the items deposited....The reasons for this [association] have not been clearly established - there are many factors that influence citation rates... Consistent longitudinal data over a period of years... would fill this gap."

[2] "There is no evidence as yet to demonstrate any relationship (or lack of relationship) between subscription cancellations and repositories... Proving or disproving a [causal] link between availability in self-archived repositories and cancellations will be difficult without long and rigorous research."
The obvious empirical and practical conclusion to draw from the findings -- that (1) all the self-archiving evidence to date is positive for research and that (2) none of the self-archiving evidence to date is negative for publishing -- would have been that the research community should now apply and extend these findings -- by applying and extending self-archiving (through self-archiving mandates) to all UK research output, along with consistent, rigorous longtitudinal studies over a period of years, to test (1) whether the positive effect on citations continues to be present (and why) and (2) whether the negative effect on subscriptions continues to be absent.

But instead, the two overall findings are hedged with volumes of special pleading, based mostly on wishful thinking, to the effect that (1') the observed relationship between self-archiving and citations may not be causal, and that (2') there may exist an as-yet-unobserved causal relationship between self-archiving and cancellations after all.

Even that would be alright, if this Report's conclusions were coupled with a clear endorsement of the proposed self-archiving mandates, so that the competing hypotheses can be put to a rigorous long-term test. But the only test the commissioners of this Report seem to be interested in conducting is "Open Option" publishing, i.e., authors paying publishers to make their article OA for them, instead of self-archiving it for themselves. This would certainly be a nice way to hold author self-archiving and institution/funder self-archiving mandates at bay for a few years more, while at the same time protecting publishers from undemonstrated risk of revenue loss. But it would also leave global unmandated self-archiving to continue to languish at the current spontaneous 15% rate that the self-archiving mandates had been meant to drive up to 100%. And it would leave research unprotected from its demonstrated risk of impact loss. The option of having to pay to provide OA is certainly not likely to enhance the unmandated rate of uptake by authors (though I'm sure publishers would have no quarrel with funder mandates to provide OA coupled with the funds to pay publishers' asking price for paid OA, as provided by the Wellcome Trust).

The longterm test will nevertheless be conducted, because five out of seven UK Research Councils have already mandated self-archiving (an eighth CCLRC, will soon merge with one of the mandators, PPARC). Their citation rates and their cancellation rates can then be compared with those for the two that have not mandated self-archiving (and whose authors hence only do it spontaneously, by "self-selection"). Alas this will be mostly comparing apples and oranges (e.g. MRC vs AHRC), and it will needlessly be depriving the oranges of several more years of potential growth enhancement. My guess is that all the other councils -- except possibly the paradoxical EPSRC (which evidently thinks, with the publishing lobby, that there's still some sort of pertinent pretesting to be done for a few more years here) -- will come to their senses long before that, unpersuaded by Reports like this one.

UK scholarly journals: 2006 baseline report
An evidence-based analysis of data concerning scholarly journal publishing.
Prepared on behalf of the Research Information Network, Research Councils UK and the UK Department of Trade and Industry.
By Electronic Publishing Services Ltd
In association with Professor Charles Oppenheim and LISU at Loughborough University Department of Information Science

This is a rather long and repetitious report, but it does contain a few nuggets. It is obviously biassed, but biassed in a restrained way, meaning it does not really try to conceal its biases, nor does it overstate biassed conclusions. It also (reluctantly, but in most cases candidly) acknowledges its own weaknesses.

(The Report was commissioned by RIN, RCUK and DTI, but it is glaringly obvious that the questions, answers and interpretations have been slanted toward the interests of the publishing lobby rather than those of the research community -- possibly because the research community has no lobby in this matter, apart from the OA movement itself! Nevertheless, there has been considerable circumspectness, at least in the summary and conclusion passages, with weak points and gaps usually pointed out explicitly rather than denied or concealed, and with the overall preoccupation with publishing interests rather than research interests very open too.)

Some quotes and comments:

Whilst some evidence does suggest that [self-archiving in] repositories [is] an important new factor in the journal cancellation decision process, and one which is growing in significance, there is no research reporting actual or even intended journal subscription cancellation as a consequence of the growth of OA self-archived repositories.

So far, this sounds fair and reasonable. (In fact, this is the gist of the Report! The rest is mostly special pleading.)

Subscriptions are reported to have been declining over a period of 10+ years, but for a number of reasons. Proving or disproving a link between availability in self-archived repositories and cancellations will be difficult without long and rigorous research. In this connection, the outcome of research recently announced by the Research Councils UK (RCUK) with the co-operation of Macmillan, Blackwell and Elsevier, will be eagerly awaited, even though a report is not due until late 2008.

With evidence of self-archiving's benefits to research mounting, and zero evidence yet of any negative effect at all on publisher revenue, publishers nevertheless seem quite willing to wait (and keep research waiting too), trying to fend off self-archiving and its potential benefits to research for a long time to come yet, in order to keep trying to find some evidence of negative causal effects on publisher revenue (or, failing that, to deny positive causal effects on research impact).

Note that whereas a link between OA self-archiving and subscription decline has not yet been "proved or disproved" (not for want of looking!) -- and it is for that reason that we are hearing these calls for "long and rigorous research" -- the vast preponderance of the evidence we do have has already "proved" a "link" between OA self-archiving and citation counts (a link that is almost certainly causal, despite the wishful thinking of some who have a vested interest in its all turning out to be merely a-causal self-selection and superstition on the part of authors).

The question that the research community accordingly needs to ask itself is whether self-archiving's evidence-based benefits to research should be held in abeyance still longer, and meanwhile interpreted by default as a-causal, in order to buy still more time to try to "prove/disprove" hypothetical subscription declines for which there is no evidence whatsoever to date, even in fields where self-archiving has been near 100% for years.

(Researchers should also go on to ask themselves whether the research benefits should be held in abeyance even if they are causally linked to a subscription decline: Is research impact to be sacrificed in the service of publisher revenue? Are we conducting and funding research in order to generate -- or to safeguard -- publisher revenue?)

There is no evidence as yet to demonstrate any relationship (or lack of relationship) between subscription cancellations and repositories. Work in this field would need sufficient, representative and balanced samples, and the collaboration of all stakeholders, including especially research institutions and publishers. Any such study will need to be maintained over a fairly extended period, with regular reports, since it seems likely that the position could change with time if the contents of self-archiving repositories become progressively more comprehensive.

This would be fine, if proposed as an extended research project to be conducted after self-archiving mandates are in place, to analyze their long-term effects on subscriptions.

But this would be an exceedingly self-serving suggestion on the part of the publishing community (and a methodologically empty one) if meant as a "pilot" study that must somehow be conducted before adopting self-archiving mandates. (And it would be exceedingly self-defeating of the research community to even consider accepting such a pre-emptive suggestion as a precondition, before adopting self-archiving mandates.)

There is some consistency in results that show more citations for articles self-archived in repositories as distinct from the same or similar articles available [only via journal] subscription (although there have also been a few contradictory results). Overall, deposit of articles in open access repositories seems to be associated with both a larger number of citations, and earlier citations for the items deposited.

This a fair summary -- except that immediately after stating it, this "association" is about to be deconstructed (much as the "association" between cigarette-smoking and lung cancer was deconstructed for years and years by the tobacco industry, claiming that only correlation had been demonstrated, and not causation). Read on:

The reasons for this [association] have not been clearly established - there are many factors that influence citation rates, including the reputation of the author, the subject-matter of the article, the self-citation rate, and, of course, how important or influential the repository is in its own right. The little existing evidence suggests that a possible [sic] reason for increased citation counts is not that the materials were free, or that they appeared more rapidly, but that authors put their best work into OA format. This research was limited to one discipline, however [astronomy], and more extensive evidence is required to validate this finding.

This (important) study by Kurtz et al in astronomy, however, is not what the vast majority of the evidence (no longer little!) shows: Moreover, as noted, this a-causal interpretation -- only one of the possible interpretations of the astronomy evidence -- also happens to be the interpretation that the publishing community prefers for all the self-archiving evidence, in all fields. The alternative interpretation is that the relationship is causal: that the OA advantage is not merely an arbitrary whim on the part of the better authors to make their work OA, to no causal effect at all (why on earth would they be doing it at all then?): They do it because making their work more accessible increases its accessibility, uptake, downloads, usage, applications, citations, impact -- exactly as the correlational evidence shows, without exception, in field after field.

(NB: The only methodologically unexceptionable way to demonstrate causation here, by the way, is to select a large enough random sample of articles, divide them in half randomly, mandate half of them to be self-archived and half not, and then compare their respective citation counts after a few years. No one is likely to do quite that study -- any more than it was likely that a large random sample of people would be divided in half randomly, with half mandated to smoke and half not! But we are in the process of doing an approximation to that causal study, by comparing the citation counts of articles in the IRs of the (few) institutions that have already mandated self-archiving with the average for other articles in the same journals/years in which those articles appeared, but that have not been self-archived; we will also compare the size of the OA advantage for mandated and comparable non-mandated self-archiving. [We do not believe for a moment that these data are necessary to demonstrate causation, as causation is a virtual certainty anyway, but we are ready to play the game, in order to try to cut short the absurd delay in doing the obvious: mandating self-archiving universally.])

Although quite a lot of evidence has been collected regarding the quantitative effect of OA on citation counts (whether in the form of OA journals or as self-archived articles), much of it is scattered, uses inconsistent methods and covers different subject areas.

Yet, despite this scatter, methodological inconsistency and diversity, virtually all of it keeps showing exactly the same consistent pattern: A citation (and download) advantage for the OA articles. (No amount of special pleading can make that stubborn pattern go away!)

Consistent longitudinal data over a period of years to measure IF trends in a representative range of journals would fill this gap

There is no gap! There is a growing body of studies, across all fields and all journals, that keeps showing exactly the same thing: the OA advantage (in article citations and article downloads: this is not about journal impact factors, especially because comparing different journals is comparing apples and oranges).

(There seems to be a confusion here between the existence of the correlation itself, between self-archiving and citation count counts -- this is found consistently, over and over -- and the question of the causal relation, which will not be answered by longtitudinal data (we have longtitudinal data already!) but by comparing mandated and unmandated self-archiving: if they both show the OA advantage, then the effect is causal and self-selection bias is a minor component.)

e.g., studying a range of journals that were toll-access and went OA (or vice versa). In the short-term, more data in different disciplines measuring the impact on citation counts of articles in hybrid journals or articles that are available in both forms versus articles that are only available in one of the forms will improve the evidence base.

No, the question about the reality and causality of the OA advantage will not be settled by OA journal vs. non-OA journal comparisons; that can always be dismissed as comparing apples with oranges, and, failing that, can always be attributed to self-selection bias (i.e., choosing to publish one's better work in an OA journal)!

And if we wait for the uptake of hybrid Open Choice -- i.e., paying the journal to self-archive the published PDF for you -- these "longtitudinal" studies are likely to take till doomsday (and any positive outcome can still be dismissed as self-selection bias in any case!).

What is needed is precisely the data already being gathered, on huge samples, across all disciplines, comparing citation counts for self-archived versus non-self-archived articles within the same journal and year. The result has been a consistent, high OA Advantage (which has elicited a lot of special pleading about causality).

So we will look at the mandated subset of the self-archived papers, to try to show that the OA advantage is not (only, or mostly) a self-selection effect (Quality Bias [QB]).

(There is undoubtedly a non-zero self-selection [QB] component in the OA advantage, but there are many other components as well, including a Quality Advantage [QA], an Early Access Advantage [EA], a Competitive Advantage [CA, which will, like QB, vanish once all articles are OA], and a Usage (Download) Advantage [UA]. At 100% OA, there will no longer be any QB or CA (or Arxiv Advantage [AA]), but EA, QA and UA will still be going strong. EA and UA components have already been confirmed by the Kurtz study in astronomy. QA is implied by the repeated finding of a positive correlation between citation count and the proportion of those articles with that citation count that are OA. The mandate study will try to show that this correlation is causal, i.e., QA, not QB.)

Harnad, S. (2005) OA Impact Advantage = EA + (AA) + (QB) + QA + (CA) + UA.
The whole area of the relationship between citation counts and scholarly communication channels is confused because of problems associated with quality bias [QB] (e.g., if scholars tend to self-archive only their best work, as suggested by Kurtz et al. [in astronomy]; alternatively, it may be that only the best journals are OA). In other words, differences in citation counts and IFs may simply reflect the quality of the materials under study rather than having anything to do with the channel by which the material is made available.

First, the issue is article citation counts, not journal Impact Factors (IFs).

Second, this is all special pleading. The biggest OA effects are based on comparing articles within the same journal/year. The size of the effect is indeed correlated with the quality of the article, because no amount of accessibility will generate citations for bad articles, whereas good articles benefit the most from a level playing field, with all affordability/accessibility barriers removed: that is the Quality Advantage [QA]. The idea that the Quality Advantage is merely a Quality (Self-Selection) Bias [QB], i.e., that the advantage is merely correlational, not causal, is of course a logical possibility, but it is also highly improbable (and would imply that accessibility/affordability barriers count for nothing in usage and citations, and that the better work is being made OA by its authors for purely superstitious reasons, because doing so has no effect at all!).

Overall, we concur with Craig's introduction that "the problems with measuring and quantifying an Open Access advantage are significant. Articles cannot be OA and non-OA at the same time."

They need not be. It is sufficient if we take a large enough sample of articles that are OA and non-OA from the same journals and years. Randomly imposing the self-archiving would be the only way to equate them completely (and our ongoing study on mandated self-archiving will approximate this).

(The analysis by Craig, commissioned by Blackwell Publishing, has not, so far as I know, been published.)

"Further, the variation of citation counts between articles can be extremely high, so making controlled comparisons of OA vs. non-OA articles nigh on impossible" [Craig, Blackwell Publishing]

(The way Analysis of Variance works is to compare variation between and within putatively different populations, to determine the probability that they are in reality the same population. The published comparisons show that the OA/non-OA differences are highly significant, despite the high variance.)

It would of course be absurd to try to compare citation counts for OA and non-OA articles having the same citation counts. But we can compare OA and non-OA article counts among articles having the same citation counts, in the same journals -- and what we find is a strong positive correlation between the citation count and the proportion of articles that are OA (just as Lawrence reported in 2001, but not only in computer science, but across all 12 disciplines studies so far, and with much bigger sample sizes):

Source 4.8: Hajjem, C., Harnad, S. and Gingras, Y. (2005) Ten-Year Cross-Disciplinary Comparison of the Growth of Open Access and How it Increases Research Citation Impact. IEEE Data Engineering Bulletin 28(4) pp. 39-47.

Note that the appendix to the Report under discussion here, states, in connection with the above study, which it cites:

"Harnad is THE advocate of OA and, thus, whilst expert in the field, is inevitably biased."

There is a bit of irony in the fact that in connection with another of the studies it cites:

Source 4.9: Harnad, S, Brody, T, Oppenheim, C et al, Comparing the impact of open access versus non open access articles in the same journals, D-Lib Magazine, 10,(6), 2004,

the appendix of the Report goes on to say:

"Harnad is THE exponent of OA, but, thus, potentially less objective."

Ironic (or, shall we say, conflicted, since this Report aspires to be a neutral one as between the interests of the research community and the publisher community), because the sole named collaborator on the Report is also a co-author of the above-cited study!

Let us agree that we all have views on the underlying issues, but that reliable data speak for themselves, qua data, and our data (and those of others) keep showing the same consistent OA Advantage. The disagreement is only on the interpretation: whether or not the consistent correlations are causal. And here, allegiances are tugging on both sides: Those favouring causality tend to come from the research community, those favouring a-causality tend to come from the publishing community. (Let us hope that the data from mandated self-archiving will soon settle the matter objectively.)

"[since] any Open Access advantage appears to be partly [sic] dependent on self-selection, the more articles that are {self-}archived... you'd expect to see any Open Access advantage reduce." [Craig, Blackwell Publishing]

Note that Craig carefully says "partly" -- and that we agree that self-selection is one of the many potential contributors to the OA advantage.

We also agree, of course, that once 100% OA is reached, the OA citation advantage -- in the form of an advantage of OA over concurrent non-OA articles -- will be reduced: indeed it will vanish! With all articles OA, there can no longer be either a Competitive Advantage [CA] or a Self-Selection Advantage (Quality Bias, QB) of OA over (non-existent) non-OA.

But the Quality Advantage [QA] will remain. (Higher quality articles will be used and cited more than they would have been if they had not been OA: this is not a competitive advantage but an absolute one.) And the Early Advantage [EA] as well as the Usage (Download) Advantage [UA] will remain too (as already shown by Kurtz's findings in Astronomy).

"Authors self-archiving in the expectant belief that each and every paper they archive will receive an Open Access advantage of several hundred percent are going to be sorely disappointed." [Craig, Blackwell Publishing]

This too is correct, but who on earth thought that OA would guarantee that all work would be used, whether or not it was any good? OA levels the playing field so merit can rise to the top, unconstrained by accessibility or affordability handicaps. But bad remains bad, and let's hope that researchers will continue to avoid trying to build on weak or invalid findings, whether or not they are OA.

The OA advantage is an average effect, not an automatic bonus for each and every OA article; moreover, the OA advantage is highly correlated with quality: The higher the quality, the higher the advantage. It is this effect that is open to the a-causal interpretation that the Quality Advantage [QA] is merely a Quality Bias [QB] (Self-Selection). But, equally (and, in my view, far more plausibly) it is open to the causal interpretation that OA causes wider usage and citation precisely because it removes all accessibility/affordability constraints that are currently limiting uptake and usage. That does not mean everything will be used more, regardless of quality ("usefulness"): But it will allow users (who are quite capable of exercising self-selection too!) to access and use the better work, selectively.

In addition, since the distribution of citations is not gaussian -- a small percentage of articles receives most of the citations and more than half of articles receive no citations at all -- it is almost axiomatic that the OA advantage will be strongest in the high-quality range

Finally, it is worth noting that all researchers in the field are agreed that if the vast majority of scholarly publications become available in OA form, no citation advantage to OA will be measurable.

It is a tautology that with 100% OA, the OA/NOA ratio is undefined! But EA will still be directly measurable, and it will be possible to infer UA and QA indirectly (UA by comparing downloads for articles of the same age, before and after OA for the same articles, and QA by doing the same with citations; the Kurtz study used such methods in Astronomy. But by that time (100% OA), not many people will still have any interest in the a-causal hypothesis.

Thus, what OA advantage there is will prove to be temporary if OA does become the standard mode of publication.

This, however, is simply incorrect. At 100% OA, the Competitive Advantage (CA) will be gone; the Self-Selection Advantage (Quality Bias, QB) will be gone; the method of comparing citation counts for OA and non-OA articles within the same journal and year will be gone. So much is true by definition.

But (as Kurtz has shown in Astronomy), the Early Advantage and the Usage Advantage will still be there. And the Quality Advantage, will still be there too; and that was what this was all about: Not just a horse-race for who can make his articles OA first, so as to reap the competitive advantage before 100% OA is reached (though that's not a bad idea!); not a guarantee that, no matter how bad your work, you can increase your citations by making them OA; but a guarantor that with access-barriers removed, quality will have the best chance to have its full potential impact, to the benefit of research productivity and progress itself, as well as the authors, institutions and funders of the high quality work.

(There is a bit of a [lurid] analogy here with saying that if only we can get everyone to smoke, it will be clear that smoking has no differential effects on human health! Perhaps the converse is a better way to look at it: if only we could get everyone to stop smoking, smoking will no longer have a differential effect on human health!)

(PS: OA is not a "mode of publication": OA publication is a mode of publication. OA itself is a mode of access-provision, which can be done in two ways, via OA publication or via OA self-archiving of non-OA publications.)

Self archived articles

It is this area that has been most studied, with numerous key publications. Most of these are focussed on the citation advantage of self-archived articles rather than of OA journals. Craig, in an as yet unpublished review, provides an excellent overview of the evidence collected to date. Lawrence (Source 4.13) is significant because it was the first major paper that identified a citation advantage for OA self-archived articles, and it has been widely cited ever since. However, it was based on a too small-scale a study to support general conclusions. Harnad et al. (Source 4.9) provides a useful summary of the state of play of OA advantage studies, while Hajjem et al. (Source 4.8 ) is fairly typical of the many articles produced by Harnad claiming that self-archiving leads to higher citation counts.

Let us be clear: The many OA vs. non-OA studies, ours and everyone else's, across more than a dozen different disciplines, many of them based on large-scale samples, all show the very same consistent pattern of positive correlation between OA and citation counts. Those are data, and they are not under dispute. The only "claim" under dispute is that that consistent correlation is causal...

Antelman (Source 4.1) is arguably the most carefully constructed study of the question. Articles in four disciplines were evaluated, and in each case it was found that open access articles had greater citation counts than non-open access articles.

One wonders why this particular small-scale study (of about 2000 articles in 4 fields) was singled out, but in any event, it shows exactly the same pattern as all the other studies (some of them based on hundreds of thousands of articles instead of just a few thousand, in three times as many fields).

Eysenbach challenges the notion that OA "green" articles (i.e., those in repositories) are more effective than OA "gold" (i.e., those published in OA journals, such as those produced by Public Library of Science) in obtaining high citation counts. It is this part of his paper that produced a furious response from Harnad, much of it focused on particular details.

The issue was not about OA green (self-archived) articles producing higher citation counts than OA gold (OA-journal)! No one had claimed one form of OA was more effective than the other in generating the OA Advantage before the Eysenbach study: It was Eysenbach who claimed to have shown gold was more effective than green -- indeed that green was only marginally effective at all!

And I think anyone reading the exchanges will see that all the fury is on the Eysenbach side. All I do is point out (rather patiently) where Eysenbach is overstating or misstating his case:

Harnad, S. (2006)PLoS, Pipe-Dreams and Peccadillos PLoS Biology eletters (May 16, 2006) [1] [2] [3] [4]

Eysenbach's study does find the OA advantage, as many others before it did. It certainly doesn't show that the gold OA advantage is bigger than the green OA advantage, in general. It simply shows that for the 1500-article sample in the one journal tested, Proceedings of the National Academy of Sciences (PNAS), a very high impact journal, both paid OA (gold) and green OA (free) increased citation counts over non-OA, but gold increased them more than green. That result is undisputed. Its extrapolation to other journals is:

The likely explanation of the PNAS result is very simple: PNAS is not a randomly chosen, representative journal: it is a very high-impact, very high visibility, interdisciplinary journal, one of very few like it (along with Nature and Science). Articles that pay for OA are immediately accessible at PNAS's own high-visibility website -- a website that probably has higher visibility than any single institution's IR today. So PNAS articles made freely accessible at PNAS's website get a bigger OA advantage than PNAS articles made made freely accessible by being self-archived in the author's own IR.

The reason it definitely does not follow from this that gold OA is bigger than green OA is very simple: Most journals are not PNAS, and do not have the visibility or average impact of PNAS articles! Hence Eysenbach's valid finding for one very high-impact journal simple does not generalize to all, most, or even many journals. Hence it is not a gold/green effect at all, but merely a very high-end special case.

Apart from the spurious gold/green advantage, Eysenbach did confirm, yet again, (1) the OA advantage itself, and confirmed it (2) within a very short time range. These are both very welcome results (but not warranting to be touted, as they were, by both the author and by the accompanying PLoS editorial, as either the first "solid evidence" of the OA advantage -- they certainly were not that -- or a demonstration that gold OA generates more citations than green OA: the very same method has to be tried on middle and low-ranking journals too, before drawing that conclusion!). (Nor are the PLoS/PNAS results any more exempt from the methodological possibility of self-selection bias [QB] than any of the many prior demonstrations of the OA advantage, as authors self-choose to pay PNAS for gold OA as surely as they self-choose to self-archive for green OA!)

The fury on Eysenbach's part came from my pointing out that his and PLoS's claim to primacy for demonstrating the OA advantage (and their claim of having demonstrated a general gold-over-green advantage) was unfounded (and might have been due to both PLoS's and Eysenbach's zeal to promote publication in gold journals: Eysenbach is the editor of one too, but not a high-end one like PNAS or PLoS): Eysenbach's was just the latest in a long (and welcome) series of confirmations of the OA advantage (beginning with Lawrence 2001), the prior ones having been based on far larger samples of articles, journals and fields (and there was no demonstration at all of a general gold over green advantage: just the one non-representative, hence non-generalisable special case of PNAS).

Both authors believe that OA produces a citation advantage, but Eysenbach has presented evidence that casts doubt on Harnad's notion that the "green" route is the preferred route to getting that increased impact.

Green may not be the preferred route to OA for editors of gold journals, but it is certainly the preferred route for the vast majority of authors, who either have no suitable gold journal to publish in, or lack the funds (or the desire) to pay the journal to do what they can do for free for themselves. The only case in which paid gold OA may bring even more citations than free green OA (even though both increase citations) is in the very highest quality journals, such as PNAS, today -- but that high-end reasoning certainly does not generalise to most journals, by definition. (And it will vanish completely when OA self-archiving is mandated, and the harvested IR contents become the locus classicus to access the literature for those whose institutions are not subscribed to the journal in which a particular article appeared -- whether or not it is a high-end journal.)

(There is also a conflation of the (less interesting) question of (1) whether green or gold generates a greater OA citation advantage [answer, for high-end journals like PNAS, gold does, but in general there is no difference] with the (far more important) question of (2) whether green or gold can generate more OA [answer: green can generate far more OA, far more quickly and easily, not just because it does not cost the author/institution anything, but because it can be mandated without needing either to find the extra funds to pay for it or to constrain the author's choice of which journal to publish in].

However, despite the intuitive attractiveness of the hypothesis that OA will lead to increased citations because of easier availability, the one systematic study of the reasons for the increased citations - by Kurtz (Source 4.12) - showed that in the field of astronomy at least, the primary reason was not that the materials were free, or that they appeared more rapidly, but that authors put their best work into OA format, and this was the reason for increased citation counts.

Astronomy is an interesting but anomalous field: It differs from most other fields in that:

(1) Astronomy consists of a small, closed circle of journals.

(2) Virtually all research-active astronomers (so I am told by the author) have institutional access to all those journals.

(3) For a number of years now, that full institutional access has been online access.

(4) So astronomy is effectively a 100% OA field.

(5) Hence the only room left for a directly measurable OA advantage in astronomy is (5a) to self-archive the paper earlier (at the preprint stage) [EA] or (5b) to self-archive it in Arxiv (which has evolved into a common central port of call, so it generates more downloads and citations -- mostly at the preprint stage, in astronomy).

(6) What Kurtz found, was that under these conditions, higher quality (higher citation-count) papers were more likely to be self-archived.

(7) This might be a quality self-selection effect (QB) (or it might not), but it is clearly occurring under very special conditions, in a 100% OA field.

(8) Kurtz did make another, surprising finding, which has bearing on the question of how much of a citation advantage remains once a field has reached 100% OA.

(9) By counting citations for comparable articles before and after the transition to 100% OA, Kurtz found that the citations per article had actually gone down (slightly) rather than up, with 100% OA.

(10) But a little reflection suggests a likely explanation: This slight drop is probably a shift in balance with a level playing field:

(11) With 100% OA (i.e., equal access to everything), authors don't cite more articles, they cite more selectively, able now to focus on the best, most relevant work, and not just on the work their institutions can afford to access.

(12) Higher quality articles get more citations, but lower quality articles of which there are far more (some perhaps previously cited by default, because of accessibility constraints) are cited less.

(13) On balance, total citations are slightly down, on this level playing field, in this special, small, closed-circle field (astronomy), once it reaches 100% OA.

(14) It remains to be seen whether total and average citations go up or down when other fields reach 100% OA.

(15) What Kurtz does report even in astronomy is that although total citations are slightly down, downloads are doubled.

(16) Downloads are correlated with later citations, but perhaps at 100% OA this is either no longer true, or true only for higher quality articles.

Similarly, more carefully conceived work on the impact of both OA journals and self-archiving on the quality of research communications, especially on the peer review system, will be required.

OA journals are peer-reviewed journals: What sort of impact are they feared to have on peer review?

And why on earth would the self-archiving of peer-reviewed, published postprints have any impact on the peer review system? The peers review for free. (Could this be just a veiled repetition of the question about the impact of self-archiving on journal revenues, yet again?)

Recently, the results of a study undertaken by Ware for ALPSP, which were published in March 2006 (Source 1.16, in Area 1), have provided at least some initial data on the question of the possible linkage between the availability of self-archived articles in an OA repository and journal subscription cancellations by libraries...: availability of articles in repositories was cited as either a "very important" or an "important" possible factor in journal cancellation by 54 per cent of respondents, even though ranking fourth after (i) decline of faculty need, (ii) reduced usage, and (iii) price. When respondents were invited to think forward five years, availability in a repository was still fourth-ranking factor, but the relevant percentage had risen to 81. Whilst this is not evidence of actual or even intended cancellation as a consequence of the growth of OA self-archiving repositories, it strongly suggests that such repositories are an important new factor in the decision process, and growing in significance.

Summary: No evidence of cancellations, but speculations by librarians to the effect that their currently fourth-ranking factor in cancellations might possibly become more important in the next five years...

Sounds like sound grounds for fighting self-archiving mandates and trying to deny research the benefit of maximized impact for yet another five years -- if one's primary concern is the possible impact of mandated self-archiving on publishers' revenue streams. But if one's primary concern is with the probable impact of mandated self-archiving on research impact, this sort of far-fetched reasoning has surely earned the right to be ignored by the research community as the self-serving interference in research policy that it surely is.

Stevan Harnad
American Scientist Open Access Forum

Posted by Stevan Harnad in Self-Archiving Mandates at 00:30 | Comments (2) | Trackbacks (0)

Friday, October 6. 2006

Responses to EC Self-Archiving Mandate Recommendation

The synthesis of the responses to the European Commission's (EC's) research-access related recommendations is alas so far still rather wishy-washy. One hopes that the EC will not lose sight of the fact that researchers (and their institutions and funders) are both the providers and the users of research (in generating further research, as well as research applications, for the benefit of the tax-paying public that funds the research). Research is not done, or funded, in order to support the publishing industry. EC Recommendation A1 was for an Open Access Self-Archiving Mandate. That is a matter for the European Research Community to decide upon. It would be a great strategic mistake to let the publishing industry decide what the research community does in order to maximize the European tax-paying public's return on the euros it invests in supporting research. They are not investing in the publishing industry, and far, far more is at stake than the publishing industry's concerns about possible risks to its revenue streams.

Stevan Harnad
American Scientist Open Access Forum

Posted by Stevan Harnad in Self-Archiving Mandates at 22:52 | Comments (0) | Trackbacks (0)

Preprints, Postprints, Peer Review, and Institutional vs. Central Self-Archiving

SUMMARY: Arxiv is a Central Repository (CR) in which physicists have been self-archiving their unrefereed preprints and their peer-reviewed postprints since 1991. There is now a growing movement toward distributed Institutional Repositories (IRs). Thanks to the OAI Protocol, all OAI-compliant IRs and CRs are now interoperable: their metadata can be harvested into search engines that treat all of their contents as if they were in one big virtual CR. What authors self-archive is their peer-reviewed publications, not just their unrefereed preprints. An archive is merely a repository, not a certifier of having met a peer-reviewed journal's quality standards.
Since the research institutions themselves are the primary research providers, with the direct interest in maximising the uptake and usage of their own research output, the natural place for them to deposit their own output is in their own IRs. Any central collections can be harvested via OAI. Institutions are also best placed to monitor and reward compliance with self-archiving mandates, both their own institutional mandates and those of the funders of their institutional research output. Arxiv has played an important role in getting us where we are, but it is likely that the era of CRs is coming to a close, and the era of distributed, interoperable IRs is now coming into its own in an entirely natural way, in keeping with the distributed nature of the Net/Web itself.

Comments on:

Ginsparg, Paul (2006) As We May Read. The Journal of Neuroscience, September 20, 2006, 26(38): 9606-9608 doi:10.1523/JNEUROSCI.3161-06.2006

"[A]rticles are deposited [in Arxiv] by researchers when they choose (either before, simultaneous with, or after peer review), and the articles are immediately available to researchers throughout the world."

Arxiv is a Central Repository (CR) in which physicists (mostly, and many mathematicians, and some computer scientists) have been self-archiving their unrefereed preprints and their peer-reviewed postprints since 1991. It is important to keep in mind that researchers self-archive preprints as well as postprints, because it makes a big difference whether one extrapolates from Arxiv as a preprint CR or a postprint CR, as we shall see below.

It is also pertinent to bear in mind that Arxiv is indeed a Central Repository (CR), because there is now a growing movement toward distributed Institutional Repositories (IRs). The IR movement was facilitated by the Open Archives Initiative (OAI) Protocol for Metadata Harvesting, which renders all IRs and CRs interoperable: the OAI Protocol was in turn created partly as a result of an initiative from Arxiv.

As a consequence of the OAI Protocol, all OAI-compliant IRs and CRs are interoperable: their metadata can be harvested into search engines that treat all of their contents as if they were in one big virtual CR.

"As a pure dissemination system, [Arxiv] operates at a factor of 100-1000 times lower [1.0% - 0.1%] in cost than a conventionally peer-reviewed system (Ginsparg, 2001)."

This is true, but it is tantamount to saying that as a pure dissemination system, photocopying the articles published in journals operates at a fraction of the cost of publishing a journal: A fraction, but a parasitic fraction, for without the journal, there would be nothing to either photocopy or distribute in Arxiv.

Nothing but the unrefereed preprint, that is. And this brings us face to face with the fundamental question: What are the true costs of peer review, and peer review alone? The peers (scarce, overused resource though they are) review for free, so it is not their services whose costs we are talking about, but the cost of implementing the peer review: processing the submissions, picking the referees, processing their reports, deciding what revisions need to be done to meet the journal's quality standards for acceptance, and deciding -- perhaps again by consulting the referees -- whether those revisions have been successfully done. The selection of referees and the decision as to what needs to be done is usually made by a qualified, answerable super-peer: the editor (or a board of editors). The editor(s) services, and the clerical services for processing submissions, communicating with referees, and processing referee reports are the costs involved -- and these include not just accepted papers, but rejected ones too (with some journals' rejection rates being over 90%).

In other words, peer-reviewed journal publishing is not a "pure dissemination system." Implementing the peer review costs some money too. There are estimates of what it costs (about $500 per paper was the average estimate a few years ago, which is between one-third and one-sixth of the charge per article that today's "Open Choice" journals are currently proposing -- although a few journals with high rejection rates have suggested a figure of $10,000 per article, without making it clear whether this represents their costs per article or their income per article).

The annual cost per paper in Arxiv, to Arxiv, has been estimated at about $10 (a few years ago), so this is indeed somewhere between 2% of the low-end estimate and 0.1% of the high-end estimate. If we include the cost of keying in the deposit to the depositor, it's a few pennies more.

But what do these figures mean? Why compare the cost of online dissemination alone with the cost of peer review (or any of the other values a journal adds, such as the print edition, copy-editing, reference-checking, and mark-up)?

"with many of the production tasks automatable or off-loadable to the authors, the editorial costs will then dominate the costs of an unreviewed distribution system by many orders of magnitude."

Translation: Online dissemination of unrefereed preprints alone costs a lot less than peer-reviewed publication. True, but what follows from that? Peer-reviewed publication costs a lot more than photo-copying too, but what authors photocopy and distribute is their peer-reviewed publications, not just their unrefereed preprints.

"Although the most recently submitted articles have not yet necessarily undergone formal review, the vast majority of the articles can, would, or do eventually satisfy editorial requirements somewhere.... [Arxiv's moderated] submissions are at least 'of refereeable quality'."

Every paper is first an unrefereed preprint -- and then, eventually, most are revised into peer-reviewed, accepted articles (postprints). Hence if preprints are deposited in Arxiv at all, it stands to reason that Arxiv's most recently deposited (sic) papers (sic) have not yet undergone peer review. Tune in a year later, and they will have been, with the revised postprint now also deposited.

Preprints and postprints are deposited rather than "submitted" to IRs or CRs, because an archive is merely a repository, not a certifier of having met a peer-reviewed journal's quality standards: let's reserve "submission" for the attempt to meet a journal's peer-review quality standards. Moreover, unrefereed preprints are merely papers, not articles; they become articles when they have been accepted for publication by a peer-reviewed journal. This is not pedantry or formalism. It is merely the sorting out of what has and has not met known quality control standards. The tag certifying this is currently the journal name, with its established quality level and track-record. A peer-reviewed journal (apart from its function as an access-provider) is a peer-review service-provider/certifier, publicly answerable for its quality standards with its own prestige and reputation. And authors are in turn answerable to the editor and referees, for meeting their standards for acceptance; revision is not optional but obligatory, a condition on acceptance for publication. Hence earning the tag certifying acceptance is a dynamic, interactive process, and not merely a pass/fail system.

Publication is even less like a pass/fail system in that in most fields there is a hierarchy of journals, with a range of peer-review standards, from the one or few most rigorous ones at the top (usually the ones with the highest rejection rates), all the way down to what is sometimes almost a vanity press at the bottom (little better than an unrefereed preprint). These differences in quality standards are known and relied upon in the field. And papers are not really published or unpublished: Most are published, eventually, but at their own quality level. The journals are all autonomous, independent of the authors and the authors' institutions, each dependent on its own established standards for quality and selectivity. Users are in turn dependent on each journal's public track record in deciding what to trust.

It is not at all clear what an IR's or CR's certification of which of its deposits is "of refereeable quality" might mean to busy researchers who need to know whether a paper is worth risking their limited time to read and try to use, apply and build upon. Users currently do this by seeing whether and where it has been published (with the journal name and track record serving as their indicator of the article's probable level of quality, reliability and validity). Unrefereed preprints have always been something handled with care, having only the author's name, institution and prior track-record as a guide to their reliability. Is Arxiv's tag of being "of refereeable quality" meant to serve as a further guide? or as a substitute for something?

"[P]roposed modifications of the peer review include a two-tier system (for more details, see Ginsparg, 2002), in which, on a first pass, only some cursory examination or other pro forma certification is given for acceptance into a standard tier. At some later point, a much smaller set of articles would be selected for more extensive evaluation."

This is a speculative hypothesis. It is no doubt being tested to see whether it works, whether it delivers results of quality and useability comparable to standard peer review, whether it is cost-effective, and whether it can replace journals. But as it stands, the hypothesis alone does not tell us whether and how well it will work; Arxiv is certainly not evidence for the validity of this hypothesis, since virtually all papers in Arxiv still undergo standard peer review. Arxiv is merely a CR that provides Open Access (OA) to both the preprints and the postprints.

"using standard search engines, more than one-third of the high-impact journal articles in a sample of biological/medical journals published in 2003 were found at nonjournal Web sites (Wren, 2005)."

This is very interesting. This is the higher end of a self-archiving rate that we have found to range between about 5% and 25% across disciplines. Physics is of course even higher (mostly because of Arxiv) and computer science higher still (see Citeseer, a google-style harvester of distributed locally deposited papers).

"at least 75% of the publications listed [in neuroscience] were freely available either via direct links from the above Web page or via a straightforward Web search for the article title."

This is even more interesting. It means that in such fields the majority of the articles -- note that we are almost certainly not talking about unrefereed preprints here but about peer-reviewed postprints -- are being self-archived already, so the only thing that remains to be done is to deposit (or harvest) them into the author's own OAI-compliant IR rather than a random website, to maximise visibility, harvestability, and impact.

"The enormously powerful sorts of data mining and number crunching that are already taken for granted as applied to the open-access genomics databases can be applied to the full text"

Indeed. And semantic and scientometric analyses too (though article texts are not quite the same thing as the research data on which the articles are based, hence the analogy with the genomics data base may be a bit misleading).

"it is likely that more research communities will join some form of global unified archive system without the current partitioning and access restrictions familiar from the paper medium"

What makes it most likely is the self-archiving mandates proposed or already adopted the world over (e.g., RCUK, Wellcome Trust, FRPAA, EC, plus individual institutional self-archiving mandates: CERN, Southampton, QUT, Minho).

But the deposits will not be done in one global CR, nor in a CR like Arxiv for each discipline or combination of disciplines. With the advent of the OAI protocol, all IRs and CRs are interoperable, and since the research institutions themselves are the primary research providers, with the direct interest inshowcasing their own research output as well as maximizing its uptake, usageand impact, the natural place for them to deposit their own output is in their own IRs. Any central collections can be gathered via OAI harvesting. Institutions are also best placed to monitor and reward compliance with self-archiving mandates, both their own institutional mandates and those of the funders of their institutional research output.

Arxiv has played an important role in getting us where we are, but it is likely that the era of CRs is coming to a close, and the era of distributed, interoperable IRs is now coming into its own in an entirely natural way, in keeping with the distributed nature of the Net/Web itself.

Stevan Harnad
American Scientist Open Access Forum

Posted by Stevan Harnad in Institutional Repositories at 05:36 | Comments (0) | Trackbacks (0)

Wednesday, October 4. 2006

France's HAL, OAI interoperability, and Central vs Institutional Repositories

SUMMARY: France's HAL is a large national repository for research output with enriched metadata. Like all other countries, France clearly needs self-archiving mandates from its research institutions and funders. The question is whether the mandates should be to deposit directly in HAL, or in each researcher's own Institutional Repository (from which HAL could then harvest the contents). Franck Laloë of CCSD replies to some questions about HAL.

On Mon, 2 Oct 2006, Franck Laloë [FL] wrote:

FL: "Hal (and I presume Prodinra [INRA]) are of course OAI-PMH compatible, and of course can be harvested within this protocol and others. This compatibility is a necessary condition for an archive to be useful to the scientific community. But a necessary condition is not always sufficient. We need more interoperability than just that possible within OAI-PMH; Hal meets this requirement. -- I know that Stevan and others will disagree with the last sentence above..."

But I don't disagree at all! The more interoperability the better! What I am still very keen to know is the following:

(1) How is it proposed to get all of France's research output into HAL?

(2) Why do the deposits need to be directly in HAL, rather than in each author's own Institutional Repository (IR), then harvested by HAL?

(3) Are there are any self-archiving mandates being proposed or planned in France (along the lines already adopted by RCUK and the Wellcome Trust in the UK, proposed by the FRPAA in the US, and recommended (A1) by the European Commission recommendations)?

(4) Where will such mandates, if any, require researchers to deposit: in HAL or in the their own institution's IR?

The most important question is (3), because what is becoming increasingly clear is that -- be they as interoperable as one could possibly wish -- near-empty repositories are not very useful! The spontaneous (unmandated) self-archiving rate worldwide is about 15% (with a few disciplines, e.g. physics, well above 15%, but most disciplines at or below 15%), whereas the mandated deposit rate climbs toward 100% within a few years of adoption (as predicted by Alma Swan's international surveys, and confirmed by Arthur Sale's analyses comparing mandated and unmandated deposit rates).

So where do things stand in France regarding self-archiving mandates (whether for local institutional deposit or national deposit in HAL)?

FL: "There are several reasons ["why... deposits need to be directly in HAL, rather than in each author's own Institutional Repository (IR), then harvested by HAL"]:

Hal requires a structure of metadata where each author is linked to one (or several) laboratories, which in turn are linked to one (or several institutions)."

But I don't understand: Can these complex affiliation metadata not be provided (rather trivially)?

FL: "This structure of metadata in Hal is richer than OAI-PMH (in its present state)."

I understand: HAL does richer tagging than the OAI-PMH IRs. But that tagging has to be added in any case: So why not add to the metadata harvested from the distributed deposits in each author's own IR?

FL: "If we harvested all OAI-PMH compliant archives, this would introduce redundancy"

So what? And there are ways to either remove or coalesce duplicates. (The problem today -- we cannot remind ourselves often enough -- is too few deposits [15%], not too many, nor too many duplicates!)

FL: "documents with incomplete metadata (most institutional repositories only mention their institution, not the contribution of others, etc..)."

We've agreed that if HAL aspires to have richer metadata than IRs with their OAI-PMH then the extra tags have to be added; this is not an answer to the question of why HAL insists on direct deposits, rather than harvesting from IRs.

And you already mentioned earlier the baroque intricacies of institutional affiliation -- but not why you don't think this trivial problem cannot be handled easily by software, with the help of IR affiliations, lists of co-authors and (why not) central authoritative lists chronicling all French authors' current (and past!) complexities of affiliation, turning them into explicit metadata tags...

FL: "Many OAI repositories do not guarantee sufficient quality, and even access to the full text."

The 1st, 2nd and Nth immediate problem today is lack of content, not low-quality metadata: The texts [85% of them] are not deposited at all. The OA movement, and OA self-archiving mandates, are endeavouring to get that content deposited. Authors' own IRs are the natural place to deposit it, and to mandate depositing it.

Then (as agreed above) HAL can, if it wishes, harvest that content, and improve its metadata. (Again, this is no argument against harvesting, or in favour of direct deposit in HAL.)

As to texts that are deposited in IRs but not made OA: I wish that were the only remaining problem, for I guarantee that if it were, it would solve itself in short order. (The OA metadata would elicit email eprint requests, and authors would soon tire of emailing eprints and would instead set access to their deposited full-texts as OA instead of Closed Access.)

But we do not have all or most or even much of the target literature -- the peer-reviewed research corpus -- deposited in IRs in Closed Access, with only their (low-quality) metadata accessible: At least 85% percent of OA's target content is not deposited at all.

So it seems to me HAL would benefit as much as everyone else from a self-archiving mandate that would get all that content deposited; so the only question is who will mandate it to be deposited, and where?

So far, the two natural candidate mandaters are the researchers' own institutions and funders. Clearly institutions have an interest in mandating that the deposit should be in their own IRs (for institutional visibility, prestige, and record-keeping). Funders (although some unthinkingly insist on central deposits today, e.g., in PubMed Central) are mostly indifferent to where their funded research is deposited, as long as it is OAI-compliant and OA. So many mandate depositing in the researcher's own IR too. And PubMed Central should be asking itself the same questions I am asking you about HAL: Why not deposit in each researcher's own OAI-compliant IR and simply harvest from there?

Institutions have a direct institutional interest in their own IRs; they are the ones that can best monitor and reward compliance with self-archiving mandates; and the spectrum of disciplines at research institutions (mostly universities) effectively cover all of OA's target content space (whereas central disciplinary and multidisciplinary repositories do not).

A national repository like HAL is a very good idea, but unless the problem of the means of mandating and monitoring direct self-archiving in HAL by all French researchers has an immediate solution, at the very least a hybrid deposit system would seem to be optimal: either researchers deposit in their own IRs (subsequently harvested and enhanced by HAL) or directly in HAL; but deposit they must.

FL: "Hal includes certification procedures (which we call "stamps") which do not exist in other open archives."

That's fine, but the non-existence that is the immediate problem is not certificates but deposits! At least 85% of French research output is not being self-archived at all. Institutional and funder self-archiving mandates can remedy this, but are all or most of France's research institutions more likely to agree to mandate and monitor depositing all of their own output in HAL, or in their own IRs?

The "stamps" could come either way, either via harvesting from IRs or via direct deposits.

FL: "In brief, we want Hal to be an homogeneous system, really usable by the reader, and by labs (even if they belong to several institutions) and institutions - all this through a single entry into the system."

If HAL can become a direct entry point for all French research institutions, and they all agree to a means of mandating and monitoring compliance, nolo contendere!

But what is sure is that a central repository and central depositing is not the only way to get an OA corpus usable by all (authors and users), in France and worldwide. On the contrary, the nature of the Internet, the Web, the OAI protocol and any other richer metadata tagging schemes is such that distributed interoperability -- rather than a central locus and central management -- is far more likely to prove to be the successful means of generating and using the OA corpus, in France and worldwide.

FL: "For instance, my lab belongs to 4 institutions, we do not want to put our articles into four open archives; one is enough."

First, if the 4 institutions don't want or need their research output to be deposited in their own IRs, there is no need to do it. Perhaps the lab itself will want to have its own IR. Moreover, harvesting works N ways: Once a paper and its (OAI) metadata are deposited in one IR, other IRs (as well as Central Repositories like HAL or PubMed Central or Arxiv) can harvest it; or the author can import/export it to his multiple IR affiliations. (And let us not forget that even direct deposit takes less than ten minutes worth of keystrokes!)

In other words, with OAI harvestability, yes, one deposit is enough.

FL: "I am just explaining what we do, and the strategy we chose (after much discussion!). I am not claiming that it is the best in the world, or even superior to others; actually, I know that you do not approve it, Stevan. But I personally believe in it, because I feel that it meets the quality that is necessary to build a real tool for research."

Franck, it has nothing to do with approval or disapproval. Whatever system results in 100% of French research output being made OA (soon!) -- whether by mandating direct deposit in HAL, or mandating local IR mandating, or even (mirabile dictu) by having all journals convert to OA publishing -- realises the goal of the OA movement: 100% OA for peer-reviewed research output, now.

But is HAL's policy of central deposit and metadata enhancement sufficient to generate that 100% self-archiving? For if not, then whatever other desiderata it may be providing, it is not providing OA's target content.

FL: "one can easily extract a local institutional repository from Hal, and even import all the data locally, if useful."

I don't doubt it. But you have not yet told me how you propose to get all that content deposited in HAL in the first place, so that institutions can then harvest back their own content from it: On the face of it, it would seem that the institutions should be depositing their own content in their own IRs directly, and HAL should be harvesting it, not vice versa. But if you do have a plan for a national mandate to deposit directly in HAL, I would say all this discussion is moot. Without such a plan, however, this discussion is beside the point (at least insofar as OA is concerned).

FL: "one can also transfer documents to Hal from local systems using the so called "webservice" techniques. In other words you can load documents into Hal from your local system for electronic documents, without knowing anything about Hal, provided that your metadata are Hal compatible. This is what several institutions are now doing in France."

The French institutions that have already succeeded in getting their research output into their own IRs -- whether merely OAI-compliant IRs or HAL-compliant IRs -- have already succeeded in solving the problem we (or at least I!) am discussing here, for whatever contents they have succeeded in getting deposited. My guess is that if these deposits are unmandated, than they represent about 15% of those institutions' annual research output, and we are back where we started.

The issue, au fond, is not where papers are deposited, but whether they are deposited. The only reason I keep harping on institutional IR depositing rather than central depositing is that institutions are the primary content providers, in all research disciplines, and hence their own IRs are the natural place to require their own researchers to self-archiving their own research output. Moreover, institutions cover all research disciplines, hence all of OA's target content space.

It is virtually certain that the only way to attain 100% OA self-archiving is via self-archiving mandates from researchers' institutions and funders. Hence the only real question about IR deposit vs. HAL deposit in France is whether the probability of a successful pandisciplinary, paninstitutional national mandate to deposit in HAL is greater in France than the probability of institutional and funder mandates to self-archive institutionally. What is best for France is whichever of these is in fact more likely.

FL: "Let me finally add that Hal has been conceived to combine the advantage of disciplinary open archives (what scientists want)."

I think that what you wanted to say, Franck, was that (many) scientists want to be able to search and access all and only the relevant research in their own disciplines. That they want it all to be in one discipline-based "archive," and that that archive must have been deposited in directly rather than harvested -- and even that the realisation of these wishes requires the full richness of HAL's proposed metadata -- is rather a theoretical assumption on the part of some, rather than an objective statement of "what scientists want"...

FL: "and institutional archives (which are indispensable if we want institutions to push scientists to deposit their [research output]). "

This, I think, is closer to assumption-free objectivity: Institutions do want their own output in their own IRs and not just in some external discipline-based collective database. But here I would agree with you: Harvesting could work in either direction, to give everyone what they want.

But harvesting will not get undeposited content deposited; only mandates will. So the question is whether institutions (and funders) are more likely to be pushed to push their researchers to deposit their research output in (1) international disciplinary archives like Arxiv or PubMed Central, (2) national omnibus archives like HAL, or (3) their own institutional IRs?

FL: "You can create portals of Hal that are institutional, with the logo, words, etc.. of the institution, for both upload and download."

I agree completely that harvesting can go either way, so if, mirabile dictu, HAL succeeded in getting all or most of French research output in all disciplines directly deposited in HAL, then it would be trivial to generate virtual IRs for each institution via back-harvesting.

But how do you propose to get the content deposited in HAL in the first place? You seem to be focussed on centrality and metadata enrichment, but we need to hear about how you plan to get the content (and how much of it): The target is 100% of French research output. The baseline today is 15% spontaneous self-archiving: How do you plan to get from 15% to 100%, and when?

FL: "But at the same time all the documents go to the same data base. This is technically possible, but requires the solid structure of metadata that I described above."

It requires something much harder to get the solid metadata structure: it requires 100% of the target content!

FL: "I hope that I have explained the situation clearly."

As Fermat (or the hopeful builder of the perpetuum mobile) would have conceded: there are still a few little details missing. In this case, the detail concerns how you plan to get HAL filled. For without that, we are talking about raising the quality standards and price for a product that does not yet have any customers (apart from the 15% spontaneous baseline)...

Stevan Harnad
American Scientist Open Access Forum

Posted by Stevan Harnad in Institutional Repositories at 23:28 | Comments (0) | Trackbacks (0)

Tuesday, October 3. 2006

The Wellcome Trust Open Access Self-Archiving Mandate at Age One

One year old this month, the Wellcome Trust's is still not the optimal Open Access (OA) self-archiving mandate because:

(1) it should instead require the depositing to be done in the author's own Institutional Repository (IR) (thereafter harvestable to PubMed Central therefrom) rather than requiring direct central deposit; and

(2) it should require the deposit to be done immediately upon acceptance for publication, permitting the 6-month delay only in the setting of Access to Open Access (versus Closed Access), rather than permitting the depositing itself to be delayed.

But it's a damn good mandate just the same, and an inspiration and encouragement to research funders and research institutions the world over (as long as it's upgraded to include (1) and (2))!

Stevan Harnad
American Scientist Open Access Forum

Posted by Stevan Harnad in Self-Archiving Mandates at 11:50 | Comments (0) | Trackbacks (0)

« previous page (Page 2 of 2, totaling 17 entries)

Entries from October 2006

Thursday, October 12. 2006

CIHR Proposes 99.99% Optimal OA Self-Archiving Mandate

Tuesday, October 10. 2006

Hypotheses Non Fingo

Monday, October 9. 2006

Critique of EPS/RIN/RCUK/DTI "Evidence-Based Analysis of Data Concerning Scholarly Journal Publishing"

Friday, October 6. 2006

Responses to EC Self-Archiving Mandate Recommendation

Preprints, Postprints, Peer Review, and Institutional vs. Central Self-Archiving

Wednesday, October 4. 2006

France's HAL, OAI interoperability, and Central vs Institutional Repositories

Tuesday, October 3. 2006

The Wellcome Trust Open Access Self-Archiving Mandate at Age One

EnablingOpenScholarship (EOS)

Federal Research Public Access Act (FRPAA)

Alliance for Taxpayer Access (ATA)

Creative Commons License:

Quicksearch

Syndicate This Blog

Materials You Are Invited To Use To Promote OA Self-Archiving:

Archives

Calendar

Categories

Blog Administration

Statistics

Top Referrers

Syndicate This Blog