Institutional Repositories

Friday, October 6. 2006

Preprints, Postprints, Peer Review, and Institutional vs. Central Self-Archiving

SUMMARY: Arxiv is a Central Repository (CR) in which physicists have been self-archiving their unrefereed preprints and their peer-reviewed postprints since 1991. There is now a growing movement toward distributed Institutional Repositories (IRs). Thanks to the OAI Protocol, all OAI-compliant IRs and CRs are now interoperable: their metadata can be harvested into search engines that treat all of their contents as if they were in one big virtual CR. What authors self-archive is their peer-reviewed publications, not just their unrefereed preprints. An archive is merely a repository, not a certifier of having met a peer-reviewed journal's quality standards.
Since the research institutions themselves are the primary research providers, with the direct interest in maximising the uptake and usage of their own research output, the natural place for them to deposit their own output is in their own IRs. Any central collections can be harvested via OAI. Institutions are also best placed to monitor and reward compliance with self-archiving mandates, both their own institutional mandates and those of the funders of their institutional research output. Arxiv has played an important role in getting us where we are, but it is likely that the era of CRs is coming to a close, and the era of distributed, interoperable IRs is now coming into its own in an entirely natural way, in keeping with the distributed nature of the Net/Web itself.

Comments on:

Ginsparg, Paul (2006) As We May Read. The Journal of Neuroscience, September 20, 2006, 26(38): 9606-9608 doi:10.1523/JNEUROSCI.3161-06.2006

"[A]rticles are deposited [in Arxiv] by researchers when they choose (either before, simultaneous with, or after peer review), and the articles are immediately available to researchers throughout the world."

Arxiv is a Central Repository (CR) in which physicists (mostly, and many mathematicians, and some computer scientists) have been self-archiving their unrefereed preprints and their peer-reviewed postprints since 1991. It is important to keep in mind that researchers self-archive preprints as well as postprints, because it makes a big difference whether one extrapolates from Arxiv as a preprint CR or a postprint CR, as we shall see below.

It is also pertinent to bear in mind that Arxiv is indeed a Central Repository (CR), because there is now a growing movement toward distributed Institutional Repositories (IRs). The IR movement was facilitated by the Open Archives Initiative (OAI) Protocol for Metadata Harvesting, which renders all IRs and CRs interoperable: the OAI Protocol was in turn created partly as a result of an initiative from Arxiv.

As a consequence of the OAI Protocol, all OAI-compliant IRs and CRs are interoperable: their metadata can be harvested into search engines that treat all of their contents as if they were in one big virtual CR.

"As a pure dissemination system, [Arxiv] operates at a factor of 100-1000 times lower [1.0% - 0.1%] in cost than a conventionally peer-reviewed system (Ginsparg, 2001)."

This is true, but it is tantamount to saying that as a pure dissemination system, photocopying the articles published in journals operates at a fraction of the cost of publishing a journal: A fraction, but a parasitic fraction, for without the journal, there would be nothing to either photocopy or distribute in Arxiv.

Nothing but the unrefereed preprint, that is. And this brings us face to face with the fundamental question: What are the true costs of peer review, and peer review alone? The peers (scarce, overused resource though they are) review for free, so it is not their services whose costs we are talking about, but the cost of implementing the peer review: processing the submissions, picking the referees, processing their reports, deciding what revisions need to be done to meet the journal's quality standards for acceptance, and deciding -- perhaps again by consulting the referees -- whether those revisions have been successfully done. The selection of referees and the decision as to what needs to be done is usually made by a qualified, answerable super-peer: the editor (or a board of editors). The editor(s) services, and the clerical services for processing submissions, communicating with referees, and processing referee reports are the costs involved -- and these include not just accepted papers, but rejected ones too (with some journals' rejection rates being over 90%).

In other words, peer-reviewed journal publishing is not a "pure dissemination system." Implementing the peer review costs some money too. There are estimates of what it costs (about $500 per paper was the average estimate a few years ago, which is between one-third and one-sixth of the charge per article that today's "Open Choice" journals are currently proposing -- although a few journals with high rejection rates have suggested a figure of $10,000 per article, without making it clear whether this represents their costs per article or their income per article).

The annual cost per paper in Arxiv, to Arxiv, has been estimated at about $10 (a few years ago), so this is indeed somewhere between 2% of the low-end estimate and 0.1% of the high-end estimate. If we include the cost of keying in the deposit to the depositor, it's a few pennies more.

But what do these figures mean? Why compare the cost of online dissemination alone with the cost of peer review (or any of the other values a journal adds, such as the print edition, copy-editing, reference-checking, and mark-up)?

"with many of the production tasks automatable or off-loadable to the authors, the editorial costs will then dominate the costs of an unreviewed distribution system by many orders of magnitude."

Translation: Online dissemination of unrefereed preprints alone costs a lot less than peer-reviewed publication. True, but what follows from that? Peer-reviewed publication costs a lot more than photo-copying too, but what authors photocopy and distribute is their peer-reviewed publications, not just their unrefereed preprints.

"Although the most recently submitted articles have not yet necessarily undergone formal review, the vast majority of the articles can, would, or do eventually satisfy editorial requirements somewhere.... [Arxiv's moderated] submissions are at least 'of refereeable quality'."

Every paper is first an unrefereed preprint -- and then, eventually, most are revised into peer-reviewed, accepted articles (postprints). Hence if preprints are deposited in Arxiv at all, it stands to reason that Arxiv's most recently deposited (sic) papers (sic) have not yet undergone peer review. Tune in a year later, and they will have been, with the revised postprint now also deposited.

Preprints and postprints are deposited rather than "submitted" to IRs or CRs, because an archive is merely a repository, not a certifier of having met a peer-reviewed journal's quality standards: let's reserve "submission" for the attempt to meet a journal's peer-review quality standards. Moreover, unrefereed preprints are merely papers, not articles; they become articles when they have been accepted for publication by a peer-reviewed journal. This is not pedantry or formalism. It is merely the sorting out of what has and has not met known quality control standards. The tag certifying this is currently the journal name, with its established quality level and track-record. A peer-reviewed journal (apart from its function as an access-provider) is a peer-review service-provider/certifier, publicly answerable for its quality standards with its own prestige and reputation. And authors are in turn answerable to the editor and referees, for meeting their standards for acceptance; revision is not optional but obligatory, a condition on acceptance for publication. Hence earning the tag certifying acceptance is a dynamic, interactive process, and not merely a pass/fail system.

Publication is even less like a pass/fail system in that in most fields there is a hierarchy of journals, with a range of peer-review standards, from the one or few most rigorous ones at the top (usually the ones with the highest rejection rates), all the way down to what is sometimes almost a vanity press at the bottom (little better than an unrefereed preprint). These differences in quality standards are known and relied upon in the field. And papers are not really published or unpublished: Most are published, eventually, but at their own quality level. The journals are all autonomous, independent of the authors and the authors' institutions, each dependent on its own established standards for quality and selectivity. Users are in turn dependent on each journal's public track record in deciding what to trust.

It is not at all clear what an IR's or CR's certification of which of its deposits is "of refereeable quality" might mean to busy researchers who need to know whether a paper is worth risking their limited time to read and try to use, apply and build upon. Users currently do this by seeing whether and where it has been published (with the journal name and track record serving as their indicator of the article's probable level of quality, reliability and validity). Unrefereed preprints have always been something handled with care, having only the author's name, institution and prior track-record as a guide to their reliability. Is Arxiv's tag of being "of refereeable quality" meant to serve as a further guide? or as a substitute for something?

"[P]roposed modifications of the peer review include a two-tier system (for more details, see Ginsparg, 2002), in which, on a first pass, only some cursory examination or other pro forma certification is given for acceptance into a standard tier. At some later point, a much smaller set of articles would be selected for more extensive evaluation."

This is a speculative hypothesis. It is no doubt being tested to see whether it works, whether it delivers results of quality and useability comparable to standard peer review, whether it is cost-effective, and whether it can replace journals. But as it stands, the hypothesis alone does not tell us whether and how well it will work; Arxiv is certainly not evidence for the validity of this hypothesis, since virtually all papers in Arxiv still undergo standard peer review. Arxiv is merely a CR that provides Open Access (OA) to both the preprints and the postprints.

"using standard search engines, more than one-third of the high-impact journal articles in a sample of biological/medical journals published in 2003 were found at nonjournal Web sites (Wren, 2005)."

This is very interesting. This is the higher end of a self-archiving rate that we have found to range between about 5% and 25% across disciplines. Physics is of course even higher (mostly because of Arxiv) and computer science higher still (see Citeseer, a google-style harvester of distributed locally deposited papers).

"at least 75% of the publications listed [in neuroscience] were freely available either via direct links from the above Web page or via a straightforward Web search for the article title."

This is even more interesting. It means that in such fields the majority of the articles -- note that we are almost certainly not talking about unrefereed preprints here but about peer-reviewed postprints -- are being self-archived already, so the only thing that remains to be done is to deposit (or harvest) them into the author's own OAI-compliant IR rather than a random website, to maximise visibility, harvestability, and impact.

"The enormously powerful sorts of data mining and number crunching that are already taken for granted as applied to the open-access genomics databases can be applied to the full text"

Indeed. And semantic and scientometric analyses too (though article texts are not quite the same thing as the research data on which the articles are based, hence the analogy with the genomics data base may be a bit misleading).

"it is likely that more research communities will join some form of global unified archive system without the current partitioning and access restrictions familiar from the paper medium"

What makes it most likely is the self-archiving mandates proposed or already adopted the world over (e.g., RCUK, Wellcome Trust, FRPAA, EC, plus individual institutional self-archiving mandates: CERN, Southampton, QUT, Minho).

But the deposits will not be done in one global CR, nor in a CR like Arxiv for each discipline or combination of disciplines. With the advent of the OAI protocol, all IRs and CRs are interoperable, and since the research institutions themselves are the primary research providers, with the direct interest inshowcasing their own research output as well as maximizing its uptake, usageand impact, the natural place for them to deposit their own output is in their own IRs. Any central collections can be gathered via OAI harvesting. Institutions are also best placed to monitor and reward compliance with self-archiving mandates, both their own institutional mandates and those of the funders of their institutional research output.

Arxiv has played an important role in getting us where we are, but it is likely that the era of CRs is coming to a close, and the era of distributed, interoperable IRs is now coming into its own in an entirely natural way, in keeping with the distributed nature of the Net/Web itself.

Stevan Harnad
American Scientist Open Access Forum

Posted by Stevan Harnad in Institutional Repositories at 05:36 | Comments (0) | Trackbacks (0)

Wednesday, October 4. 2006

France's HAL, OAI interoperability, and Central vs Institutional Repositories

SUMMARY: France's HAL is a large national repository for research output with enriched metadata. Like all other countries, France clearly needs self-archiving mandates from its research institutions and funders. The question is whether the mandates should be to deposit directly in HAL, or in each researcher's own Institutional Repository (from which HAL could then harvest the contents). Franck Laloë of CCSD replies to some questions about HAL.

On Mon, 2 Oct 2006, Franck Laloë [FL] wrote:

FL: "Hal (and I presume Prodinra [INRA]) are of course OAI-PMH compatible, and of course can be harvested within this protocol and others. This compatibility is a necessary condition for an archive to be useful to the scientific community. But a necessary condition is not always sufficient. We need more interoperability than just that possible within OAI-PMH; Hal meets this requirement. -- I know that Stevan and others will disagree with the last sentence above..."

But I don't disagree at all! The more interoperability the better! What I am still very keen to know is the following:

(1) How is it proposed to get all of France's research output into HAL?

(2) Why do the deposits need to be directly in HAL, rather than in each author's own Institutional Repository (IR), then harvested by HAL?

(3) Are there are any self-archiving mandates being proposed or planned in France (along the lines already adopted by RCUK and the Wellcome Trust in the UK, proposed by the FRPAA in the US, and recommended (A1) by the European Commission recommendations)?

(4) Where will such mandates, if any, require researchers to deposit: in HAL or in the their own institution's IR?

The most important question is (3), because what is becoming increasingly clear is that -- be they as interoperable as one could possibly wish -- near-empty repositories are not very useful! The spontaneous (unmandated) self-archiving rate worldwide is about 15% (with a few disciplines, e.g. physics, well above 15%, but most disciplines at or below 15%), whereas the mandated deposit rate climbs toward 100% within a few years of adoption (as predicted by Alma Swan's international surveys, and confirmed by Arthur Sale's analyses comparing mandated and unmandated deposit rates).

So where do things stand in France regarding self-archiving mandates (whether for local institutional deposit or national deposit in HAL)?

FL: "There are several reasons ["why... deposits need to be directly in HAL, rather than in each author's own Institutional Repository (IR), then harvested by HAL"]:

Hal requires a structure of metadata where each author is linked to one (or several) laboratories, which in turn are linked to one (or several institutions)."

But I don't understand: Can these complex affiliation metadata not be provided (rather trivially)?

FL: "This structure of metadata in Hal is richer than OAI-PMH (in its present state)."

I understand: HAL does richer tagging than the OAI-PMH IRs. But that tagging has to be added in any case: So why not add to the metadata harvested from the distributed deposits in each author's own IR?

FL: "If we harvested all OAI-PMH compliant archives, this would introduce redundancy"

So what? And there are ways to either remove or coalesce duplicates. (The problem today -- we cannot remind ourselves often enough -- is too few deposits [15%], not too many, nor too many duplicates!)

FL: "documents with incomplete metadata (most institutional repositories only mention their institution, not the contribution of others, etc..)."

We've agreed that if HAL aspires to have richer metadata than IRs with their OAI-PMH then the extra tags have to be added; this is not an answer to the question of why HAL insists on direct deposits, rather than harvesting from IRs.

And you already mentioned earlier the baroque intricacies of institutional affiliation -- but not why you don't think this trivial problem cannot be handled easily by software, with the help of IR affiliations, lists of co-authors and (why not) central authoritative lists chronicling all French authors' current (and past!) complexities of affiliation, turning them into explicit metadata tags...

FL: "Many OAI repositories do not guarantee sufficient quality, and even access to the full text."

The 1st, 2nd and Nth immediate problem today is lack of content, not low-quality metadata: The texts [85% of them] are not deposited at all. The OA movement, and OA self-archiving mandates, are endeavouring to get that content deposited. Authors' own IRs are the natural place to deposit it, and to mandate depositing it.

Then (as agreed above) HAL can, if it wishes, harvest that content, and improve its metadata. (Again, this is no argument against harvesting, or in favour of direct deposit in HAL.)

As to texts that are deposited in IRs but not made OA: I wish that were the only remaining problem, for I guarantee that if it were, it would solve itself in short order. (The OA metadata would elicit email eprint requests, and authors would soon tire of emailing eprints and would instead set access to their deposited full-texts as OA instead of Closed Access.)

But we do not have all or most or even much of the target literature -- the peer-reviewed research corpus -- deposited in IRs in Closed Access, with only their (low-quality) metadata accessible: At least 85% percent of OA's target content is not deposited at all.

So it seems to me HAL would benefit as much as everyone else from a self-archiving mandate that would get all that content deposited; so the only question is who will mandate it to be deposited, and where?

So far, the two natural candidate mandaters are the researchers' own institutions and funders. Clearly institutions have an interest in mandating that the deposit should be in their own IRs (for institutional visibility, prestige, and record-keeping). Funders (although some unthinkingly insist on central deposits today, e.g., in PubMed Central) are mostly indifferent to where their funded research is deposited, as long as it is OAI-compliant and OA. So many mandate depositing in the researcher's own IR too. And PubMed Central should be asking itself the same questions I am asking you about HAL: Why not deposit in each researcher's own OAI-compliant IR and simply harvest from there?

Institutions have a direct institutional interest in their own IRs; they are the ones that can best monitor and reward compliance with self-archiving mandates; and the spectrum of disciplines at research institutions (mostly universities) effectively cover all of OA's target content space (whereas central disciplinary and multidisciplinary repositories do not).

A national repository like HAL is a very good idea, but unless the problem of the means of mandating and monitoring direct self-archiving in HAL by all French researchers has an immediate solution, at the very least a hybrid deposit system would seem to be optimal: either researchers deposit in their own IRs (subsequently harvested and enhanced by HAL) or directly in HAL; but deposit they must.

FL: "Hal includes certification procedures (which we call "stamps") which do not exist in other open archives."

That's fine, but the non-existence that is the immediate problem is not certificates but deposits! At least 85% of French research output is not being self-archived at all. Institutional and funder self-archiving mandates can remedy this, but are all or most of France's research institutions more likely to agree to mandate and monitor depositing all of their own output in HAL, or in their own IRs?

The "stamps" could come either way, either via harvesting from IRs or via direct deposits.

FL: "In brief, we want Hal to be an homogeneous system, really usable by the reader, and by labs (even if they belong to several institutions) and institutions - all this through a single entry into the system."

If HAL can become a direct entry point for all French research institutions, and they all agree to a means of mandating and monitoring compliance, nolo contendere!

But what is sure is that a central repository and central depositing is not the only way to get an OA corpus usable by all (authors and users), in France and worldwide. On the contrary, the nature of the Internet, the Web, the OAI protocol and any other richer metadata tagging schemes is such that distributed interoperability -- rather than a central locus and central management -- is far more likely to prove to be the successful means of generating and using the OA corpus, in France and worldwide.

FL: "For instance, my lab belongs to 4 institutions, we do not want to put our articles into four open archives; one is enough."

First, if the 4 institutions don't want or need their research output to be deposited in their own IRs, there is no need to do it. Perhaps the lab itself will want to have its own IR. Moreover, harvesting works N ways: Once a paper and its (OAI) metadata are deposited in one IR, other IRs (as well as Central Repositories like HAL or PubMed Central or Arxiv) can harvest it; or the author can import/export it to his multiple IR affiliations. (And let us not forget that even direct deposit takes less than ten minutes worth of keystrokes!)

In other words, with OAI harvestability, yes, one deposit is enough.

FL: "I am just explaining what we do, and the strategy we chose (after much discussion!). I am not claiming that it is the best in the world, or even superior to others; actually, I know that you do not approve it, Stevan. But I personally believe in it, because I feel that it meets the quality that is necessary to build a real tool for research."

Franck, it has nothing to do with approval or disapproval. Whatever system results in 100% of French research output being made OA (soon!) -- whether by mandating direct deposit in HAL, or mandating local IR mandating, or even (mirabile dictu) by having all journals convert to OA publishing -- realises the goal of the OA movement: 100% OA for peer-reviewed research output, now.

But is HAL's policy of central deposit and metadata enhancement sufficient to generate that 100% self-archiving? For if not, then whatever other desiderata it may be providing, it is not providing OA's target content.

FL: "one can easily extract a local institutional repository from Hal, and even import all the data locally, if useful."

I don't doubt it. But you have not yet told me how you propose to get all that content deposited in HAL in the first place, so that institutions can then harvest back their own content from it: On the face of it, it would seem that the institutions should be depositing their own content in their own IRs directly, and HAL should be harvesting it, not vice versa. But if you do have a plan for a national mandate to deposit directly in HAL, I would say all this discussion is moot. Without such a plan, however, this discussion is beside the point (at least insofar as OA is concerned).

FL: "one can also transfer documents to Hal from local systems using the so called "webservice" techniques. In other words you can load documents into Hal from your local system for electronic documents, without knowing anything about Hal, provided that your metadata are Hal compatible. This is what several institutions are now doing in France."

The French institutions that have already succeeded in getting their research output into their own IRs -- whether merely OAI-compliant IRs or HAL-compliant IRs -- have already succeeded in solving the problem we (or at least I!) am discussing here, for whatever contents they have succeeded in getting deposited. My guess is that if these deposits are unmandated, than they represent about 15% of those institutions' annual research output, and we are back where we started.

The issue, au fond, is not where papers are deposited, but whether they are deposited. The only reason I keep harping on institutional IR depositing rather than central depositing is that institutions are the primary content providers, in all research disciplines, and hence their own IRs are the natural place to require their own researchers to self-archiving their own research output. Moreover, institutions cover all research disciplines, hence all of OA's target content space.

It is virtually certain that the only way to attain 100% OA self-archiving is via self-archiving mandates from researchers' institutions and funders. Hence the only real question about IR deposit vs. HAL deposit in France is whether the probability of a successful pandisciplinary, paninstitutional national mandate to deposit in HAL is greater in France than the probability of institutional and funder mandates to self-archive institutionally. What is best for France is whichever of these is in fact more likely.

FL: "Let me finally add that Hal has been conceived to combine the advantage of disciplinary open archives (what scientists want)."

I think that what you wanted to say, Franck, was that (many) scientists want to be able to search and access all and only the relevant research in their own disciplines. That they want it all to be in one discipline-based "archive," and that that archive must have been deposited in directly rather than harvested -- and even that the realisation of these wishes requires the full richness of HAL's proposed metadata -- is rather a theoretical assumption on the part of some, rather than an objective statement of "what scientists want"...

FL: "and institutional archives (which are indispensable if we want institutions to push scientists to deposit their [research output]). "

This, I think, is closer to assumption-free objectivity: Institutions do want their own output in their own IRs and not just in some external discipline-based collective database. But here I would agree with you: Harvesting could work in either direction, to give everyone what they want.

But harvesting will not get undeposited content deposited; only mandates will. So the question is whether institutions (and funders) are more likely to be pushed to push their researchers to deposit their research output in (1) international disciplinary archives like Arxiv or PubMed Central, (2) national omnibus archives like HAL, or (3) their own institutional IRs?

FL: "You can create portals of Hal that are institutional, with the logo, words, etc.. of the institution, for both upload and download."

I agree completely that harvesting can go either way, so if, mirabile dictu, HAL succeeded in getting all or most of French research output in all disciplines directly deposited in HAL, then it would be trivial to generate virtual IRs for each institution via back-harvesting.

But how do you propose to get the content deposited in HAL in the first place? You seem to be focussed on centrality and metadata enrichment, but we need to hear about how you plan to get the content (and how much of it): The target is 100% of French research output. The baseline today is 15% spontaneous self-archiving: How do you plan to get from 15% to 100%, and when?

FL: "But at the same time all the documents go to the same data base. This is technically possible, but requires the solid structure of metadata that I described above."

It requires something much harder to get the solid metadata structure: it requires 100% of the target content!

FL: "I hope that I have explained the situation clearly."

As Fermat (or the hopeful builder of the perpetuum mobile) would have conceded: there are still a few little details missing. In this case, the detail concerns how you plan to get HAL filled. For without that, we are talking about raising the quality standards and price for a product that does not yet have any customers (apart from the 15% spontaneous baseline)...

Stevan Harnad
American Scientist Open Access Forum

Posted by Stevan Harnad in Institutional Repositories at 23:28 | Comments (0) | Trackbacks (0)

(Page 1 of 1, totaling 2 entries)

Entries from October 2006

Friday, October 6. 2006

Preprints, Postprints, Peer Review, and Institutional vs. Central Self-Archiving

Wednesday, October 4. 2006

France's HAL, OAI interoperability, and Central vs Institutional Repositories

EnablingOpenScholarship (EOS)

Federal Research Public Access Act (FRPAA)

Alliance for Taxpayer Access (ATA)

Creative Commons License:

Quicksearch

Syndicate This Blog

Materials You Are Invited To Use To Promote OA Self-Archiving:

Archives

Calendar

Categories

Blog Administration

Statistics

Top Referrers

Syndicate This Blog