SUMMARY: France's HAL is a large national repository for research output with enriched metadata. Like all other countries, France clearly needs self-archiving mandates from its research institutions and funders. The question is whether the mandates should be to deposit directly in HAL, or in each researcher's own Institutional Repository (from which HAL could then harvest the contents). Franck Laloë of CCSD replies to some questions about HAL.
On Mon, 2 Oct 2006,
Franck Laloë [
FL] wrote:
FL: "Hal (and I presume Prodinra [INRA]) are of course OAI-PMH compatible, and of course can be harvested within this protocol and others. This compatibility is a necessary condition for an archive to be useful to the scientific community. But a necessary condition is not always sufficient. We need more interoperability than just that possible within OAI-PMH; Hal meets this requirement. -- I know that Stevan and others will disagree with the last sentence above..."
But I don't disagree at all! The more interoperability the better! What I am still very keen to know is the following:
(1) How is it proposed to get all of France's research output into HAL?
(2) Why do the deposits need to be directly in HAL, rather than in each author's own Institutional Repository (IR), then harvested by HAL?
(3) Are there are any self-archiving mandates being proposed or planned in France (along the lines already adopted by RCUK and the Wellcome Trust in the UK, proposed by the FRPAA in the US, and recommended (A1) by the European Commission recommendations)?
(4) Where will such mandates, if any, require researchers to deposit: in HAL or in the their own institution's IR?
The most important question is (3), because what is becoming increasingly clear is that -- be they as interoperable as one could possibly wish -- near-empty repositories are not very useful! The spontaneous (unmandated) self-archiving rate worldwide is
about 15% (with a few disciplines, e.g.
physics, well above 15%, but most disciplines at or below 15%), whereas the mandated deposit rate climbs toward 100% within a few years of adoption (as predicted by
Alma Swan's international surveys, and confirmed by
Arthur Sale's analyses comparing mandated and unmandated deposit rates).
So where do things stand in France regarding self-archiving mandates (whether for local institutional deposit or national deposit in HAL)?
FL: "There are several reasons ["why... deposits need to be directly in HAL, rather than in each author's own Institutional Repository (IR), then harvested by HAL"]:
Hal requires a structure of metadata where each author is linked to one (or several) laboratories, which in turn are linked to one (or several institutions)."
But I don't understand: Can these complex affiliation metadata not be provided (rather trivially)?
FL: "This structure of metadata in Hal is richer than OAI-PMH (in its present state)."
I understand: HAL does richer tagging than the OAI-PMH IRs. But that tagging has to be added in any case: So why not add to the metadata harvested from the distributed deposits in each author's own IR?
FL: "If we harvested all OAI-PMH compliant archives, this would introduce redundancy"
So what? And there are ways to either remove or coalesce duplicates. (The problem today -- we cannot remind ourselves often enough -- is too
few deposits [15%], not too many, nor too many duplicates!)
FL: "documents with incomplete metadata (most institutional repositories only mention their institution, not the contribution of others, etc..)."
We've agreed that if HAL aspires to have richer metadata than IRs with their OAI-PMH then the extra tags have to be added; this is not an answer to the question of why HAL insists on direct deposits, rather than harvesting from IRs.
And you already mentioned earlier the baroque intricacies of institutional affiliation -- but not why you don't think this trivial problem cannot be handled easily by software, with the help of IR affiliations, lists of co-authors and (why not) central authoritative lists chronicling all French authors' current (and past!) complexities of affiliation, turning them into explicit metadata tags...
FL: "Many OAI repositories do not guarantee sufficient quality, and even access to the full text."
The 1st, 2nd and Nth immediate problem today is
lack of content, not low-quality metadata: The texts [85% of them] are not deposited at all. The OA movement, and OA self-archiving mandates, are endeavouring to get that content deposited. Authors' own IRs are the natural place to deposit it, and to mandate depositing it.
Then (as agreed above) HAL can, if it wishes, harvest that content, and improve its metadata. (Again, this is no argument against harvesting, or in favour of direct deposit in HAL.)
As to texts that are deposited in IRs but not made OA: I
wish that were the only remaining problem, for I guarantee that if it were, it would solve itself in short order. (The OA metadata would elicit
email eprint requests, and authors would soon tire of emailing eprints and would instead set access to their deposited full-texts as OA instead of Closed Access.)
But we do not have all or most or even much of the target literature -- the peer-reviewed research corpus -- deposited in IRs in Closed Access, with only their (low-quality) metadata accessible: At least 85% percent of OA's target content is not deposited at all.
So it seems to me HAL would benefit as much as everyone else from a self-archiving mandate that would get all that content deposited; so the only question is
who will mandate it to be deposited, and
where?
So far, the two natural candidate mandaters are the researchers' own institutions and funders. Clearly institutions have an interest in mandating that the deposit should be in their own IRs (for institutional visibility, prestige, and record-keeping). Funders (although some unthinkingly insist on central deposits today, e.g., in
PubMed Central) are mostly indifferent to where their funded research is deposited, as long as it is OAI-compliant and OA. So many mandate depositing in the researcher's own IR too. And PubMed Central should be asking itself the same questions I am asking you about HAL: Why not deposit in each researcher's own OAI-compliant IR and simply
harvest from there?
Institutions have a direct institutional interest in their own IRs; they are the ones that can best monitor and reward compliance with self-archiving mandates; and the spectrum of disciplines at research institutions (mostly universities) effectively cover all of OA's target content space (whereas central disciplinary and multidisciplinary repositories do not).
A national repository like HAL is a very good idea, but unless the problem of the means of mandating and monitoring direct self-archiving in HAL by all French researchers has an immediate solution, at the very least a hybrid deposit system would seem to be optimal:
either researchers deposit in their own IRs (subsequently harvested and enhanced by HAL)
or directly in HAL; but deposit they must.
FL: "Hal includes certification procedures (which we call "stamps") which do not exist in other open archives."
That's fine, but the non-existence that is the immediate problem is not certificates but deposits! At least 85% of French research output is not being self-archived at all. Institutional and funder self-archiving mandates can remedy this, but are all or most of France's research institutions more likely to agree to mandate and monitor depositing all of their own output in HAL, or in their own IRs?
The "stamps" could come either way, either via harvesting from IRs or via direct deposits.
FL: "In brief, we want Hal to be an homogeneous system, really usable by the reader, and by labs (even if they belong to several institutions) and institutions - all this through a single entry into the system."
If HAL can become a direct entry point for all French research institutions, and they all agree to a means of mandating and monitoring compliance, nolo contendere!
But what is sure is that a central repository and central depositing is
not the only way to get an OA corpus usable by all (authors and users), in France and worldwide. On the contrary, the nature of the Internet, the Web, the OAI protocol and any other richer metadata tagging schemes is such that distributed interoperability -- rather than a central locus and central management -- is far more likely to prove to be the successful means of generating and using the OA corpus, in France and worldwide.
FL: "For instance, my lab belongs to 4 institutions, we do not want to put our articles into four open archives; one is enough."
First, if the 4 institutions don't want or need their research output to be deposited in their own IRs, there is no need to do it. Perhaps the lab itself will want to have its own IR. Moreover, harvesting works N ways: Once a paper and its (OAI) metadata are deposited in one IR, other IRs (as well as Central Repositories like HAL or PubMed Central or
Arxiv) can harvest it; or the author can import/export it to his multiple IR affiliations. (And let us not forget that even direct deposit takes less than ten minutes worth of keystrokes!)
In other words, with OAI harvestability, yes, one deposit is enough.
FL: "I am just explaining what we do, and the strategy we chose (after much discussion!). I am not claiming that it is the best in the world, or even superior to others; actually, I know that you do not approve it, Stevan. But I personally believe in it, because I feel that it meets the quality that is necessary to build a real tool for research."
Franck, it has nothing to do with approval or disapproval.
Whatever system results in 100% of French research output being made OA (soon!) -- whether by mandating direct deposit in HAL, or mandating local IR mandating, or even (mirabile dictu) by having all journals convert to OA publishing -- realises the goal of the OA movement: 100% OA for peer-reviewed research output, now.
But is HAL's policy of central deposit and metadata enhancement sufficient to generate that 100% self-archiving? For if not, then whatever other desiderata it may be providing, it is not providing OA's target content.
FL: "one can easily extract a local institutional repository from Hal, and even import all the data locally, if useful."
I don't doubt it. But you have not yet told me how you propose to get all that content deposited in HAL in the first place, so that institutions can then harvest back their own content from it: On the face of it, it would seem that the institutions should be depositing their own content in their own IRs directly, and HAL should be harvesting it, not vice versa. But if you do have a plan for a national mandate to deposit directly in HAL, I would say all this discussion is moot. Without such a plan, however, this discussion is beside the point (at least insofar as OA is concerned).
FL: "one can also transfer documents to Hal from local systems using the so called "webservice" techniques. In other words you can load documents into Hal from your local system for electronic documents, without knowing anything about Hal, provided that your metadata are Hal compatible. This is what several institutions are now doing in France."
The French institutions that have already succeeded in getting their research output into their own IRs -- whether merely OAI-compliant IRs or HAL-compliant IRs -- have already succeeded in solving the problem we (or at least I!) am discussing here, for whatever contents they have succeeded in getting deposited. My guess is that if these deposits are unmandated, than they represent about 15% of those institutions' annual research output, and we are back where we started.
The issue, au fond, is not
where papers are deposited, but
whether they are deposited. The only reason I keep harping on institutional IR depositing rather than central depositing is that institutions are the primary content providers, in all research disciplines, and hence their own IRs are the natural place to require their own researchers to self-archiving their own research output. Moreover, institutions cover all research disciplines, hence all of OA's target content space.
It is virtually certain that the only way to attain 100% OA self-archiving is via self-archiving mandates from researchers' institutions and funders. Hence the only real question about IR deposit vs. HAL deposit in France is whether the probability of a successful pandisciplinary, paninstitutional national mandate to deposit in HAL is greater in France than the probability of institutional and funder mandates to self-archive institutionally. What is best for France is whichever of these is in fact more likely.
FL: "Let me finally add that Hal has been conceived to combine the advantage of disciplinary open archives (what scientists want)."
I think that what you wanted to say, Franck, was that (many) scientists want to be able to search and access all and only the relevant research in their own disciplines. That they want it all to be in one discipline-based "archive," and that that archive must have been deposited in directly rather than harvested -- and even that the realisation of these wishes requires the full richness of HAL's proposed metadata -- is rather a theoretical assumption on the part of some, rather than an objective statement of "what scientists want"...
FL: "and institutional archives (which are indispensable if we want institutions to push scientists to deposit their [research output]). "
This, I think, is closer to assumption-free objectivity: Institutions
do want their own output in their own IRs and not just in some external discipline-based collective database. But here I would agree with you: Harvesting could work in either direction, to give everyone what they want.
But harvesting will not get undeposited content deposited; only mandates will. So the question is whether institutions (and funders) are more likely to be pushed to push their researchers to deposit their research output in (1) international disciplinary archives like Arxiv or PubMed Central, (2) national omnibus archives like HAL, or (3) their own institutional IRs?
FL: "You can create portals of Hal that are institutional, with the logo, words, etc.. of the institution, for both upload and download."
I agree completely that harvesting can go either way, so if, mirabile dictu, HAL succeeded in getting all or most of French research output in all disciplines directly deposited in HAL, then it would be trivial to generate virtual IRs for each institution via back-harvesting.
But how do you propose to get the content deposited in HAL in the first place? You seem to be focussed on centrality and metadata enrichment, but we need to hear about how you plan to get the content (and how much of it): The target is 100% of French research output. The baseline today is 15% spontaneous self-archiving: How do you plan to get from 15% to 100%, and when?
FL: "But at the same time all the documents go to the same data base. This is technically possible, but requires the solid structure of metadata that I described above."
It requires something much harder to get the solid metadata structure: it requires 100% of the target content!
FL: "I hope that I have explained the situation clearly."
As Fermat (or the hopeful builder of the perpetuum mobile) would have conceded: there are still a few little details missing. In this case, the detail concerns how you plan to get HAL filled. For without that, we are talking about raising the quality standards and price for a product that does not yet have any customers (apart from the 15% spontaneous baseline)...
Stevan Harnad
American Scientist Open Access Forum