SUMMARY: Data-mining robots like SciBorg can harvest Green OA full-texts, self-archived in their authors' Institutional Repositories (IRs) and “repurpose” them for better functionality. The postprint is the author’s own refereed, revised final draft. Green journal publishers endorse author posting of postprints in their own IR, free for all. The author can certainly revise that draft further, making additional corrections, updates and enhancements, including marking it up in XML and adding comments. Those corrections need not be done by the author's own hands: They could be done by a graduate student, a collaborator, a secretary, or a hired hand. The author could also have SciBorg “repurpose” his postprint -- under one trivial condition, easily fulfilled, which is that the locus of the enhanced postprint, the URL from which users must download it, remains the author’s own IR, not a 3rd-party website. It is not only unnecessary but would be highly inimical to the progress of Green OA mandates to insist instead that the Green publisher’s endorsement to self-archive the postprint in the author’s IR is "not enough" for full-blooded OA — that the author must also successfully negotiate with the publisher the retention of the right to assign to 3rd-party harvesters like SciBorg the right to publish a “derivative work” derived from the author’s postprint.
In
"Why Green Open Access does not support text- and data-mining", Peter Murray-Rust wrote:
PM-R: "...the first thing to do is to gather a corpus of documents... any other scientist should be able to have access to it. It therefore has to be freely distributable..."
Agreed. So far this is just bog-standard OA. If the original documents are self-archived as Green OA postprints in their authors'
Institutional Repositories (IRs), your
SciBorg robot can harvest them and data-mine them, and make the results freely accessible (but linking back to the postprint in the author's IR whenever the full-text needs to be downloaded).
PM-R: "[At SciBorg] we are interested in machines understanding science..."
Fine. Let your SciBorg machines harvest the Green OA full-texts and "repurpose" them as they see fit.
PM-R: "almost all articles are copyrighted and non-distributable. Publisher Copyright is a major barrier... you can’t just go out and compile a wordlist or whatever as you may infringe copyright or invisible publisher contracts (we found that out the hard way)..."
You can't do that if you are harvesting the publisher's proprietary text, but you can certainly do that if you are harvesting the author's Green OA postprints.
PM-R: "PDFs are so awful... we have to repurpose them by converting to HTML, XML and so on..."
Fine.
PM-R: "Now the corpus is annotated. Expert humans go through line by line...It is this annotated corpus which is of most use to the scientific community..."
Fine.
PM-R: "So suppose I find 50 articles in 50 different repositories, all of which claim to be Green Open Access. I now download them, aggregate them and [SciBorg] repurpose[s] them. What is the likelihood that some publisher will complain? I would guess very high..."
Complain about what, and to whom? A Green publisher has endorsed the author's posting of his own Green OA postprint in his own IR, free for all. The postprint is the author's own refereed, revised final draft. Now follow me: Having endorsed the posting of that draft, does anyone imagine that the publisher would have any grounds for objection if the author revised the draft further, making additional corrections and enhancements? Of course not. It's exactly the same thing: the author's Green OA
postprint.
So what if the author decides to mark it up as XML and add comments? Any grounds for objections? Again, no. Corrections, updates and enhancements of the author's postprint are in complete conformity with posting his postprint.
Suppose the author did not do those corrections with his own hands, but had a colleague, graduate student, a secretary, or a hired hand do them for him, and then posted the corrected postprint? Still perfectly fine.
Now suppose the author had your SciBorg "repurpose" his postprint: Any difference? None -- except a trivial condition, easily fulfilled, which is that
the locus of the enhanced postprint, the URL from which users can download it, should again be the author's IR, not a 3rd-party website (which the publisher could then legitimately regard as a rival publisher -- especially if it was selling access to the "repurposed" text).
So the solution is quite obvious and quite trivial: It is fine for the SciBorg harvester to be the locus of the data-mining and enhancement of each Green OA postprint. It can also be the means by which users search and navigate the corpus. But SciBorg must not be the locus from which the user accesses the full-text: The "repurposed" full-text must be parked in the author's own IR, and retrieved from there whenever a user wants to read and download it (rather than just to search and surf the entire corpus via SciBorg).
Not only does this all sound silly: it really is silly. In the online age, it makes no functional difference at all where a document is actually physically located, especially if the document is OA! But we are still at the confused interface between the paper age and the OA era. So we have to be prepared to go through a few silly rituals, to forestall any needless fits of apoplexy, which would otherwise mean further dysfunctional delay (for OA).
So the ritual is this: It would be highly inimical to the progress of Green OA mandates to insist that the publisher's endorsement to self-archive the postprint in the author's IR is "not enough" -- that the author must also successfully negotiate with the publisher the retention of the right to assign to 3rd-party harvesters like SciBorg the right to publish a "derivative work" derived from the author's postprint. That would definitely be the tail wagging the dog, insofar as OA is concerned, and it would put authors off providing Green OA (and hence put their institutions off mandating it) for a long time to come.
Instead, when SciBorg harvests a document from a Green OA IR, SciBorg must make an arrangement with the author that the resultant "repurposed" draft will be deposited by the author in the author's own IR as an update of the postprint. Then, whenever a user of SciBorg wishes to retrieve the "repurposed" draft, the downloading site must always be the author's IR: no direct retrieval from the SciBorg site.
This ritual is ridiculous, and of course it is functionally unnecessary, but it is pseudo-juridically necessary, during this imbecilic interregnum, to keep all parties (publishers, lawyers, IP specialists, institutions, authors) calm and happy -- or at least mutely resigned -- about the transition to the optimal and inevitable that is currently taking place. Once it's over, and we have 100% Green OA, all this papyrophrenic horseplay can be well-deservedly dropped for the nonsense it is.
Please, Peter, be prepared to adapt SciBorg to the exigencies of this all-important (and all too slow-footed) transitional phase, rather than trying to force-fit the status quo to SciBorg, at the cost of still more delays to OA.
PM-R: "Only a rights statement actually on each document would allow us to create a corpus for NLP without fear of being asked to take it down..."
No. Green OA authors with standard copyright agreements are not in a position to license republication rights to SciBorg or any other 3rd party. Let us be happy that they have provided Green OA at all, and let SciBorg be the one to adapt to it for now, rather than vice versa.
Brody, T., Carr, L., Gingras, Y., Hajjem, C., Harnad, S. and Swan, A. (2007)
Incentivizing the Open Access Research Web: Publication-Archiving, Data-Archiving and Scientometrics.
CTWatch Quarterly 3(3).
Stevan Harnad
American Scientist Open Access Forum