Date: Mon, 02 Aug 1999 15:57:40 +0100
From: Leslie Carr <lac@ecs.soton.ac.uk>
To: harnad@ecs.soton.ac.uk, sh94r@ecs.soton.ac.uk, wh@ecs.soton.ac.uk,
ginsparg@qfwfq.lanl.gov,
halpern@cs.cornell.edu, lagoze@cs.cornell.edu,
wya@cs.cornell.edu, ijones@bcs.org.uk,
eric@hellman.net,
lawrence@research.nj.nec.com,
vandesompel@rug.ac.be,
kurt@research.nj.nec.com,
friedman@highwire.stanford.edu,
giles@research.nj.nec.com,
hoc@SLAC.Stanford.EDU,
www-admin@xxx.lanl.gov
Subject: EprintLinks (Opcit) project: report on some early work
and directions
During July 99 some proof-of-concept work has been undertaken on the
EprintLinks (Opcit) project. This document is a brief report on the state
of this initial work. (For a simple diagram showing the kind of work that
the project is trying to achieve, see
http://www.ecs.soton.ac.uk/~lac/EPrintLinkdemo.gif
.)
XXX (ArXiv physics) Citation Strategies
=======================
There are a number of ways to provide links between the citations and
the cited preprints. The essential part of this process is the ability
to be able to map between a journal citation (e.g. Phys. Rev. D. 56,
6336) and the apropriate XXX reference (e.g. astro-ph/907075). The
responsibility for declaring the relationship between a citation and the
archive holdings can be born by
(a) the author of the article directly citing the XXX reference code
(b) an external agency maintaining a manually entered database i.e.
(SLAC) SPIRES
(c) an external software agency maintaining a citation database derived
from the article contents
Once a process performing this mapping is defined, the appropriate hypertext links can be embedded directly into a suitable (viewable) version of the article by the Open Journal linking software or by adding hyperdvi comands to the original source.
Some initial work has been undertaken on the XXX archive using the processed
articles in the postscript cache. This has allowed us to develop a prototype
of (c), the software that reads the articles, parses the references
and deduces the journal citations which which are capable of being linked
to eprints. An example of this can be seen at
http://cogprints.ecs.soton.ac.uk/~lac/eprintlinks/NEW/astro-ph/9907075.pdf
(references section begins on p11). The five links on p11 have been
automatically added to the PDF document under the control of the software
reference reader.
A smilar procedure can be followed using (b), software that reads the
WWW page of the SPIRES database and uses that data to automaticaly add
links. Identical results are obtained from these two methods for this
example article. In practise SPIRES will be more accurate than at least
our current prototype; however (c) can be used in areas outside High Energy
Physics and can be usefully applied to new eprint submissions.
Next Stages
===========
Previous experience indicates that it may be easier to work with dvi/ps/pdf
than with the range of TeX input formats. (Most of the problems come from
the fact that TeX is a programming language, not a
document description language.) If this is truly the case then attention
must be paid to the fact that the postscript/dvi cache constitutes only
some 3% of the total archive. The EprintLink (Opcit) software would need
to independently process the articles to work with the reference sections
of the full database.
An important premise of the EprintLink (Opcit) project is that the archive readers should be able to navigate directly betwen viewable articles using citation links. To acomplish this it wil be necessary to provide PDF format as standard. The results of the processing required above should be made available directly to the users, and not just held as part of the internal linking process.
Although (b) can be performed interactively when the user requests a particular article (as part of an Open Linking Service) it would be more helpful for developmental and user-testing purposes to use a mirrored copy of the SPIRES data.
The major part of the work will be performed in making bringing
(c) up to better levels of performance. At the moment it is based on some
very simple scripts which do not deal correctly with all the complexities
of
the article formats (e.g. two columns). Of the 2523 articles that we
are using from the cache, 80% have a recognisable references section. Of
those, 96% are successfully parsed into a database of references but
only 48% seem to yield citations to XXX (arXiv). We have a list of
6243 citation links generated by (c) together with a list of some 11000
direct XXX (arXiv) citations provided by (a).
Currently (c) is intended to work directly within the confines of the LANL archive. Applying SLinkS would allow us to provide links also to publishers' holdings as well.
Other bibliometric work is in very early stages. About half of the archive
is declared as having been subsequently published in a journal (according
to the presence of a Journal-ref: field in the listing).
Initial analysis of the rate of change of the article meta-data indicates
that there is an average lag of about 11 months between submitting a preprint
and declaring it published. Exactly what is happening in an eprint's life-cycle
is not yet clear, and we would like to do a lot more work in this area
along with the development of new bibliographic measures.
---
Leslie Carr
Tel: +44 1703 594479
Fax: +44 1703 592865
Email: L.Carr@ecs.soton.ac.uk URL: http://www.ecs.soton.ac.uk/~lac
ACM Member: 5135934
IEEE Member: 40323275
Dept of Electronics and Computer Science, University of Southampton
SO17
1BJ, UK
Follow-up: Current link demos, presented for evaluation