Early Opcit: The Les Carr archives

Preliminary user/citation analysis of arXiv


From: "Leslie Carr" <lac@ecs.soton.ac.uk>
To: "Stevan Harnad" <harnad@ecs.soton.ac.uk>,
        "Steve Hitchcock" <sh94r@ecs.soton.ac.uk>
Subject: citation patterns
Date: Mon, 26 Jul 1999 07:56:57 +0100

If you look by destination, then just over half of the explicit XXX (arXiv physics) citations are *to* self-confessed journal articles. The ones that aren't are distributed as follows by year....

   9 91
  28 92
 100 93
 108 94
 178 95
 292 96
 417 97
 831 98
 658 99

i.e. 831 direct XXX (arXiv) citations are to 'not-journals' added to the archive in 1998. Does this distribution reflect the growth in the use of the archive, or the publishing lag?

(Remember that all these numbers are based on an examination of the reference sections of a small subset of the archive data.)

---
Les
 

From: "Leslie Carr" <lac@ecs.soton.ac.uk>
To: "Steve Hitchcock" <sh94r@ecs.soton.ac.uk>
Subject: XXX usage analysis
Date: Mon, 8 Nov 1999 16:06:22 -0000

I finished the following rudimentary analysis of XXX (arXiv) usage literally hours before Julian 'accidentally' deleted the old XXX (arXiv) archive (cogprints) from which I gleaned the data.

http://www.ecs.soton.ac.uk/~lac/XXXmetadatadeltas.html

The main conclusion of interest is that only about 11% of people seem to update their articles when the update the metadata. This needs more investigating, as it is counter to Stevan's assumptions.
---
Les
 

From: Leslie Carr <lac@ecs.soton.ac.uk>
To: Stevan Harnad <harnad@ecs.soton.ac.uk>
Date: 26 November 1999 00:40
Subject: xxx deltas

Now that Zhuoan has an extra 40Gb on her workstation (60Gb total!) I have unpacked the diffs that julian has been sending me every night. Here's some off-the-top-of my-head comments:

between 5 November and 25 November there were 3634 changes (additions/alterations) made to the archive.

that's 173 per day.

about 1542 of these changes were for pre-november articles (73/day).
(I'm going to ignore the November articles: fresh additions AND aleterations because I can't tell the difference at the mo).

57 articles had 2 or 3 changes made. (seems to be the case that the first change is the addition of a journal-ref and the next changes are slight changes to the formatting of the journal-ref).

510 articles had both content and meta-data updated
1020 just changed the metadata
7 updated the content without changing the metadata.

The changes of pre Nov99 articles fitted into the following categories
 25 articles didn't have a Journal-ref to start with, didn't add one and didn't change the contents.
451 articles didn't have a Journal-ref to start with, didn't add one BUT DID change the contents.
785 articles didn't have a Journal-ref to start with, added one and didn't change the contents.
 43 articles didn't have a Journal-ref to start with, added one AND DID change the contents.
215 articles already had a Journal-ref to start with and didn't change the contents.
 11 articles already had a Journal-ref to start with and did change the contents.

Of all those who (in this period) added a journal-ref, only 5% (43/828) changed the contents as well.

I think it is necessary to look at exact;ly what happens when an article is submitted, i.e. don't ignore the November data.
---
Les
 

Date: Fri, 26 Nov 1999 14:07:39 +0000
From: Leslie Carr <lac@ecs.soton.ac.uk>
To: sh94r@ecs.soton.ac.uk
Subject: how much change

Steve: I have looked at 40 of the articles that were changed when the journal-ref was added.

I have chosen to reprsent the "amount of change" as the ratio of "number of lines of old version changed" / "number of lines of old version". You could argue that material is more likely to be added than deleted, so perhaps it ought to be different. However, here are the results:

25% of the articles have <10% changes.
15% of the articles have >10% and < 20% changes.
30% of the articles have >20% and < 30% changes.
30% of the articles have >30% changes.

The *median* value is 21% changes.
---
Les
 

From: "Leslie Carr" <lac@ecs.soton.ac.uk>
To: "Steve Hitchcock" <sh94r@ecs.soton.ac.uk>
Subject: Re: Quikchart
Date: Fri, 26 Nov 1999 17:27:37 -0000

Another random sample, another set of statistics.

Looking at the set of hep-th articles from Jan 97 -> Oct 99.

There are about 10600 articles all told.
45% of them (4802 articles) do NOT have a journal ref.
Of those without a journal ref, 38% do have some "publication clue" in the comments field e.g. the phrases "to appear in" or "submitted to" or "presented at" or "published in". The clue may indicate something other than journal publication, e.g. "talk given" or "proceedings" or "lecture".

The balance of comment fields simply give the number of pages and the TeX macro packages used for formatting.
---
Les
 

From: "Leslie Carr" <lac@ecs.soton.ac.uk>
To: "Steve Hitchcock" <sh94r@ecs.soton.ac.uk>,
        "Stevan Harnad" <harnad@ecs.soton.ac.uk>
Subject: More XXX fascinating facts
Date: Tue, 14 Dec 1999 14:26:23 -0000

From an analysis of (reader) usage 5th January 1999.

There seem to be 1478 "user sessions", where 1 session is a use of the archive from 1 client. A handful of these correspond to proxies and mirrors. The rest seem genuine individuals.

On this day there are requests for (abstracts, sources or ps of) 3773 different articles.

There were...
3718 requests for abstracts
841 requests for the TeX sources
4031 requests for postscript

The distribution of the requested articles by year are as follows:
   1    1990
  10   1991
  45   1992
  94   1993
 190  1994
 185  1995
 271  1996
 551  1997
3878 1998
1322 1999

You can see the emphasis on the immediate past by looking at the number of downloads from each month for 1998 and Jan 1999 below.

   63 9712
   73 9801
   82 9802
  104 9803
   96 9804
  106 9805
  134 9806
  133 9807
  188 9808
  161 9809
  221 9810
  262 9811
2318 9812
1322 9901

Bear in mind that this is just the downloads for Jan 5th, and the Jan figure is HUGE!

The average number of downloads (all abs/ps/tex) per session is 4.8. Ignoring the largest culprits (all proxies) then that drops to 3.8.

Altogether there were 4031 occasions when a postscript file was requested. 22% of those incidences also downloaded the abstract. 8% of them also downloaded the TeX.

Altogether there were 3718 occasions when an abstract was requested.
24% of those incidences also downloaded the postscript.
3% of those incidences also downloaded the TeX.

It would appear that downloading an abstract only leads to "further reading" about a quarter of the time.
It would appear that downloading the PostScript is infrequently prompted by reading the article. In fact,
many of the postscript downloads come immediately after reading the "current" summary list of articles.

Coming soon: what is the most common means of accessing articles? "Search" ? Reading the "current" list? Reading the list for a particular month? Can we tell if subsequent downloads in the same session are from articles cited in the initial download (use (SLAC) SPIRES and examples from hep-th).
---
Les
 

Follow-up: Mining the social life of an eprint archive