|Home | You are at|
How does Citation Latency vary with time?
Written by Tim Brody, last updated August 31 2000 14:33:19.
With the advent of digital archives information transfer between researchers can be achieved at a far quicker rate than that possible in the world of the paper press.
Authors can deposit an unrefereed pre-print and in the same day be read by an international audience of like-minded researchers. This is to be contrasted with the printed journal which must first decide whether the paper is of an appropriate content and then may only be published bi-annually, leading to a possible time-gap between a paper being written by an author and being read of upwards of a year. By this time the paper may be obsolete.
The effect of digital archives on citation patterns is crucial and interesting; does the rapid availability and readership of papers reduce the time gap between a paper being deposited and it being cited? How does this relate to the citation of articles in the printed press? Will authors cite unrefereed preprints while waiting for the refereed postprint?
Using the Los Alamos National Laboratory Digital Physics Archive [arXiv] we can analyse the citation patterns within the digital domain, that is we can analyse the citation "links" between two papers deposited in the archive.
Each paper in the archive is given a unique digital identity, formulated from a subject area (for
example "hep-th" is High Energy Physics - Theoretical), a date indentifier (consisting of the year and month of
deposition) and a three digit index number. Given a list of arXiv papers we can generate pairs of citing paper and
cited paper and the deposit date of the citing paper minus the deposit date of the cited paper, with an accuracy of
one month. For example:
Latency of Citations
By looking retrospectively at the archive we obtain what look like erroneous results: negative time differences (i.e. a paper cited a paper that had not come in to existence yet). These occurences account for 6289 citations, of the 603,460 total identified citations. This can be explained by:
By finding the time differences between a number of papers and their cited papers a table can be built of the latency of citations - a list of the arXiv id of the paper, the paper that is being cited and the date of the paper's deposit minus the date the cited paper was deposited.
paper reference time diff
astro-ph/9501044 hep-ph/9408302 5 astro-ph/9501044 hep-ph/9408342 5 astro-ph/9501044 hep-ph/9406139 7 astro-ph/9501044 gr-qc/9302019 23 astro-ph/9501074 astro-ph/9311052 14 astro-ph/9501074 astro-ph/9311057 14 astro-ph/9501085 astro-ph/9312023 13 astro-ph/9501085 astro-ph/9311064 14 astro-ph/9501085 astro-ph/9311003 14 ...
This set can then be broken down by the year that the paper was deposited, therefore building a picture of how citation behaviour may have changed over the period the archive has been active.
The archive has been active since 1991, therefore we can analyse papers deposited from 1992 through until 1999 - covering citations from 1991 up to the most recent.
After a paper is deposited there is a quick growth in its citations, a peak, and then a linear decrease ending with the start of the archive (and therefore identifiable citations). In the early years of the archive, 92-94, the initial citation rise was slower, peaking at around 12 months. Following 1994 the peak of citations became faster, peaking at 2-4 months.
Deposit rates for the archive have been increasing linearly since the start of the archive, therefore the values for each year increase each year. The percentage of identifiable citations also reduces towards the beginning of arXiv, as fewer authors directly cite articles in arXiv (therefore there is more reliance on identification by publication). However this should have little effect on the peak of citations, as the significant changes between years are within 12 months - that the peak moves from 12 months to 2-4 months.
When analysing the linear decrease this should be taken in the context of the general archive behaviour, that deposits have been growing linearly, therefore the older the citation the less "chance" it has of being to a paper that is in the archive. We can plot a graph of this linear growth along with the citation ages, and find a ratio between the two, to try to "factor" out the effect of changing background population.
Even with taking into account the background population of papers there is still a peak of citations at a latency of 2-4 months, however with a slightly flatter decrease after 36 months - suggesting that the age of a paper over 36 months is not as significant a factor. The results become erratic after 6 years (72 months) as the background population becomes too small.
This scaling for the background population can be applied to all the years, however this does not take account for the reduced number of direct citations (and ability to identify citations) to arXiv during the earlier period of the archive.
Within the field of bibliometrics great emphasis is placed on accessing author and journal "impact", that is the number of citations that authors, journals and institutions receive. So far these analyses have only looked statically at citation data, for example ISI analyse the total citations for a 2 year period. This does not answer the question of the nature of these citations; do high-impact authors receive citations over a longer period - i.e. do high impact authors produce papers with a, relatively, longer shelf-life? By applying impact factors to the citation data from arXiv we can analyse the latency of citations by impact.
Citation pairs can be broken down by author impact. This graph covers papers that were deposited in 1999, with citations to papers by high, medium and low impact authors (e.g. "when have high impact authors been cited"). To define high, medium and low impact the impact factor was found for each author (sum citations divided by number of papers), ordered and plotted on a cumulative graph. The top 25% was defined as being high impact, middle 50% were medium impact and bottom 25% as low impact.
Because authors can not be readily identified prior to 1995 (when a seperate "authors" meta-data field was introduced), this graph only covers the period from 1995 to 1999. By splitting authors in this fashion there are less citations in total to high impact authors, however these citations are to a much smaller number of papers - i.e. the citations per paper is higher for high impact authors than it is for medium and low.
This graph can be scaled so that easier comparison can be made between the impact factors. High impact was scaled by 2.5 and medium impact scaled by 0.6.
This does not take into account the background population of papers; for example a citation to a high impact author is more significant because there are fewer papers to cite. The unscaled graph of citation latencies can be normalised with the deposit rates for each author impact (retrospective deposit rates for authors who are now high, medium and low).
By zeroing the deposit rates at 12/1999 and dividing the citations per month difference, by the number of deposits for each impact factor per month, the citation latency can be normalised - therefore removing the bias for high, medium and low impact authors.
It appears that, when measuring by author impact, the impact factor has little effect on when papers receive citations.
Glossary of Terms