Harnad, S. (1985) Rational disagreement in peer review. Science,Technology and Human Values 10: 55 - 62.
http://www.cogsci.soton.ac.uk/~harnad/Papers/Harnad/harnad85.peerev.htm

Rational Disagreement in Peer Review

Stevan Harnad
Behavioral and Brain Sciences,
20 Nassau Street,
Princeton, NI 08540
harnad@cogsci.soton.ac.uk
http://cogsci.soton.ac.uk/harnad


In most cases of what we are inclined to call "lying with statistics," it is not the statistics that "lie," but the way in which they are described and interpreted. If an advertisement claimed that "Of 100 people surveyed, three times as many of those expressing a preference preferred product A over product B," the statement would be a statistically truthful although highly misleading description of the outcome of a study in which three people preferred A, one preferred B, and 96 rated both products equal. In such cases, it is safe to assume that there is an intention to mislead. The motive is obvious. But in the case of the 1981 peer review study reported in Science by Jonathan R. Cole, Stephen Cole, and Gary A. Simon, there is certainly no intention to mislead, and the motivation is clearly honorable throughout. If there are nevertheless misleading features in the authors' description and interpretation of their otherwise very important and useful findings (as I will attempt to show below), these may in part issue from a disposition to describe a cup at half empty rather than half full, and perhaps also from an element of hyperbole (about reviewer unreliability) in the face of null findings on the more dramatic question addressed in their study (reviewer bias).

Hyperbole is defined as "exaggeration not intended to deceive but to add force to a statement." In statistics, however, and for such delicate topics as scientific value judgments, hyperbole distorts rather than adds force-especially at a time when everyone seems to be looking for any plausible-sounding excuse to cut expenses, and when government and public are more seriously ill-informed and ill-disposed than ever concerning the normal, everyday activities of science (as opposed to its occasional sensational moments, inevitable but infrequent as they are). Unfortunately, some hyperbole is involved when Cole and his colleagues speak about "reversals" of funding decisions, about "chance," about the "luck of the draw," and about "consensus" in the peer review "lottery."

I will attempt to show that the findings of Cole et al. are considerably less dramatic than they appear; that they do not reflect badly on "the peer review system" (for which no one has yet proposed a coherent alternative in any case) but only underscore the need to be always looking to optimize science's self-monitoring, quality-control functions; and that Cole et al.'s analysis in terms of "funding decision reversals" (inferred from rank order changes) is based on an overinterpretation of results already contained in their simpler prior findings (National Science Foundation within-proposal rating variances). Because the Science report is incomplete on various crucial points, I had to make various assumptions in this critique, some of which may turn out to have been wrong; any reader of the Science report would be in the same uncertain position, however, and would likewise be forced to make educated guesses at some junctures. Wherever possible, an attempt has been made to secure the missing information from their 1978 report2 or from their later one,3 which also appeared in 1981.

In the Science study, Cole, Cole, and Simon took 25 funded and 25 unfunded U.S. National Science Foundation (NSF) grant proposals in each of three program areas (chemical dynamics, economics, and solid-state physics) and replicated NSF's mail review procedures under the auspices of the National Academy of Science's Committee on Science and Public Policy (COSPUP). They found that, although the COSPUP reviews did not appear to show any systematic "bias" effect relative to the NSF reviews,4 they did "reverse" the NSF's funding decision in 25% of the cases, from which, Cole a al. say, "we may conclude that the fate of a particular proposal is roughly half determined by the characteristics of the proposal and the principal investigator, and about half by apparently random elements which may be characterized as the 'luck of the reviewer draw" (p. 885).

Sampling Problems

What were some of the bases for the study's findings and conclusions? First the total sample of 150 proposals is described as having been provided in 1977 by the NSF from a set "upon which decisions had been made recently" (p. 881). The question of the sampling procedure immediately arises, and one is at first led to conjecture (incorrectly) that these proposals may have been a subset of the approximately 1,200 proposals on which the earlier (1978) study had been based; but those were themselves reported as having been a "stratified sample" out of the total number of proposals considered in 1975 (for which our best guess is 1,730 proposals in all.5 In any case, no matter how and when the sample was derived,6 a sampling problem appears to exist.

First, consider the factor of "stratification," i.e., selecting exactly half funded proposals and half unfunded ones. Cole et al. elsewhere describe their 1975 sample as "systematically random"7 which-besides sounding a bit like a contradiction in terms-is not sufficiently informative. If we use their 1978 figures,8 the three programs analyzed in the Science article, chemical dynamics, economics, and solid-state physics, had 142, 196, and 172 new Proposals (respectively) in 1975, of which 36%, 38%, and 41% (respectively) were funded.9 Assuming comparable figures for the actual source of Cole a al.'s sample, how were 25 funded and 25 unfunded proposals selected in each program? Figure 1 suggests that no sampling scheme, with or without range restriction, would be fully satisfactory to ensure that a symmetric stratified subsample will be representative of the total asymmetric population under these empirical conditions.10

A more serious problem, however, is the fact that the program directors and other NSF officers certainly are not dealing with symmetric stratified samples when they make their funding decisions; so how can one validly infer what "decisions" they would have made on the basis of the COSPUP reviews, given that the sample proposals were not embedded in the full (asymmetric) range of variation from which they had been drawn? This question already begins to suggest that an interpretation in terms of 25% "reversals" of NSF funding decision by COSPUP reviewers may not be warranted under these sampling conditions.
 

Figure 1. Possible sampling schemes for deriving a 50%/50% stratification from a larger asymmetric (i.e., <50%/>50%) population. Solid lines represent the total population from which the sample (25 funded/25 unfunded) is derived. X's represent the range of the stratified subsample. Horizontal break represents hypothetical cutoff point for funding decision (but see critique in text). (a) Separate random samples from funded and unfunded proposals (unequal ranges, nonrepresentative variability). (b) Separate samples, symmetric range restriction (nonrepresentative ranges and variability). (c) Subsample of 25 immediately above and below cutoff point (nonrepresentative ranges and variability).

Rating/Funding Discrepancy

To examine the basis of the funding decisions themselves, let us first look at what the Cole et al. article says:
  There is a high correlation between reviewer ratings and grants made. If one attaches numerical values to the ratings, say from 10 for poor to 50 for excellent, the mean scores predict with a high degree of accuracy which proposals will be funded and which will be denied. Whether or not NSF program directors actually compute statistical averages from the ratings and use them in decision-making, the statistical average of the ratings turned out to be highly correlated with the actual decision rules employed by the program directors.11
If we truly have here a case of averages that predict funding "with a high degree of accuracy" without actually being computed (and an entire section of the 1978 report-pp. 86-108-indicates that they are not computed 12), then either (a) program directors are superb intuitive statisticians or (b) inter-reviewer agreement must be very high, with minimal rating overlap between proposals. Because all of Cole et al's data indicate that the latter, (b), at least, is decidedly not the case, there is already reason to doubt the premise that average ratings are such accurate predictors of funding.

The rating/funding correlation coefficients are embedded in a table 13 that raises two more problems: (a) Some NSF programs, including economics, supplement mail ratings with panel evaluations, which were not replicated in the COSPUP review 14 and (b) some correlations were based on probit analysis and some on regression analysis, with the coefficients differing markedly in magnitude, the chief parameter of interest here. 15 The square of the correlation (which indicates the percentage of the variance [PV] in one variable that can be predicted from the other variable) will be used throughout this critique, with indications as to whether it is derived from probit or regression analysis where necessary.

Across all ten of the 1975 NSF programs sampled, 16 the probit PVs ranged from 44% to 92% for mail reviews. Eliminating the five programs that supplement mail reviews with panels (including economics) leaves five programs, with PVs ranging from 57% to 92%. The specific PVs for the two nonpanel programs (of the three reported in the Science article) for which the comparisons have any meaning were 92% for chemical dynamics and 70% for solid-state physics. 17 Now, 92% is an impressive figure, but, it occurs only twice. The rest of the PVs in Table 35 are below 80%, and the only other interpretable one for the purposes of the Science study is 70%. To state it in Cole et at's language, 8% of the (probit) variation in chemical dynamics funding and 30% in solid-state physics cannot be accounted for by average ratings. (The uncertainty is still higher if one uses regression analysis 18). And all of these are 1975 figures. Their yearly variation would have to be known to make any generalizations. 19

Comments and Causation

Funding decisions are not informed solely by ratings. Cole a al. state that their interest is in "whether the system as currently employed is ... a rational one" (p. 881). It would seem rational to pay some attention to the comments each reviewer provides with his ratings, if there is any point in soliciting comments at all; and indeed program directors repeatedly emphasize that they rely heavily on such comments. 20 Nevertheless, Cole et al. insist that reviewer comments are not "the crucial variable" in determining the funding decision,21 arguing that the seeming contradiction can be explained: (a) Program directors may be exaggerating the importance of the comments in order to "enhance the importance of their [own] role"; (b) they may be focusing disproportionately on the rarer borderline cases where the comments do play a role; and (c) there is a "high correlation"22 between the ratings and the comments. Finally, Cole et al. do acknowledge that (d) because the rating/funding correlation is "not perfect," dissociation may also be responsible for some of the variance. (It was my impression in reading the three documents cited in Notes 1, 2, and 3 that in certain problematic cases such as the latter one, the most reasonable explanations would indeed be touched upon, but then they would be either minimized or ignored. 23)

We will have to assume that there was indeed a high correlation between the ratings and (some suitable quantification of the content of) the comments (one would hope that there would be), for the actual correlation coefficients do not appear to have been provided anywhere. The authors only report some qualitative samples together with their judgment that "there is a high degree of correspondence."24

One can perhaps surmise what led Cole et al. to minimize the role of the comments. On the one hand, they failed to find much of a correlation between funding decisions and such predictors as the applicant's seniority, institutional prestige, funding history, publications and citations; and they judged that much of the content of the comments reflected these apparently nonpredictive variables. On the other hand, they were evidently very influenced by the fact that ratings were "so highly associated with decisions."25 There may be some circularity in this reasoning, however, and perhaps some neglect of the usual correlation/causation cautions. 26

Can One Really Speak of Decision "Reversals" Under These Conditions?

Given that PVs ranging as low as 44% (probit) and 30% (regression) are described as being "high," how do Cole et al. characterize the observed correlations between the mean NSF and COSPUP ratings for the three programs tested? The corresponding (regression) correlation coefficients of .60, 66, and .62 (respectively) yield PVs of 36%, 44%, and 38%. (Ironically, the best of these is the incommensurable figure for economics proposals.) These correlations are described as "moderately high," with the "match" being "less than perfect" (p. 8821 To be fair, the NSF funding decisions are likewise described as "highly, though not perfectly, correlated with ratings" (p. 882). In any case, this moderately high correlation is what is then subjected to the following transformation and interpretation. First, the rank order of the 50 mean NSF ratings is plotted against the rank order of the mean COSPUP ratings (yielding the scatter-plots of what look like perfectly respectable positive rank correlations; see Figure 1 on p. 883 of the Science article); and then every case that fails in the top half in one ranking and the bottom half in the other is designated a "reversal" of the NSF funding decision by COSPUP. It should be evident, however, that the scatter-plots and the reversal counts add no further information to the observed empirical value of the NSF/COSPUP correlation coefficients themselves. They are simply a way of dramatizing them; moreover, they become the source of the authors' "lottery" image. My critique of the sampling procedures and of the independent predictiveness of mean ratings should cast some doubt on the appropriateness of characterizing the 56%-64% free variance here as in any way reflecting "reversals" 27

Reviewer Sample Size and Variability

What, then, aside from the confounding effects of stratification and the unmeasured effect of the reviewers' comments and of the program directors' judgments28 might underlie this free variance? Cole et al. never specify precisely how much the number of mail reviewers varied,29 but from Table 4 in the Science article (p. 884), one can infer that both NSF and COSPUP averaged close to four (3.26-3.84) reviewers per proposal.30 On p. 884 of the Science report, Cole et al. provide the within-proposal individual reviewer variances estimates (eg., NSF 48.93 and COSPUP 50.25 in solid-state physics). They even go on to add that "[o]f course, the average of several reviewers will have lower variance" (e.g., 12.23, and 12.56, respectively), "but," they go on to say, "these are still not tiny compared to the [between-]proposal variance [emphasis added]" (which was 243 in this case). On the next page (p. 885), they mention that "[t]he element of chance would, of course, be reduced by increasing the number of reviewers;" but they then proceed, without comment or conclusions, to a non sequitur discussion31 of the possibly "artificial" consensus arising from the panel review system used by NIH (National Institutes of Health), and how this system is in any case not comparable with the NSF system. But half the programs studied at NSF, including one of those analyzed in the Science article, supplement mail reviews with panel ratings.

Not only can mean rating variance be reduced (ie., its reliability can be increased) by increasing reviewer sample size,32,33 but the correlation between the mean ratings of two independent samples of reviewers from the same population, rating the same proposals, can thereby likewise be commensurately raised. In other words, the correlation between mean NSF and COSPUP ratings (given that the demonstrated absence of selection bias34 indicates that the reviewers were drawn from the same population) can be improved considerably by simply using more reviewers.

The relation between individual reviewer variance and the NSF/COSPUP mean rating correlation is in a sense statistically tautological (given that there was no systematic bias). It only indicates, given the observed population variance for samples of size 4, what the covariance would be between the means of pairs of samples of size 4 (from the same population, rating the same proposals). Cole et al.'s "reversal" terminology is hence even more difficult to justify, since all rank cross-overs (from the top half to the bottom half) were already contained not only in the NSF/COSPUP correlations but also in the NSF individual reviewer variances themselves.35

Peer Disagreement: Random or Rational?

Let us examine one last instance of hyperbole, namely Cole et al.'s use in the Science article of the term "consensus" (and its apparent opposite, "dissensus," a word that does not exist and that seems to serve only to call to mind its cognate "dissension"). Why should one desire or expect to get consensus-which means unanimity, 100% agreement-in scientific judgments at all? A case can be made for the vital role of "creative disagreement" in science.36

Cole et al. seem to have an ambivalent attitude to this question. In the last paragraph of their Science article (p. 885), they suggest that some disagreement might be a healthy thing, but they then go on to the question of "randomness" as if it were a synonym for disagreement.37

It is my conviction that a certain degree of disagreement is not only a healthy but an informative and even essential aspect of scientific activity38 and that it should be clearly represented as such, rather than as some capricious and arbitrary shortcoming of scientific judgment, responsible only for unfairness and waste. (Cole et al. certainly do not imply the latter, by the way, but their emphasis on chance, uncertainty, and randomness, leaves the way open for their readers to do just that.) Without question, every effort should be made to strengthen peer review, not only in terms of its reliability (which, as has been suggested here, is only partly synonymous with low reviewer variance) but also in terms of its validity (which may well be improved by rational and creative disagreement). We should continue to do studies such as that undertaken by Cole, Cole, and Simon, not only on scientific funding, but also on the refereeing system in scientific publication.39 One can only applaud Cole et al.'s valuable empirical contributions to research on peer review; but one must also wish that they would have been more circumspect in interpreting their findings.


Acknowledgment-Many thanks to Don Rubin for wise counsel on this paper, not all sapiently followed by the author. This analysis was supported in part by National Institutes of Health Grant LM 03539 from the National Library of Medicine.

Notes 1. Stephen Cole, Jonathan R. Cole, and Gary A. Simon, "Chance and Consensus in Peer Review," Science, Volume 214 (20 November 1981): 881.

2. Stephen Cole, Leonard Rubin, and Jonathan R. Cole, Peer Review in the National Science Foundation: Phase I of a Study (Washington, DC: National Academy of Sciences, 1978).

3. Jonathan R. Cole and Stephen Cole, Peer Review in the National Science Foundation: Phase II of a Study (Washington, DC: National Academy of Sciences, 1981).

4. The COSPUP reviewers' ratings did average 253 points (or about 7%) lower than the NSF ratings. The statistical significance of this main effect was not tested in Cole et al's analysis, but from the absence of an interaction with proposal variance (Cole, Cole, and Simon, op. cit., Table 3, p. 884), they conclude-rightly, I think-that there was no systematic bias in NSF versus COSPUP reviewer selection methods. Cole et al. acknowledge that the lower COSPUP ratings may have been due to the fact that these reviewers knew that their ratings were nonbinding. Considering the potential magnitude of such a crucial subjective difference between two groups using "exactly the same evaluational criteria" (and the possibility of a timeliness differential arising from the interval that elapsed between the two sets of reviews-see Note 6 below), it is rather remarkable that the NSF and COSPUP results were actually as similar as they were.

5. The number 1,730 is achieved by adding together the number of "new grants" awarded and declined in Table B-2 (Cole, Rubin, and Cole, op. cit. p. 176). Although it is certainly possible to treat new grants independently in this way, it is not clear whether NSF program directors actually do so in their evaluations and rankings. If they do not, then the respective total populations from which the original 1,200 proposals and the present 150 are derived would be still larger, and a further confounding factor would be introduced into the Science study. But even if all these estimates were wrong, there would remain the basic sampling problem of how to make a stratified half/half split representative of the larger (and probably asymmetric, i.e. <half/>half) population from which it was derived (Figure 1).

6. We are informed in the 1981 report (p.6) that they were NSF's "latest" 25 fundend and declined proposals in each of the three programs in 1976. Timing-related questions still remain: Is there variation through the granting season, from early submissions to last minute ones, and from early to late (i.e., perhaps lean) disbursements?

7. Cole, Rubin, and Cole, op. cit., p. 172.

8. Ibid., Table B-2, p. 176.

9. In the 1981 report (p. 6), Cole and Cole state (without data) that "[t]he actual funding percentage of NSF is close to 50." They further state that the "distribution of mean peer-review scores for the 50 proposals in each program closely approximated the distribution obtained for the same program in Phase 1" (i.e., in Cole, Rubin, and Cole, op. cit.).

10. Cole, Rubin, and Cole (op. cit., p. 173) discuss this as a "weighting" problem, attempting to show-using the NSF algebra program, an unspecified weighting scheme, and every predictor variable except ratings-that stratification makes no difference (Table B-1).

11. Cole, Cole, and Simon, op. cit., p. 881, emphasis added.

12. See note 26 below.

13. Cole, Rubin, and Cole, op. cit.. Table 35, p. 111, which is misidentified in the introductions to both the 1978 (p. viii) and 1981 (p. 1) NAS reports.

14. This artifact is acknowledged in the 1981 report, where it is stated that it "limits our design in Economics" (p. 11).

15. Table B-4 in the 1978 report (p. 185) compares the (squared) probit and regression values. The probit correlations-based on untested and sometimes knowingly violated normality assumptions (p. 110)-are consistently inflated relative to the regression correlations (p. 186). Probits predict a hypothetical continuous variable ("true merit?" program director's confidence?) rather than the funding decisions themselves, as the regression coefficients do. Hence the fairest comparison for the COSUP/NSF rating correlations (and associated variance predictiveness)-which are regression values-may really be with the regression estimates of the funding/rating correlations rather than the probits. In that case, the variance estimates differ only by about 10 percent (compare Table B-4, p. 185, in the 1978 report with Table 1, p. 882 of the 1981 Science article).

16. Cole, Rubin, and Cole, op. cit., Table 35, p. 111.

17. All these 1975 PVs were based on probit analysis. The corresponding regression PVs ranged, for all 10 programs, from 30% to 54%; chemical dynamics and solid-state physics were 54% and 47%, respectively. From the data provided in Cole and Cole (op. cit.. pp.71-72 and 77-78), we can calculate the regression PVs within the 1976 samples of 50 spectively. There is clearly considerable variability here, between programs, between years, between samples, and betwen probit and regression estimates. (See notes 28, 30, and 32, below.)

18. See the comments in notes 15 and 17, above.

19. See the comments in notes 30 and 32, below.

20. Cole, Rubin, and Cole, op. cit., Section 4, pp. 86-108.

21. Ibid., p. 112

22. Ibid.. p. 87 and passim.

23. See notes 33 and 37, below.

24. Cole, Rubin, and Cole, op. cit., p. 112 and passim.

25. Ibid., p. 141.

26. The extent to which this aspect of their interpretation seems to have assumed the status of a self-fulfilling prophecy for the authors is illustrated by some of the corrective admonitions they offer to program directors on the basis of their findings (Cole and Cole, op. cit., pp. 60-61): "Program directors cannot and ought not be completely bound by the numercial averages of the peer ratings...;" "it is somewhat disconcerting to note that the outcome in the great majority of cases nonetheless remains closely correlated with the numerical average of peer ratings...;" " in using peer review, program directors would be wise by and large to direct greater attention to the numerous qualitative considerations raised by the reviewers in their written comments, moving away from reliance on single numerical ratings." Yet this appears to be preaching to the converted, as the program directors interviewed in Section 4 of the 1978 NAS report (pp. 86-108) evidently tried in vain to indicate.

27. There are hints that Cole et al., too, have some doubts about their "reversal" interpretations. On pp. 27-28 of the 1981 report, they indicate that the real question their study may be posing is "How much would funding decisions change if they were wholly determined by COSPUP ranking?" (emphasis added). Later, they add "Since the NSF funding decision did not follow NSF ratings exactly, the findings reported below should be treated cautiously" (emphasis added); but then, inexplicably, they continue "and as indicative of levels of reversals that might be obtained were a set of hidpendently selected reviewers used" (emphasis added), which seems to say, they were not reversals, but let us treat them as if they were.

Other indications that even the funded/unfunded dichotomy may not be appropriate and may mask an underlying continuum appear in the 1978 report (pp.178, l81) and the 1981 report (p. 28), for example, with respect to the unexamined variable of percent funding (the granting of a lower amount than requested) (see my remarks in note 15, above).

28. It seems clear that it is the reliability (and validity) of the program directors' decisions, rather than the reviewers' ratings, that is really at issue in investigations of the peer review system. There are in fact two levels of variability here (reviewer and director), and the empirical evidence on what would optimize the system as a whole-i.e., "minimize random elements and maximize the influence of both the quality of the proposal and the ability of the principal investigator to perform the research" (Cole, Cole, and Simon, op. cit., p. 8811-may well turn out to indicate that, up to a point (see note 32, below), reviewer "unreliability" (i.e., variance) is informative, and hence strengthens the validity of the superordinate decision on which the various reviews converge. This is an empirical question, however, and Cole et al can only offer conjectures on the matter-and ambivalent ones at that (see comments in notes 31 and 37, below). Analogous questions arise concerning editorial decisions in the journal review system (see note 39, below).

29. Cole, Rubin, and Cole, op. cit., p. 81; Cole. and Cole, p. 8.

30. The number of reviewers was apparently sometimes as low as two, or even one (Cole and Cole, op. cit., pp. 28-30). In view of the high individual rating variance, it is likely that the observed mean rating correlations (NSF/COSPUP) as well as the rating/ funding correlations were sensitive to sample size. It is noteworthy that the 50 solid state physics proposals, with their lower NSF rating/funding correlation, also had many more samples of size two or one. It may also be of some significance that reviewer sample size seems to have considerably more effect on actual decisions (ie., rating/funding correlations) than on hypothetical "reversals" (i.e., NSF/COSPUP correlations).

Although the optimal number of reviewers per proposal is an empirical matter (see note 32, below), it seems worth pointing out that both the NSF program directors, and Cole et al. (in their elaborate estimates of the size of the total potential "pool of eligible reviewers" for any given proposal, Cole, Cole, and Simon, op. cit., pp. 881-882 and 886; Cole and Cole, op. cit., pp. 19 and 66-67), are relying exclusively on word-of-mouth sources. Some individuals are now advocating, and experimenting with, the use of modem computer-aided bibliographic retrieval sources-searching current databases on the basis of keywords and citations-in order to increase the size (and perhaps the objectivity) of the word-of-mouth sample considerably (see symposium cited in note 39, below).

31. This curious "give-and-take" style-giving a plausible interpretation and later taking it away again in favor of a less plausible one-occurs throughout the Cole et al. reports; e.g., on the ranking/funding discrepancy (Cole, Rubin, and Cole, op. cit., p. 141); on interpreting "reversals" (Cole and Cole op. cit.. p. 27-28); on increasing the number of reviewers (Cole, Cole, and Simon, op. cit.. p. 884); on the role of the panel (Cole, Rubin, and Cole, p. 141); on the role of reviewer comments (ibid., p. 86); on chance versus disagreement (Cole, Cole, and Simon, p. 885). In general, the 1981 report seems more circumspect than the 1978 report which seems more circumspect than the 1981 Science article.

32. In the 1978 report (p. 81), Cole et al. discuss a highly suggestive confounding factor in the observed relation between individual rating variance and the number of reviewers used, namely, that program directors tended to increase the sample size in cases of excessive disagreement among the initial four reviewers. This artificially produced a negative correlation between variance and sample size. The example is instructive, because it indicates that reviewer variance (i.e., reviewer disagreement) may be a source of information raiher than noise to a program director; as supported by reviewers comments, it is, after all rational (not random) disagreement. Hence questions about what sample sizes and variances are desirable are empirical matters of optimization rather than a priori matters of "chance" versus "consensus." The comments, suitably weighted, probably serve to stabilize the rating variance. Moreover, since most reviewers can be expected to be motivated to improve proposals (and prove themselves) rather than just to praise them, most reviews should exhibit a negative or critical bias; this may in turn set an upper bound on how large a sample one should elicit, before the criticism converges (and becomes redundant) or diverges (and becomes irrelevant), depending on the particular proposal.

Other empirical questions about optimizing the peer review system could be directed at the role (i) of panels, (ii) of relative judgments on multiple proposals by the same reviewer, (iii) of formal criterion checklists (Cole, Cole, and Simon, op. cit., p. 885; Cole and Cole, op. cit.. pp. 61-63; and see note below), and especially, (iv) of the background knowledge and the selection, evaluation, and weighting strategies of program directors.

33. C. Cicchetti, "We Have Seen the Enemy and He is Us," The Behavioral and Brain Sciences, Volume 5 (1982).

34. See note 4, above.

35. The "reversal" interpretation of rank-order changes is already considerably vitiated by the (i) high likelihood of variability around the midpoint (as Cole, Cole and Simon acknowledge in the Science article, p. 882, and especially when one excludes the non-comparable economics data in Table 2, p. 883), and by the (ii) partial dissociation between NSF mean ratings and decisions (as implicitly acknowledged by the separate rank and decision analyses in Table 2). But now we see that (iii) using just the original individual rating variances (plus the proposal variances) one could in principle have predicted all the quintile reversal data in Table 2 a priori. A maximal quintile reversal need involve a rank change of only 26-10 = 16 (the standard deviation for 50 ranks is about 14.6). But one could even go on to calculate-if one were so inclined-the probability of a reversal from the very first rank to the fiftieth...

36. "Do scientists agree? It is not only unrealistic to suppose that they do, but probably just as unrealistic to think that they ought to. Agreement is for what is already established scientific history. The current and vital ongoing aspect of science consists of an active and often heated interaction of data, ideas and minds, in a process one might call 'creative disagreement'. The 'scientific method' is largely a reconstruction based on selective hindsight. What actually goes on has much less the flavor of a systematic method than of trial and error, conjecture, chance [sic], competition and even dialectic.

"What makes this enterprise science is not any 'method,' but the fact that the entire process is at all times answerable to three fundamental constraints. The first of these is the most general one, namely, logical consistency: Science must not be self-contradictory. The second is testability: Hypotheses must be confirmable or falsiflable by experiment. The last constraint is that science must be a public, self-corrective process; not public in the sense of the general public-the role of popular opinion in science is an ethical rather than a scientific concern-but public in the sense of one's peers, one's fellow-scientists.

"Peer interaction, in the form of repeating and building upon one another's experiments, testing and elaborating one another's theories, and evaluating and criticizing one another's research, is the real medium for the sell-corrective aspect of science. It is in fact this medium that helps enforce science's other two constraints (consistency and testability) through peer review of both publication and funding of scientific research." Stevan Harnad, The Sciences, Volume 19, (1979): 18. (Note that even "chance" need not be a pejorative in science.)

37. More ambivalence on this topic can be found in Cole and Cole, op. cit.. pp. 41-43, 55, and 57.

38. See the comments in notes 28 and 32, above.

39. To demonstrate the dynamics and informativeness of peer interaction in science, the journal, The Behavioral and Brain Sciences, a project devoted to externalizing creative disagreement in the biobehavioral sciences, has produced a special self-reflective symposium, Peer Commentary on Peer Review: A Case Study in Scientific Quality Control, Stevan Harnad, ed., (New York: Cambridge University Press, 1983); reprinted from The Behavioral and Brain Sciences, VolumeS (1982). In that issue, over 60 editors, grant officers, bibliometricians, sociologists of science, and investigators, critics, advocates and reformers of peer review analyze, criticize, and elaborate a controversial study by Donald Peters and Steven Ceci: This study found that when 12 articles were resubmitted to the (psychology) journals that hid already published them (with only title, author, institution, and a few cosmetic portions of text changed), only three were detected as having already been published, and eight were rejected on methodological grounds, with high inter-referee agreement. This study (which was not without methodological weaknesses of its own) served to provoke an unusually rich panorama of critical perspectives in and on peer review.

Unless otherwise noted, all page references within the text are to the 1981 Science article. See Note I