Author's Response

The null-hypothesis significance-test procedure is still warranted

Siu L. Chow

Department of Psychology, University of Regina, Regina, Saskatchewan,

Canada S4S OA2.

Abstract: Entertaining diverse assumptions about empirical research, commentators give a wide range of verdicts on the NHSTP defence in Statistical Significance. The null-hypothesis significance-test procedure (NHSTP) is defended in a framework in which deductive and inductive rules are deployed in theory corroboration in the spirit of Popper's Conjectures and Refutations (1968b). The defensible hypothetico-deductive structure of the framework is used to make explicit the distinctions between (1) substantive and statistical hypotheses, (2) statistical alternative and conceptual alternative hypotheses, and (3) making statistical decisions and drawing theoretical conclusions. These distinctions make it easier to show that (1) H0 can he true,, (2) the effect size is irrelevant to theory corroboration, and (3) "strong" hypotheses make no difference to NHSTP. Reservations about statistical power, meta-analysis, and the Bayesian approach are still warranted.

R1. Introduction

For ease of exposition, "NHSTP defence" is used to refer to the defence of the null-hypothesis significance-test procedure (NHSTP) presented in Statistical Significance. "Synopsis" refers to the synopsis of the NHSTP defence, and "rejoinder" refers to the present response to the commentaries. There are four main reasons for the wide range of verdicts on the NHSTP defence. First, commentators entertain disparate ideas about various aspects of empirical research. Second, logistic concerns of empirical research sometimes make it impossible to observe the nuances important to philosophers, logicians, or statisticians (Kraemer, Mayo). Such departures may be justified when no logical or mathematical rule is broken.

The third reason is the use of the collective terms "power analysts," "meta-analysis," and "Bayesian" to discuss critically particular versions of these techniques in order not to sound personal and to indicate that the criticisms are about the ideas, not their proponents. The fourth reason is the need to distinguish between a technique and the assumptions about empirical research held by experts who use it. If the NHSTP defence were a critique of Bayesian statistics, meta-analysis and power analysis at the technical level, it would be necessary to direct the criticisms at their more recent, mature versions. The critique, however, is about what power analysts, meta-analysts, and Bayesians think about empirical research.

It is argued in the NHSTP defence that the validity of the theory-corroborative experiment is assessed in terms of conceptual, theoretical, and methodological criteria, not a numerical index, an account accepted by some commentaries in general terms (Boklage, Hayes, Tassinary, Thyer, Vokey). Specifically, it is essential that the substantive explanatory (hence, not necessarily quantitative) hypothesis to be tested be consistent with the phenomenon to be understood in a nontautological way. The experimental hypothesis should be a valid deduction from the substantive hypothesis in the context of the specific experimental task. The exclusion of recognized alternative explanations is made possible when data are collected and analyzed according to the experimental design that satisfies the formal requirements of an inductive rule (e.g., Mill's [1973] method of difference). NHSTP is used to exclude chance influences as an alternative explanation of the data (i.e., to choose between chance and non-chance). Using the NHSTP outcome, the experimenter interprets the data with reference to the implicative relationships among the substantive, research, and experimental hypotheses (i.e., to isolate what the non-chance factor is). What the experimental data mean at the conceptual level is determined by the theoretical foundation of the experiment, not by statistics or any other nonconceptual concerns (e.g., the practical importance of the result).

The issues that will be discussed in the present response are (1) the propriety of using statistics in psychological research, (2) the formal structure of scientific investigation versus the sociology of science, (3) the differences between substantive and statistical hypotheses, (4) the validity of the hypothetico-deductive framework, and (5) some reservations about effect size, statistical power, meta-analysis, and Bayesian statistics.

R2. The historical perspective

The NHSTP defence is a rationalization of cognitive psychologists' modus operandi, namely, conducting theory corroboration experiments in Popper's (1968x; 1968b) "conjectures and refutations" framework. The outcome of NHSTP is used to start the chain of syllogistic arguments that leads to the theoretical conclusion. The issue is whether or not it is wrong from the historical perspective.

R2.1. Neyman and Pearson and the inverse probability.

Reading too much into a statement made by Neyman and Pearson (1928), I suggested that they subscribed to using the inverse probability. This was a mistake (Gregson, Krueger, Kyburg, Mayo, Poitevineau & Lecoutre).

R2.2. Statistics and nomothetic research. Statistics is used when the concern is about

what can be said about a well-defined group of individuals (i.e., nomothetic) rather than of individuals as unique beings (i.e., idiographic). Stam & Pasay as well as Verplanck raise the possibility that the NHSTP defence perpetuates a historical error, namely, the error psychologists made when they were diverted from doing idiographic research as a result of using statistics. However, there are qualities that are found in all members of' the group. Statistics is used to describe how these properties are distributed among the members of the group. For example, the mean of the group is sensitive to the magnitude of individual scores. To use the mean to represent the group is to describe how the individuals are distributed along the dimension in question, not to suggest that individual differences are not important. Statistics is used in nomothetic research to ascertain what is true of the group, despite individual differences.

The uniqueness of X is described in terms of how X differs from others in terms of demeanor, dress code, tastes, social skills, and the like. This description is meaningful only when there is information about the demeanor, dress code, tastes, social skills, and the like, common to the group. That is to say, statements in idiographic research are meaningful only against the backdrop of nomothetic research.

There is another reason why using statistics is not incompatible with idiographic research. Suppose that it is necessary to ascertain individual X's memory capacity. Not only is X different from other individuals, there are also intra-individual differences in different situations or on different occasions. It is legitimate to ask what X is really like, despite such intra-individual differences. Statistics (particularly NHSTP) can, and should, be used for such a purpose.

R2.3. The hybrid nature of NHSTP. The hybrid NHSTP is faithful to neither the

Fisherian account nor that of Neyman and Pearson (Gigerenzer 1993). For example, from Neyman and Pearson's approach is adopted the practice of fixing the critical associated probability (p) before data collection and of making a binary decision about H0. The features from Neyman and Pearson render NHSTP rigid and mechanical. From Fisher what is adopted is his appeal to only one numerically nonspecific H1 as a complement of H0. To critics, this Fisherian feature is responsible for discouraging numerically specific hypotheses in psychology when NHSTP is used. Hence, using NHSTP impedes theory development (Gigerenzer) and prevents researchers from engaging in two types of "statistical thinking."

The "rigid" and "mechanical" characterizations of NHSTP are correct. Nonetheless, "strict" is a better characterization than "rigid." The rigidity is necessary for inter-researcher agreement. The issue is really not that the decision is rigid, but whether it is well defined and appropriate. The meaning of the associated probability, p, of the test statistic is well defined in terms of the sampling distribution in question. It is appropriate as an index of the decision maker's strictness.

Three issues seem relevant when one asks why p = .048 is treated differently from p = .052 (Krueger, Vicente). First, the choice of a = .05 is called into question. The answer is simply that the rationale of NHSTP is not affected if any other value is chosen for a. Second, the importance of using a in a strict manner may be seen by considering why it is important for a teacher to maintain a consistent passing grade. The third issue concerns why it is necessary to fix the a level before data collection (Lewandowsky & Maybery). The reason is that all design and procedural decisions about the experiment are made with reference to a as the criterion of strictness used to reject chance. Had a different a value been used, concomitant changes in the design or procedure would have to be made.

It bears reiterating that "mechanical" is not a derogatory term. A mechanical procedure is one that guarantees the same outcome if it is carried out properly. Hence, using NHSTP does not render an experiment a "mindless" exercise. Cognitive psychologists do engage in Gigerenzer's Type I "statistical reasoning," namely, choosing among alternative statistical procedures in an informed way. Nor does using the mechanical NHSTP release the researcher from the need to consider other conceptual, theoretical, and methodological factors.

R3. The Popperian structure

NHSTP is defended by illustrating its important, though very restricted, role in the theory-corroboration a experiment in Popper's (1968a, 1968b) "conjectures and refutation" perspective. It is not clear why Shafto denies NHSTP's contribution. The defence can be strengthened by settling the following issues suggested by Erwin, Glück & Vitouch, Gregson, Nester, and Waller & Johnson: (1) the effect of auxiliary assumptions on using modus tollens, (2) the sociology of science, (3) the invalidity of affirming the consequent, and (4) the neglect of theory discovery.

R3.1. The implication of auxiliary assumptions on modus tollens.

Glück & Vitouch refer to Folger's (1989) point that the experimental expectation is the implication of the conjunction of the substantive hypothesis and additional auxiliary assumptions. Hence, modus tollens does not guarantee the rejection of the experimental hypothesis because the experimenter may blame the auxiliary assumptions. Chow's (1989) reply was that a responsible, well-informed, and noncynical experimenter should have good reasons not to blame the auxiliary assumptions in the face of an unexpected result. The reasons are (1) the specificity of the substantive hypothesis, (2) the methodological assumptions commonly held by workers in the same area of research, and (3) the well-established theoretical ideas in cognate areas. If there are good reasons to suspect any of the auxiliary assumptions, the experimenter is obliged to conduct additional experiments to substantiate the suspicion. Using auxiliary assumptions does not mean being cynical or cavalier toward research.

R3.2. The formal structure versus the sociology of science.

The issue of the sociology of science is raised (Gregson) because scientists do not behave in the way envisaged in the Popperian perspective. This sentiment is reinforced by Glück & Vitouch's reference to Lakatos's rendition of Popper's framework. Note that the distinction is not made in the sociology argument about the formal requirement of a particular ideal of knowledge and the activities or psychology (Blaich) of the professionals in some disciplines. Popper's argument is a defensible account of the former if the ideal consists of the following features:

1. Knowledge evolves by reducing ambiguity.

2. Conclusions must be justified with valid empirical evidence.

3. There are well-defined criteria for settling inter-researcher disagreement.

4. Objectivity is achieved when the critical criteria are independent of theoretical preferences of the disputants.

R3.3. The invalidity of affirming the consequent.

Given a conditional syllogism, it is not possible to draw a definite conclusion about its antecedent when its consequent is affirmed. This is the case when H0 is rejected. The interim solution in the NHSTP defence is that recognized alternative explanations are excluded by virtue of the experimental controls. How does the researcher know what constitutes adequate experimental controls (Erwin)? The researcher is guided by the formal requirement of Mill's canons of induction.

Erwin may legitimately push the point further and ask what the researcher should do if a confounding variable is discovered despite having dealt with the "auxiliary assumption" issue (see sect. 3.1). The solution is not just to appeal to another statistical index in the study in question. It is to conduct another experiment designed specifically to examine the confounding variable. The three embedding syllogistic arguments, whose sufficiency as a means to establish "warranted assertibility" is called into question by Erwin, are also involved in the new experiment.

R3.4. The mechanical syllogistic arguments.

Drawing a theoretical conclusion from the data with a series of three embedding arguments is a mechanical exercise (Vicente). However, it does not follow that the hypothetico-deductive framework is antithetical to critical thinking. The experimenter's critical thinking, intuition, and subjective preferences have important roles to play in proposing the substantive hypothesis, devising the experimental task, choosing the appropriate experimental design, and the like. At the same time, it is important that the influences of these subjective factors are neutralized in data interpretation. It is for this reason that experimental controls, as a means of excluding alternative explanations, are necessary. It is also for this reason that the NHSTP defence is based on the theory-corroboration experiment (Waller & Johnson).

R3.5. Converging operations and replication.

The various theoretical properties of a hypothetical structure envisaged in a cognitive theory have to be substantiated. The series of experiments designed for such purposes constitute converging theory-corroboration operations. It is emphasized in the NHSTP defence that these theory-corroboration experiments are not literal replications of the original study because they differ from the original study as well as among themselves. What should also be said is that literal replication has a different role, a point made by Tassinary. Specifically, replication studies are essential for ensuring that the new discovery is not a fluke. NHSTP is used in each of the replication studies in the same way it is used in theory-corroboration studies, namely, to exclude chance influences as an explanation.

R4. The nature of the substantive hypothesis

Substantive hypotheses are speculative explanations proposed to explain phenomena. Research data are collected to substantiate these hypotheses. There arises the distinction between the instigating phenomenon and the evidential data of the hypothesis as well as the following issues: (1) alternative conceptual hypotheses versus a statistical alternative hypothesis, (2) the nature and role of the effect size, (3) the nature of a good explanatory hypothesis, and (4) the numerically nonspecific versus numerically specific statistical hypotheses.

R4.1. Instigating phenomenon versus evidential data.

Phenomenon P cannot provide the substantiating evidence for the hypothesis that explains P itself in a nontautological way. For example, the hypothesis "snake phobia" is proposed to account for the instigating phenomenon of an individual's irrational fear of snakes. [See Davey: "Preparedness and Phobias" BBS 18(2) 1995.] Consequently, this reaction to snakes cannot be used as evidence for the "snake phobia" explanation without rendering the argument circular. It is necessary to distinguish between the phenomenon of which the substantive hypothesis is an explanation (the instigating phenomenon) and the data that are used to substantiate the hypothesis (the evidential data).

R4.2. Alternative conceptual hypotheses versus alternative statistical hypotheses.

While the substantive hypothesis explains the instigating phenomenon, the experimental hypothesis describes what the experimental data should be like. When the experimental hypothesis is expressed in statistical terms, it is the statistical alternative hypothesis, H1. This is very different from Zumho's recounting of Rozeboom's (1960) view of what H1 is. A successful substantive hypothesis (T) is one that makes the to-be-explained phenomenon understandable. The experimental hypothesis serves as the criterion of rejection of T in the sense that T is deemed untenable if data do not match what is said in the experimental hypothesis. The statistical null hypothesis is used to ascertain whether it is possible to exclude the explanation in terms of chance influences. Dar however, finds the distinctions among the substantive, research, experimental, and statistical hypotheses unnecessary.

Of interest is the fact that the substantive hypothesis (e.g., the view that the short-term store retains acoustic-verbal-linguistic information) takes the form of delineating the nature of the psychological structures or mechanisms underlying the phenomenon of interest. These are qualitative specifications, not quantitative stipulations. Moreover, the experimental hypothesis is an implication of the hypothetical structure in a particular experimental context. This context is not quantitative (e.g., acoustically similar and acoustically dissimilar materials are used in the experimental and control conditions, respectively). Seen in this light, the insistence that a "strong" hypothesis is one that gives a specific numerical value to a parameter (Vicente, Waller & Johnson, Zumbo) raised several issues.

R4.3. The nature of a good theory.

First, what is the criterion of a "strong" hypothesis apart from the fact that it gives a specific nonzero parameter value? The characterization is not informative if the criterion for being "strong" is not independent of its being numerically specific. Second, a hypothesis that is "strong" in this sense is not necessarily a testable or a more informative hypothesis, as can be seen from Propositions [P1], [P2], [P3], and [P4]:

There will be 1 inch of snow. [P1]

There will be snow on Christmas day. [P2]

There will be 1 inch of snow on Christmas day. [P3]

There will be snow on Christmas day between 2 and 3 p.m. [P4]

Although [P1] is numerically specific, it is not testable because it is not clear under what condition [P1] can be shown to be wrong. [P2] is a testable hypothesis even though it is not numerically specific in the way [P1] is specific. It is true that [P3] is testable and more informative than [P2]. However, the superiority of [P3] over [P2] is a matter of specificity, not of being quantitative for the following reason. [P4] is similarly superior to [P2] because it is more specific. [P3] is "stronger" than [P4] in the sense envisaged by critics of NHSTP. However, it is not possible to say which of [P3] and [P4] is more informative. What can be said is that they are informative in different senses. The moral of the story is that even if it were possible to ignore the circularity problem (see sect. R4.1), the "strong" characterization is not an appropriate, let alone the only, criterion for theory assessment (Waller & Johnson). A more serious objection to the "strong" characterization is that it is not clear that a "strong" hypothesis is a necessarily explanatory hypothesis, as may be seen from [P5]:

Every six hours of practice will improve performance by 3 points. [P5]

Suppose that [P5] is an empirical generalization of practical importance established after numerous meticulous replications, and it gives rise to a specific numeric parameter (Bookstein). Does it explain why such a functional relationship exists between practice and performance? [P5] itself invites an explanation. To explain functional relationships such as [P5], appeals to hypothetical mechanisms are inevitable. Specific theoretical properties are attributed to these mechanisms. Theory-corroboration experiments are conducted to substantiate these theoretical properties. This approach is different from, as well as superior to, the operationalization suggested by Verplanck. An insistence on using the "strong" criterion or indifference to hypothetical mechanisms (Bookstein) may actually impede theory development if psychologists stop asking the "Why" questions about statements like [P5].

Recognizing that the substantive hypothesis is more than a functional relationship between variables, the researcher would have to consider a number of issues in a light different from that envisaged by critics of NHSTP. For example, it becomes necessary to distinguish between an efficient cause and a material (or formal) cause. Hence, there are important differences among experimental, quasi-experimental, and nonexperimental research (Palm), as well as differences between utilitarian and theory corroboration experiments. Specifically, while the outcome of the experimental manipulation is the phenomenon of interest in the utilitarian experiment, it is not so in the theory-corroboration experiment. Although the efficient cause is the concern of the utilitarian experiment, only the material (or formal) cause is important to the theory corroboration experiment. These considerations change the complexion of the issues related to the nature of the statistical hypotheses, effect size, and statistical power.

R5. More ado about the null hypothesis

It is necessary to belabour the point that H1 is not the substantive hypothesis because it has not been taken up in the commentaries in the discussion of (1) the nature of H0, (2) the testing of a numerically specific difference between test conditions, and (3) how to ascertain that the parameters are the same in two different conditions.

R5.1. The nature of H0.

Some commentators reiterate the criticism that, as a result of using NHSTP, it is easy to support weak hypotheses. This criticism was predicated on the assumption that the null hypothesis is a straw man to be rejected because it is always false (Rindskopf, Swijtink, Waller & Johnson). Critics of NHSTP seem to lave in mind a hypothesis that explains or describes a phenomenon that is a complement of the to-be-explained phenomenon. They are satisfied that HO, can never be true when it is shown that such a complementary phenomenon is not possible.

H0 is neither an explanation nor a description of a phenomenon that is complementary to the phenomenon to be explained. Rather, it is derived from the complement of the substantive hypothesis in exactly the same way that H1 is derived from the substantive hypothesis. H0 can be true (and should be true in a properly designed and conducted experiment; see Lewandowsky & Maybery) because it is a prescription of what the data should be like if what is said in the substantive hypothesis is not true and chance influences alone determine the pattern in the data.

Apart from the fact that H0 is not a straw man, it is never used as a categorical proposition. Instead, it appears as the consequent in [P6] and the antecedent in [P7]:

If chance factors alone influence the data, then H0 is true. [P6]

If H0 is true, then the sampling distribution of differences has a mean

difference of zero. [P7]

The cogency of the commentaries is unclear when no attempt has been made to deal with [P6] and [P7]. For example, it is neither the case that [P6] is silly (Swijtink) nor that [P7] is inappropriate (Frick). [P6] is a statement about what should follow solely from chance influences in the case of the completely randomized one-factor, two-level experiment.

R5.2. The case of expecting a numerically specific parameter.

An objection to NHSTP voiced in the commentaries is that in resting satisfied with "H1: u 0" or "H1: u > 0" or "H1: u < 0," the researcher is distracted from developing a "stronger" hypothesis that makes it possible to say "H1: uE - uC = 5," instead of "H1: uE = uC." Zumbo, as well as Harris, subscribes to Rozeboom's (1960) view that there are multiple H1's with specific nonzero values. A difficulty with this position (over and above the one discussed in sect. R4.3) may be seen by considering the situation that suggests "uE - uC = 5."

The experimenter is justified in expecting a difference of 5 between the experimental and control conditions when it is an implication of a functional relationship like [P5] or the result of a computer simulation (Lashley). However, the decision about statistical significance is made on the basis of one sampling distribution of differences. Hence, this expectation of " uE - uC = 5" is not represented by " H1: uE - uC = 5," but by "H0: (uE - uC) - 5 = 0" because the numerator of the t statistic is "(u1 - u2) - 5 = 0" (Kirk 1984), and the denominator is the standard error of the difference.

R5.3. Using HO to ascertain the equivalence between two conditions.

Some commentators suggest that it may be too negative to characterize a nonsignificant result as a failure to reject chance influences (Frick). Instead, NHSTP can be used in a positive way to ascertain or accept a properly drawn null hypothesis (Bookstein). The suggestion to use NHSTP to accept a null hypothesis is reminiscent of Rogers et al.'s (1993) "non-equivalence null-hypothesis" approach, the purpose of which is to ascertain statistically that the parameters (e.g., the means) from two conditions are equivalent.

The "non-equivalence null-hypothesis" approach is debatable for the following reasons. First, in view of the role played by the sampling distribution of the test statistic in tests of significance, a significant result is one that is deemed too unlikely to be the result of chance influences. What can it mean to say that a test statistic is "significant by chance" (Rogers et al. 1993, p. 554)? Rogers et al. (1993) seen to have conceptualized their equivalence test at a level of abstraction different from that of tests of significance.

Second, a nonsignificant result is made unambiguous if statistical equivalence is achieved when the confidence interval is included in the equivalence interval (Rogers et al. 1993). At the sane time, the, equivalence between two conditions is deemed established when the confidence interval falls within the equivalence interval. The difficulty of this position is that the equivalence interval is determined in terms of practical or clinical criteria, not statistical ones. The width of the equivalence interval is sensitive to the context, which includes, among other things, the researcher's vested interests. Objectivity becomes a concern, especially if the equivalence interval is determined after the significance test is carried out.

In short, questions about the ambiguity of the result of NHSTP are questions about data stability, for example, whether or not (1) the measurements are made properly, (2) subjects are selected or assigned properly, and (3) subjects are given sufficient training and the like. These are not statistical concerns. Nor can they be quantified. Hence, the equivalence interval cannot disambiguate the ambiguity of the statistical decision.

R5.4. Experimental expectation and HO.

An implication of Schneider and Shiffrin's (1977) model of automatic detection is that the subject's reaction time to a target is the same regardless of the set size (i.e., the number of items in the briefly shown visual display). In other words, the implication of the automaticity hypothesis is that there is no effect of set size (viz., u1 = u2 = . . uk), and it is indistinguishable from the null hypothesis. Consequently, it seems that accepting the null hypothesis is more than accepting chance explanations (Bookstein, Frick).

There are two reservations. First, in view of the fact that NHSTP is based on the sampling distribution that is predicated by chance influences, it is inherently impossible to decide whether the absence of the set-size effect in Schneider and Shiffrin's (1977) study is the result of automatic, parallel detection or of chance influences. Findings of this kind become less ambiguous, however, when the HO-like experimental expectation is placed in a mutually exclusive and exhaustive relationship with the expectation of a competing hypothesis. For example, the expectation of the serial controlled search model of target identification is unlike HO. In such an event, the emphasis is on rejecting the serial controlled search, not on accepting the HO-like automaticity hypothesis. The second reason is that it should be possible to derive an experimental expectation that is unlike H0. The experimenter's inability to do so indicates that the substantive hypothesis is not as well defined as it should be.

R6. The statistical alternative hypothesis, effect

size, and statistical power

Apart from the issue of whether or not H1 is the substantive hypothesis, there is also the question of its exact role in NHSTP. It is not possible to talk about effect size or statistical power if H1 has no role in NHSTP. There are also two intertwining issues, namely, the graphical representation of effect-size or statistical power and the level of abstraction involved.

It is customary to discuss the effect size and statistical power in the context of the t test. Moreover, the discussion is carried out in the context of a distribution for the control condition and another one for the experimental condition. These two are distributions at the level of raw scores, as witnessed by the fact that the effect size is defined as the difference between the experimental and control means in units of the standard deviation of the control condition. They are labeled the HO and H1 distributions, respectively. The effect size is represented by the distance between the means of the two distributions. Although the H1 distribution is not used in making the decision about statistical significance, it is essential in defining the effect size and statistical power. This account of the t test will be called the "customary account" henceforth.

What actually takes place in the t test is not what is described in the customary account. The a level is defined in terms of the sampling distribution of differences, not the distribution of population scores. Nor is this sampling distribution about, or based on, the control condition. This is true not because psychological theories are not "strong" or because psychologists are not "bold" when they propose their hypotheses (Gigerenzer's Type II "statistical reasoning"). Hence, any appeal to computer simulation for insights about the expected effect size becomes moot (Lashley). That is, even if the "strong" theory argument were not problematic for the reasons given in section R4.3, it is still not possible to represent the H1 distribution in a way that reflects properly the probability basis of the t test. In other words, "effect size" and "statistical power" cannot be defined at the level at which the statistical decision is carried out.

In short, to accept the customary account, it is necessary to show why the probability basis of the t test is not the theoretical sampling distribution of differences between two means. In the event that it is not possible for critics to do so, they have to provide a valid reconciliation between the customary and NHSTP accounts.

To the extent that the NHSTP account of the probabilistic basis of the t test is not refuted, the questions about the customary account raised in the NHSTP defence remain, particularly those about the dependence of effect size and statistical power on the H1 distribution. Lashley's commentary seems to be an attempt to reconcile the customary and NHSTP accounts by suggesting that the power analysis is meant to deal with a different stage in the inferential process. The power analytic argument is about stage 1 (at which two distributions are involved), whereas the test of significance is an exercise in stage 2 that utilizes only one distribution. Moreover, power analysis predicts the outcome of NHSTP.

Lashley's effort raises the following questions: Do the two stages belong to the same level of abstraction? What is the basis of the researcher's ability to predict the lone distribution in stage 2 from the two distributions from stage 1? How is the prediction possible when nonstatistical influences on data stability are not taken into account (e.g., the difficulty of the experimental task, the amount of practice available to the subjects, etc.)? What additional numerical information is provided by the effect size that is not provided by the test statistic? How can the sample size be determined in the mechanical way suggested in power analysis? How is the magnitude of the effect related to the validity of the theoretical statement about a material or formal cause?

R7. Some issues about the effect size

Effect size is of interest because it seems to indicate the amount of evidential support provided by the data for the hypothesis (the "evidential status of the data") or of the practical importance of the data (the issue of "practical validity"). The question of evidential status takes on a different complexion when distinctions are made between (1) the substantive and statistical hypotheses, and (2) the formal (or material) and efficient causes.

Consider Sternberg's (1969) study of short-term memory storage. He manipulated the memory set size (viz., 1, 2, or 4 digits) and found that subjects' correct reaction times increased linearly with increases in the size of the memory set. One can (and very often does) say that the manipulation of the set size was the cause of the increase in reaction times. Nonetheless, there is a less misleading way to describe the functional relationship between set size and correct reaction times.

In manipulating the memory set size, Sternberg (1969) provided the memory system with different contexts to operate. The increase in reaction times when given larger memory set sizes is a reflection of a property of the short-term store (viz., its inability to handle multiple items simultaneously). In other words, the observed functional relationship reveals what Aristotle would call a "material cause" or a "formal cause." This is different from an efficient cause (e.g., exerting force to move a stationary object), which is what critics of NHSTP have in mind when they talk about effect size. The material or formal cause of interest to cognitive psychologists is not ascertained by statistical significance or effect size (Lewandowsky & Maybery). Instead, it is determined by the validity of the series of embedding syllogisms. Furthermore, the effect size gives no information that is not available in the test statistic that is used to decide whether chance influences can be excluded. The conclusion is that effect size, (i.e., the magnitude of an efficient cause) is irrelevant to theory corroboration.

Maher's suggestion to prepare and use actuarial tables for utilitarian research is appropriate for reasons of practical validity. The success of such an approach depends on having some valid, well-defined, nonstatistical criteria developed independently of the effect size itself (see also Kihlstrom). The more immediate lesson is that regardless of the index used, the effect size on its own (or any other statistical index) is informative of neither the practical validity nor the evidential status of the data (Nester).

R8. Statistical power

The validity of power analysis can still be questioned because it has not been shown why the NHSTP account is incorrect or how the NHSTP and customary accounts may be reconciled. For the sake of argument, assume that the customary account had not been problematic. Power analysis is meant to be used to disambiguate the decision about statistical significance. The ambiguity arises because statistical significance depends on sample size, effect size, and a level. The power of the test is used to determine the correct sample size for a required effect size at the chosen a level. The decision about statistical significance is said to be unambiguous under such circumstances.

Suppose that the sample size stipulated in the statistical power table is 25, given that the expected effect size is 0.85 and the power of the test is .95 with a set at .05. How can one be sure that the result is unambiguous when it is not known whether the 25 subjects have been given the proper training on the task? Are they given enough trials in the experimental session? These questions become more important, regardless of the decision about statistical decision, when fewer than 10 well-trained subjects are typically tested in multiple 300-trial sessions in the area of research in question (Lewandowsky & Maybery). It is simply not clear how the numerical index, statistical power, can confer validity on matters that are not quantitative in nature.

R9. A recapitulation of some reservations about

the Bayesian approach

Psychologists often propose hypotheses to explain phenomena (see Bookstein for an exception). The minimal criterion for accepting an explanatory hypothesis is that it should be consistent with the phenomenon to be explained. Given the distinction between the instigating phenomenon and the evidential data in section R4.1, it can be seen that the data collection procedure is neither the criterion used to assess the validity of the hypothesis nor the unexplained phenomenon itself. Such a scenario is characterized as the "phenomenon ® hypothesis ® evidential-data" sequence in the NHSTP defence.

Also important in the defence of the NHSTP is the fact that, regardless of the experimenter's theoretical biases or preferences, the hypothesis to be corroborated is given the benefit of the doubt when experimenter derives the research and experimental hypothesis, designs the experiment, and tabulates the data. That is, what the data indicate is not affected by what the experimenter thinks (or feels) about the hypothesis before the experiment. This is ensured partly by stipulating the size of the data set (viz., the number of subjects, the number of trials per session, the number of sessions, etc.). Nor is the data collection procedure adjusted as a result of periodic examinations of the data accumulated. This account of psychological research is different from the scenario Bayes had in mind.

Bayes's concern was with a situation in which the hypothesis was about the outcome of the data collection itself. The size of the data set was ill defined. There was nothing to explain, and there was no criterion for assessing whether the subjective degree of belief in the outcome of data collection exercise is correct. Would a data analysis procedure based on such a scenario be applicable to the "phenomenon -> hypothesis -> evidential-data" sequence? Given the Bayesian overtone of Rozeboom's (1960) alternative, this issue also applies to Zumbo's suggestion.

The derivation of the posterior probability from the prior probability in the context of new data is not questioned in the NHSTP defence at the mathematical level. Instead, the issue raised concerns whether the Bayesian theorem is appropriate for analyzing data about the validity of the "phenomenon -> hypothesis -> evidential data" sequence. How can this nonmathematical issue be dealt with in the new Bayesian developments? The more serious consideration is an implication of the Bayesian theorem about data interpretation.

It was Bayes's practice to obtain the posterior probability by adjusting the prior probability. The extent to which the data change the prior degree of belief in the hypothesis depends on the prior belief itself. The data have less weight the higher the prior degree of belief, a point not taken into account by Snow. In non-Bayesian terms, this practice amounts to saying that the theoretical importance of the data depends on the researcher's degree of belief in the hypothesis before the experiment. This is antithetical to objectivity, and it is shown in the NHSTP defence why there is no reason to do so.

This objection would not apply to new Bayesian approaches if they no longer used the Bayesian theorem for such a purpose. Rouanet seems to suggest another possibility. The Bayesian exercise is still the derivation of the posterior probability from the prior probability. However, a "noninformative" Bayesian would assume a "state of ignorance" about parameters when choosing the "prior distribution." The posterior distribution then expresses the evidence provided by the data, presumably not contaminated by any nonzero prior probability.

There is a close relationship between the experimental design and the test of significance. For example, the t test and ANOVA are used for experiments that use the one-factor, two-level and the one-factor, multilevel design (or factorial designs), respectively. Are these issues important in the "noninformative" Bayesian approach? How are the various aforementioned meta-theoretical and methodological considerations met in the "noninformative" Bayesian approach?

R10. Further ado with meta-analysis -

psychometric meta-analysis

Snow may be referring to Glass et al.'s (1981) meta-analytic approach when he suggests that specific information about the associated probability, p, may be useful. The issue of incommensurability was one of the difficulties of Glass et al.'s (1981) approach. Even though a group of studies is all about the same phenomenon, it is inappropriate to combine them in meta-analysis for an overall test of significance because it is inappropriate to mix apples and oranges.

Representing the psychometric meta-analytic orientation (Hunter & Schmidt 1990; Schmidt 1996), Hunter points out that studies dealing with the same independent and dependent variables enter into the meta-analysis only to obtain a better estimate for the parameter, not to do a test of significance. This does not overcome the "mixing apples and oranges" difficulty. For example, set size was the independent variable and correct reaction time was the dependent variable in both Schneider and Shiffrin's (197 7 ) and Sternberg's (1969) studies. Be that as it may, it is not meaningful to include than in the same meta-analysis as Hunter recommends because there are other important differences between the two studies.

Researchers are advised by psychometric meta-analysts not to draw conclusions about substantive issues on the basis of data from single studies because the psychometric meta-analysis is more accurate and less ambiguous than individual tests of significance. An examination of the justification offered for this assertion is instructive. With 1,455 participants from 15 geographic sites, Schmidt et al. (1985) found a correlation coefficient of .22 between the performance on the test being validated and the ability to operate a special keyboard. The correlation coefficient was statistically significant. They then formed 21 random samples (without replacement) of 68 members each from the 1,455 participants. Each of these 68-member samples was treated as a ministudy. The correlation between task performance and keyboard operation was obtained for each of the 21 "ministudies." Statistical significance was found in only 8 of the 21 ministudies. The means of the 8 ministudies was .33, which differed from the "true" correlation coefficient of .22. This is the reason psychometric meta-analysts find individual tests of significance misleading. They also conclude that the meta-analytic result is more accurate.

Four things to note before accepting the argument for meta-analysis: First, the effective size of the population shrank as the number of ministudies increased because the samples were selected without replacement. This feature renders the independence among the ministudies suspect. The second point is that Schmidt et al. (1985) should lave used cluster sampling, not simple random sampling, to form their ministudies in order to reflect the local characteristics of the 15 sites. That is, samples in their ministudies were not representative. Third, their "true" parameter (r = .22) was not theoretically informed. It was the measurement obtained from a complete enumeration of all the participants, and they assumed that a complete enumeration necessarily gives an accurate result. There is no clear reason why this should be the case; the opposite is more likely to be true. Given the same extent of the resources for conducting the research, the chance of making mistakes is higher if there are more units or participants to be measured (see Slonium 1960). Fourth, LeLorier et al. (1997) have reported that "the outcome of the 12 large randomized, controlled trials . . . were not predicted accurately 35 percent of the time by the meta-analyses published previously on the same topics" (p. 536).

Is meta-analysis a valid means of corroborating explanatory hypotheses (Rossi)? Hunter makes it clear that psychometric meta-analysts are not interested in explanatory hypotheses. Chow (1987c) has shown that Glass et al.'s (1951) approach was invalid as a theory-corroboration procedure. The objection to using meta-analysis is not that it may been misused (Glück & Vitouch), but that some of its underlying meta-theoretical assumptions are debatable. The difficulty with resolving the discrepancies in studies of spontaneous recovery may partly be due to the fact that there is insufficient theoretical insight (Rossi). Alternatively, why should there be theoretical unanimity when the phenomenon may have multiple underlying material or formal causes?

R11. Summary and conclusions

The NHSTP defence is an attempt to rationalize the role of tests of significance in the theory-corroboration experiment. This approach was adopted because it has been acknowledged that the criticisms of NHSTP were not applicable to experimental studies in which all recognizable controls are properly instituted. There is no reason to believe that using NHSTP hinders theory development. There are difficulties with the characterization of the "strong hypothesis." The effect size has no evidential status. The more serious reservation about the effect size and statistical power is based on the fact that they are defined at a level of abstraction different from the level at which the decision about statistical significance is made. Without disputing the mathematics in the Bayesian or meta-analytic approaches, their role in theory corroboration may be questioned on methodological or conceptual grounds. In sum, there is as yet no reason to revise the NHSTP defence in any substantive way.