Harnad, S. Hanson, S.J. & Lubin, J. (1995) Learned Categorical Perception in Neural Nets: Implications for Symbol Grounding. In: V. Honavar & L. Uhr (eds) Symbol Processors and Connectionist Network Models in Artificial Intelligence and Cognitive Modelling: Steps Toward Principled Integration. Academic Press.
ABSTRACT: After people learn to sort objects into categories they see them differently. Members of the same category look more alike and members of different categories look more different. This phenomenon of within-category compression and between-category separation in similarity space is called categorical perception (CP). It is exhibited by human subjects, animals and neural net models. In backpropagation nets trained first to auto-associate 12 stimuli varying along a one-dimensional continuum and then to sort them into 3 categories, CP arises as a natural side-effect because of four factors: (1) Maximal interstimulus separation in hidden-unit space during auto-association learning, (2) movement toward linear separability during categorization learning, (3) inverse-distance repulsive force exerted by the between-category boundary, and (4) the modulating effects of input iconicity, especially in interpolating CP to untrained regions of the continuum. Once similarity space has been "warped" in this way, the compressed and separated "chunks" have symbolic labels which could then be combined into symbol strings that constitute propositions about objects. The meanings of such symbolic representations would be "grounded" in the system's capacity to pick out from their sensory projections the object categories that the propositions were about.
Categorical perception (CP) occurs when there is a perceived compression of within-category similarities and/or an enhancement of between-category differences so that members of the same category look more alike and members of different categories look more different than one would expect on the basis of their physical characteristics alone (Harnad 1987). This effect is known to occur with innate categories such as colors (Berlin & Kay 1969; Bornstein 1987). speech sounds (Howell & Rosen 1987), and facial expressions (Etcoff & Magee 1992) but its occurrence purely as a result of learning, first reported by Lane (1965), has only begun to be investigated.
Andrews, Livingston, Harnad & Fischer (in preparation) have found learned CP effects with artificially generated "textures" as well as with natural stimuli such as chick genitalia (Biederman 19XX). Human subjects were trained to sort the stimuli into two categories "correctly" (according to an experimenter-specified invariance, unknown to the subject, in the case of the textures, and a natural invariance in the case of chick sexing) through trial and error, with feedback indicating the correct category name. In the subjects who learned successfully, a CP effect (consisting of within-category compression and/or between-category separation) was found to occur when pairwise similarity judgments within and between categories were compared with those of subjects who had not been trained or who had not succeeded in learning to sort and label the stimuli correctly.
In a previous paper (Harnad, Hanson & Lubin 1991), we reported neural net simulations of category learning with stimuli that varied along a one-dimensional "continuum." The inputs consisted of 8 "lines" varying from the shortest (1) to the longest (8). The net's task was to sort the four shorter lines (1 - 4) into one category and the four longer lines (5 - 8) into another. Backpropogation nets (McClelland & Rumelhart 1986) were used; apart from the 8 input units they had 3 hidden units and 9 output units.
The input lines were either "place" coded (e.g., a line of length 4 would be 0 0 0 1 0 0 0 0) or "thermometer" coded (e.g., line 4 would be 1 1 1 1 0 0 0 0). The place code was interpreted as more arbitrary and the thermometer code more iconic, in that the thermometer code preserved, through multi-unit constraints, preserved some of the analog structure of real lines (as they would appear if they were projected onto a sensory receptor surface) whereas the place code did not have any analog structure. In addition, the thermometer-coded lines and the place-coded lines could be discrete-coded (as above) or they could be coarse-coded, allowing some gaussian spillover to adjacent units (e.g., line 4 coarse/place-coded might be 0 .001 .1 .99 .1 .001 0 0, and line 4 coarse/thermometer-coded might be .90 .99 .99 .90 .1 .001 0 0). Finally, because CP concerns the formation of boundaries between categories, a lateral inhibition coding was also tested, in which adjacent coarse-coded units were inhibited so as to enhance boundaries (e.g., line 4 lateral-inhibition/place-coded might be .1 .1 .001 .99 .001 .1 .1 .1, and line 4 lateral-inhibition/thermometer-coded might be .8 .9 .9 .99 .001 .1 .1 .1). Coarse coding was interpreted as more analog than the discrete binary coding, again because it preserved multi-unit constraints. Lateral Inhibition was likewise more analog than the discrete code, but also more complicated, because the width and placement of the boundary effects from the lateral inhibition could in principle help or hinder the formation of a CP boundary, depending on whether the two effects happened to be in or out of phase.
The method used to generate the precategorization baseline was "auto-association" (Hanson & Kegl 1987; Cottrell, Munro & Zipser 1987). Different nets were trained (separately for each of the 6 input codings) to produce as output exactly the same pattern they received as input. For each net trained to a predefined criterion level of performance on auto-association the interstimulus distances for all pairs of the 8 lines were then calculated as the euclidean distance between the vectors of hidden unit activations for each pair of lines. For example, if there were four hidden units and their activation values after training for line X were (x1 x2 x3 x4) and for line Y (y1 y2 y3 y4), then the distance between the two inputs, and hence their discriminability for that net, would be the distance between X and Y (see Hanson & Burr 1990 for prior work on using this internal measure of interstimulus distance).
After auto-association was learned the trained weights for the connections between the hidden layer and the output layer were reloaded (and then all weights were left free to vary); the net was then given a double task: Auto-association (again) and categorization, i.e., lines 1 - 4 had to be given one (arbitrary) "name" (e.g., "1") and lines 5 - 8 had to be given another (e.g., "0"). (In practice, this naming required one more bit on the output, the usual eight for the auto-association, and then one more for the categorization (initially seeded randomly with weights in the (-1.0, 1.0) range).
What we found in these prior studies as a very reliable correlate of categorization training was a strong CP effect, consisting of compression of within-category similarities and enhancement of between-category differences, as measured by comparing, for each net, (a) all interstimulus distances (pairwise euclidean distance in hidden-unit activation space) after auto-association alone with (b) all interstimulus distances after categorization. This category-learning-induced "warping" of interstimulus similarity space was dramatic and ubiquitous, occurring with all six input codings, both the more arbitrary ones and the more analog ones.
In the study reported here, the number of stimuli was increased from 8 to 12, and the number of categories from 2 to 3 (so as to rule out the possibility that CP effects were either peculiar to dichotomies or mere artifacts of endpoint singularities). The three categories were "short" (lines 1 - 4), "medium" (lines 5 - 8) and long (lines 9 - 12). The results were again substantially the same for all six input codings. Figure 1 displays the characteristic pattern of results (using the coarse thermometer coding to illustrate). Figure 1a illustrates all the pairwise interstimulus distances for the 12 stimuli after auto-association learning and Figure 1b shows the change in interstimulus distances after categorization learning. Polarity is down for separation and up for compression. For ease of inspection, within-category distances are dark and between-category distances are light (for the 1-step, 2-step and 3-step comparisons; beyond that, all comparisons are between-category and dark signifies that not one but two categories are straddled by the comparison). Almost without exception, there is within-category compression and between-category separation. The same is true for the other five codings.
In our earlier study on the 8-bit/2-category task, the time-course dynamics of this learned CP effect in hidden unit representational space were analyzed more closely to determine what causal role they might be playing in the nets' successful learning of the categorization task. Three factors -- (1) maximal interstimulus separation induced during auto-association learning, (2) stimulus movement in order to achieve linear separability during categorization learning, and (3) the inverse-distance "repulsive force" exerted by the plane separating the categories -- were found to play a direct role in generating CP and one further factor -- (4) the iconicity of the input codings -- was found to modulate CP. The same factors were found to be operative in the more general 12-bit/3-category task analyzed here:
(1) Maximal Interstimulus Separation
During auto-association learning, the hidden-unit representations of each of the stimuli, initially random, move as far apart from one another as possible, like an expanding universe. When successsful auto-association has been attained, the representations have often reached maximal pairwise separation from one another in the unit cube (which bounds the maximum and minimum values of each of the hidden units). This "expanding universe" effect during auto-association learning can be seen in Figure 2a, which shows results for the most arbitrary input coding, discrete-place. The coordinates of each point are the activations of each of the three hidden units for that input. There are 12 inputs; the "smoke-trail" of each shows how their values evolved during the course of learning from their initial (random) value (light gray) to the final one (black), after successful learning.
It is evident from the final configuration of the hidden-unit representations after successful auto-association learning (Figure 2a) that if such a net is to learn anything further, something will have to "give," forcing the hidden-unit representations of some of the stimuli to move closer to one another than they would otherwise have "liked" to be, in order to achieve successful categorization (Figure 2b).
The expansion effect was maximal with the most arbitrary nets. With the more iconic nets (e.g., coarse thermometer), the "structure preserving" nature of the analog input codings (Factor 4, below) imposes an a priori ordering on the hidden unit representations, again forcing some stimuli from the very outset to be nearer to one another (see Figures 2c and 2d) than they would be with the more arbitrary input codings (2a, 2b).
(2) Movement to Achieve Linear Separability
Often the initial random "seeding" of the hidden-units or the configuration they happen to reach at the end of auto-association learning will not allow planes to be placed within the cube so as to separate the three categories from one another. To achieve this linear separability, more movement is needed during learning, and this movement too is in the direction of within-category compression and between-category separation, i.e., CP (see Figures 2b and 2d).
(3) Inverse-Distance Forces at the Category Boundary
Because of the error-metric of the equation governing the way these networks learn, each hidden-unit representation is "pushed" with a force that is inversely proportional to an exponential function of its distance from the planes separating the three categories. This effect can be seen in Figure 1b, where the separation/compression effects are greater near each category boundary; it can also be seen in Figures 2b and 2d, where the greatest movement can be seen to have occurred in the region of the boundaries if one visualizes the two planes that divide the cube into three categories (with four members each).
(4) Iconicity
If we compare Figures 2b and 2d, it is clear that the learning in 2d has a "head-start" because of the iconic input coding, which has already ordered the stimuli in a way that makes partitioning them into three categories much easier. We consider the coarse thermomenter code (2c, 2d) to be the most natural for sensory category learning and we have tested this further by investigating the capacity of these nets to interpolate categorization into regions of input space that they have not been explicitly trained on:
All data so far have been for nets in which there has been categorization training for all 12 of the stimuli. Figures 3a and 3b show the time-course of learning in hidden-unit space for a non-iconic (discrete place) net in which the hidden-unit representations for the interior members within a category are left free to vary and only the endpoints of the category range are trained (in other words, stimuli 1,4,5,8,9, and 12 are trained and stimuli 2,3,6,7,10 and 11 are free). This net cannot interpolate and hence the untrained generalization stimuli are not correctly categorized. By contrast, the iconic coarse-thermometer net (3c, 3d) does successfully interpolate to the untrained region because of the "spill-over" from the structure-preserving form of the thermometer and coarse coding.
The arbitrary and iconic codings can also be compared by looking at the "receptive field" of each of the hidden units across the input space. Figure 4a shows the receptive field of the three hidden units of an arbitrary (discrete place) net after auto-association learning; Figure 4b shows their receptive fields after categorization learning. The "warping" of similarity space illustrated in Figures 2a and 2b is evident here too; the effect looks disorderly, reflecting the unstructured, arbitrary nature of this form of input coding.
In contrast, the receptive fields of the three hidden units of an iconic (coarse place) net after auto-association are shown in Figure 4c; the same receptive fields, now "warped" by the CP effect in the service of categorization, are shown in Figure 4d. Note that the receptive fields in 4c are somewhat reminiscent of the trichromatic "Red/Green/Blue (RGB)" receptive fields of mammalian retinal cones tuned to Red, Green and Blue selectivity (Figure 5). RGB trichromacy (plus Red/Green Blue/Yellow opponency) is known to be the mechanism underlying human color vision (Boynton 1979) -- including, of course, color CP. It is interesting that some of the iconic codings with 3 hidden units produce similar receptive field profiles.
Discussion. Factors 1 - 3 above, together with the modulating iconicity factor (4) suggest a mechanism for the learned CP effects reported by Andrews et al. (in preparation). Our next step will be to apply this neural net model to the same stimuli used by Andrews et al., to determine whether, apart from mimicking the general profile of CP with simple unidimensional stimuli (in which categorization depends only on learning a sensory threshold), such nets have the capacity to learn the complex invariants underlying the much more challenging categorization tasks that human beings are capable of mastering. We are also examining whether other kinds of neural nets (e.g., Artmap: Carpenter, Grossberg & Reynolds (1991); Lubin, Hanson & Harnad, in preparation) exhibit CP, and whether the four underlying factors we have isolated turn out to have their counterparts in other kinds of nets. It has been suggested that the hippocampus may be the locus of CP effects (Gluck & Myers 1993).
If neural nets do turn out to be powerful enough to do human-scale category learning then they, together with the analog mechanisms that transduce sensory input, may play an important role in a still more general mechanism that grounds the meanings of the symbols we use to name the objects in our environment through our capacity to categorize those objects on the basis of their sensory projections, thereby allowing us to go on to use combinations of those grounded symbols in grounded propositions that describe and explain that environment (Harnad 1990, 1992).
Andrews, J., Livingston, K., Harnad, S. & Fischer, U. (1992) Learned Categorical Perception in Human Subjects: Implications for Symbol Grounding.
Berlin, B. & Kay, P. (1969) Basic color terms: Their universality and evolution. Berkeley: University of California Press
Bornstein, M. H. (1987) Perceptual Categories in Vision and Audition. In: Harnad (1987)
Boynton, R. M. (1979) Human color vision. New York: Holt, Rinehart, Winston
Cottrell, Munro & Zipser (1987) Image compression by back propagation: an example of extensional programming. ICS Report 8702, Institute for Cognitive Science, UCSD.
Etcoff, N.L. & Magee, J.M. (1992) Categorical perception of facial expressions. Cognition 44: 227 - 240.
Gibson, E. J. (1969) Principles of perceptual learning and development. Engelwood Cliffs NJ: Prentice Hall
Gluck, M.A. & Myers, C.E. (1993) Hippocampal mediation of stimulus representation: A computational theory. Hippocampus (in press)
Carpenter G.A., Grossberg S. & Reynolds J.H. (1991) ARTMAP - Supervised real-time learning and classification of nonstationary data by a self-organizing neural network. Neural Networks 4:565-588.
Hanson S.J. & Burr D.J. (1990) What connectionist models learn: Learning and Representation in connectionist networks. Behavioral and Brain Sciences 13: 471-518.
Hanson, S. J. and Kegl, J. (1987) Parsnip: A Connectionist Model that Learns Natural Language Grammar from Exposure to Natural Language Sentences. Proceedings of the Ninth Annual Cognitive Science Conference Seattle.
Harnad, S. (ed.) (1987) Categorical Perception: The Groundwork of Cognition. New York: Cambridge University Press.
Harnad, S. (1990a) The Symbol Grounding Problem. Physica D 42: 335-346.
Harnad, S. (1990b) Symbols and Nets: Cooperation vs. Competition. Review of: S. Pinker and J. Mehler (Eds.) (1988) Connections and Symbols Connection Science 2: 257-260.
Harnad, S. (1992) Connecting Object to Symbol in Modeling Cognition. In: A. Clarke and R. Lutz (Eds) Connectionism in Context. Springer Verlag.
Harnad, S., Hanson, S.J. & Lubin, J. (1991) Categorical Perception and the Evolution of Supervised Learning in Neural Nets. In: Working Papers of the AAAI Spring Symposium on Machine Learning of Natural Language and Ontology (DW Powers & L Reeker, Eds.) pp. 65-74. Presented at Symposium on Symbol Grounding: Problems and Practice, Stanford University, March 1991.
Howell P. & Rosen, S. (1984) Natural auditory sensitivities as universal determiners of phonemic contrasts. Linguistics 21: 205-235.
Lane, H. (1965) The motor theory of speech perception: A critical review. Psychological Review 72: 275 - 309.
Lawrence, D. H. (1950) Acquired distinctiveness of cues: II. Selective association in a constant stimulus situation. Journal of Experimental Psychology 40: 175 - 188.
Lubin, J., Hanson, S. & Harnad, S. (in preparation) Categorical Perception in ARTMAP Neural Networks.
McClelland, J.L., Rumelhart, D. E., and the PDP Research Group (1986) Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1. Cambridge MA: MIT/Bradford.
Newell, A. (1980) Physical Symbol Systems. Cognitive Science 4: 135 - 83.
FIGURE LEGENDS
Figure 1
Pairwise interstimulus distances in hidden-unit space for coarse-thermometer coded nets. 12 inputs were sorted into 3 categories (1-4, 5-9, 10-12). a: After auto-association alone. b: Difference between auto-asssociation alone and auto-association plus categorization: Upward deviation means compression, downward deviation means separation. Lowest level of pyramid, one-step comparisons (stimulus 1 vs. 2, 2 vs. 3, etc.); next level, two-step comparisons (1 vs. 3, etc.); highest level, eleven-step comparisons. Note that from level four upward all comparisons are between categories. Until that level (only), dark bars signify within-category comparisons and light bars signify between-category comparisons. Virtually without exception there is downward deviation (compression) within categories and upward deviation (separation) between categories. All deviations above level four are negative (as they should be) and here dark signifies that not one but two categories boundaries are being straddled.
Figure 2
Evolution of the 12 stimulus representations in hidden-unit space for 3-hidden-unit nets during learning. Each stimulus's representation is displayed as a "smoke trail" of circles in the unit cube, its value on each axis corresponding to the activations of each of the hidden units. The trail begins as gray circles with the initial learning trial and ends as black circle on the last trial. The diameter of the circles is in increasing size, the smallest for stimulus 1 and the largest for stimulus 12. 2a: Discrete place net during autoassociation learning. 2b: Same net during categorization learning. 2c: Coarse thermometer net during auatoassociation. 2d: Same net during categorization. Note maximal separation with arbitrary coding and constrained order with iconic coding.
Figure 3
Evolution of learning in nets where interior values in category (2,3,6,7,10,11) are untrained. 3a: Discrete place, auto-association. 3b: Discrete place, categorization. 3c: Coarse thermometer, auto-association. 3d: Coarse thermometer, categorization. Note that only the iconic net interpolates successfully.
Figure 4
Receptive fields of each of the three hidden units. 4a: Discrete place, auto-association. 4b: Discrete place, categorization. 4c: Coarse place, auto-association. 4d: Coarse place, categorization.
Figure 5
Human "RGB" curves: Normalized photopigment absorption. Note similarity to 4c and d.