Re: Pinker & Prince: Connectionism

From: Cove Stuart (
Date: Tue May 01 2001 - 21:23:27 BST

In this paper, pinker and Prince approach the question of whether
knowledge about language is just a matter of storing mental rules, or
whether it's something more.
They use a connectionist example of language acquisition, developed by
Rumelhart and McClelland, who are influential figures in the research
of neural networks.
Their opinion is that language is far too complex and subtle for such a
connectionist model to learn, and structure a critique of the
aforementioned system to explain the failings of a connectionist

The papers introduction explains the difference between the symbolic
and connectionist approaches to natural language acquisition.

>Some early cognitive models have assumed an underlying
>architecture inspired by the historical and technological accidents of current
>computer design, such as rapid reliable serial processing, limited bandwidth
>communication channels, or rigid distinctions between registers and memory. These
>assumptions are not only inaccurate as descriptions of the brain, composed as it is of
>slow, noisy and massively interconnected units acting in parallel, but they are
>unsuited to tasks such as vision where massive amounts of information must be
>processed in parallel.

This, the authors suggest, is one of the main reasons that "brain
style" computation seems a more appealing prospect for the construction
of a cognitive model. I agree with this, symbolic methods are clearly
not suited to adaptive learning tasks, falling foul of the frame

The paper goes on to outline the structure and function of neural
networks, before digressing into the ways in which the connectionist
model will fit in with symbolic models of cognitive processes.

>In one, PDP models would occupy an intermediate level
>between symbol processing and neural hardware: they would characterize the
>elementary information processes provided by neural networks that serve as the
>building blocks of rules or algorithms. Individual PDP networks would compute the
>primitive symbol associations (such as matching an input against memory, or pairing
>the input and output of a rule), but the way the overall output of one network feeds
>into the input of another would be isomorphic to the structure of the symbol
>manipulations captured in the statements of rules. Progress in PDP modeling would
>undoubtedly force revisions in traditional models, because traditional assumptions
>about primitive mechanisms may be neurally implausible, and complex chains of
>symbol manipulations may be obviated by unanticipated primitive computational
>powers of PDP networks.

This possibility has definitely come to fruition. Many commercial
systems employing neural nets use this kind of hybrid system, with the
networks output being fed into more conventional algorithms for post
processing. Neural networks are certainly best suited to a certain kind
of problem. The authors call this model "implementational

We then look at the possibility that PDP networks, will replace
symbolic models completely.

>It would be impossible to find a principled mapping between
>the components of a PDP model and the steps or memory structures implicated by a
>symbol-processing theory, to find states of the PDP model that correspond to
>intermediate states of the execution of the program, to observe stages of its growth
>corresponding to components of the program being put into place, or states of
>breakdown corresponding to components wiped out through trauma or loss -- the
>structure of the symbolic model would vanish.

Solving the 'toy' problem of language acquisition, and deeming the
solution a success does not rely on a purely connectionist approach.
However, if we consider the integration of language processing, and
other toy problems, into something that was attempting to pass TT3, the
use of symbolic tools, only to monitor its development seems
The fact that a PDP networks hidden units can often only give us a
heuristic clue to its internal state, appears to be more analogous to
the way consciousness manifests itself in us.
The authors have called this "eliminative connectionism".

The third prospect is the consideration of all intermediaries between
the previous two extremes. In terms of real world solutions, both of
toy problems and the progression toward TT3, a mixture of models seems
more likely to provide the answers.
Although the human conscious is made of more than just symbol
manipulations, it is also clear that the human mind is adept at
computation, and a proportion of our reasoning and knowledge is
manufactured via symbolic manipulation alone.

It is mentioned that language has been the most concentrated on aspect
of cognition, within many theoretical frameworks.

>Many observers thus feel that connectionism, as a radical
>restructuring of cognitive theory, will stand or fall depending on its ability to
>account for human language.

I feel that this comment is not strictly true. As a statistical tool, a
PDP networks 'best fit' nature lends itself well to a number of
mathematical problems that have nothing to do with modelling

Next, the authors give a brief description of the language acquisition
PDP model defined by Rumelhart-McClelland (RM model).
The system is an attempt to map and learn the past tense
representations of present tense words, in much the same way a child
would. The system suffers from the many difficulties in mastering the
use of regular and irregular verbs that we have as youngsters.
We can see then that this is a toy subset of a toy problem; natural
language processing. The authors are quick jump on some of the more
outlandish claims made by its creators. It is their view that these
claims are incorrect, and that the system is fundamentally flawed in a
number of areas.

> we analyze the assumptions and consequences of the RM
>model, as compared to those of symbolic theories, and point out the crucial tests that
>distinguish them. In particular, we seek to determine whether the RM model is
>viable as a theory of human language acquisition -- there is no question that it is a
>valuable demonstration of some of the surprising things that PDP models are
>capable of, but our concern is whether it is an accurate model of children.

This statement would suggest that the paper is really an exemplar of
the fact that the RM model does not learn like a child.

>We will conclude that the claim that parallel distributed
>processing networks can eliminate the need for rules and for rule induction
>mechanisms in the explanation of human language is unwarranted. In particular, we
>argue that the shortcomings are in many cases due to central features of
>connectionist ideology and irremediable; or if remediable, only by copying tenets of
>the maligned symbolic theory. The implications for the promise of connectionism in
>explicating language are, we think, profound.

This statement, however, appears to be making a broader claim. That by
dissecting the RM model the authors can solve the symbols v nets
problem, and deem the connectionist approach worthless.
Despite the papers age, theoretically more sophisticated neural
networks were already appearing, and I definitely disagree with the
notion that by exposing the weaknesses of one system, you can dismiss
them all.

The next main section of the paper gives an overview of English verbal
The importance of past tense acquisition is expanded upon, and the
motivation of determining to what extent we learn to structure
sentences 'parrot fashion' is considered. Rumelhart-McClelland, believe
that if language is learnt largely by the manipulation of lexical
components, the idea that children are applying a more rule- based
approach appears more plausible.

>There is one set of "rules" inherent in the generation of the
>past tense in English that is completely outside the mapping that the RM model
>computes: those governing the interaction between the use of the past tense form and
>the type of sentence the verb appears in, which depends on semantic factors such as
>the relationship between the times of the speech act, referent event, and a reference
>point, combined with various syntactic and lexical factors such as the choice of a
>matrix verb in a complex sentence (I helped her leave/*left versus I know she
>left/*leave) and the modality and mood of a sentence (I went/*go yesterday versus I
>didn't go/*went yesterday; If my grandmother had/*has balls she'd be my

This interesting problem exposes the first flaw in the RM models claim
to be a pure connectionist model. This problem requires semantic
reasoning to solve, and demonstrates the frame problems weakness to a
shift in time or location.
Pinker and Prince highlight the fact the RM model submits to this
problem, and the system must characterise the situations in which to
use a certain tense, as one set of semantic and symbolic rules. This
is certainly a major downfall of the RM model, but not neural solutions
to natural language processing in general.

Pinker & Prince go on to describe a decomposition of the symbolic
representations of morphological and phonological components, before
describing the RM model in greater depth.
It is pointed out that the RM model uses a much simpler mapping, from
the phonetic representation of the lexicon given, directly to the
phonetic representation of its past tense form. We also discover that
the RM model is a simple single layer perceptron, having only a single
layer of weights between input and output neurons.
The authors appear very critical of this simplification of current
symbolic classifications, but I feel this is a bit unfair. As
mentioned, the system encompasses a subset of a subset of human skills,
and as such will have a unique and possibly distinct architecture to a
system that encompassed all natural language skills. Also, the current
rule based ideas about decomposition of stems into phonetic
representations may not be an accurate description of the actual
cognitive representation. I think the real measure of success is the
performance of the system in comparison to its human counterpart,
purely on a functional basis. It is impossible to judge its success at
modelling cognitive behaviour, as its usefulness is isolated from all
of the other useful stuff we do, and therefore all distinguishable
cognitive activity (TT3).

During the next section the authors describe in detail the system the
RM model uses to encode the phonetic stems. This happens in a Fixed
Encoding Network that converts the letters from strings over an
alphabet, to trigrams consisting of a combinatory pattern of three
characters. These are known as Wickelphones, and are the primitives of
the system. The RM model comprises 460 nodes, each representing a
Wickelfeature, which is a taken from a sub set of the complete
Wickelphone representation. Prince & Pinker point out an interesting
feature of the encoding system.

> The input encoder is deliberately designed to activate some incorrect
>Wickelfeatures in addition to the precise set of Wickelfeatures in the stem: specifically, a
>randomly selected subset of those Wickelfeatures that encode the features of the central phoneme
>properly but encode incorrect feature values for one of the two context phonemes. This "blurred"
>Wickelfeature representation cannot be construed as random noise; the same set of incorrect
>Wickelfeatures is activated every time a word is presented, and no Wickelfeature encoding an
>incorrect choice of the central feature is ever activated. Rather, the blurred representation fosters
>generalization. Connectionist pattern associators are always in danger of capitalizing too much on
>idiosyncratic properties of words in the training set in developing their mapping from input to
>output and hence of not properly generalizing to new forms. Blurring the input representations
>makes the connection weights in the RM model less likely to be able to exploit the idiosyncrasies
>of the words in the training set and hence reduces the model's tendency toward conservatism.

The forced error in the system does not appear to suggest a Hebbian
form of learning, and I think the authors feel that this is in some way
imposing rule based classification on the system, in quite a
fundamental way. In this case possibly so, because of the encoding
method, but neural networks are well known for over or under
generalisation. In a functional representation, it is often necessary
to tune the bias and variance to obtain a 'best fit' solution using a
given training set. Whether or not this kind of tuning is something
that maps to the cognitive process depends on where the distinctions
between training data, encoding and classification are placed.

The next main section of the paper deals with the representational
assumptions that the RM model makes.

> These are the fundamental linguistic assumptions of the RM model:
> - That the Wickelphone/Wickelfeature provides an adequate basis for phonological
>generalization, circumventing the need to deal with strings.
>-That the past tense is formed by direct modification of the phonetics of
> the root, so that there is no need to recognize a more abstract level of morphological structure.
>-That the formation of strong (irregular) pasts is determined by purely phonetic considerations, so
>that there is no need to recognize the notion 'lexical item' to serve as a locus of idiosyncrasy.
>-That the regular system is qualitatively the same as the irregular, differing only in the number
>and uniformity of their populations of exemplars,
> so that it is appropriate to handle the whole
>stem/past relation in a single, indissoluble facility.

The first assumption is addressed first, and consists of an even more
detailed description of Wickelphonology, stating its major weaknesses.
I agree that this method of representing strings is inadequate. Both
the ambiguous encoding scheme, and its inability to generalise and
specialise in an appropriate fashion make it unsuitable.
The most important weakness of Wickelphones is their ability to represent
the linguistically impossible.

> A quintessential unlinguistic map is relating a string to its mirror image
>reversal (this would relate pit to tip, brag to garb, dumb to mud, and so on); although neither
>physiology nor physics forbids it, no language uses such a pattern. But it is as easy to represent
>and learn in the RM pattern associator as the identity map. The rule is simply to replace each
>Wickelfeature ABC by the Wickelfeature CBA. In network terms, assuming link-weights from 0
>to 1, weight the lines from ABC --> CBA at 1 and all the (459) others emanating from ABC at 0.
>...The Wickelphone tells us as little about unnatural avenues of generalization as it does about the
>natural ones.

The authors feel that this is due to the way in which the Wickelphones
attempt to remove any rule-based context at the encoding stage. It is far
harder, they argue, to obtain a string representation of this same
Wickelfeature once it has been encoded than Rumelhart-McClelland are willing
to admit.
The second assumption is again down to the simplicity of the encoding
method. Prince and Pinker feel the assumption that only a phonetic
description of input and output is necessary, lays waste to a level of
morphological abstraction used in human classification.

> Though much of the activation for the affix features eventually is
>contributed by some stem features that cut across many
>individual stems, such as those at the end
>of a word, not all of it is;
>some contribution from the word-specific stem features that are
>well-represented in the input sample can play a role as well. Thus the RM
>model could fail to generate
>any past tense form for a new stem if the stem
>did not share enough features with those stems that
>were encountered in the
>past and that thus grew their own strong links with past tense features.

I agree with this but to remain faithful to my general opinion of this
critique, we must realise that this is only a weakness of this system not
connectionist models in general.

The last two assumptions focus on the differences between the strong and
regular formation systems.

>Morphological classification responds to fairly large-scale measures on
>word structure: is the word a monosyllable? does it rhyme with a key exemplar? does it alliterate
>(begin with the a similar consonant cluster) as an exemplar? Phonological rules look for different
>and much more local configurations: is this segment an obstruent that follows a voiceless
>consonant? are these adjacent consonants nearly identical in articulation? In many ways, the two
>vocabularies are kept distinct:
>we are not likely to find a morphological subclass holding together
>because its members each contain somewhere inside them a pair of
>adjacent obstruents; nor will
>we find a rule of voicing-spread that applies
>only in rhyming monosyllables. If an analytical
>engine is to generalize
>effectively over language data, it can ill afford to look upon morphological
>classification and phonological rules as processes of the same formal type.

This opinion makes sense for a number of reasons. Most notably the way
in which we tend to memorise past tense versions of strong verbs, and
correctly derive past tense versions of regular forms by rule.

In the next part of the paper we find an evaluation of the systems
performance, which is more of a reliable measure of success when talking
about toy problems.

> The bottom-line and most easily grasped claim of the RM
> model is that it
>succeeds at its assigned task: producing the correct past
> tense form. Rumelhart and McClelland
>are admirably open with their test
> data, so we can evaluate the model's achievement quite directly.

Clearly, the authors think to claim the system a success is
reasonable, on certain grounds.
These grounds are rightly based on performance, but the notion that a
cognitive process is being emulated is out of the question.
The real problem for the system is generalisation. As pointed out by the

> In sum, for 14 of the 18 stems yielding incorrect forms, the forms were
>quite removed from the confusions we might expect people to
>make. Taking these with the 6 no->shows, we have 20 out of the 72 test stems
>resulting in seriously wrong forms, a 28% failure rate.
>This is the state
>of the model after it has been trained 190-200 times on each item in a vocabulary
>of 336 regular verbs.
>What we have here is not a model of the mature system.

This is an unacceptable margin of error if the system is going to be
of any practical use.
I tend to think this a weakness of the Wickelphone representation, and the
simplicity of the network. Other systems such as NETtalk, feature more
complex representations of the properties of language, and are much more
effective at converting strings to their morphological and phonetic
representations. Current PDP's also tend to use much more sophisticated
networks, such as multi layer perceptrons, in which layers of hidden neurons
can give feedback to lower or higher layers. This allows the network to
generalise or specialise in a more intuitively similar way to that in which
we deal with credit/blame assignment.
The next section deals with some common objections to arguments based on
linguistic evidence. The most interesting of these is the notion that a
connectionist model is close in the way it makes errors to its human
counterpart, therefore, it must be a good approach to modelling phycological

>The ability to account for patterns of error is a useful criterion for
>evaluating competing theories each of which can account for successful performance equally well.
>But a theory that can only account for errorful or immature performance, with no account of why
>the errors are errors or how children mature into adults,
> is of limited value (Pinker, 1979, 1984;
>Wexler & Culicover, 1980; Gleitman & Wanner, 1982).

The theme about how children make mistakes is continued, and a
thorough analysis is given. It is suggested that the RM model does not
replicate these phenomena, and this is true. This is less a symptom the
connectionist model, and more a symptom of the fact that without grounding
in the real world, the definition of what normal error constitutes is very
blurred. On the other hand if we had a TT3 robot system that displayed the
same kind of language error patterns a child did, would we have any better
way of understanding why these errors occur and how they are solved?

Finally, we are subjected to a general discussion of claims made about the
RM model. Concentrating first on the claims about strong language, we have
seen already that the interpretation is weak, and not really adequate for
the task at hand.
The rest of the paper tries to question whether or not a connectionist
approach is really a good approach to take in solving this kind of problem.

> The point is that people's inductive inferences depend on variables assigned
>to sets of individuals that pick out some properties and completely ignore others, differently on
>different occasions, depending in knowledge-specific ways on the nature of the inductive
>inference to be made on that occasion.
> Furthermore the knowledge that can totally alter or reverse
>an inductive inference is not just another pattern of trained
> feature correlations, but depends
>crucially on the structured propositional content of the knowledge
This is representative of the symbol grounding problem again. The
knowledge-specific component cannot be given to a PDP system, unless it is
anchored to some real world meaning.
The authors do concede that more complicated PDPs may well provide more
understanding about language acquisition in the future, although a lot of
the negative points they raise about connectionism can only really be
attributed to the RM model.

This archive was generated by hypermail 2.1.4 : Tue Sep 24 2002 - 18:37:30 BST