Chapter 4: Vision

From: Harnad, Stevan (
Date: Fri Feb 28 1997 - 19:54:11 GMT

[These are just copies of my lecture notes: you need to read Chapter
4 of Green et al's Cognitive Science (1996 Blackwell) too.]


Vision is probably our most important sense. Over half the brain is
devoted to vision. Same is true of the brains of our primate relatives.

An explanation of what the mind can do with vision is therefore going to
be a big part of the explanation of the mind.

There are two kinds of research on vision:

MACHINE VISION is not intended to help us understand the seeing mind
but to get machines to do helpful things for us using visual input
(e.g., recognise signatures, recognise faces at cash machines, etc.)

The research on HUMAN VISION is intended to help explain
how WE see. Nevertheless, both approaches use computational modeling,
because someone has a theory that is meant to explain how we see, the
best way to check whether the theory really works is by simulating
it with a computer.

When a computer is used for modeling, it is really like a very powerful
paper and pencil, checking whether the principles of our theory are
correct: checking whether, when given visual inputs of the kind that we
see, the model can do with them what we can do (including recognising
faces and signatures!)

So it should not be surprising that the work on machine vision and
on human vision has a lot in common.


One of the most important things the mind can do is to see an object
as the same object from many different views of it: close, far,
and moved around in many ways. This is called OBJECT CONSTANCY. It
seems trivial to us to be able to see that a cup is a cup no matter
what angle you look at it from, but once you try to get a machine to
do it, the difficulty of this task becomes apparent. Introspection tells
us it's easy, and that there's nothing to explain -- it's obvious that an
object is the same from many angles and distances. But trying to
explain HOW we manage to see that it's the same is very hard.

Remember that the shadow an object casts on your retina changes shape
depending on the position and distance of the object. How do we know
it's the same object? Here's one way we DON'T do it: By memorising
every single view we ever have of it:

The novelist, Borges, wrote a short-story, "Funes the Memorious", about
someone who could remember everything: every instant of his life was
exactly recorded in his mind forever. That may sound like an advantage,
but actually it was a huge handicap -- so huge that a man like Funes
could only exist in fiction. Funes could give a different name to every
number as far as he could count -- 1 would be, say, "Pedro," 2 would be
"Luis," 1,246,937 would be Jorge, etc. -- but he couldn't do calculations
with the numbers (how could you even add them if each one was unique
and had only its own proper name?).

Funes also had trouble with calling (what we would call) "the dog,
Rover," by the same name when it was in two different locations at two
different times, because for him every instant was unique and
remembered for ever. He could give a proper name to each "snapshot"
instant in his life, just as he could name each number, but he couldn't
find the INVARIANTS in those snapshots -- the things that stayed the
same from snapshot to snapshot.


The ability to detect invariants (the features that stay the same while
others are changing), is the basis of many of the things the mind can do.
Object constancy is based on detecting the invariants in the visual
input that do not change when the object changes its position. Your
brain uses those invariants to "recover" the shape of the object.

When you transform something in some way, some things will change and
other things will stay the same. The "invariant" is the part that stays
the same.

Here is an example. A square is a 2-dimensional shape with all four
sides equal in length. If you transform it by rotating it, the four
sides are still the same length, so the length of the sides is
invariant under a ROTATION transformation. If you move it up and down,
or sideways (TRANSLATION transformation), the length of the sides is
still invariant.

Some features of the square are not invariant under rotation: When
the corners point up and down, we see it as a diamond rather than a

If the transformation were a SCALING transformation (making the object
grow or shrink) then there would still be an invariant: the length
of the lines would grow or shrink, but all 4 lengths would still
be equal to one another.


When children are very young, they do not know how an object looks from
a viewpoint other than their own. If you show them an object that
is green on one side and red on the other, and then put it down in front
of them with the green side facing them and ask them what it looks
like to a person looking at the other side, they say it's green.

Studies of the effect of brain damage have shown that there are two
kinds of spatial perception: VIEWPOINT-BASED (or egocentric)
and OBJECT-BASED (allocentric) spatial perception. These systems
are separate in the brain because one can be damaged and the other

Children first see objects egocentrically (view-based): they do not
know how an object looks from a viewpoint other than their own.

How do we see 3-D shapes as invariant despite the many changes
in the shadows they cast on our retina?

According to a theorem in solid geometry, if you mark three points on
the surface of a 3-D object, and look at the shadow it casts in 2D,
then the whole shape of the object can be recovered from just 2
different 2-D views. This is a view-based invariance. It means that your
visual system (or a machine's) could tell what the shape of a 3-D object
was from just two views of it.


According to one theory of vision, the way we recognise shapes is that
we have TEMPLATES or PROTOTYPES of them stored in our brains, and
whenever we see a new shape, our brain tries to fit it to one of the
prototypes we have stored by deforming the prototype to fit the
shape. Whichever prototype required the least deformation: that's
what we see the object as.

An example of this is alphabetic letter recognition: We all learned to
write cursive script by looking at templates or models of what the
letters look like. We imitated the models, so it is not surprising that
hand-draw letters are recognisable by a system that has stored templates
for all the letters in the alphabet, and recognises any given letter by
which prototype requires the least deformation to match the letter.

The template matching model of shape recognition works for some shapes
better than others. It works well for facial expressions. We have a
prototypical "happy face" and "sad face," and we see faces as sad if
they are closer to one and happy if they are closest to the other.


Galton put many faces on transparencies one on top of the other
to find the "prototype" criminal face.

Susan Brennan used a similar technique to do caricatures of faces, by
finding all the ways they departed from the average -- bigger/smaller
nose, wider-set/narrower-set eyes, etc. -- and then increasing the way
the face deviated from the average (slightly bigger than average nose
is made still bigger than average, etc.). Caricatures are more easily
recognised than the real face, suggesting that our brains may really be
doing template-matching in face recognition.


According to the work of Irv Biederman (who spoke here last week)
the way we recognise shapes is by seeing them as a combination
of "geons" -- elementary geometric shapes that are invariant
under rigid transformations like rotation, translation,
and scaling. See:

This can explain a lot of our visual ability, even chicken-sexing.
Chicken-sexing is very difficult; it is said that you have
to study with a "master" chicken-sexer (black belt) for years
to learn it, and it cannot be explained in words.

Biederman did a geon analysis of chicken abdomens and explained to
beginners what they should look for. Ten minutes instruction was
enough to get them to "brown belt" level. When the masters
were shown the geon solution and asked whether that was how they
did it, they said "yes, come to think of it."

This was probably a case of IMPLICIT learning (made EXPLICIT
by the geon analysis).

Geons have object-based, viewpoint-invariant features

Geons are local features (parts). Shapes also have global features.

Edelman has shown that for unfamiliar shapes, a viewpoint-dependent
2-dimensional template plus a few 2-D views are enough to recover the
shape for both people and machines.

Edelman's theory of similarity is a viewpoint-based theory
rather than an object-based one like Biederman's geon theory.


Gibson stressed the role of invariants under sensorimotor
transformations, such as motion parallax: When you walk along, the
"shadows" of nearer things on your retina change faster than the
shadows of faraway things. The brain picks up that invariance
as part of its capacity to see depth. There are many such invariants
under the transformation of movement (things get bigger as you approach
them, etc.).


Face perception is probably special. There are areas in the brain where
injury leaves the patient able to see, but unable to recognise faces
(human faces, and in some cases sheep faces too). "Agnosia" is an
inability to recognise. Prosopagnosia is the inability to recognise
faces and object agnosia is the inability to recognise objects.

Unlike geometrical shapes, which are equally well recognised right-side
up and upside-down, face recognition is almost zero for upside-down
faces: So face recognition is not viewpoint-invariant. This is already
true in newborn infants.

This archive was generated by hypermail 2b30 : Tue Feb 13 2001 - 16:23:51 GMT