How many words in an "average" person's vocabulary?

Mike Salovesh (t20mxs1@CORN.CSO.NIU.EDU)
Sat, 31 Aug 1996 04:56:39 -0500

It's probably true that most "estimates" of how many different words might
be found in the average person's vocabulary have an *extremely* flimsy
basis. Perhaps the first problem might come when you try to set up some
means of deciding whether two words are different, or one is really just a
derivative of the other. Not to mention the first problem a well-trained
anthropologist should have learned to recognize long since: What kind of
sample is adequate for the purpose of discovering ANYTHING about ANY
so-called "average person"? How big should it be, who should be in it,
how do I keep the sampling pool constant and in reasonable correspondence
with what a statistician would do to get a "random sample"?

I did hear Martin Joos (if I remember the spelling correctly) offer what
sounded like a pretty good way to get a handle on a first approximation.
This was during a linguistics institute run during the summer of
1957 by the Linguistic Society of America at the U of Michigan, Ann
Arbor. (I give the long title to make sure nobody confuses the LSA
Linguistic Institutes with the Summer Institute of Linguistics, which is
an entirely different kettle of fish.)

His suggestion was to take the Merriam-Webster Unabridged Dictionary, a
reasonable instrument for speakers of American English, and use what the
dictionary itself says is the total number of listings it has as an
estimate of how many words there are in the language.

Joos pointed out that the estimate would have to be a pretty loose one,
because the editors made a conscious decision to leave out whole classes
of technical terms that they know are part of the language. A dictionary
would be too long for any practical use if it tried to list every separate
species name used by biological taxonomists, the spoken formulas
corresponding to every substance known to chemistry, the labels
astronomers use for every known galaxy, all the first names recorded in
every government registry, and other similar lists. But it's reasonable
to guess that the number of separate entries in the unabridged
Merriam-Webster is somewhere around the right order of magnitude for the
number of words in American English.

The next step would be to take a random number generator (for practical
purposes, a printed table of random numbers would do) and use it to grind
out numbers that could be assigned to page number, column number, and
place of a particular word in that column by a reasonably bias-free
process. (He suggested a whole series of techical refinements to get a
process that would produce a pretty defensible sample of the universe of
all the words listed in the dictionary. For example, he suggested a
neat way of letting a table of random numbers generate the starting point
within the table to be used as the source of the numbers to be used in
creating the sample.)

Now retrieve 100 words from the dictionary by Joos's suggested method.
The question then is "how many of these words does the person being
tested recognize?" A reasonable cross-check might be to ask that person
to define each word recognized. Settling on exactly how to separate "hit"
responses from "miss" responses might be difficult, but I have faith that
it could be done so as to produce an end product that is both reliable
and replicable.

Suppose I got 50% of the sample items right. OK, that seems to suggest
that my vocabulary includes something around 50% of the words in the M-W
Unabridged. Simple arithmetic -- multiplying whatever the number of
definitions M-W say they have in their dictionary by .50 -- should now
produce a number that is somewhere in the neighborhood of the number of
words in my vocabulary.

Joos claimed that he got fascinated with the process and did repeated runs
with himself as subject. He kept coming up with estimates of his own
vocabulary that ran around 200,000 to a quarter of a million words. While
at a cocktail party, Joos talked a stack of fellow linguists into playing
the game. (To keep the game from getting boring, he used lists that were
much shorter than 100 words. He prepared them beforehand, and had copies
of his word lists typed out for use on just such occasions.) Most of the
guests he tested early in the process of the cocktail party showed about
the same vocabulary size as Joos, give or take fifty thousand here or a
hundred thousand there.

Although the numbers for vocabulary sizes that came out of this occasion
sound much larger than I have seen people guess at elsewhere, that
shouldn't be surprising in the setting. Most of the guests were members
of the LSA Board of Directors, or past LSA Presidents, or people used to
running in such company. At that level of linguistics in those days, you
just had to be a lover and collector of words for their own sake. Anyplace
linguists gathered, before Chomsky, was like a convention of Words 'R' Us.
Joos claimed that when he did the same test informally back in Madison
with professorial colleagues from other disciplines at the U of Wisconsin,
he found their vocabulary sizes running in the 50,000 to 100,000 word
range, on average. The few student vocabularies he tested in the same way
ranged considerably lower.

=============== WARNING: At this point, I digress! ===============

I have been carrying an unconfessed burden for nearly forty years. It is
about time I revealed a secret I have kept ever since a notable night at
that Linguistics Institute of 1957.

The cocktail party where Joos trotted out his vocabulary-estimator was
given by Norman A. Mcquown. I was there because I was Mac's captive
graduate student at the Institute. I joined in Joos's exercise for the
fun of it. (I was secretly proud that at that party, and in later reruns
I did for myself, my vocabulary tested right up there with all the Big
Linguists.) Mostly, it was my duty to mix and tend the punch, made
according to a favorite recipe of Al Marquardt's in honor of the fact that
we were at his university.

Or rather, Mac and I made the punch according to what we thought the
recipe had to mean. We got all the ingredients right except one: we
didn't know that the punch was supposed to age for at least half a day
after being poured over a fifty-pound chunk of ice. I think we put in
three or four trays of ice cubes instead, and they hadn't melted much when
the guests started showing up. In other words, the stuff had a pretty
incredible alcohol content that needed, but didn't get, a lot of dilution.
The punch was insidiously neutral-TASTING, the day was hot, and people
drank a LOT of it. WHAMMO.

The cocktail party was scheduled just before the annual dinner of the
Society. All those who would be at the head table were at the party,
including the guest of honor (whom I prefer not to name this publicly).
If you know a reasonable sample of professional linguists, you know that
most of them could drink any two ordinary people under the table and keep
on going and going and going longer than that drum-banging Battery Bunny.
Not this time, though. By the time they left McQuown's to wander over to
the dinner, none of them were operating at the peak of their intellectual
or physical powers. I have never seen a finer example of mutual support
among professional colleagues: they were holding each other up because
they couldn't have gone anywhere otherwise. The guest of honor actually
fell asleep during dinner, with his head dropping into his untouched soup.
That went almost unnoticed at the head table, however, since the other
guests were having their own problems.

All of which is offered as an extended explanation of why Joos's check of
vocabulary sizes among the cocktail party guests could only be regarded
as more or less reliable for those tested early in the party. VERY

Now that I have revealed what happened to the LSA Board before the 1957
Linguistic Institute Dinner, my conscience can finally rest easy. With
that weight gone, perhaps I may even be able to explain why I'm laughing,
the next time I talk about some of the finest, most distinguished
linguists of the pre-Chomsky era.

(I still have, and occasionally use, the Salovesh/McQuown modified
recipe for that punch. I try not to serve it unless the celebration,
whatever it is, is planned to continue for lots of hours after the punch
runs out. I would not reveal the recipe, even under torture: it should
only be practiced by properly trained adepts.)

============== Whew! That ends the digression! =================

The testing process Joos described and put into use is complicated, and in
its day was quite time-consuming. I don't think that anybody has ever
tried to run it past a well-selected sample of the population. Until
somebody does Joos's test or something similar with all the controls that
survey research people would immediately plug in, we just won't know how
large or small the "average" vocabulary size is.

The best available substitutes are not based on ordinary speech, or on
Joos's kind of intensive probing of an individual's knowledge. Instead,
they are based on word-counts run on published documents: books,
magazines, texts of public speeches, and other written materials. In my
opinion, the basic source materials are not about the language as she is
spoken: they're about what a limited segment of the language community
does when writing for their impression of their target markets. There
are too many intervening factors for me to feel comfortable with that
approach. Nonetheless, that's the best we have.

(No, perhaps I should say that's the most dependable vocabulary counting
*I have heard about* in support of nearly all estimates of vocabulary size.
Maybe there are lots of linguists and other language scholars doing a
better job with different sources of their materials, and I just
haven't heard about it. I no longer consider myself a linguist, and I'm
far from familiar with current practice.)

Today, tests like those suggested by Joos could be done easily and quickly
out of a desktop computer and a good dictionary on CD-ROM. The basic
principles would remain the same. Implementation would be so much simpler
that people should be doing it all the time. In fact, I wouldn't be
surprised to learn that they already are.

Anyhow, to my eye the Olde Emperors of Counting Words in People's
Vocabularies aren't wearing any fancy new duds for everyone to see. They
may not be wearing any clothes at all. Their published prejudices can't
be taken as the whole truth or nothing but the truth, and I'd rather not
be confused by the spurious numerical accuracy of statements like "the
average high school graduate in the U.S. has a vocabulary of 27,600 plus
or minus 800 words. The calculated value is statistically significant at
the level of p less than .05 per cent." The first thing I want to do with
statements like that is make their authors back down to a more reasonable
statement. Something like "most (or many) people in the U.S. who are over
18 years old use about <dingbat> thousand different words in their normal
daily speech. They are able to recognize (or define, or whatever) the
written form of something around <brumpty> thousand different words". I
see no justification in citing numbers with any more significant figures
than the two examples I just gave. "Thirty thousand", in the present
state of our knowledge, is preferable to "27,600 plus or minus 800"
because no matter how much nicer the latter figure looks on a report, it's
a fake exercise in number magic. I just don't believe in numerology,
that's all.

I do believe it's way past the time when I should have cut this off.

Besides, in precisely 1.76 pages more I will have used every English word
I know at least once, either for readin' or speakin'. How do I know that
I know the words I think I do? What happens when I run out of words I
know? Where was I, anyhow?

Good night, all. Or good morning. Whatever.

-- mike salovesh, anthropology department <>
northern illinois university PEACE !

In the words of "Liza Doolittle", from My Fair Lady:

"Words, words, words. I get words all day through, first from him, then
from you. Is that all you blighters can do?"

"Never do I ever want to hear another word. There isn't one I haven't