10/95

DOCUMENTATION AND METHODOLOGY FOR FREQUENTLY OCCURRING NAMES IN THE U.S.--1990

The Census Bureau has a primary obligation to protect the
confidentiality of individual responses to the Census.  As part
of this confidentiality commitment, the Census Bureau does not
currently release individual census questionnaires (or any other
information that could identify an individual) until 72 years
after a Decennial Census was taken.  In 1992, the Census Bureau
released the 1920 Census schedules to National Archives.  In
fact, the Census Bureau is so concerned about confidentiality
that name has not been entered into the basic internal electronic data
used to tabulate census results.

However, there have been numerous demands for sunmary data on the
frequency of surnames for genealogical reasons.  Similar interest
has arisen for the frequency of first names by sex.  This data
set attempts to satisfy these demands while still providing
utmost confidentiality of individual results.


BACKGROUND

In the summer of 1990, immediately following the 1990 Decennial
Census, the United States Census Bureau conducted a large scale
survey to measure undercount in the 1990 Census.  This
independent post Census operation (the 1990 Post-Enumeration
Survey--PES) collected items of demographic data (race, sex, age
and NAME) from 377,000 persons living in 165,000 housing units in
5,300 predefined blocks (or block clusters). 

The information acquired from this independent (PES) operation
was matched against actual 1990 Census records for persons living
in those same 5300 blocks plus additional surrounding ring
blocks.  The PES blocks plus the surrounding ring--"the Search
Area"--contained 7.2 million census records replete with name. 
It is this Search Area data set that provides the impetus for the
three name files at this internet site.
  
FILE FORMATS

In July 1995, the Census Bureau placed abridged summary
information from the Search Area on its internet site.  Selected
data from these files have appeared in the print media with the
citation "source--Census Bureau".  Since the documentation
accompanying the original release of this information was
sketchy, we are supplying additional explanatory material about
the limitations of these data.


Each of the three files, (dist.all.last), (dist. male.first), and
(dist female.first) contain four items of data.  The four items
are:    

         (1).  A "Name"
         (2).  Frequency in percent
         (3).  Cumulative Frequency in percent 
         (4).  Rank

In the file (dist.all.last) one entry appears as:

        MOORE       0.312       5.312       9

In our Search Area sample, MOORE ranks 9th in terms of frequency. 
5.312 percent of the sample population is covered by MOORE and
the 8 names occurring more frequently than MOORE.  The surname,
MOORE, is possessed by 0.312 percent of our population sample.


EDITING

Producing that summary line for the name MOORE required a great
deal of program editing.  For example, we immediately realized
that it was necessary to convert the entries MOORE JR, MOORE SR,
and MOORE III in the last name field to MOORE.  For purposes of
consistency we also converted entries such as MOORE JONES or
MOORE-JONES to MOORE.

In addition to those rather simplistic edits, we also examined
each name entry for the possibility of an inversion. (eg: a first
name appearing in the last name field and a last name placed in
the first name area).  Consider a 2 person household with the
entries MOORE   ROBERT, and MOORE   CAROLYN in the name fields. 
From our sample name universe, we can empirically determine the
probability that the inversion (ROBERT   MOORE and CAROLYN  
MOORE) as a far greater probability of being "right" than the
keyed entry.  When the probability that the odds of an inversion
attained odds of 10,000 to 1, the inversion was done.  

Many names can be inverted and sound absolutely right.  For
example, there is absolutely no reason to suspect that HENRY  
THOMAS is wrong and THOMAS    HENRY is preferable.  However, if
HENRY  THOMAS had a spouse listed as HENRY   MARTHA  and a female
child named HENRY   SUSAN, that additional information suffices
to invert the name field for the entire family.

For first names, we considered concatenating entries but finally
decided against it.  Among males the combinations JOHN PAUL and
JOSE LUIS in the first name field were far more frequent than any
other set of spaced names.  We could possibly have formed them as
JOHNPAUL and JOSELUIS.  As a result the male first names JOHN and
JOSE may be marginally overstated. 

The one name that is most affected by our decision not to
concatenate is the grand old name of MARY.  The entries  (MARY
ANN, MARY BETH, MARY CATHERINE, MARY ELLEN, MARY FRANCES, MARY
GRACE  etc) wind up as MARY.  MARY may or may be the most common
first name among American women, but our decision to avoid
concatenation did add a significant number of MARY's.
                      
Finally, we came to the conclusion that the existence of a single
letter (an initial perhaps?) appearing in either the first or
last name field would not qualify as a name; but an initial in
one field would not disqualify the other name field.  For example
the 19th century financier (J      P  MORGAN) has a valid last
name but the letter J does not meet these standards for a first
name.  MUHAMMED    X is an example of a acceptable first name
with an unusable surname.


MISSING DATA

Although the search area contains 7.2 million persons, almost 15
percent of those persons do not provide enough information to
form a name.  In the previous paragraph we provided the situation
where we decided that a single a single letter would not
constitute a name.  Other situation are listed below.


     1. The respondents did not enter a "name" at the top of page
2 of the 1990 Census form, even though names might have appeared
on the roster of page 1.  A name must appear at the top of page 2
for the name to be keyed. 

     2. The respondent may have inadvertently left sex (gender)
off his or her Census record.  In that instance we accept the
last name, but we have no "certain" way of placing the first name
in the male or the female file.  We do not assume that JENNIFER
without a sex designator would be female even though common sense
suggests that this is indeed the case.

     3. A family may have put down a last name for the
householder but not for any other household member.   We may have
the following family JOHN    SMITH,   MARY    (blank),  JOHN JR   
(blank),  ROBERT     (blank),  JENNIFER      (blank),  SUSAN  
(blank).  In that family we have a first and last name for
householder John, but first names only for the remaining 5 family
members.

     4. The keyed name may not follow acceptable form.  Some
examples of invalid entries in either the first or last name
field are: BABY   GIRL,  MR    JONES,   DR    BROWN,  FILIPINO 
FEMALE.  

Each of these situations are responsible for limiting our
original sample of 7.2 million person records down to its present
size of 6.3 million.  The actual number of person records making
up the unabridged files are: 

     File Name             Valid Records      Unique Names  
                         
  1. dist.all.last          6.290,251            88,799 
  2. dist.female.first      3,184,399             4,275
  3. dist.male.first.       3,003,954             1.219 

For purposes of both confidentiality and elimination of data
noise we restricted the number of unique names available at this
internet site to the minimum number of entries that contain 90
percent of the population in that data file.  There is an
extremely small chance that an individual with a truly "unique"
name could have been captured in sample, and is far more likely
for surnames than for first names.   A second basis for limiting
entries is that a smattering of entries exist because of the
combination of bad handwriting coupled with poor typing. 
Consider the entry JOSEHP in dist.male.first.  Although JOSEHP
may be a name, it is much more likely that all of the JOSEHP
entries are really miskeys of JOSEPH.


LIMITATIONS

For the names at the top of the distribution, (SMITH, JOHNSON,
WILLIAMS, JONES etc),  or (MARY, PATRICIA, LINDA, BARBARA etc) or
(JAMES. JOHN, ROBERT, MICHAEL etc) the data speaks for 
themselves.  However as the sample thins, one might draw
conclusions about frequency that are not warranted.

The PES sample intentionally over sampled both Blacks and
Hispanics, and it is likely that the Search Area also contains an
excess of these two groups.  Thus the frequently occurring
surnames: GARCIA, RODRIGUEZ, GOMEZ and WASHINGTON as well as
first names: JUAN, JOSE, GUADALUPE and WILLIE might attain higher
rankings than their actual population numbers within the United
States would warrant.  

But the limitations due to sampling are much more noticeable when
looking at rarely occurring names--especially surnames.  Consider
a surname appearing 63 (out of 6.3 million entries) times in the
file dist.all.last.  Here the frequency would appear as 0.001
percent, but it is possible that that sample frequency may not be
close to "truth".

Ignoring clustering, (persons in the same household usually have
the same surname) the coefficient of variation on a number of
that magnitude would be approximately 12 percent.  But most
people who do not live alone share a last name with other people
in that household.  Thus the 63 persons with that rare name may
be the result of 16 households, which would raise the coefficient
of variation to approximately 25 percent.  

But we are not done.  Even in the last years of the 20th century,
families tend to live close to each other, and it is not
impossible to conceive a situation where all Americans with a
certain surname appear in sample.  Were that situation to occur
it would be possible to overstate the frequency of that surname
name by a factor of 40.  The number 40 arises because the number
of Census records in the sample (6,290,251) is approximately one
fortieth of the United States population.

The fact that a name doesn't appear in these three files does not
mean that it is non existent, only that it is reasonably rare.

In conclusion we do realize that misleading frequencies are much
less likely in  the files (dist.female first) and
(dist.male.first).  Although fathers and one son may share a
first name, brothers almost never share the same first name.


ADDITIONAL INFORMATION

Persons wanting or needing more information about the contents of
these three files can contact David L. Word (dword@census.gov)
301-457-2103 or Randy M. Klear (rklear@census.gov) 301-457-1727.


