Mark Clement, Brigham Young University
Joseph P. Price, Brigham Young University
Recent breakthroughs in handwriting recognition have the capability to improve the quality of the 1940 Census data and expand the set of fields that are available to use for research. Our hand-writing recognition algorithms uses new data augmentation and normalization methods applied to a convolutional neural network that feeds into a Long-Short-Term-Memory (LSTM) network. We also have a unique advantage by having access to a training set that is unprecedented size. Census records consist of a set of rows for each person and columns for each of the fields of information for that person. We’ve developed an algorithm to extract the sub-image in each cell of the census record and match these with the indexed data for that cell. This provides us a labeled training set with 2.4 billion images from the 1940 census (18 fields x 132 million individuals). We are using our algorithm to re-index the 1940 census and fix mistakes made by the original human indexers and also expand the number of fields that are indexed. We conducted a pilot study on the 1930 census using a small training set and have already achieved a character error rate (CER) of 10.4% for names. We also make use of the FamilySearch Family Tree, a crowdsourced genealogical database which includes a substantial number of individuals linked to the 1940 census. These sources have often been attached to the Family Tree by family members who have access to additional information about these people that improve the accuracy of the linkages to these sources. We use information from these sources to correct mistakes in the index of the 1940 census and identify alternative name spellings and nicknames for the individual.
No extended abstract or paper available
Presented in Session 40. Automatic Handwriting Recognition