Christian Møller Dahl, University of Southern Denmark
Emil Sørensen, University of Southern Denmark
Data acquisition forms the primary step in all empirical research. Historical records are a particularly challenging source as they often require transcription. Manual transcription is infeasible when the data requirements grow and the number of documents reaches an unsurmountable level. We use machine and deep learning techniques to automate the digitization process. This provides several advantages compared to manual transcription: Speed, reproducibility, scalability, and quantifiable error rates. This paper describes our pipeline for transforming paper documents into readily usable data with a minimum of human interaction. Our methods are tailored to tabular documents and transcription of digit sequences. Digits are highly relevant for sources used in quantitative social science as they represent counts, amounts, dates, ages, area-codes etc. We show that coherent point drift methods are effective for segmentation even for complex tables. Also, we develop a custom transcription model based on multi-label convolutional neural networks that are trained to transcribe complete digit sequences in one-go, i.e. without an intermediate character-level segmentation step. The models are applied to both typed and handwritten sources: US mortality records and Danish death certificates. Our approach is reliable, fast and scalable at an end-to-end transcription rate of 100.000 sequences/hour without any performance optimizations. At this speed the models still yield accuracies of 99.65% for typewritten and 96.78% for handwritten sequences with variation in script and clutter. Finally, we discuss techniques to mitigate requirements for a manually transcribed training sample: Simple augmentation, synthetic sampling (sprites and generative adversarial networks) and transfer learning. We find that models trained on synthetic data can perform on-par with those trained on large manually constructed training sets. Also, models that are trained on US mortality data can generalize to UK data by transfer learning and thereby make the training data requirements negligible.
No extended abstract or paper available
Presented in Session 39. Problems with Data and Measurement