Handwriting Text Recognition & Word Spotting Techniques to Build Individual-Level Historical Demographic Databases. The Barcelona Case.

Joana Maria Pujades Mora, Open University Of Catalonia & Center For Demographic Studies, Universitat Autònoma De Barcelona
Alícia Fornés, Computer Vision Center
Josep Lladós, Computer Vision Center
Miquel Valls Fígols, Universitat Autònoma de Barcelona
Gabriel Brea-Martinez, Centre for Economic Demography-Economic History Lund University

Nowadays, one of the great challenges of Historical Demography is integrating handwriting recognition techniques into data collection of primary sources as a way of being part of the Big Data revolution. This integration would make possible to reduce the time of data collection and processing large collections of documents and would offer ever-increasing arrays of information. The aim of the paper is to describe the main document image analysis techniques that have been developed for extracting the information from handwritten demographic sources in order to create the Barcelona Historical Marriage Database and the Baix Llobregat Demographic Database. These databases have been created by an interdisciplinary team compounded by historians/demographers and computer scientists from the Center for Demographic Studies and the Computer Vision Center, both at the Autonomous University of Barcelona. The specific applied techniques are the Key Word Spotting and the Handwritten Text Recognition. Word Spotting turns out to be more suitable when a document does not have a clear internal structure or when the handwriting style is new to the system. Word spotting has been approached through structural and learning-free method, graphs and statistical and learning-based method when training data is available. To train the system we have included the human in the loop through a bimodal crowdsourcing platform. It integrates two point of views: the semantic information and the ground-truthing for document analysis. However, when the document is legible and there is enough training data of a particular handwriting style, handwriting recognition techniques are applied. In this way, once the words are recognized, the next step consists in assigning to them a semantic category. This process is known as Named Entity Recognition, being key to interpret the contents and store the information in semantically accessible knowledge databases. The validation of automatic transcription has been done using gamesourcing experiences.

See extended abstract

 Presented in Session 40. Automatic Handwriting Recognition