I-Balsac: Completing Families with the Help of Automatic Text Recognition

Helene Vezina, Projet BALSAC, Université du Québec à Chicoutimi
Jean-Sebastien Bournival, Projet BALSAC, Université du Québec à Chicoutimi
Christopher Kermorvant, Teklia, Paris and Laboratoire LITIS, Université de Rouen
Marie-Laurence Bonhomme, Teklia, Paris

For the last 50 years, BALSAC has been reconstructing the genealogical lines of the Quebec population of French Canadian descent using marriage records. However, technical limitations have emerged since, as we move forward in time, more events are recorded every year and the task to integrate them in the database increases accordingly. It has become obvious that to pursue the development of the database we could no longer rely exclusively on manual or semi-automatic operations to digitize, integrate and link millions of records. Progress in machine learning open up promising avenues for historical databases. Word recognition algorithms, especially handwritten text recognition (HTR), have improved significantly in the past few years. We just initiated a new project relying on HTR for the transcription of Quebec civil registers. Ultimately, our goal is to process approximately 1.3 million pages of digitized records and extract about 6 million birth and death records from 1850 to 1917. We intend to identify and to index various entities contained in each record: names and surnames (of subject, parents, and spouse), dates, places, and occupations. One of the key issues we are facing pertains to the quality of the digitized documents (quality of preservation as well as quality of digitization). Moreover, since we cover the whole Quebec territory and a period spanning 70 years, the great diversity across registers in terms of wording and of handwriting styles will certainly also represent a major challenge. In this paper, we describe the context that has led us to make the decision to rely on HTR for the transcription of Quebec civil records. We provide an overview of our approach discussing the difficulties encountered and the choices made to overcome them and achieve the best possible results. Lastly, we present preliminary results on HTR operations.

No extended abstract or paper available

 Presented in Session 40. Automatic Handwriting Recognition