Guenter Muehlberger, University of Innsbruck
Kurt Scharr, University of Innsbruck
Dirk Alverman, University of Greifswald
The paper will present results achieved with the open research platform Transkribus (http://transkribus.eu). Transkribus enables researchers to carry out Handwritten Text Recognition on modern as well as historical handwriten and printed documents. Moeover recognized documents can be searched with Keyword Spotting (Query by String). The goal of the paper is to show how these tools can be exploited for social science and history research. The first case study will deal with land registry documents (ca. 220.000 pages) from the first Austrian cadastre in the 19th century. These documents are in tabular form and contain information about the land owner, his profession, his properties (parcels) and their use (e.g. acre, field, woods, orchard). They are multi-writer documents written in German Kurrent. In parallel to the digitization process some hundreds of pages are carefully transcribed and used as training material for creating HTR models. The goal is to achieve a Character Error Rate (CER) of around 5%. We will show the implications of different training strategies (one large modul vs. several specialized models). The second case study deals with court records from Germany, 17th and 18th century. The training strategy will be similar to the first use case, however here we will demonstrate how Keyword Spotting (KWS-Query by String) can be utilized to search across the complete collection of documents. KWS takes benefit from the internal variants stored by the neural networks of the HTR engine. These internal variants can be presented to the user which means in practise that the user can decide either to prefer a high recall rate (and accept more false positives) or to focus on more precise results (and accept that some occurances of his search term may be missed). We will compare the results achieved by KWS with a conventional full-text search.
No extended abstract or paper available
Presented in Session 31. Emerging Methods: Computation/Spatial Econometrics