Using Machine Learning to Digitize a Unique Source of Data from Scandinavia: The Student Yearbooks

Paul Sharp, University of Southern Denmark
Kristin Ranestad, Lund University

Our presentation will discuss a unique historical data source from the Scandinavian countries, and how we plan to digitize this using machine learning. The data is contained in student yearbooks, comprising hundreds of thousands of biographies of the universe of graduates from secondary and tertiary education in Denmark, Norway, and to a lesser extent Sweden. Usually published 25 and 50 years after graduation, these include detailed information about the career and life of graduates since their time at school, based on surveys sent out to entire cohorts. The total number of Danish (1820-1925) and Norwegian (1831-1943) high school biographies is estimated to be over 150,000, while the total number of Swedish, Danish and Norwegian technical and engineering school biographies (between roughly 1860 and 1920) is estimated to be more than 12,000. The graduates themselves normally wrote the biographies – based on questionnaires – but editors collected, organised and published the information. The share of graduates (or family members of deceased graduates) who answered the questionnaires was normally very high, up to 99% of the total cohort for the Norwegian yearbook of 1909, for example. It is thus possible to track who received an education, where they studied and travelled, where they worked, what their family background was, and a range of other variables, for these individuals throughout their entire lives. These sources will then be matched with full count census data, providing coverage of the section of the population which did not receive secondary education, and can also be linked to modern administrative data. Once the material is digitized we plan to apply these data to the question of how expertise was transmitted across time and space, and finally to examine the impact of the spread of expertise on socio-economic outcomes.

No extended abstract or paper available

 Presented in Session 200. Textual Analysis of Disciplinary Histories: Economics, Sociology and Concepts of Development