Identification of Race and Ethnicity in Big Data

Hannah Brückner, NYU Abu Dhabi
Bedoor Alshebli, NYU Abu Dhabi
Julia P. Adams, Yale University

Advances in technology and infrastructure have opened up a plethora of new data sources that allow new approaches to the study of social structure, human behavior, and knowledge production. However, in contrast to survey and census data, big data often comes without information on the personal attributes of individuals, potentially obscuring inequalities of access and representation. For example, we know a fair amount about the gender gap on Wikipedia, but almost nothing about the inclusion of people of color (Adams et al 2019), because algorithmic methods can identify gender from first names quite well, but identification of race and ethnicity is less well developed. It is therefore important to assess the quality of existing tools that could help big data researchers to pay more attention to race and ethnicity. We compare hand-coded data on a large sample of academics with data classified using a classification tool using last names (Ambekar et al 2009). We present descriptive results on the accuracy of the classification tool. We then use the two measures of ethnicity to predict inclusion of academics on Wikipedia, with a focus on ethnic underrepresentation. We conclude with a discussion of the problems and pitfalls of studying diversity and inclusion with big data. References Adams, Julia, Hannah Brückner, and Cambria Naslund (forthcoming): Who Counts as a Notable Sociologist on Wikipedia? Gender, Race and the ‘Professor Test’. Socius: Sociological Research for a Dynamic World. Ambekar, A., Ward, C., Mohammed, J., Male, S., and Skiena, S. 2009. Name-ethnicity classification from open sources. In Proceedings of the 15th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Paris, France, June 28 - July 01, 2009). KDD '09. ACM, New York, NY, 49-58. DOI= http://doi.acm.org/10.1145/1557019.1557032

No extended abstract or paper available

 Presented in Session 17. Race and Methodological Inequalities