Speakers
Description
With more than 400 DDI documented datasets, Center for Socio-Political Data’s catalogue (CDSP) counts ten of thousands of variables - mainly quantitative survey data collected from structured questionnaires.
With the final goal to produce accurate and consistent data training material for a machine learning model (camemBERT), the CDSP’s engineers launched a working group for variable tagging using the French version of the European Language Social Science Thesaurus (ELSST) keywords.
Experimenting with machine learning for classifying data at the variable level, this paper evaluates the machines' capabilities to process and classify large datasets, while emphasising the accuracy and contextual understanding that human experts provide.
This presentation aims to provide feedback on the methodology developed for the human tagging process in order to minimise bias and provide a harmonised classification.