Speaker
Description
Microdata provides tremendous value in socioeconomic analysis. However, these data may not be easily discoverable when metadata are not as rich, structured, and optimized as they could be. In the case of microdata, an issue is the semantic discoverability of information contained in the variable-level metadata (the data dictionary). This paper presents an unsupervised framework that leverages large language models (LLMs) to generate variable groups in DDI and thematic description of these groups automatically from microdata's data dictionary. The framework leverages natural language processing (NLP) methods to improve the context accessible to the LLM, and self-consistent prompting is proposed to automate the validation of the generated themes. The framework also implements an AI agent to assess the self-consistency of the LLM's output for automating the quality assurance (QA) process. The automatically generated thematic descriptions of the variables serve as input for lexical search, for generating embeddings for semantic searchability, and recommendations for microdata.