Speaker
Description
The CESSDA Data Catalogue (CDC) has long supported metadata about Social Science studies in DDI Codebook 1.2.2 and 2.5 forms. Historically, however, metadata in DDI Lifecycle formats has not been supported. This was a pain point for CESSDA’s service providers who work with this format.
The CESSDA Metadata Office has created mapping from DDI 3 metadata to CDC UI elements which was the basis for this work.
In order for the CDC to ingest Lifecycle metadata CESSDA MO has implemented a parser. Implementing Lifecycle is significantly more complex than Codebook. Simple parsing techniques like linearly reading through the document are not sufficient for Lifecycle. This is because DDI 3 elements can reference other parts of the document. These references need to be resolved.
To implement this, the behaviour of the parser needed to be defined programmatically at the XPath. This is different from what the parser did previously and required significant rewrites to introduce the required flexibility. Extensive use of lambdas was used to associate XPaths with parsing behaviour.
Next steps will be looking at performance and memory optimisations could be reduced when parsing large XML source files. Could this be accomplished with a streaming XML parser?