Provenance Metadata Gathering and Cataloguing of EFIT++ Code Execution

Provenance Metadata Gathering and Cataloguing of EFIT++ Code Execution https://scientific-publications.ukaea.uk/wp-content/themes/blade/images/empty/thumbnail.jpg 150 150 UKAEA Opendata UKAEA Opendata https://secure.gravatar.com/avatar/679db0b751d3e5fa797cd4b46afe6f58?s=96&d=mm&r=g 1st May 2015 20th August 2018

CCFE-PR(17)40

Provenance Metadata Gathering and Cataloguing of EFIT++ Code Execution

I. Lupelli D.G.Muir L. Appel R. Akers M. Carr P. Abreu

Preprint Published

Journal publications, as the final product of research activity, are the result of an extensive complex modelling and data analysis effort. It is of paramount importance, therefore, to capture the origins and derivation of the published data in order to achieve high levels of scientific reproducibility, transparency, internal and external data reuse and dissemination. The consequence of the modern research paradigm is that high performance computing and data management systems, together with metadata cataloguing, have become crucial elements within the nuclear fusion scientific data lifecycle. This paper describes an approach to the task of automatically gathering and cataloguing provenance metadata, currently under development and testing at Culham Center for Fusion Energy. The approach is being applied to a machine-agnostic code that calculates the axisymmetric equilibrium force balance in tokamaks, EFIT++, as a proof of principle test. The proposed approach avoids any code instrumentation or modification. It is based on the observation and monitoring of input preparation, workflow and code execution, system calls, log file data collection and interaction with the version control system. Pre-processing, post-processing, and data export and storage are monitored during the code runtime. Input data signals are captured using a data distribution platform called IDAM. The final objective of the catalogue is to create a complete description of the modelling activity, including user comments, and the relationship between data output, the main experimental database and the execution environment. For an intershot or post-pulse analysis (~1000 time slices, 65x65 grid, mpi execution n=8 cores) of a typical MAST pulse, the overhead in the code runtime caused by the Provenance Metadata Gathering System is less than 10%, the metadata/data size ratio is about ~20%, which we consider to be reasonable according to the present literature. A visualization interface based on Gephi for catalogue interrogation, will be presented.