Metabolomic datasets in COVID-19 research: a systematic literature review of availability, characteristics, and methodologies
Abstract
The COVID-19 pandemic has accelerated the integration of metabolomics and Machine Learning in biomedical research, resulting in the creation of numerous datasets with high potential for reuse. However, information regarding their accessibility, quality, and usability remains scattered and inconsistent. This systematic review aims to identify and evaluate publicly available human metabolomic datasets related to COVID-19, providing detailed information on their main characteristics and how to access them, to inform their potential for reuse in future research. Following PRISMA guidelines and the Kitchenham methodology, we conducted a comprehensive search of the scientific literature and specialized metabolomics repositories, identifying 96 unique datasets. Each dataset was assessed based on 15 variables related to data availability, accessibility, collection methodologies, sample sizes, and the extent of participant metadata provided. These datasets offer significant value for secondary analyses and ML applications, contributing to insights into disease mechanisms, early diagnosis, and patient stratification. By offering a structured overview of dataset characteristics, this review aims to support researchers in identifying suitable resources, encourage data reuse, and promote best practices for data sharing and standardization in the context of COVID-19 and metabolomics. Nonetheless, our findings reveal critical limitations, including the underuse of dedicated repositories, frequent unavailability of raw data, lack of standardization in processed data, and insufficient metadata—particularly regarding participant demographics and clinical information. Inconsistencies in data formats and reporting standards further hinder dataset findability, interoperability, and reuse. To enhance the value and impact of future metabolomic research, we recommend adopting standardized reporting guidelines, improving metadata completeness, ensuring the availability of raw data, and promoting the use of interoperable repositories to facilitate reproducibility, integration, and broader application of shared datasets.