PeerJ Computer Science Preprints: Databases

PeerJ Computer Science Preprints: Databases https://peerj.com/preprints/index.atom?journal=cs&subject=9700 Databases articles published in PeerJ Computer Science Preprints GENOMA: a multilevel platform for marine biology https://peerj.com/preprints/27347 2018-11-14 2018-11-14 Chiara Colantuono Marco Miralto Mara Sangiovanni Luca Ambrosino Maria Luisa Chiusano

Next-generation sequencing (NGS) technologies are greatly facilitating the sequencing of whole genomes leading to the production of different gene annotations, released often from both reference resources (such as NCBI or Ensembl) and specific consortia. All these annotations are in general very heterogeneous and not cross-linked, providing ambiguous knowledge to the users. In order to give a quick view of what is available, and trying to centralize all the genomic information of reference marine species, we set up GENOMA (GENOmes for MArine biology). GENOMA is a multilevel platform that includes all the available genome assemblies and gene annotations about 12 species (Acanthaster planci, Branchiostoma floridae, Ciona robusta, Ciona savignyi, Gasterosteus aculeatus, Octopus bimaculoides, Patiria miniata, Phaeodactylum tricornutum, Ptychodera flava and Saccoglossus kowalevskii). Each species has a dedicated JBroswe and web page, where is summarized the comparison between the different genome versions and gene annotations available, together with the possibility to directly download all the information. Moreover, an interactive table including the union of different gene annotations is also consultable on-line. Finally, a query page system that allows to search specific features in one or more annotations simultaneously, is embedded in the platform. GENOMA is publicly available at http://bioinfo.szn.it/genoma/.

In a digital world in the making, digital natives develop new learning profiles, interests, and way of working. Simultaneously teachers are facing students with lack of engagement and motivation with quite traditional learning process that has probably to be reframed considering the effects of digital transformation in the education sector. This issue is acute when it comes to complex subject of study, such as SQL geospatial to manipulate the geospatial characteristic of data. Indeed, some common difficulties have been identified by teachers from HEIG-VD university both in Media Engineering and Geomatics fields of study. The user-centered approach aims at creating digital products highly responding to the user’s needs through techniques improving the user experience. Various aspects have to be considered, including emotions. In education, gamification, along with user experience, interface design and usability best practices is one promising approach able to increase the learner's engagement, interest and motivation. It aims to implement game mechanics within non-game context, in order to motivate the learner to accomplish a task and increase the ability to learn new skills. Using a gamification layer within a given context, being digital or not, act as a motivational trigger. It helps giving meaningful, enjoyable and empowering experience. SQL Island is a project from Kaiserslautern University of Technology which illustrates very well a gamified learning experience of the SQL special-purpose programming language. The GeoSQL Journey project goes further, tackling SQL geospatial to learn in a fun way how to manipulate the geospatial characteristic of data. It is a gamified pedagogical application to introduce the students to the practice of SQL geospatial during the first hours or days of the course. Serving as an initiation, it is designed to focus on intrinsic motivation (personal development, quest, challenge and fulfillment) with learning objectives determined and integrated with an engaging and coherent game world and narrative. This paper describes the early work of conceptual design of the GeoSQL Journey project. Game mechanics and game interface has been conceived and brought together according to the literature in the domain and best practices on this matter. The following step for this project is to elaborate a testing method without yet having to develop an application prototype (e.g. organizing a fairly raw tabletop game associated with a classic SQL console) so as to challenge the design with students and teachers to get their feedbacks. Also, it is envisioned to evaluate how existing open source gamification tools and frameworks would be suitable to develop the first prototype planned for the 2019-2020 academic year.

DataBases (DB) are a widespread source of data, useful for many applications in different scientific fields. The present contribution describes an automatic procedure to access, download and store open access data from different sources, to be processed in a GIS environment. In particular, it refers to the specific need of the authors to manage both meteorological data (pressure and temperature) and GNSS (Global Navigation Satellite System) Zenith Total Delay (ZTD) estimates. Such data allow to produce Precipitable Water Vapor (PWV) maps, thanks to the so called GNSS for Meteorology(G4M) procedure, developed through GRASS GIS software ver. 7.4, for monitoring in time and interpreting severe meteorological events. Actually, the present version of the procedure includes the meteorological pressure and temperature data coming from NOAA’s Integrated Surface Database (ISD), whereas the ZTD data derive from the RENAG DB, that collects ZTD estimates for 181 GNSS Permanent Stations (PSs) from 1998 to 2015 in the French-Italian boundary region. Several Python scripts have been implemented to manage the download of data from NOAA and RENAG DBs, their import on a PostgreSQL/PostGIS geoDB, besides the data elaboration with GRASS GIS to produce PWV maps. The key features of the data management procedure are its scalability and versatility for different sources of data and different contexts. As a future development, a web-interface for the procedure will allow an easier interaction for the users both for post-processing and real-time data. The data management procedure repository is available at https://github.com/gtergeomatica/G4M-data

This paper presents part of the European FEDER Project TCVPYR, which aims to promote tourism in the French Pyrenees region by leveraging elements of its cultural heritage. The TCVPYR is a multidisciplinary project involving scientists from various domains, including: computers scientists, geographers, historians and anthropologists. In order to achieve its goal, some of the TCVPYR researchers are currently collecting georeferenced cultural heritage data in different areas of the Pyrenees. This data, together with data from the local governments and the Open Data, is intended to feed a mobile application that promotes the tourism in the region. The mobile application will allow tourists, but also scientists, to access cultural heritage data in the form of points of interest (POI). Moreover, these POI are to be presented according to the user profile and the environmental context, including spatiotemporal aspects such as his current location. For example, the application may suggest to a tourist an itinerary with several POI taking into account his interests, means of locomotion, available time and mobile features (battery level, internet connexion…). All data collected in this project as well as the final application will be published as open data and open source.

Biodiversity information is made available through numerous databases that each have their own data models, web services, and data types. Combining data across databases leads to new insights, but is not easy because each database uses its own system of identifiers. In the absence of stable and interoperable identifiers, databases are often linked using taxonomic names. This labor intensive, error prone, and lengthy process relies on accessible versions of nomenclatural authorities and fuzzy-matching algorithms. To approach the challenge of linking diverse data, more than technology is needed. New social collaborations like the Global Unified Open Data Architecture (GUODA) that combine skills from diverse groups of computer engineers from iDigBio, server resources from the Advanced Computing and Information Systems (ACIS) Lab, global-scale data presentation from EOL, and independent developers and researchers are what is needed to make concrete progress on finding relationships between biodiversity datasets. This paper will discuss a technical solution developed by the GUODA collaboration for faster linking across databases with a use case linking Wikidata and the Global Biodiversity Interactions database (GloBI). The GUODA infrastructure is a 12-node, high performance computing cluster made up of about 192 threads with 12 TB of storage and 288 GB memory. Using GUODA, 20GB of compressed JSON from Wikidata was processed and linked to GloBI in about 10-11 minutes. Instead of comparing name strings or relying on a single identifier, Wikidata and GloBI were linked by comparing graphs of biodiversity identifiers external to each system. This method resulted in adding 119,957 Wikidata links in GloBI, an increase of 13.7% of all outgoing name links in GloBI. Wikidata and GloBI were compared to Open Tree Taxonomy to examine consistency and coverage. The process of parsing Wikidata, Open Tree Taxonomy and GloBI archives and calculating consistency metrics was done in minutes on the GUODA platform. As a model collaboration, GUODA has the potential to revolutionize biodiversity science by bringing diverse technically minded people together with high performance computing resources that are accessible from a laptop or desktop. However, participating in such a collaboration still requires basic programming skills.

Motivated by a desire to curb "predatory" publishing, we created FlourishOA, a one-stop shop for authors, publishers, funders, librarians, and policy makers to find high-quality, cost-effective Open Access (OA) journals. FlourishOA provides Article Processing Charge and Article Influence (AI) score data for OA journals. AI scores are retrieved from InCites Journal Citations Reports (JCR). However, the FlourishOA database contains thousands of journals not indexed in JCR. In order to provide users with more data, our team gathered five years of citation counts from the Microsoft Academic Graph database via Microsoft Cognitive Services Academic Knowledge API and used a log-transformed linear regression to predict over 2,500 additional 2015 AI scores.

The volume of collected genetic data has been growing exponentially in the past few years and we need to improve the way we store, analyze and visualize it in order to be able to draw relevant conclusions that could improve the life quality of people. Extracting patterns and predicting future mutations and their impact will rely heavily on the efficient use of Big Data. Often a mutation on its own cannot provide enough information about a disorder or disease. Only if we combine the genetic information with the organism’s environment we can draw some conclusions about penetrance and expressively of the mutation. Because many genes can cause a single disease and at the same time a single gene can cause multiple diseases, we need to analyze the whole context of a person. In this work, a distributed solution that provides demographics and metrics about diagnostics and mutations is pro posed. Seeing the occurrence of a mutation in a particular geographic region can help medical special ists narrow down the search for a patient’s mutations without sequencing the whole genome.

REDCap (Research Electronic Data Capture) is one of the most popular web-based applications to support data capture for research studies and registries. i2b2 (Informatics for Integrating Biology and the Bedside) is a widely adopted data warehouse to re-use clinical data for research purposes. A general procedure able to integrate these solutions could facilitate research activities in several institutions. Starting from the principles adopted by the SEINE approach, one of the most successful approach designed to i2b2-REDCap integration, we proposed a general and flexible ETL (Extract Transform and Load) procedure for synchronizing an i2b2 project with a REDCap study.

Melanoma is a highly immunogenic tumor. Therefore, in recent years physicians have incorporated drugs that alter the immune system into their therapeutic arsenal against this disease, revolutionizing in the treatment of patients in an advanced stage of the disease. This has led us to explore and deepen our knowledge of the immunology surrounding melanoma, in order to optimize its approach. At present, immunotherapy for metastatic melanoma is based on stimulating an individual’s own immune system through the use of specific monoclonal antibodies. The use of immunotherapy has meant that many of patients with melanoma have survived and therefore it constitutes a present and future treatment in this field. At the same time, drugs have been developed targeting specific mutations, specifically BRAF, resulting in large responses in tumor regression (set up in this clinical study to 18 months), as well as a higher percentage of long-term survivors. The analysis of the gene expression changes and their correlation with clinical changes can be developed using the tools provided by those companies which currently provide gene expression platforms. The gene expression platform used in this clinical study is NanoString, which provides nCounter. However, nCounter has some limitations as the type of analysis is restricted to a predefined set, and the introduction of clinical features is a complex task. This paper presents an approach to collect the clinical information using a structured database and a Web user interface to introduce this information, including the results of the gene expression measurements, to go a step further than the nCounter tool. As part of this work, we present an initial analysis of changes in the gene expression of a set of patients before and after targeted therapy. This analysis has been carried out using Big Data technologies (Apache Spark) with the final goal being to scale up to large numbers of patients, even though this initial study has a limited number of enrolled patients (12 in the first analysis). This is not a Big Data problem, but the underlaying study aims at targeting 20 patients per year just in Málaga, and this could be extended to be used to analyze the 3.600 patients diagnosed with melanoma per year.

We present in this article a lightweight ontology named PGxO and a set of rules for its instantiation, which we developed as a frame for reconciling and tracing pharmacogenomics (PGx) knowledge. PGx studies how genomic variations impact variations in drug response phenotypes. Knowledge in PGx is typically composed of units that have the form of ternary relationships gene variant–drug–adverse event, stating that an adverse event may occur for patients having the gene variant when being exposed to the drug. These knowledge units (i) are available in reference databases, such as PharmGKB, are reported in the scientific biomedical literature and (ii) may be discovered by mining clinical data such as Electronic Health Records (EHRs). Therefore, knowledge in PGx is heterogeneously described (i.e., with various quality, granularity, vocabulary, etc.). It is consequently worth to extract, then compare, assertions from distinct resources. Using PGxO, one can represent multiple provenances for pharmacogenomic knowledge units, and reconcile duplicates when they come from distinct sources.