Pixel: a digital lab assistant to integrate biological data in multi-omics projects

Background In biology, high-throughput experimental technologies, also referred as “omics” technologies, are increasingly used in research laboratories. Several thousands of gene expression measurements can be obtained in a single experiment. Researchers are routinely facing the challenge to annotate, store, explore and integrate all the biological information they have at their disposal. We present here the Pixel web application (Pixel Web App), an original digital assistant to help people involved in a multi-omics biological project. Methods The Pixel Web App is built with open source technologies and hosted on the collaborative development platform GitHub (https://github.com/Candihub/pixel). It is written in Python using the Django framework and stores all the data in a PostgreSQL database. It is developed in the open and licensed under the BSD 3-clause license. The Pixel Web App is also heavily tested with both unit and functional tests, a strong code coverage and continuous integration provided by CircleCI. To ease the development and the deployment of the Pixel Web App, Docker and Docker Compose are used to bundle the application as well as its dependencies. Results The Pixel Web App offers researchers an intuitive way to annotate, store, explore and mine their multi-omics results. It can be installed on a personal computer or on a server to fit the needs of many users. In addition, anyone can enhance the application to better suit their needs, either by contributing directly on GitHub (encouraged) or by extending Pixel on their own. Unlike other bioinformatics platforms like Galaxy, the Pixel Web App does not provide any computational programs to analyze the data. Still, it allows to rapidly integrate existing results and thus holds a strategic position in the management of research data.


34
In biology, high throughput (HT) experimental technologies -also referred as "omics" -are 35 routinely used in an increasing number of research teams. Financial costs associated to HT 36 experiments have been considerably reduced in the last decade [1] and the trend in HT 37 sequencing (HTS) is now to acquire benchtop machines designed for individual research 38 laboratories (for instance Illumina NextSeq500 or Oxford Nanopore Technologies 39 MinION, [2]). The number of HT applications in biology has grown so rapidly in the past 40 decade that it is hard to not feel overwhelmed [3][4]. It seems possible to address in any 41 organism, any biological question through an "omics" perspective, providing the right HT 42 material and method are found. If HTS is often put at the forefront of "omics" technologies 43 (essentially genomics and transcriptomics, [5]), other technologies must be considered. Mass 44 spectrometry (MS) for instance, enables HT identification and quantification of proteins 45 (proteomics). Metabolomics and lipidomics are other derived applications of MS to 46 characterize quantitative changes in small-molecular weight cellular components [6]. 47 Together, they all account for complementary "omics area" with the advantage to quantify 48 distinct levels of cellular components (transcripts, proteins, metabolites, etc.). 49 Integration of datasets issued from different HT technologies (termed as multi-omics datasets) 50 represents a challenging task from a statistical and methodological point of view [7]. It 51 implies the manipulation of two different types of data. The first type is the "primary data", 52 which correspond to raw experimental results. It can be FASTQ files for sequencing 53 technology [8] or mzML files for MS [9]. These files can be stored in public repositories such 54 as SRA [10], GEO [11], PRIDE [12] or PeptideAtlas [13]. Analyses of primary data rely on 55 standard bioinformatics protocols that for instance, perform quality controls, correct 56 experimental bias or convert files from a specific format to another. A popular tool to analyse 57 primary data is Galaxy [14], which is an open web-based platform. "Secondary data" are 58 produced upon analysis of primary data. It can be the counts of reads per genes for HTS 59 results or the abundance values per proteins for MS results. In multi-omics datasets analysis, 60 combining secondary data is essential to answer specific biological questions. It can be 61 typically, the identification of differentially expressed genes (or proteins) between several cell 62 growth conditions from transcriptomics (or proteomics) datasets, or the identification of 63 cellular functions that are over-represented in a list of genes (or proteins). In that respect, 64 secondary data can be analysed and re-analysed within a multitude of analytical strategies, 65 introducing the idea of data analysis cycle. The researcher is thus constantly facing the 66 challenge to properly annotate, store, explore, mine and integrate all the biological data he/she 67 has at his/her disposal in a multi-omics project. This challenge is directly related to the ability 68 to extract as much information as possible from the produced data, but also to the crucial 69 question of doing reproducible research. 70 A Nature's survey presented in 2016 indicates that more than 70% of the questioned 71 researchers already experienced an impossibility to reproduce published results, and more 72 than half of them were not able to reproduce their own experiments [15]. This last point is 73 intriguing. If experimental biology can be subjected to random fluctuations hardly difficult to 74 control, computational biology should not. Running the same software on the same input data 75 is expected to give the same results. In practice, replication in computational science is harder 76 than people generally think (see [16] as an illustration We developed the Pixel web application (Pixel Web App) with these ideas in mind. It acts as 83 a digital lab assistant to help the researchers involved in a multi-omics biological project, to 84 collaboratively mine and integrate their HT data. The Pixel Web App does not perform 85 analysis on the primary data. It is focused on annotation, storage and exploration of secondary 86 data (see Figure 1). These explorations represent critical steps to answer biological questions 87 and need to be carefully annotated and recorded to be further exploited in the context of new 88 biological questions. The Pixel Web App helps the researcher to specify necessary 89 information required to replicate multi-omics results. We added an original hierarchical 90 system of tags, which allows to easily explore and select multi-omics results stored in the

Stack overview 113
The Pixel Web App provides researchers an intuitive way to annotate, store, explore and mine 114 their secondary data analyses, in multi-omics biological projects. It is built upon mainstream 115 open source technologies (see Figure 2). Source code is hosted on the collaborative 116 development platform GitHub 1 and continuous integration is provided by CircleCI 2 . More 117 precisely, the Pixel Web App uses the Python Django framework. This framework is based on 118 a model-template-view architecture pattern, and data are stored in a PostgreSQL 3 database. 119 We have built a docker image for the Pixel Web App. Other containers, Nginx (to serve the 120 Django application) and PostgreSQL rely on official docker images. Each installation / 121 deployment will result in the creation / execution of three docker instances: one for the Pixel 122 Web App, one for the PostgreSQL database and one for the Nginx web server. In case of 123 multiple installations, each trio of docker instances is fully isolated, meaning that data are not 124 shared across multiple Pixel Web App installations. 125

Technical considerations 126
• Docker images 127 The Pixel Web App is built on containerization paradigm (see Figure 2). It relies 128 on Docker 4 , i.e. a tool which packages an application and its dependencies in an image that 129 will be run as a container. Docker helps developers to build self-contained images to run a 130 software. These images are downloaded on the host system and used to build the Pixel Web 131 App. 132 ➢ Start all instances (web, db and proxy) recreating the proxy and web instances. Collect 147 all static files from the Django app. These files will be served by the proxy instance. 148 ➢ Migrate the database schema if needed (to preserve existing data). 149 Note that further technical considerations and full documentation can be found on GitHub 150 repository associated to the Pixel project 6 . 151

Definition of terms: Omics Unit, Pixel and Pixel Set 153
In the Pixel Web App, the term "Omics Unit" refers to any cellular component, from any 154 organism, which is of interest for the user. The type of Omics Unit depends on the HT 155 experimental technology (transcriptomic, proteomic, metabolomic, etc.) from which primary 156 and secondary datasets were collected and derived ( Figure 1A). In this context, classical 157 Omics Units can be transcripts or proteins, but any other cellular component can be defined 158 as, for instance, genomic regions with "peaks" in case of ChIPseq data analyses [23]. A 159 "Pixel" refers to a quantitative measurement of a cellular activity associated to a single Omics 160 Unit, together with a quality score (see Figure 1A). Quantitative measurement and quality 161 score are results of statistical analyses performed on secondary datasets, e.g. search for 162 differentially expressed genes [24]. A set of Pixels obtained from a single secondary data 163 analysis of HT experimental results is referred as a "Pixel Set" (see Figure 1A). Pixel Sets 164 represent the central information in the Pixel Web App and functionalities to annotate, store, 165 explore and mine multi-omics biological data were designed according to this concept (see 166 below). 167 Functionalities to annotate, store, explore and integrate Pixel Sets 168 Pixel Sets are obtained from secondary data analyses (see Figure 1A). Their manipulation 169 with the Pixel Web App consists in (i) their annotation, (ii) their storage in a database, (iii) 170 their exploration and (iv) their integration (or mining, see Figure 1C). This represents a cycle 171 of multiple data analyses, which is essential in any multi-omics biological project. These 172 different steps are detailed in the following. 173 Pixel Sets. We defined minimal information necessary for relevant annotations of Pixel Sets 177 (see Figure 3). "Species", "Strain", "Omics Unit Type" and "Omics Area" are mandatory 178 information that must be specified before a new Pixel Set submission (highlighted in blue, 179  Web App (see Figure 4B). In this file, multiple-choice selections are proposed for 205 "Species", "Strain", "Omics Unit Type" and "Omics Area" fields. These choices 206 reflect what is currently available in the database and can be easily expanded. User 207 must fill other annotation fields related to the "Experiment", "Analysis" and "Pixeler" 208 information. The Excel file is next bundled into a ZIP archive with the secondary data 209 file (in tab-separated values format), the user notebook (R markdown 8 or Jupyter 210 notebook 9 for instance) that contains the code used to produce the Pixel Sets from the 211 secondary data file.  Note that the procedure of importing meta data as an Excel file has been inspired from the 226 import procedure widely used in GEO [11]. 227  table that comprises all   245 Pixel Sets, which match the filter criteria (see A). Particular Pixel Sets can be selected here (for 246 instance "Pixel_C10.txt" and "Pixel_C60.txt"). They will therefore appear in the "Selection" list (see  The Pixel Web App aims to help researchers to mine and integrate multiple Pixel Sets stored 252 in the system. We developed a dedicated web interface to explore all the Pixel Sets stored in a 253 particular Pixel instance (see Figure 5). The upper part named "Selection" lists a group of 254 Pixel Sets selected by the user for further explorations (Figure 5A). The middle part named 255 "Filters" lists the Pixel database contents regarding the Species, Omics Unit Types, Omics 256 Areas and Tags annotation fields. The user can select information (Candida glabrata 257 and modified pH here), search and filter the Pixel Sets stored in the database (Figure 5B). In that respect, tags allowed to rapidly retrieve them using the web interface, applying the 303 keywords "Candida glabrata" and "alkaline pH" (Figure 6, Step 1). As we wanted to limit the 304 analysis to the C. glabrata genes potentially involved in the yeast pathogenesis, a filter could 305 be used to only retain the Omics Units for which the keyword "pathogenicity" is written in 306 their description filed (see Figure 6, Step 2). As a result, a few numbers of Pixels were thus 307 selected, respectively 17 in Pixel Set A and 6 in Pixel Set B. The last step consists in 308 integrating the mRNA and protein information (see Figure 6, Step 3). For that a table 309 comprising the multi-pixel sets can be automatically generated and easily exported (this table  310 is provided as supplementary data). We present Table 1 five genes for which logFC values  311 were obtained both at the mRNA and the protein levels, and for which statistical p-values 312 were significant (< 0.05). Notably two genes (CAGL0I02970g and CAGL0L08448g, lines 3 313 and 5 in Table 1

351
In this article, we introduced the principle and the main functionalities of the Pixel Web App. 352 With this application, our aim was to develop a tool to support on a daily basis, the biological 353 data integration in our multi-omics research projects. It is our experience that research studies 354 in which HT experimental strategies are applied, require much more time to analyse and 355 interpret the data, than to experimentally generate the data. Testing multiple bioinformatics 356 tools and statistical approaches is a critical step to fully understand the meaning of a 357 biological dataset and in this context, the annotation, the storage and the ability to easily 358 explore the all results obtained in a laboratory can be the decisive steps to the success of the 359 entire multi-omics project. 360 The data modelling around which the Pixel Web App was developed, has been conceived to 361 find a compromise between a too detailed and precise description of the data (which could 362 discourage the researchers of systematically use the application after each of their analyses) 363 and a too short and approximate description of the data (which could prevent the 364 perfect reproduction of the results by anyone). Also, a particular attention has been paid to 365 allow heterogeneous data, i.e. different Omics Unit Type quantified in different Omics Area, 366 to be stored in a coherent and flexible way. Unlike other bioinformatics platforms like 367 Galaxy, the Pixel Web App does not provide any computational programs to analyse the data. 368 Still, it allows to explore existing results in a laboratory and to rapidly combine them for 369 further investigations (using for instance the Galaxy platform or any other data analysis tool). 370 Therefore, the Pixel Web App holds a strategic position in the data management in a research 371 laboratory, i.e. as the starting point but also at the final point of all new data explorations. It 372 also ensures data analysis reproducibility and gives a constant feedback regarding the 373 frequency of the data analysis cycles; the nature of the import and export data sets as well as 374 full associated annotations. It is thus expected that the content of different Pixel Web App 375 instance will evolve with time, according to the type of information stored in the system and 376 the scientific interests of a research team. This will be the case for our Pixel Web App 377 instance 12 (from which the case study was obtained), which presently (July 2018) stored more 378 than 20,000 pixels, arising from transcriptomics (microarray and RNAseq technologies) or 379 proteomics (mass spectrometry) technologies applied in two different pathogenic 380 yeasts Candida glabrata and Candida albicans. 381 The Pixel Web App is freely available to any interested people. The initial installation on a 382 personal workstation required IT support from a bioinformatician, but once this is done, all 383 administration tasks can be performed through the Web Interface. This is of interest for user 384 with a few technical skills. We chose to work exclusively with open source technologies and 385 our GitHub repository is publicly accessible 13 . We thus hope that the overall quality of the 386 Pixel Web App source code and documentation will be guaranteed over time, through the 387 shared contributions of other developers. 388