The merits of the serverless approach have been well understood, and have been applied to biomedical data for a number of years, from genomics (Wilkinson & Almeida, 2014) to image analysis in pathology (Almeida et al., 2012). However, until recently it came with the suspicion that either the analytical challenge was computationally too intensive to be trackable as a client-side application, or that a dedicated server-side indexing resource would have to help carry the load. Interestingly, this perception that the performance of the “cloudification” (Bremer et al., 2016) of large data assets is challenged persists even when confronted with the favorable tabulation of execution times, as with did in that report at AMIA 2016. Instead, this architectural argument appears to be one that requires the development of “believe it when I see it” proof of concept applications that rely exclusively on the API of the data resource along the lines recently detailed for GDC, NCI Genomic Data Commons (Wilson et al., 2017). This argument, and the development of a validating application, were approached here by targeting Open Health Data resources of the Department of Health of New York State (NY. State of New York-Open Data Health-Health Data NY, 2018). In that data-intensive infrastructure, the core Data Commons argument that APIs with the ability to consume functionalized query languages are needed is addressed by SoQL (Socrata, 2018). On the one hand, this still falls short of the full Backend-as-a-Service (BaaS) model pursued by Data Commons (Grossman et al., 2016). On the other, because of the real-world shortcomings of public health data discussed later in this report, the Open Health Data offers the clearest practical assessment of the argument that the BaaS model is viable for any Data resource with a REST API able to consume query languages. This argument is currently the subject of a number of novel BaaS implementations, as detailed in the Discussion section.
Although the tool described in this report is being used at Stony Brook University Academic Medical Center to track signs as diverse as opioid overprescription or child obesity in Clinical Informatics bootcamps (Clinical Informatics Bootcamp, 2018), the purpose of this report is solely to describe the implementation methodology. Accordingly, only data in the public domain will be used and all code is provided with open source. Success in achieving this goal will be measured by the ability to deploy the interactive analytics application without requiring the direct management or hosting of servers. This approach to cloud computing where the web services are managed, and are assembled, as part of the cloud provision, is designated as “serverless” (Kanso & Youssef, 2017), in the sense that neither the application developer nor the user have to sustain them.
The architecture design for this application starts with OpenHealth (Almeida et al., 2015), which is about in-browser constructs assembled on-the-fly by code injection, with the primary source of data served by remote HTTP-REST Application Programming Interfaces (API). That original implementation, recalled in Fig. 1B, followed the straightforward API Economy model (Brown, Fishenden & Thompson, 2014) of stateless integration by bringing together data from different sources via REST (Representational State Transfer) APIs. This is also the architecture where the ability to handle large amounts of heterogeneous data comes into question. Recalling from the introductory section, addressing this scaling challenge is best pursued with real-world health data sources, with real-world problems such as the lack of referential integrity that is often encountered in OpenData systems. Those practical challenges, the argument goes, would not be accurately assessed by applications targeting synthetic datasets or targetting heavily engineered BigData.
The data used for this study is that of New York state Statewide Planning and Research Cooperative System (SPARCS) (NY. State of New York-Open Data Health-Health Data NY, 2018), made publicly available by the state’s Department of Health via SoQL APIs (Socrata, 2018). As detailed in the program’s web page at www.health.ny.gov/statistics/sparcs at the time of this writing, “SPARCS is a comprehensive all-payer data reporting system established in 1979 as a result of cooperation between the healthcare industry and government. The system was initially created to collect information on discharges from hospitals. SPARCS currently collects patient level detail on patient characteristics, diagnoses and treatments, services, and charges for each hospital inpatient stay and outpatient (ambulatory surgery, emergency department, and outpatient services) visit; and each ambulatory surgery and outpatient services visit to a hospital extension clinic and diagnostic and treatment center licensed to provide ambulatory surgery services.”
The public tier of the SPARCS dataset accessed by accompanying application documents 34 variables covering a range of parameters, from demographic and geographic to clinical, including payment information and identification of caregiver. Figure 2 provides a snapshot of the first entry of the over 2 million records for 2016. As the API section below details, this report and the accompanying application do not make any data available: it simply distributes a in-browser computational artifact that engages the application programming interfaces of the Department of Health on behalf of the user (not the application developer). The flat file export of the SPARCS data alone (Table 1) is about 15 GB. Indexing its 34 fields to satisfy joint parameter constraints could have produced a far larger volume. The combination of size and combinatorial indexing are far in excess of what would have been possible to handle through client-side processing alone, the approach followed by the original OpenHealth model (Fig. 1B).
API (application programming interface)
Table 1 lists all of the SoDA (Socrata, 2018) endpoints used by the accompanying application (see Availability). The document in reference details the API specification and the way in which Socrata provides interoperable Open Data infrastructure. For example, the record displayed in Fig. 2 can be obtained by dereferencing the address https://health.data.ny.gov/resource/gnzp-ekau.json?$limit=1.
Availability of serverless application
At an architectural level, the SPARCS application was built on the foundations of the OpenHealth serverless model (Almeida et al., 2015). That architecture corresponds to a cached version of the Web 2.0 AJAX model described in Fig. 1B. As overviewed in the ‘Background’ section, the feasibility of that model is typically limited to applications that integrate moderate data volumes by operating the Data Layer API in a narrowly prescribed manner. This architecture was changed by creating a client-side object with attributes that map to the query language consumed by SoQL API, as explained in Fig. 1C. The key role of the isomorphic mapping of client-side methods to data-intensive server-side operations is illustrated in Fig. 3 for the count method used to generate the data in Table 1.
The snapshots in Figs. 3 and 4 illustrate the wide versatility of complex query constraints defined by the operation of the user interface, which is itself assembled in the user’s web browser without download or installation. That development versatility is the functionality that enables the BaaS model associated with the architecture described in Fig. 1C. However, the full measure of the BaaS model will be the operation of the APIs of remote data-intensive resources, as if they were local to the user’s own machine. That confirmation of scalability without loss of real-time interaction can only be verified by operating the application. See Availability in the ‘Methods’ section for the live web-based serverless application and demonstrative webcast video. The key role of the asynchronous NoSQL caching in the browser, IndexedDB, for web-based biomedical informatics has been noted by other researchers (Shi et al., 2015).
Comparison with existing software tools
The development of mobile-first software to traverse open health data is still relatively new. As detailed in our original report on OpenHealth applications (Almeida et al., 2015), this reflects the early stage of development of consumer-facing software for outcomes-driven assessment of Health Care services. The key change is the public availability of large volumes of data-intensive resources that would have been considered too sensitive for publication just 2 years ago when the original OpenHealth tools were developed. Accordingly, two comparisons to existing tools are in order, speed and interactivity, while engaging the same SoQL API exposed by the Department of Health of the state of New York (health.data.ny.gov). The first comparison is straightforward: dereferencing a standard stateless application such as bit.ly/pqiSuffolk has a much longer assembly time, in the order to tens of seconds to a minute, than the approach presented here (Fig. 1C), bit.ly/loadsparcs, which takes less than 10 s and traverses a dataset over 100 times larger. The interactivity comparison is not as quantitatively straightforward because it requires the use of the analytical tools published with the data. That exercise can be approached by dereferencing, for example, health.data.ny.gov/Health/All-Payer-Hospital-Inpatient-Discharges-by-Facilit/srur-4jdu, and noting that the numerical results are not themselves linked to additional analysis where they are used as independent variables.
In summary, the proposed engagement of the data-intensive data-intensive SPARCS dataset has a clear advantage over approaches that do not use the cached BaaS model. That advantage is proposed here as a definite argument to approach data-intensive software Commons for research applications by using this model. That is, by mapping server-side to client-side abstractions as a generic backend that goes beyond the conventional stateless architecture of REST APIs. That conclusion, discussed at length in the next section, is particularly well aligned with recent developments in funding agencies promoting the use of interoperable cloud-hosted Research Commons infrastructure (Grossman, 2018b). Putting it plainly, the conventional “API economy”model (Figs. 1A–1B) simply doesn’t work as a client-side application at the SPARCS scale, regardless of the resources available to the machine used to run the web application. On the contrary, the new implementation (Fig. 1C) will work regardless of the machine, from high-end desktops to underpowered smartphones.
The Backend-as-a-Service (BaaS) model advanced by recent Data Commons infrastructure (Grossman et al., 2016) are recognized as the scalable route towards Precision Medicine (Jensen et al., 2017). Therefore, what combination of API language and query engine would best serve that goal in a FAIR manner (Wilkinson et al., 2016) is a critical design goal. In this study, SoQL (see Methods) was found to provide the necessary read-only interoperability. Naturally, the full BaaS model would require a more comprehensive approach to schema definition and data presentation. While this discussion is beyond the scope of the present report, it may be informative to note that data submission to NCI Genomic Data commons, at the time of this writing (as per GDC v1.13.0, Feb 18, 2018), requires the use of GraphQL as the interoperability model of choice for 3rd generation Data Commons infrastructure (Grossman, 2018a). In any case, new longitudinal Population Studies such as the NIH All of Us Research Program (National Institutes of Health, NIH), are bound to require a new approach to interactive analytics able to tackle the scale, diverse data models, and wide institutional distribution of associated cloud-based infrastructure for data-intensive science.
The use of in-browser serverless applications (Web Apps calling data layer APIs directly) was tested with the real-world challenge of assembling web applications capable of traversing 20 million patient records of the public SPARCS dataset served by New York’s Department of Health. The portability and security of the web app model is a good match to the principles of FAIR Data Commons. The real-world test was that of interactive and open-ended constraint satisfaction on this large data space of well over half a billion individual measurements (34 × 19, 907, 183 = 676, 844, 222), convoluted by a significant lack of referential integrity. In spite of these obstacles, the isomorphic mapping of client-side operators to remote APIs supporting a full-fledged query language, combined with the native support for vectorized operators of the modern Web browser, was shown to achieve the performance levels required for real-time interactivity. It is therefore concluded that the emerging Data Commons frameworks are particularly well suited for ecosystems of Web applications. This BaaS behavior suggests a solution that overcomes the need for local, or even on-premise, implementations of Biomedical Informatics applications.