Bioinformatics workflows are typically composed of numerous programs and stages coupled together loosely by intermediate files. These workflows tend to be quite complex and require much computational time, hence a good workflow must be able to manage intermediate files, guarantee rentrability—the ability to re-enter a partially run workflow and continue from the latest point—and also facilitate easy description of workflows.
We present BioShake: a Haskell Embedded Domain Speciic Language (DSL) (EDSL) for bioinformatics workflows. The use of a language with strong types gives our framework several advantages over existing frameworks (Amstutz et al., 2016; Di Tommaso et al., 2017; Goodstadt, 2010; Köster & Rahmann, 2018; Leipzig, 2016; OpenWDL, 2012; Vivian et al., 2017) (see Table 1 for a high-level comparison):
|Strong static typing||–||✓|
|Cloud computing integration||✓||✓||✓||✓||–||–|
|Cluster integration (Torque)||✓||✓||✓||✓||–||–||✓|
|Cluster integration (Slurm)||✓||✓||✓||✓||–||–|
|Cluster integration (SGE)||✓||✓||✓||✓||–||–|
|Cluster integration (LSF)||✓||✓||✓||–||–|
|Cluster integration DRMAA||✓||–||–|
The type system is strongly leveraged to prevent errors in the workflow construction during compilation—that is during development of the workflow (compile time) and prior to actual execution of the workflow (execution time). Workflows that pass the type checking at compile time are guaranteed free of certain classes of errors that with other workflow frameworks would ordinarily only be detected during execution. Catching errors significantly earlier in the analysis process reduces debugging time, which is especially advantages for bioinformatics as the workflows tend to have long runtimes. To the best of our knowledge, this is the first bioinformatics workflow framework to use strong typing and type inference to prevent specification errors during compile time.
Naming of outputs at various stages of a workflow are abstracted by BioShake. Output at a stage can be explicitly named if they are desired outputs. Thus, the burden of constructing names for temporary files is alleviated. This is similar in spirit to Sadedin, Pope & Oshlack (2012) who also allow abstraction away from explicit filenames.
BioShake builds on top of Shake, an industrial strength build tool also implemented as an EDSL in Haskell. BioShake thus inherits the reporting features, robust dependency tracking, and resumption capabilities offered by the underlying Shake architecture.
Unlike underlying Shake that expects dependencies to be specified (i.e., in a DAG the arrows point from the target back towards the source(s)), BioShake allows forward specification of workflows (i.e., the arrows point forward). As bioinformatics workflows tend to be quite long and mostly linear, this eases the cognitive burden during workflow design and improves readability.
Non-linear workflows are constructed using typical Haskell constructs such as maps and folds. Combinators are available for the most common grouping of outputs together for a subsequent stage. However, as the main data type is recursively defined, outputs of a stage can always be referenced by subsequent stages without explicit non-linear constructs (i.e., the alignments used for variant calling are available for a subsequent variant annotation stage without explicitly introducing non-linearity).
BioShake, in essence, is an EDSL for specifying workflows that compiles down to an execution engine (Shake). In this respect, it resembles other specification languages such as Common Workflow Language (CWL) (Amstutz et al., 2016) and Workflow Description Language (WDL) (OpenWDL, 2012), but executes on top of Shake. Table 1 provides a high level feature overview of BioShake when compared to several other workflow specification language, workflow EDSLs, and execution engines. We will further elaborate on the unique features of BioShake:
Strong type-checking The use of a language with strong types gives our framework several advantages over existing frameworks (Amstutz et al., 2016; Di Tommaso et al., 2017; Goodstadt, 2010; Köster & Rahmann, 2018; Leipzig, 2016; OpenWDL, 2012; Sadedin, Pope & Oshlack, 2012; Vivian et al., 2017). Our framework leverages Haskell’s strong type-checker to prevent many errors that can arise in the specification of a workflow. As an example, file formats are statically checked by the type system to prevent specification of workflows with incompatible intermediate file formats. Furthermore, tags are implemented through Haskell type-classes to allow metadata tagging, allowing various properties of files—such as whether a Binary Alignment Map (BAM) file is sorted—to be statically checked. Thus, a misspecified workflow will simply fail to compile, catching these bugs well before the lengthy execution. As tags are represented in the type system, they are assumed to hold and are not validated at runtime (i.e., an output tagged as sorted and in Binary Alignment Map (BAM) format is not verified to be sorted nor a valid Binary Alignment Map (BAM) file at runtime). Consequently, no runtime overhead is incurred.
Intrinsic and extrinsic building Our framework builds upon the Shake EDSL (Mitchell, 2012), which is a make-like build tool. Similarly to make, dependencies in Shake are specified in an extrinsic manner (called implicit by Leipzig, 2016); that is, a build rule will define its input dependencies based on the output file path. Our EDSL compiles down to Shake rules, but allows the specification of workflows in an intrinsically, whereby the processing chain is explicitly stated and hence no filename based dependency graph needs to be specified. However, as BioShake compiles to Shake, both extrinsic and intrinsic rules can be mixed, allowing a choice to be made to maximise workflow specification clarity. For example, small “side” processing like generation of indices can be specified extrinsically, removing the need for an explicit index step in the workflow specification.
Furthermore, the use of explicit sequencing for defining workflows allows abstraction away from the filename level: intermediate files can be automatically named and managed by BioShake, removing the burden of naming the intermediate files, with only desired outputs requiring explicit naming.
The following examples illustrate various aspects of the BioShake EDSL. Examples 1 and 2 demonstrate the expression of a couple of workflows in the EDSL. Example 3 shows how an end-user interface may be constructed for end-users rather than workflow designers.
Core data types
BioShake is built using a tagless-final style (Carette, Kiselyov & Shan, 2009) around the following datatype:
data a → b where (→):: a → b → a → b infixl 1 →
This datatype represents the conjunction of two stages a and b. As we are compiling to Shake rules, the Buildable class represents a way to build thing of type a by producing Shake actions:
class Buildable a where build:: a → Action ()
Finally, as we are ultimately building files on disk, we use a typeclass to represent types that can be mapped to filenames:
class Pathable a where paths:: a → [FilePath]
A stage—for example align ing and sort ing—is a type in this representation. Such a type is an instance of Pathable as outputs from the stage are files, and also Buildable as the stage is associated with some Shake actions required to build the outputs. We give a simple example of declaring a stage that sorts bam files in Example 4.
Compiling to Shake rules
The workflows as specified by the core data types are compiled to Shake rules, with Shake executing the build process. The distinction between Buildable and Compilable types are that the former generate Shake Action s and the latter Shake Rules. The Compiler therefore extends the Rules monad, augmenting it with some additional state:
type Compiler = State T (S.Set [FilePath]) Rules
The state here captures rules we have already compiled. As the same stages may be applied in several concurrent workflows (i.e., the same preprocessing may be applied but different subsequent processing defined) the set of rules already compiled must be maintained. When compiling a rule, the state is checked to ensure the rule is new, and skipped otherwise. The rule compiler evaluates the state transformer, initialising the state to the empty set:
compileRules :: Compiler () → Rules () compileRules p = evalStateT p mempty
A compilable typeclass abstracts over types that can be compiled:
class Compilable a where compile:: a → Compiler ()
a → b is Compilable if the input and output paths are defined, the subsequent stage a is Compilable, and a → b is Buildable. Compilation in this case defines a rule to build the output paths with established dependencies on the input paths using the build function. For convenience BioShake provides a compileAndWant function that both compiles a workflow to Rules and requests Shake to build it.
These rules can then be executed by Shake using the (simplified type for clarity) function
bioShake :: Rules () → IO ()
Dynamic workflows are those where the number or type of output files for a stage is dependent on the input to the stage. The BioShake abstractions do not directly allow this, however it is still possible by splitting the workflow into two static workflows with Haskell logic in-between. Example 5 demonstates this approach.
BioShake uses tags to ensure type errors will be raised if stages are incompatible. Other workflow systems such as CWL have limited ability to tag outputs and verify inputs using file patterns, however this approach does not scale gracefully to many tags and tags carrying additional metadata. As encoding tags into filenames inheriently imposes an ordering between tags, parsing and generation of names is complicated, and with many tags filesystem limitations on length can be encountered.
We have already seen in Example 4 the use of IsBam to ensure the input file format of Sort is compatible. By convention, BioShake uses the file extension prefixed by Is as tags for filetype, e.g.,: IsBam, IsSam, IsVCF. Other types of metadata are used such as if a file is sorted (Sorted) or if duplicate reads have been removed (DeDuped) or marked (DupsMarked). These tags allow input requirements of sorting or deduplication to be captured when defining stages. It is easy to state implications, for example to consider files to have duplicates marked if duplicates have been removed:
instance Deduped a ⇒ DupsMarked a Properties, where appropriate, can also automatically propagate down the workflow; for example, once a file is DeDuped all subsequent outputs carry the DeDuped tag: instance Deduped a ⇒ Deduped (a → b)
Finally, the tags discussed so far have been empty type classes, however tags can easily carry more information. For example, BioShake uses a Referenced tag to represent the association of a reference genome. This tag is defined
class Referenced where getRef:: FilePath instance Referenced a ⇒ Referenced (a → b)
and allows stages to extract the path to the reference genome, automatically propagating down the workflow allowing identification of the reference at any stage.
EDAM (Ison et al., 2013) is an ontology containing terms and concepts that are prevalent in bioinformatics. As it is a formal ontology, the terms are organised into a hierarchical tree structure, with each term containing reference to parent terms. EDAM can be used with the flat tagging structure introduced in the previous section using Template Haskell (TH)—a metaprogramming language allowing Haskell code to be generated during compilation—to establish the tree. This is different from other systems that support EDAM for symantic annotation such as CWL as the ontology is represented at the type level and hence prevents a class of errors.
BioShake provides the EDAM ontology in the EDAM module. This module provides EDAM terms identified by their short name, along with some Template Haskell (TH) for associating EDAM terms to types. For example, the FASTQ-illumina term (http://edamontology.org/format_1931) is represented by the tag FastqIllumina and a type can be tagged using the is template Haskell function, for example:
import BioShake.EDAM data MyType = MyType $(is ’’MyType ’’FastqIllumina)
The is function declares the given type to be instances of all parents of the EDAM term, allowing tag matching at any level in the hierarchy. For instance, MyType in the above example will also carry the Fastq tag as FastqIllumina is a child of Fastq. EDAM types can be used similarly to tags as described in section ‘Tags’.
Abstracting the execution platform
In Example 4, the Shake function cmd is directly used to execute samtools and perform the build, however it is useful to abstract away from cmd directly to allow the command to be executed instead on (say) a cluster, cloud service, or remote machine. BioShake achieves this flexibility by using free monad transformers to provide a function run—the equivalent of cmd –but where the actual execution may take place via submitting a script to a cluster queue, for example.
To this end, the datatype for stages in BioShake are augmented by a free parameter to carry implementation specific default configuration –e.g., cluster job submission resources. In the running example of sorting a bed file, the augmented datatype is data Sort c = Sort c.
Much of the code necessary for defining a new stage can be automatically written using Template Haskell (TH) leading to succinct definitions of stages, increasing clarity of code and reducing boilerplate. BioShake has Template Haskell (TH) functions for generating instances of Pathable and Buildable, and for managing the tags. Example 6 shows how Template Haskell (TH) is used to simplify the stage definition presented in Example 4.
Results and Discussion
We have presented a framework for describing and executing bioinformatics workflows. The framework is an EDSL in Haskell built on Shake, allowing us to leverage the robustness of Shake and also the power of Haskell’s type system to prevent many types of errors in workflow construction. Preventing errors before execution is particularly beneficial for bioinformatics workflows as they tend to be long running and thus catching errors during compile reduces the debugging time significantly.
Though this library is built around Shake as the execution engine, the core value lies in the unique abstraction and use of types to capture metadata. It is feasible to compile a specification to a different backend instead of Shake, such as Toil (Vivian et al., 2017) or Cromwell (Cromwell, 2015) via CWL (Amstutz et al., 2016) or WDL (OpenWDL, 2012). This would allow leveraging of the cloud and containerisation facilities of Toil and Cromwell. The abstraction used may also be useful in other domains where long data-transformation stages are applied, such as data mining on large datasets.
Though many errors are currently caught by the type system, there are still classes of errors that are not. Notably, the Pathable class instance maps stages to lists of files with unknown length. Thus, the number of files expected to be exchanged between two stages may differ, causing a runtime error. Lists of typed length could be used to prevent this error, however this would increase the complexity for users. BioShake attempts to strike a balance between usability and type safe guarantees.
We have presented a unique EDSL in Haskell for specifying bioinformatics workflows. The Haskell type checker is used extensively to prevent specification errors, allowing many errors to be caught during compilation rather than runtime. To our knowledge, this is the first bioinformatics workflow framework in Haskell, as well as the first formalisation of bioinformatics workflows and their attributes in a type system from the Hindley–Milner family.