Comparing Task Runners

Python-Based

Scons

Scons Cookbook

Pros

  • python

  • order independent

Cons

  • documentation

  • not explicit

Toil

Toil Github

Concepts

  • leader : decides jobs by traversing job graph

  • job store : handles files shared between components, maintains state

  • worker : temporary processes, can run on to successors

  • batch system : schedules jobs

  • node provisioner : creates worker nodes

  • stats and logger :

  • jobs : atomic unit of work

  • workflow : extends job

  • jobDescription : metadata

Pros

  • uses cwl, wdl, python

Cons

Example

from toil.common import Toil
from toil.job import Job

def helloWorld(message, memory="1G", cores=1, disk="1G"):
    return f"Hello, world!, here's a message: {message}"

if __name__ == "__main__":
    parser = Job.Runner.getDefaultArgumentParser()
    options = parser.parse_args()
    options.clean = "always"
    with Toil(options) as toil:
        output = toil.start(Job.wrapFn(helloWorld, "You did it!"))
    print(output)

Scrapy

Concepts

  • spiders

  • middleware

  • pipeline

  • runner

  • contracts

Dataflow

  1. The Engine gets the initial Requests to crawl from the Spider.

  2. The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.

  3. The Scheduler returns the next Requests to the Engine.

  4. process_request through downloader middlewares,

  5. download.

  6. process_response through downloader middlewares.

  7. process_spider_input through spider middlewares.

  8. process_spider_output of new Requests and scraped items.

  9. The Engine sends processed items to Item Pipelines, and send processed Requests to the Scheduler and asks for possible next Requests to crawl.

  10. The process repeats (from step 3) until there are no more requests from the Scheduler.

Pros

  • non-blocking,

  • modular

Cons

  • overrules logging

Twisted

Pros

Cons

SnakeMake

Concepts

Pros

  • reproducible

  • linter

  • modular

  • auto install of dependencies

  • tool wrappers

  • cluster execution

  • tabular config

  • reports

  • generates unit tests

  • handover to other task runners

Cons

  • dsl, uncertain where python ends and snakemake begins

  • top down

Example

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/A.fastq"
    output:
        "mapped_reads/A.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

Luigi

Concepts

Target - has .exists(), possible .open Task - .run(), .output(), .requires() Parameter - Events - Event Handlers -

Pros

  • Straightforward command-line integration.

  • As little boilerplate as possible.

  • Focus on job scheduling and dependency resolution.

  • A file system abstraction where code doesn’t have to care about where files are located.

  • Atomic file system operations through this abstraction. If a task crashes it won’t lead to a broken state.

  • The dependencies are decentralized. No big config file in XML.

  • A web server that renders the dependency graph and does locking, etc for free.

  • Trivial to extend with new file systems, file formats, and job types.

  • Date algebra included.

  • Lots of unit tests of the most basic stuff.

Cons

  • Its focus is on batch processing so it’s probably less useful for near real-time pipelines or continuously running processes.

  • The assumption is that each task is a sizable chunk of work. While you can probably schedule a few thousand jobs, it’s not meant to scale beyond tens of thousands.

  • Luigi does not support distribution of execution. When you have workers running thousands of jobs daily, this starts to matter, because the worker nodes get overloaded. There are some ways to mitigate this (trigger from many nodes, use resources), but none of them are ideal.

  • Luigi does not come with built-in triggering, and you still need to rely on something like crontab to trigger workflows periodically.

Example

  import luigi

  class MyTask(luigi.Task):
      param = luigi.Parameter(default=42)

      def requires(self) -> Task|list[Task]:
          return SomeOtherTask(self.param)

      def run(self):
          with self.output().open('w'):
              ...

      def output(self):
          return luigi.LocalTarget("/temp/foo/bar-%s.txt" % self.param)

@luigi.Task.event_handler(luidi.Event.SUCCESS)
def celebrate_success(task):
    ...

Doit

Pros

  • just python

Cons

  • relies on raw dicts

Example

  def task_do_something():
      # Setup code here

      # Task Spec:
      return {
          'actions'  : [...],
          'file_dep' : [...],
          'targets'  : [...],
          }

Invoke

Javascript-Based

Gulp

Concepts

  • gulpfile

  • tasks : async functions

  • public tasks

  • private tasks

Pros

  • combinator based

Cons

  • javascript

Example

function defaultTask(cb){
    // do stuff
    cb();
}

exports.default = defaulTask

Grunt

Concepts

  • package.json

  • gruntfile

  • alias tasks

  • multi tasks

  • basic tasks

  • custom tasks

Pros

  • plugins

Cons

  • javascript

 module.exports = function(grunt) {

  // Project configuration.
  grunt.initConfig({
    pkg: grunt.file.readJSON('package.json'),
    uglify: {
      options: {
        banner: '/*! <%= pkg.name %> <%= grunt.template.today("yyyy-mm-dd") %> */\n'
      },
      build: {
        src: 'src/<%= pkg.name %>.js',
        dest: 'build/<%= pkg.name %>.min.js'
      }
    }
  });

  // Load the plugin that provides the "uglify" task.
  grunt.loadNpmTasks('grunt-contrib-uglify');

  // Default task(s).
  grunt.registerTask('default', ['uglify']);

};

Go-Based

Task

Rust-Based

Just

Toml Based

Mise

Cargo

Pros

Cons

XML Based

Maven

Pros

Cons

Ant

Concepts

Pros

  • stdlib

Cons

  • java

  • xml

Custom DSLs

Ansible

Pros

Cons

CMake

Pros

Cons

Collective Knowledge

Pros

Cons

Common Workflow Language

Pros

Cons

  • yaml

Example

cwlVersion: v1.0
class: CommandLineTool
baseCommand: echo
stdout: output.txt
inputs:
  message:
    type: string
    inputBinding:
      position: 1
outputs:
  output:
    type: stdout

Gradle

Concepts

  • settings script

  • build script

  • project

  • subproject

  • actionable tasks

  • lifecycle tasks

  • plugins

  • artifact

  • capability

  • component

  • configuration

Pros

  • plugins

  • daemon

Cons

  • groovy

  • gradlew

  • unclear syntax

  • documentation

  • constrained to jvm projects

Jenkins

https://www.jenkins.io/doc/book/pipeline/syntax/

Concepts

  • jenkinsfile

  • pipelines

  • sections

  • directives

  • steps

  • agents

Pros

  • can be declarative or scripted

Cons

  • groovy

pipeline {
    agent any
    options {
        // Timeout counter starts AFTER agent is allocated
        timeout(time: 1, unit: 'SECONDS')
    }
    stages {
        stage('Example') {
            steps {
                echo 'Hello World'
            }
        }
    }
}

kubernetes

Concepts

Pros

Cons

OPA

Concepts

  • permissions

  • agents

  • roles

  • policy

  • rules

Pros

Cons

  • rego

Make

Pros

  • rule based

Cons

  • esoteric

  • relies on whitespace

  • complex var expansion

Example

objects = main.o kbd.o command.o display.o \
          insert.o search.o files.o utils.o

edit : $(objects)
        cc -o edit $(objects)
main.o : main.c defs.h
        cc -c main.c
kbd.o : kbd.c defs.h command.h
        cc -c kbd.c
command.o : command.c defs.h command.h
        cc -c command.c
display.o : display.c defs.h buffer.h
        cc -c display.c
insert.o : insert.c defs.h buffer.h
        cc -c insert.c
search.o : search.c defs.h buffer.h
        cc -c search.c
files.o : files.c defs.h buffer.h command.h
        cc -c files.c
utils.o : utils.c defs.h
        cc -c utils.c
clean :
        rm edit $(objects)

Meson

https://mesonbuild.com/SimpleStart.html

Pros

Cons

Nix

Concepts

  • creates and composes file derivations

Pros

Cons

Rake

Pros

Cons

WDL

https://github.com/openwdl/wdl https://openwdl.org/getting-started/ https://github.com/openwdl/wdl/blob/wdl-1.1/SPEC.md

Concepts

  • workflow

  • task

  • call

  • command

  • output

Pros

Cons

Example

workflow write_simple_file {
  call write_file
}
task write_file {
  String message
  command { echo ${message} > wdl-helloworld-output.txt }
  output { File test = "wdl-helloworld-output.txt" }
}

Dagger

TaskCluster

https://taskcluster-taskgraph.readthedocs.io/en/stable/index.html