Comparing Task Runners
Python-Based
Scons
Pros
python
order independent
Cons
documentation
not explicit
Toil
Concepts
leader : decides jobs by traversing job graph
job store : handles files shared between components, maintains state
worker : temporary processes, can run on to successors
batch system : schedules jobs
node provisioner : creates worker nodes
stats and logger :
jobs : atomic unit of work
workflow : extends job
jobDescription : metadata
Pros
uses cwl, wdl, python
Cons
Example
from toil.common import Toil
from toil.job import Job
def helloWorld(message, memory="1G", cores=1, disk="1G"):
return f"Hello, world!, here's a message: {message}"
if __name__ == "__main__":
parser = Job.Runner.getDefaultArgumentParser()
options = parser.parse_args()
options.clean = "always"
with Toil(options) as toil:
output = toil.start(Job.wrapFn(helloWorld, "You did it!"))
print(output)
Scrapy
Concepts
spiders
middleware
pipeline
runner
contracts
Dataflow
The Engine gets the initial Requests to crawl from the Spider.
The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
The Scheduler returns the next Requests to the Engine.
process_request through downloader middlewares,
download.
process_response through downloader middlewares.
process_spider_input through spider middlewares.
process_spider_output of new Requests and scraped items.
The Engine sends processed items to Item Pipelines, and send processed Requests to the Scheduler and asks for possible next Requests to crawl.
The process repeats (from step 3) until there are no more requests from the Scheduler.
Pros
non-blocking,
modular
Cons
overrules logging
Twisted
Pros
Cons
SnakeMake
Concepts
Pros
reproducible
linter
modular
auto install of dependencies
tool wrappers
cluster execution
tabular config
reports
generates unit tests
handover to other task runners
Cons
dsl, uncertain where python ends and snakemake begins
top down
Example
rule bwa_map:
input:
"data/genome.fa",
"data/samples/A.fastq"
output:
"mapped_reads/A.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
Luigi
Concepts
Target - has .exists(), possible .open Task - .run(), .output(), .requires() Parameter - Events - Event Handlers -
Pros
Straightforward command-line integration.
As little boilerplate as possible.
Focus on job scheduling and dependency resolution.
A file system abstraction where code doesn’t have to care about where files are located.
Atomic file system operations through this abstraction. If a task crashes it won’t lead to a broken state.
The dependencies are decentralized. No big config file in XML.
A web server that renders the dependency graph and does locking, etc for free.
Trivial to extend with new file systems, file formats, and job types.
Date algebra included.
Lots of unit tests of the most basic stuff.
Cons
Its focus is on batch processing so it’s probably less useful for near real-time pipelines or continuously running processes.
The assumption is that each task is a sizable chunk of work. While you can probably schedule a few thousand jobs, it’s not meant to scale beyond tens of thousands.
Luigi does not support distribution of execution. When you have workers running thousands of jobs daily, this starts to matter, because the worker nodes get overloaded. There are some ways to mitigate this (trigger from many nodes, use resources), but none of them are ideal.
Luigi does not come with built-in triggering, and you still need to rely on something like crontab to trigger workflows periodically.
Example
import luigi
class MyTask(luigi.Task):
param = luigi.Parameter(default=42)
def requires(self) -> Task|list[Task]:
return SomeOtherTask(self.param)
def run(self):
with self.output().open('w'):
...
def output(self):
return luigi.LocalTarget("/temp/foo/bar-%s.txt" % self.param)
@luigi.Task.event_handler(luidi.Event.SUCCESS)
def celebrate_success(task):
...
Doit
Pros
just python
Cons
relies on raw dicts
Example
def task_do_something():
# Setup code here
# Task Spec:
return {
'actions' : [...],
'file_dep' : [...],
'targets' : [...],
}
Invoke
Javascript-Based
Gulp
Concepts
gulpfile
tasks : async functions
public tasks
private tasks
Pros
combinator based
Cons
javascript
Example
function defaultTask(cb){
// do stuff
cb();
}
exports.default = defaulTask
Grunt
Concepts
package.json
gruntfile
alias tasks
multi tasks
basic tasks
custom tasks
Pros
plugins
Cons
javascript
module.exports = function(grunt) {
// Project configuration.
grunt.initConfig({
pkg: grunt.file.readJSON('package.json'),
uglify: {
options: {
banner: '/*! <%= pkg.name %> <%= grunt.template.today("yyyy-mm-dd") %> */\n'
},
build: {
src: 'src/<%= pkg.name %>.js',
dest: 'build/<%= pkg.name %>.min.js'
}
}
});
// Load the plugin that provides the "uglify" task.
grunt.loadNpmTasks('grunt-contrib-uglify');
// Default task(s).
grunt.registerTask('default', ['uglify']);
};
Go-Based
Task
Rust-Based
Just
Toml Based
Mise
Cargo
Pros
Cons
XML Based
Maven
Pros
Cons
Ant
Concepts
Pros
stdlib
Cons
java
xml
Custom DSLs
Ansible
Pros
Cons
CMake
Pros
Cons
Collective Knowledge
Pros
Cons
Common Workflow Language
Pros
Cons
yaml
Example
cwlVersion: v1.0
class: CommandLineTool
baseCommand: echo
stdout: output.txt
inputs:
message:
type: string
inputBinding:
position: 1
outputs:
output:
type: stdout
Gradle
Concepts
settings script
build script
project
subproject
actionable tasks
lifecycle tasks
plugins
artifact
capability
component
configuration
Pros
plugins
daemon
Cons
groovy
gradlew
unclear syntax
documentation
constrained to jvm projects
Jenkins
https://www.jenkins.io/doc/book/pipeline/syntax/
Concepts
jenkinsfile
pipelines
sections
directives
steps
agents
Pros
can be declarative or scripted
Cons
groovy
pipeline {
agent any
options {
// Timeout counter starts AFTER agent is allocated
timeout(time: 1, unit: 'SECONDS')
}
stages {
stage('Example') {
steps {
echo 'Hello World'
}
}
}
}
kubernetes
Concepts
Pros
Cons
OPA
Concepts
permissions
agents
roles
policy
rules
Pros
Cons
rego
Make
Pros
rule based
Cons
esoteric
relies on whitespace
complex var expansion
Example
objects = main.o kbd.o command.o display.o \
insert.o search.o files.o utils.o
edit : $(objects)
cc -o edit $(objects)
main.o : main.c defs.h
cc -c main.c
kbd.o : kbd.c defs.h command.h
cc -c kbd.c
command.o : command.c defs.h command.h
cc -c command.c
display.o : display.c defs.h buffer.h
cc -c display.c
insert.o : insert.c defs.h buffer.h
cc -c insert.c
search.o : search.c defs.h buffer.h
cc -c search.c
files.o : files.c defs.h buffer.h command.h
cc -c files.c
utils.o : utils.c defs.h
cc -c utils.c
clean :
rm edit $(objects)
Meson
https://mesonbuild.com/SimpleStart.html
Pros
Cons
Nix
Concepts
creates and composes file derivations
Pros
Cons
Rake
Pros
Cons
WDL
https://github.com/openwdl/wdl https://openwdl.org/getting-started/ https://github.com/openwdl/wdl/blob/wdl-1.1/SPEC.md
Concepts
workflow
task
call
command
output
Pros
Cons
Example
workflow write_simple_file {
call write_file
}
task write_file {
String message
command { echo ${message} > wdl-helloworld-output.txt }
output { File test = "wdl-helloworld-output.txt" }
}
Dagger
TaskCluster
https://taskcluster-taskgraph.readthedocs.io/en/stable/index.html