Overview

Orchest is not here to reinvent the wheel when it comes to your favorite editor, Orchest is a web based tool that works on top of your filesystem (similar to JupyterLab) allowing you to use your editor of choice. With Orchest you get to focus on building your data science pipelines.

A pipeline in Orchest can be thought of as a graph consisting of executable files, e.g. notebooks or scripts, within their own isolated environment (powered by containerization). Users get a visual pipeline editor to describe the execution order of individual steps that represent those executable files. After coding your scripts, Orchest allows you to select and run any subset of the steps whilst keeping in mind the defined execution order of the pipeline.

Orchest essentially provides your with a development environment for your data science efforts without taking away the tools you know and love.

How Orchest works

Orchest runs as a collection of Docker containers and stores a global configuration file. The location for this config is ~/.config/orchest/config.json.

Orchest is powered by your filesystem. Upon launching, Orchest will mount the content of the orchest/userdir/ directory, where orchest/ is the install directory from GitHub, at different locations inside the docker containers. In the userdir/ on the host machine Orchest will store its state and user scripts. Your scripts that make up the pipeline, for example .ipynb and .py files, are stored inside the userdir/pipelines/ directory and are mounted in the container at /pipeline-dir. Additionally the following files will be stored inside the .orchest/ directory at the pipeline level (and thus for each pipeline):

  • The Orchest SDK stores step outputs in the .orchest/data/ directory to pass data between pipeline steps (in the case where orchest.transfer.output_to_disk() is used).
  • Logs are stored in .orchest/logs/ to show STDOUT output from scripts in the pipeline view.
  • An autogenerated .orchest/pipeline.json file that defines the properties of the pipeline and its steps. This includes: execution order, names, images, etc. Orchest needs this pipeline definition file to work.

Giving a directory structure similar to the following:

.
├── preprocessing.ipynb
├── .ipynb_checkpoints/
├── .orchest
│   ├── data/
│   ├── logs/
│   └── pipeline.json
└── model-training.ipynb

What can I use Orchest for?

With Orchest, you get to build pipelines where each step has its own isolated environment allowing you to focus on a specific task, may it be: data engineering, model building or more low level tasks such as data transforms.

With Orchest you get to:

  • Visually construct pipelines, but interact with the pipelines through code.
  • Code your data science efforts in your editor of choice. Additionally, Orchest has deep integration for JupyterLab to allow you to directly edit the scripts part of your pipeline.
  • Modularize, i.e. split up, your (monolithic) notebooks.
  • Run any selection of pipeline steps.
  • Select specific notebook cells to skip when running a pipeline through the pre-installed celltags extension of JupyterLab.
  • Parametrize your data science pipelines to experiment with different modeling ideas.

What Orchest does for you:

  • Provide you with an interactive pipeline editing view.
  • Manage your dependencies and environments.
  • Run your pipelines based on the defined execution order.
  • Pass data between your steps.