Scheduling

Scheduling and running PipelineWise tasks automatically is not part of the PipelineWise package but any task scheduler that can run Unix CLI commands can trigger PipelineWise jobs to run. Both Single Server and Multi-Server Cluster installations are achievable.

Let’s say you have 5 microservice databases that you want to replicate to Amazon Redshift and pipelinewise status output looks like this:

$ pipelinewise status

Tap ID        Tap Type     Target ID   Target Type       Enabled    Status    Last Sync    Last Sync Result
------------  ----------   ----------  ----------------  ---------  --------  -----------  ------------------
microserv_1   tap-mysql    redshift    target-redshift   True       ready                  unknown
microserv_2   tap-mysql    redshift    target-redshift   True       ready                  unknown
microserv_3   tap-postgres redshift    target-redshift   True       ready                  unknown
microserv_4   tap-postgres redshift    target-redshift   True       ready                  unknown
microserv_5   tap-postgres redshift    target-redshift   True       ready                  unknown
5 pipeline(s)

Since every pipeline runs, logs and manages state files independently, you’ll need to run 5 commands independently. For example if using Unix Cron you can create the following crontab:

*/5 *   * * * pipelinewise run_tap --tap microserv_1 --target redshift # Sync every 5 minutes
  0 *   * * * pipelinewise run_tap --tap microserv_2 --target redshift # Sync ever hour
  0 */3 * * * pipelinewise run_tap --tap microserv_3 --target redshift # Sync every three hours
  0 0   * * * pipelinewise run_tap --tap microserv_4 --target redshift # Sync every midnight
  0 0   * * 6 pipelinewise run_tap --tap microserv_5 --target redshift # Sync every Saturday

PipelineWise is tested and can run with at least the following schedulers:

  • Unix Cron Unix Cron - This is the simplest option for a single server installation.

  • Cicada Cicada Scheduler - A lightweight multi-server CRON manager

  • Cronicle - Cronicle is a reasonably good and relatively simple tool to schedule PipelineWise jobs in both Single Server and Multi-Server cluster installations.

  • Apache Airflow - Airflow is a robust and mature tool to schedule and monitor workflows.

Multi-Server Cluster

Running Multi-Server Cluster requires a Network File System that is accessible from every host in the PipelineWise cluster. (Amazon EFS, Google FileStore or similar)

Network File System is required because PipelineWise keeps runtime configuration files in a common place on the host machine at ${HOME}/.pipelinewise directory. If you run PipelineWise commands on multiple nodes that operate on the same project, then every node has to read/write into the same directory, doesn’t matter where the nodes are located. This is typically done by mounting ${HOME}/.pipelinewise on every node to a shared directory on NFS/EFS.