Scheduling
Scheduling and running PipelineWise tasks automatically is not part of the PipelineWise package but any task scheduler that can run Unix CLI commands can trigger PipelineWise jobs to run. Both Single Server and Multi-Server Cluster installations are achievable.
Let’s say you have 5 microservice databases that you want to replicate to Amazon Redshift
and pipelinewise status
output looks like this:
$ pipelinewise status
Tap ID Tap Type Target ID Target Type Enabled Status Last Sync Last Sync Result
------------ ---------- ---------- ---------------- --------- -------- ----------- ------------------
microserv_1 tap-mysql redshift target-redshift True ready unknown
microserv_2 tap-mysql redshift target-redshift True ready unknown
microserv_3 tap-postgres redshift target-redshift True ready unknown
microserv_4 tap-postgres redshift target-redshift True ready unknown
microserv_5 tap-postgres redshift target-redshift True ready unknown
5 pipeline(s)
Since every pipeline runs, logs and manages state files independently, you’ll need to run 5 commands independently. For example if using Unix Cron you can create the following crontab:
*/5 * * * * pipelinewise run_tap --tap microserv_1 --target redshift # Sync every 5 minutes
0 * * * * pipelinewise run_tap --tap microserv_2 --target redshift # Sync ever hour
0 */3 * * * pipelinewise run_tap --tap microserv_3 --target redshift # Sync every three hours
0 0 * * * pipelinewise run_tap --tap microserv_4 --target redshift # Sync every midnight
0 0 * * 6 pipelinewise run_tap --tap microserv_5 --target redshift # Sync every Saturday
PipelineWise is tested and can run with at least the following schedulers:
Unix Cron Unix Cron - This is the simplest option for a single server installation.
Cicada Cicada Scheduler - A lightweight multi-server CRON manager
Cronicle - Cronicle is a reasonably good and relatively simple tool to schedule PipelineWise jobs in both Single Server and Multi-Server cluster installations.
Apache Airflow - Airflow is a robust and mature tool to schedule and monitor workflows.
Multi-Server Cluster
Running Multi-Server Cluster requires a Network File System that is accessible from every host in the PipelineWise cluster. (Amazon EFS, Google FileStore or similar)
Network File System is required because PipelineWise keeps runtime configuration files in
a common place on the host machine at ${HOME}/.pipelinewise
directory. If you run
PipelineWise commands on multiple nodes that operate on the same project, then
every node has to read/write into the same directory, doesn’t matter where the nodes are
located. This is typically done by mounting ${HOME}/.pipelinewise
on every node to
a shared directory on NFS/EFS.