Replication Methods
Replication Methods define the approach PipelineWise takes (more precisley the Singer.io taps) when extracting data from a source during a replication job. Additionally, Replication Methods can also impact how data is loaded into your destination and your overall row usage.
PipelineWise supports the following replication strategies to extract data from data sources.
Log Based: It’s replicating newly inserted, updated and also deleted records.
Key Based Incremental: The Tap saves it’s progress via bookmarks. Only new or updated records are replicated during each sync.
Full Table: The Tap replicates all available records dating back to a start_date, defined in the tap config YAML, during every sync
Fast Sync: Same functionality as Full Table but optimised for data transfers between specific sources and targets and bypassing the Singer specification. Useful when initial syncing large tables with hundreds of millions of rows where singer components would usually be running for long hours or sometimes for days.
Warning
Important: Replication Methods are one of the most important settings in PipelineWise. Defining a table’s Replication Method incorrectly can cause data discrepancies and latency. Before configuring the replication settings for a data pipeline, read through this guide so you understand how PipelineWise will replicate your data.
Log Based
Log-based Replication is a replication method in which the we identify modifications to records - including inserts, updates, and deletes - using a database’s binary log files.
Warning
Log Based replication method is available only for MySQL, PostgreSQL and MongoDB backend databases that support log replication.
Note
When using Log Based replication method, table structures changes are detected automatically.
Key Based Incremental
Key-based Incremental Replication is a replication method in which the Taps (Data Sources) identify new and updated
data using a column called a Replication Key. A Replication Key is a timestamp
, date-time
, or integer
column that exists in a source table.
When replicating a table using Key-based Incremental Replication, the following will happen:
During a replication job, PipelineWise stores the maximum value of a table’s Replication Key column.
During the next replication job, Taps (Data Sources) will compare saved value from the previous job to Replication Key column values in the source.
Any rows in the table with a Replication Key greater than or equal to the stored value are replicated.
PipelineWise stores the new maximum value from the table’s Replication Key column.
Repeat.
Let’s use a SQL query as an example:
SELECT replication_key_column,
column_you_selected_1,
column_you_selected_2,
[...]
FROM schema.table
WHERE replication_key_column >= 'last_saved_maximum_value'
If Log Based Replication isn’t feasible or available for a data source, Key-based Incremental Replication is the next best option.
Warning
Key Based Incremental replication doesn’t detect deletes in source.
Key Based Incremental replication from tables with long running transactions could lead to skipping rows in certain conditions.
Full Table
Full Table Replication is a replication method in which all rows in a table - including new, updated, and existing - are replicated during every replication job.
If a table doesn’t have a column suitable for Key Based Incremental or if Log Based is unavailable, this method will be used to replicate data.
Fast Sync
Fast Sync Replication is functionally identical to Full Table replication but Fast Sync bypassing the Singer Specification for optimised performance. Primary use case of Fast Sync is initial sync or to resync large tables with hundreds of millions of rows where singer components would usually run for long hours or sometimes for days.
Important: Fast Sync is not a selectable replication method in the YAML configuration. PipelineWise detects automatically when Fast Sync gives better performance than the singer components and uses it whenever it’s possible.
Warning
Fast Sync is not a generic component and is available only from some specific data sources to some specific targets. Check Fast Sync section for the supported components.