Golden Gate: data pump and hub nodes 3/4

2020-11-23 0 Di Denis Monari

You got your database ready and your integrated extractor running. You need some data pump to move your data around. Your extractor will store the data in trails locally on the source node. This is to avoid unwanted network latency (and fails) to slow down your extraction process. A dedicated data pump process will take care to send those trails to a remote Golden Gate host and it will pay any latency and/or network failure.

This step is really simple and, unless you need some kind of transformation to take place on the data (but you should avoid do that on extractor and source node processes), the only special parameter you may need on a data pump process is:

PASSTHRU

The source node is not a good place where to split/duplicate and send your data to multiple destinations. This is because your source node main duty is to fetch data from the source database as fast and as reliable as possible and send it to the next node. Ensure you use COMPRESS (and ENCRYPT) on your RMHOST parameters to increase network performances and security for each data pump process.

Hub nodes

The best place to split your data and send it around are on the hub nodes. A hub node is nothing more than a standard Golden Gate installation but with no connection to any database (usually). They require lower resources than source nodes unless you need dozen of data pump processes.

Those nodes are like train stations where the data is received, stored and copied/splitted around. On a hub node you may have multiple data pump processes reading from the same trails and send the data (all or a subset) to multiple destinations.

The hub nodes are useful in case you need a buffer too. Suppose you need to send the data over the internet. You will probably capture data faster than you can send it to your remote data center. You need a buffer to handle the unstable network performances. Your Manager process will take care to restart your data pump processes when the network fails and ensure the trails will not get deleted unless all of your data pumps were able to send the data to their destinations by using AUTOSTART and AUTORESTART parameters. Any issue on sending data to the destinations will not slow down your data capture on the source.

But how many hubs? Well, you should have a hub node every time significant network jumps are required, so every time you move data from different datacenters and to/from the cloud or different cloud providers. A destination hub will help you continue to receive data even if your destination replicator node or destination target have issues. If you miss to have a receiver hub node, your sender hub node will be unable to send the data, effectively wasting time, and because usually the network between datacenters has lower performances, more time will be required for the replicator to catch up.

The whole series:

Replication: moving data around in near real-time
Golden Gate: from Oracle Database to S3 bucket/HDFS architecture 1/4
Golden Gate: source node 2/4
Golden Gate: data pump and hub nodes 3/4
Golden Gate: replicator node AVRO/HDFS over a S3 Bucket 4/4