Replication: moving data around in near real-time

2018-02-09 0 Di Denis Monari

In today world, data needs to be moved from one place to another. Usually there are at least two main reasons to do so: offloading data from transactional databases to analytic/computational engine systems and different data models for specific use cases. Even if you have big computational power on your source database, you would like to avoid tight dependencies between your main applications and your analytics. Plus, that source database is only one of a plethora of different sources for your analytic systems. Finally, time to market constraints usually add real-time challenge to this recipe. So, the needs to find a way to move data around, as fast as possible, from different sources to different destinations in a heterogeneous constellation of systems.

There are many tools to do this, but first you should write down and analyze some requirements you may have:

  • compatibility matrix: list your sources and targets types and versions. You need to know what are your sources (files, what databases, streams, etc) and their versions.
  • data types to be moved: you need to know what kind of data you want to move. Do you need to move only data or even metadata (ex: procedures, packages, code, etc)?
  • data size: you have to know if you need to move gb/s or kb/s. You may need compression or even specific technologies.
  • security model: do you need to mask and/or encrypt your data?
  • purpose: is disaster recovery one of your objectives?
  • HA: is high availability required while moving data around? What are RPO and RTO?

Only you know the answer of the above, but I would try to help with a POC you may try for yourself with a little bit of work.

In the next few posts I will show how to build a real-time replica from a Oracle Database to Avro files in a S3 bucket seen as a HDFS by using a Oracle Golden Gate set of EC2 machines.

Knowledge of Oracle Database and basic knowledge of Golden Gate and its console commands are required as this won’t be a step by step guide with every and all commands and configurations (it would be so annoying!). Also you should be autonomous to configure your EC2 environment along with roles and policies.

What I would like to show you are some specific details I found very peculiar and easy to cause headache.

The whole series:

Replication: moving data around in near real-time
Golden Gate: from Oracle Database to S3 bucket/HDFS architecture 1/4
Golden Gate: source node 2/4
Golden Gate: data pump and hub nodes 3/4
Golden Gate: replicator node AVRO/HDFS over a S3 Bucket 4/4