Golden Gate: from Oracle Database to S3 bucket/HDFS architecture 1/42020-11-23
As stated in previous posts, the whole picture consists in replicating contents from a Oracle DB to a s3bucket as Avro object container files. Golden Gate 12 can’t do that directly, but it can capture data changes easly from within a Oracle DB and write to many different (Big Data) systems. We will use two of them in tandem: Avro object container files and HDFS filesystem. So, with the help of Apache Hadoop libraries we can make the S3bucket appear as a HDFS filesystem and configure Golden Gate handler to write into it. Why do we want to add complexity like wrapping HDFS around a s3bucket? Answer is: if you want to save money, you can store data on a s3bucket (with its various paying/availability options) and spin up your HDFS systems on AWS only if and when you want to pay for computational power. Avro object container file format let you store your data in a ready-to-use format as a self contained self describing file (it contain both data and metadata with transactional informations).
What do you need to test this for yourself?
- Download Oracle Golden Gate 12.2+
- Download Oracle Golden Gate for Big Data 12.3+
- Download Oracle Client 12.1+ (please remember to patch it too)
- An Oracle EE 12.1 instance on premise or on RDS
- check on MOS document Latest GoldenGate/Database (OGG/RDBMS) Patch recommendations (Doc ID 2193391.1) for bugfixes on both Golden Gate and Oracle.
- Download Oracle Java JDK 1.8 for RHEL Linux 64bit
- Hadoop 3 libraries for Linux
- Latest AWS SDK java for Linux
- three RHEL/OEL 7 boxes on premise or as EC2 on AWS. If you mind to save some money, you may want to collapse all this three machines in one. I will stick to a three boxes to show you a best practice architecture in a distributed infrastructure
- one s3 bucket
- Sufficient resources in terms of:
- Database box should have at least 8GB memory and 2 CPU
- Each RHEL/OEL boxes should have at least 2GB, 20GB filesystem available for software binaries and a minimum amount of free space for replication trails and cache (more on this in next chapters)
- Ensure to have sufficient internet bandwidth if you are replicating from/to on-prem systems and cloud
The three boxes will act as db source node, hub node and target/replicator node. Like a stream they will move data from the source database to a s3bucket. You will be able to pause/stop/reboot nodes to do maintenance. For example, you can shutdown the Golden Gate process on the hub node or the whole machine, and both db source node and target node will continue to run. Db source node will start to accumulate captured data in local trail files waiting for the hub node process to be back online. Target node process will simply stop receiving new data, but will remain ready to do its job anytime. This architecture will help you do maintenance easily without stopping everything.
Db source node will have a Golden Gate installation with two main processes: an integrated extractor that will fetch data from the database and put it into a local trail, and a data pump extractor that will read that trail and send it to the hub node. Purpose of this configuration is to capture data from database as fast as possible, avoiding any unnecessary filters/transformations or other bottleneck like network bandwidth/latency while sending data to a remote host. Data pump process duty will be to fetch from the local trail file and send it to the remote hub node effectively addressing network performance without slowing down data capture.
Hub node will act as a central train station: it will receive data coming from remote sources and send it to remote destinations. Like db source node, this node will have a Golden Gate installation too. In our example we will have only one source and so we will have only one process reading from the receiving trails and sending it to the target node. You may have multiple destinations for the same receiving trail too. For example, you may want to send the same data, coming from he same source to two different systems: one will replicate to a HDFS like our example, another could be your DataWarehouse staging database or a performance system. You may have captured a large set of tables with a single extractor group simply to have it splitted in two, effectively lowering resource needs on the source database.
Target node will be our final Golden Gate installation. Here we will need to install Oracle Golden Gate for Big Data because we need to write to a big data system (HDFS). This is the only node that will require a copy of Hadoop 3 libraries and the latest AWS SDK Java so to be able to write to a S3bucket.
A word about high availability. Golden Gate and Oracle database both have a wide range of HA configurations available for you. Database especially has an incredible set of tools and options that will let you build an unbreakable infrastructure. Golden Gate on the other hand can be coupled with Oracle db to exploit a RAC configuration along with Data Guard. While this is a great and solid solution, it wouldn’t be easy to POC. In fact, if you don’t need extreme HA, you may build a solid system simply by exploiting Oracle flashback capabilities and force logging to sustain small disruptions.
Now that you are aware of the whole architecture, let’s start building the db source node and setup the database to be able to connect Golden Gate to it.
The whole series:
Replication: moving data around in near real-time
Golden Gate: from Oracle Database to S3 bucket/HDFS architecture 1/4
Golden Gate: source node 2/4
Golden Gate: data pump and hub nodes 3/4
Golden Gate: replicator node AVRO/HDFS over a S3 Bucket 4/4