Golden Gate: replicator node AVRO/HDFS over a S3 Bucket 4/4

2020-11-23 0 Di Denis Monari

We have an integrated extractor capturing data from our source Oracle database. It store captured data in local trails and a data pump process read and send them to a hub node. The hub node works as a buffer, handling network issues without slowing down the extraction, and sending received data to the next hop: the replicator node.

In the context of our scenario, the replicator node is the only one that require Golden Gate for Big Data installed. We do not need Golden Gate core installed on that node.

Because we are going to replicate to AVRO OCF files and to a S3 Bucket, we need a bit of bluff. We need to have Golden Gate for Big Data to think he is writing to HDFS.

To do so, we need the Hadoop 3 libraries on our machine by downloading them from here. Save the uncompressed content to a folder in your replicator node.

You will also need the AWS SDK library for java. You can download them from here. Place them in another folder.

To use all the above (including Golden Gate for Big Data), you also need Java 1.8+. You can download it from here.

Setup access to S3 Bucket

You will need an AWS account, of course, with a test S3 Bucket. You will also need an EC2 of course, with a role profile that grant the EC2 access to that bucket. It will need write object and list object. Unless you need to access your bucket from the Internet, you should restrict access to it from a VPC Endpoint.

Take note of your bucket name, navigate to the Hadoop 3 library and locate the file core-site.xml. Make a backup copy and then edit the file. Below you can find five property sections you need to check.

  <property>
    <!-- URI of NN. Fully qualified. No IP.-->
    <name>fs.defaultFS</name>
    <value>s3a://my_bucket_name</value>
  </property>

  <property>
    <name>fs.s3a.multipart.size</name>
    <value>20M</value>
    <description>How big (in bytes) to split upload or copy operations up into.
      A suffix from the set {K,M,G,T,P} may be used to scale the numeric value.
    </description>
  </property>

  <property>
    <name>fs.s3a.multipart.threshold</name>
    <value>20M</value>
    <description>How big (in bytes) to split upload or copy operations up into.
      This also controls the partition size in renamed files, as rename() involves
      copying the source file(s).
      A suffix from the set {K,M,G,T,P} may be used to scale the numeric value.
    </description>
  </property>

  <property>
    <name>fs.s3a.connection.maximum</name>
    <value>15</value>
    <description>Controls the maximum number of simultaneous connections to S3.</description>
  </property>

  <property>
    <name>fs.s3a.endpoint</name>
    <value>s3.eu-west-1.amazonaws.com</value>
    <description>AWS S3 endpoint to connect to. An up-to-date list is
      provided in the AWS Documentation: regions and endpoints. Without this
      property, the standard region (s3.amazonaws.com) is assumed.
    </description>
  </property>

The only section you need to change is the first property where you need to place your S3 Bucket name (in s3a URI scheme). The other four property sections will help you in tuning performances while transferring data to the S3 Bucket. You should never have performances issues using a VPC Endpoint anyway, but if you need to handle dozens of different table changes at the same time and you need extremely high performances, you may want to test some different values. Remember that writing to a S3 Bucket is way slower than a real disk. Also, the way AVRO files needs to be ingested is not the best solution if you want to use them as a real-time source.

Golden Gate for Big Data handler

The replicator process in this Big Data version of Golden Gate works in a different way than the standard core version. First of all, there is no concept of “integrated”. Second, its configuration is composed by two files: a Golden Gate replicator parameter file and a handler parameter file.

We need to setup a parameter file for the handler. It should be placed in the same dirprm/ folder along the process parameter. Here the contents:

gg.handlerlist=hdfs

gg.handler.hdfs.type=hdfs
gg.handler.hdfs.fileNameMappingTemplate=${fullyQualifiedTableName}_${currentTimestamp}.avro
gg.handler.hdfs.pathMappingTemplate=/my/bucket/folder/${fullyQualifiedTableName}/year=${currentTimestamp[yyyy]}/month=${currentTimestamp[MM]}/day=${currentTimestamp[dd]}
gg.handler.hdfs.format=avro_row_ocf
gg.handler.hdfs.mode=op
gg.handler.hdfs.maxFileSize=20M
gg.handler.hdfs.fileRollInterval=30s
gg.handler.hdfs.inactivityRollInterval=30s
gg.handler.hdfs.partitionByTable=true
gg.handler.hdfs.rollOnMetadataChange=true
gg.handler.hdfs.compressionCodec=snappy

gg.handler.hdfs.format.enableDecimalLogicalType=true
gg.handler.hdfs.format.insertOpKey=I
gg.handler.hdfs.format.updateOpKey=U
gg.handler.hdfs.format.deleteOpKey=D
gg.handler.hdfs.format.truncateOpKey=T
gg.handler.hdfs.format.encoding=UTF-8
gg.handler.hdfs.format.pkUpdateHandling=delete-insert

goldengate.userexit.timestamp=utc
goldengate.userexit.writers=javawriter
javawriter.stats.display=TRUE
javawriter.stats.full=TRUE

gg.log=log4j
gg.log.level=INFO

gg.report.time=30sec

gg.classpath=dirprm/:/my/goldengate/folder/lib/*:/my/goldengate/folder/*:/my/hadoop/share/hadoop/common/*:/my/hadoop/share/hadoop/common/lib/*:/my/hadoop/share/hadoop/hdfs/*:/my/hadoop/share/hadoop/hdfs/lib/*:/my/hadoop/lib/native/*:/my/hadoop/lib/*:/my/awssdk/lib/*:/my/awssdk/third-party/lib/*:/my/hadoop/share/hadoop/tools/lib/*:

javawriter.bootoptions=-Xmx4096m -Xms2048m -Djava.class.path=ggjava/ggjava.jar -Djava.io.tmpdir=/my/dirtmp

The most important parameter is handler.hdfs.format=avro_row_ocf where we instruct Golden Gate to write in AVRO row OCF format. Everything else depends on your needs. Here you can see how the big numbers are handled (Oracle NUMBER data type is too big for Java data types and need be managed as decimal), you can define the max size of each avro file, those file are stored in the bucket in /my/bucket/folder and each file reflect the schema.table_name along with the timestamp of where it is created, etc. Oracle documentation (especially the Golden Gate 19c version) contains most information.

The real tricky part is the classpath that must contain all the Golden Gate, Hadoop and AWS SDK library paths. Don’t forget the first part: “dirprm/”.

Golden Gate for Big Data process parameter

After compiling the handler parameter file, you can setup the Golden Gate for Big Data process parameter file. Depending on the kind of Big Data handler you need to configure, it is very limited in the options available and it is quite limited if compared to the Golden Gate core version. In our scenario, the AVRO OCF, we cannot handle any DDL operation, but Golden Gate 12 include by default the metadata structure of each table within the trail. This will enable the handler to write AVRO OCF file without issue, providing we send both before and current image (we did that in our integrated extractor configuration, remember?). Anyway, beside the standard list of table mapping, the parameters that really make difference are the following:

REPLACEBADCHAR NULL FORCECHECK
TARGETDB LIBFILE libggjava.so SET property=/full/path/of/my/avro.properties
GROUPTRANSOPS 1000

REPLACEBADCHAR parameter is especially useful for sources that store data in different languages and symbols. It will save you big headache, because not everything can really be written to text file, especially if you need to follow format rules (Avro). This option will replace bad character code with NULL. This will only avoid the crash of the replicator. It will not save you from losing data, so beware!

TARGETDB parameter points the handler to the property file we just wrote above.

GROUPTRANSOPS will help you tune your replicator. A good number is 1000, so to write 1000 operations received from the trails before every checkpoint. Depending on the data flow, you may want to raise or to low that number. Just remember that in case of ABEND, replicator will go back to the first operation of the block and retry (if Manager is configured to do so). If that block is big, it may require time to catchup.

A note about the transaction size

The java handler will need to have the whole transaction in memory before writing it to the Avro file. So if your average transaction size is few kb, 1000 of them will require few mb of process memory (the Xmx value defined in the handler file), but what if your average transaction size is big? Or if you receive a big transaction? The answer is the process will ABEND (and there will be chances manager will restart it if you configure the AUTORESTART). So you will need to ensure enough memory to sustain your average GROUPTRANSOPS number and to be able to manage an uncommon big transaction. While you can lower GROUPTRANSOPS down to 1, there is no way you can lower a big transaction; you will have to increase your memory or skip it and lose the data.

Test the replication flow

Once you compiled the handler file and the parameter file, you are ready to start the manager, create the replicator process and start it. This works like the Golden Gate core version.

Now everything should be in place: the extractor, the hub and the replicator and you are ready to test your fresh new Golden Gate replicator flow that capture data from a Oracle database and write it to a s3 Bucket in Avro OCF file format!

Conclusions

Golden Gate is really a powerful tool. Not only it is one of the main components of the Oracle Maximum Availability Architecture, but it help your company make the most from its data. You can use the best analytic tools and platforms to analyze data, the most powerful big data infrastructure to handle loads that no relational database can. All of those environments will be able to see the same data in near real-time thanks to a good architecture of Golden Gate, and that will help your business be more reactive and proactive.

You can experiment with other topologies, sources and targets, like writing to Cassandra and MongoDB. The new version 19 of Golden Gate for Big Data should be able to handle natively S3 bucket writes, simplifying the architecture you have just read about, but this configuration is still valid. It also can write in other file formats like Parquet and also it can capture from Cassandra.

There is no excuse for you not start using Golden Gate.. beside its cost of course! 😛

Enjoy your replication! 🙂

The whole series:

Replication: moving data around in near real-time
Golden Gate: from Oracle Database to S3 bucket/HDFS architecture 1/4
Golden Gate: source node 2/4
Golden Gate: data pump and hub nodes 3/4
Golden Gate: replicator node AVRO/HDFS over a S3 Bucket 4/4