Tips for running massive DML in Oracle

2020-12-09 0 Di Denis Monari

When working on a DWH or BI system, massive deletes or, more in general, massive DML and ETL are business as usual. Chances are you can exploit partitioning, exchange partition and parallel DML, while disabling constraints as you see fit to say the least.

But what about massive deletes where no such tools are available, for example where the table is not partitioned, the DML code is complex and take forever? Maybe you also have tons of dependencies (maybe even running Golden Gate replica) and you cannot easily exploit CTAS.

Usually there are two driving factor that guide choices when you need to run massive DMLs:

  1. Complete DML as fast as possible
  2. As less database resource consumption as possible

The first option is usually the reign of DWH/BI systems: you got your huge amount of resources and tools you can use in a well defined timeframe. So, no problem, right?

The second option is the reign of: I need to fix a small data portion in my big OLTP database, I cannot cause disruption on customer user experience so I cannot go offline.

The amount of data is growing on a daily basis and chances are that even in the reign of OLTP systems, you need to be always up and fix data as fast as possible.

Enter Do-It-Yourself Parallelism

In Tom Kyte book Expert Oracle Database Architecture various parallel techniques are described. When nothing else works, the last option to run massive DMLs is with the “Do-It-Yourself Parallelism”.
Trust me, you should always investigate other options because you need to be well aware of what you are doing. You need to know how many resources you can consume without disrupting customer experience (or the system). Like running a PL/SQL loop with a commit every 100k random rows deleted, you must be aware you will NOT run that massive DML in one single transaction. That means, while your database will of course remain always consistent, your business data will not. Think about it when reviewing all the constraints and triggers you may have in your database. The DML will be executed in small chunks of your table. Every chunk will be committed before moving to the next, until no more chunks are left to be computed.

Suppose for example, your application requires that every single item in its catalog has always 10 rows representing a description in a different language, and you need to change the language ID on that multi-billion rows table. If the application can sustain changes on a small portion of that table, than fine: you can split your work in parallel. If not, well, you should change the application code as soon as possible!

So, assuming you can split the whole DML in small chunks and you are still convinced, you can go down the route of the DBMS_PARALLEL_EXECUTE package. First of all you need to create a stored procedure with your main DML in it.

CREATE OR REPLACE PROCEDURE MY_SERIAL_BIG_TASK (
    p_lo_rid  IN  ROWID,
    p_hi_rid  IN  ROWID
) IS
BEGIN
-- your complex DML here --
/*
-- example
    DELETE FROM my_schema.my_big_table a
    WHERE a.my_column_identifier = 1234
    AND ROWID BETWEEN p_lo_rid AND p_hi_rid;
*/
END;
/

The above procedure will be used by the DBMS_PARALLEL_EXECUTE to consume portions of the big table. Please note there are no commits within the stored procedure and we expect to provide two input variables representing the lower and higher ROWIDs we will use to filter our big table.

To split the table we need to choose between various options, but usually you want to use data blocks or rows. Here how to split by 10k data blocks:

BEGIN
    dbms_parallel_execute.create_task('MY_BIG_TABLE_TASK');
    dbms_parallel_execute.create_chunks_by_rowid(task_name => 'MY_BIG_TABLE_TASK', table_owner => 'MY_SCHEMA',
                                                table_name => 'MY_BIG_TABLE',
                                                by_row => false,
                                                chunk_size => 10000);
 
END;
/

Then, when ready, start the task you just defined. Here you define the number of parallel processes that will be executed.

BEGIN
    dbms_parallel_execute.run_task(task_name => 'MY_BIG_TABLE_TASK', sql_stmt => 'begin MY_SERIAL_BIG_TASK( :start_id, :end_id ); end;',
                                  language_flag => dbms_sql.native,
                                  parallel_level => 16);
END;
/

You can monitor the status of the execution checking table DBA_PARALLEL_EXECUTE_CHUNKS where you can see ASSIGNED, PROCESSED and UNASSIGNED chunks (along with errors and other statistics).

Here and here you can find detailed examples on how to setup and use your parallel execution tasks.

Conclusions

You should always be careful running massive DMLs on your OLTP systems of course. With DBMS_PARALLEL_EXECUTE you should double your tests, though. A good performance tests should be always executed before running a custom parallel code. You may corrupt data and consume too much resources in no time by running the wrong code!

Enjoy your massive DMLs! 😛