Optimizing Data Load in Oracle

person Jason Huangfolder_openOraclelocal_offer, access_time October 8, 2009

Forget where I get this useful article from.

It’s a very good article on how to optimize data loading / importing in Oracle.

Introduction

At OracleWorld 2002 I participated on the DBA User Expert panel, and a person in the audience asked the following question: “I am designing an application that will need to load 10 to 12 million rows of data per hour into Oracle tables from flat files. What Oracle features and techniques should I be looking at to achieve this performance requirement?” I thought this was an excellent question–one that DBAs and developers on almost any project could relate to.

In this paper, we will look at some of the Oracle features and techniques that can be employed to speed up data loads from flat files into Oracle tables. Which method is ideal for a particular situation will depend on many factors–factors we will discuss throughout this paper. To demonstrate and compare data loading methods, we will load a sample data set using each of the techniques we discuss. We will compare elapsed time, CPU time, and caveats of each method, but please note that timings can vary from one run to the next and are provided for illustrative purposes only.

A few of these techniques require Oracle 9i, but most work with Oracle 7.3 and later. One tip requires the Enterprise Edition of the database software, but the rest can be used with Standard Edition or Enterprise Edition. We will be looking at a few popular ways to load data into Oracle, but remember that there are many other tips and tricks available–this is not an exhaustive list!

The Sample Data Set

To demonstrate various techniques for loading flat file data into Oracle tables, we will work with a flat file that contains 100,000 records of data from a call center application. The file is approximately 6 Mb in size, and a few sample records in the file look like this:

        82302284384,04-18-2003:13:18:58,5001,COMPLAINT,SYS TICKET 183487923
        82302284385,04-18-2003:13:18:59,3352,INFO-REQUEST,PAYMENT ADDRESS
        82302284386,04-18-2003:13:19:01,3142,DOC-REQUEST,2002 YE CONSOL STMT

We will be loading the data into an Oracle table named CALLS. The CALLS table looks like this:

        Name         Null?     Type          Comment
        ------------ --------- ------------- -----------------
        CALL_ID      NOT NULL  NUMBER        Primary key
        CALL_DATE    NOT NULL  DATE          Non-unique index
        EMP_ID       NOT NULL  NUMBER
        CALL_TYPE    NOT NULL  VARCHAR2(12)
        DETAILS                VARCHAR2(25)

Single Row Inserts

About the simplest way for an application to load data into Oracle is to insert one row at a time using the INSERT statement with the VALUES clause. This technique is convenient for loading small amounts of data, such as a SQL*Plus script that creates seed data at the time of application installation. However, this approach is quite slow when compared to more sophisticated techniques, and therefore is not appropriate for loading large volumes of data into Oracle. Beware! This technique might be used behind the scenes if you write your load routine in perl, Java, or a scripting language that uses ODBC.

For comparison purposes, I wrote an application to load the 100,000 call center records into the CALLS table one row at a time. The data loaded in 172 seconds, and the database server process used 52 seconds of CPU time. (All of the timing information I provide in this paper will be the average of at least three test runs, and will be rounded off to whole seconds. Please use these figures for illustrative purposes only.)

Depending on the application and data volume involved, this most basic method for loading data into Oracle might give acceptable performance. If that is the case, then there is no need to over-engineer an unnecessarily complicated load program. However, if single row inserts are not fast enough for your situation, then there are more sophisticated data load methods to consider. In the following sections, we will find that it is possible to load the sample data set using less than 10% of the elapsed time and CPU resources of single row inserts.

The chief advantages of the single row insert method are simplicity and flexibility. You don’t need to do anything special to employ this technique, and there are a variety of application tools and languages that will use this technique behind the scenes by default. The key disadvantage of this technique is that it is the slowest and most CPU-resource intensive.

Data Loading Method Elapsed Seconds Database Server
CPU Seconds
Single row inserts 172 52

Single Row Inserts Without Indexes

When a row is inserted into a table, Oracle needs to update all indexes on the table at the same time. It is typically faster for Oracle to build an index from scratch after the table is populated rather than repeatedly update an index as rows are inserted. It is often true that the index will be more compact and efficient as well. These rules of thumb are especially true when bitmap indexes are involved.

From this, it follows that one way to speed up data loads would be to drop the indexes before loading the data and recreate the indexes after the load is complete. For primary and unique key constraints, you can get the same effect by disabling or dropping the constraint. (Be careful of the impact this might have on foreign key constraints in other tables.)

This technique might not be appropriate if you are loading data into a table that already has a lot of rows in it. Consider a table with 99 million rows and you wish to load 1 million more rows. The time saved in loading the 1 million new rows will be lost when Oracle has to rebuild an index on the 99 million existing rows plus the 1 million new rows. However, if you are loading data into a table that is empty or contains few rows relative to the number of rows being loaded, then the time savings from faster inserts will typically outweigh the time required to rebuild the indexes.

Speeding up index creation is another issue to consider. Briefly, you can reduce the amount of sorting that must take place on disk by increasing the sort_area_size in your session (if you are not using Oracle 9i PGA aggregate targets). This allows your session to perform more of the sorting during index creation in memory. Also, you can use the NOLOGGING keyword to reduce the amount of redo Oracle generates during the index build. NOLOGGING has significant impacts on recoverability and standby databases, so do your homework before using the NOLOGGING keyword.

I modified the application used in the last section to disable the primary key on the CALLS table and drop its one non-unique index before loading the data, putting both back after the load was complete. In this example, the CALLS table was empty before the data load. Factoring in the amount of time required to recreate the two indexes, elapsed time for the load dropped from 172 seconds to 130 seconds. CPU time used by the database server process dropped from 52 seconds to 35 seconds.

Dropping and rebuilding indexes before and after a data load can speed up the load and yield more efficient indexes. Some drawbacks include the added complexity and potential embedding of schema design information into the application code. (When you add another index to the table being loaded, will you have to update your application code?) Dropping indexes before a load could also have significant performance impacts if users need to be able to query the target table while the load is taking place. Finally, dropping or disabling primary or unique key constraints could cause difficulties if foreign key constraints reference them.

Data Loading Method Elapsed Seconds Database Server
CPU Seconds
Single row inserts without indexes
(time includes index rebuild using
NOLOGGING and large sort area)
130 35

Bulk Inserts Without Indexes

Way back in Oracle V6 an array interface was added to the Oracle Call Interface. Often called bulk binds, array processing allows an application to submit a SQL statement to the database server with many sets of values. Oracle then iterates through the data provided and executes the SQL statement once for each set of values. Bulk operations are much more efficient than repeated single row operations. This is because Oracle only parses the statement once, and bind operations and roundtrips between the application and the database server process are drastically reduced. Furthermore, Oracle learns up front that it will be doing the same thing many times in a row, and this gives Oracle the opportunity to engage economies of scale.

Any application that interacts with Oracle through the Oracle Call Interface (the majority of applications) has the ability to perform bulk operations. However, many application tools and scripting languages do not exploit this facility or make it readily available to application developers. If you are developing an application that will load large volumes of data into Oracle, you would do well to research the support for bulk operations in your development tools.

I modified the application used in the last example to load call center records into arrays in memory and perform bulk inserts 100 rows at a time instead of one row at a time. I retained the logic that dropped the indexes and rebuilt them after the data load. Elapsed time to load the data dropped from 130 seconds to 14 seconds, while CPU time used by the database server process dropped from 35 seconds to 7 seconds. Of course, the amount of time required to rebuild the indexes was not affected by the use of bulk inserts. The actual time to load the data (setting aside the index rebuild time) dropped by about 95%.

A clear advantage of the bulk insert technique is that load time drops dramatically, as does CPU usage. One disadvantage of this technique is slightly increased complexity in application coding and a greater opportunity for bugs. This disadvantage is very modest. However, a more significant shortcoming might be a lack of flexibility–not as many application tools and programming languages support the bulk facility in Oracle. Depending on which tools you use, this option might not be available to you.

Data Loading Method Elapsed Seconds Database Server
CPU Seconds
Bulk row inserts 100 rows at a time
without indexes (time includes
index rebuild using NOLOGGING
and large sort area)
14 7

CREATE as SELECT Using an Oracle 9i External Table

One of the intriguing new features in Oracle 9i is the external table facility. An external table looks somewhat like a regular table in that it has columns with standard Oracle data types and you can query it just like any other table. However, the table data is not stored in a table segment in the database. Instead, the external table is associated with a flat file on the database server. Whenever you query the external table, Oracle parses the flat file and returns the rows to you as if the data were stored in Oracle.

You can join external tables to other tables in your queries, but you cannot index external tables nor can you insert, update, or delete rows. If the flat file on the database server changes or is deleted, this will affect the outcome of future queries against the external table.

External tables open the door to new strategies for loading data into Oracle. You could simply put the flat file on the database server and create an external table in the database. Now you can instantly query the data as if it had been loaded into a conventional Oracle table. This will work for some situations but perhaps not all because the flat file data has not been truly loaded into the database. If the flat file gets removed or overwritten, the old contents will no longer be accessible in the database. Furthermore, your access to the data will be somewhat constrained since indexes cannot be created on external tables.

However, you could use an external table as a vehicle to copy data quickly from a flat file into a conventional Oracle table. For example, you could create a CALLS_EXTERNAL table like this:

        CREATE TABLE calls_external
        (
        call_id    NUMBER,
        call_date  DATE,
        emp_id     NUMBER,
        call_type  VARCHAR2(12),
        details    VARCHAR2(25)
        )
        ORGANIZATION EXTERNAL
        (
        TYPE oracle_loader
        DEFAULT DIRECTORY extract_files_dir
        ACCESS PARAMETERS
          (
          records delimited by newline
          fields terminated by ','
          missing field values are null
            (
            call_id, call_date char date_format date mask
            "mm-dd-yyyy:hh24:mi:ss",
            emp_id, call_type, details
            )
          )
        LOCATION ('calls.dat')
        );

Now, whenever you want to resynchronize the CALLS table with the flat file data, you could drop the CALLS table and recreate it with a statement like:

        CREATE TABLE calls
        (
        call_id    NOT NULL,
        call_date  NOT NULL,
        emp_id     NOT NULL,
        call_type  NOT NULL,
        details
        )
        TABLESPACE tab1 NOLOGGING
        AS
        SELECT call_id, call_date, emp_id, call_type, details
        FROM   calls_external;

Since the CALLS table is a conventional table, you can index it as you wish, and the data in the table will stay constant even after the flat file has been updated or deleted. Note the NOLOGGING keyword. As with index rebuilds, NOLOGGING speeds up the table creation by reducing redo generation. Again do your homework before using the NOLOGGING keyword as it has significant impact on recoverability and standby databases.

I wrote a SQL*Plus script to drop the CALLS table, recreate it from the CALLS_EXTERNAL table as shown above, and add the primary key and non-unique index. Populating the CALLS table in this way took 15 seconds, using 8 seconds of CPU time in the database server process. This case was slightly slower than the bulk insert method discussed in the last section, but you shouldn’t assume that copying from an external table is always slower than bulk inserts.

One advantage of this technique is that it achieves very good performance with very little coding effort. It is also a method that works entirely within the database, so you can even schedule data loads using the dbms_job scheduler. An important disadvantage of this method is that the target table gets dropped and recreated each time a load occurs. This makes the technique inappropriate for incremental data loads where each flat file contains new data to be added to existing data in the database. Also, users will get “table or view does not exist” errors if they try to access the data while the table is being rebuilt. Finally, this method is not available to you if you are using Oracle 8i or earlier.

Data Loading Method Elapsed Seconds Database Server
CPU Seconds
CREATE as SELECT from an
external table with NOLOGGING
(time includes index build using
NOLOGGING and large sort area)
15 8

INSERT Append as SELECT Using an Oracle 9i External Table

The previous section demonstrated how to create a conventional table in the database whose contents reflect the contents of a flat file on the database server. The table could be dropped and recreated at any time to resynchronize it if the contents of the flat file had changed. However, in many situations you want to be able to add data to an existing table without wiping out its previous contents. For these situations, there is the INSERT statement with the Append hint.

Consider the following SQL statement:

        INSERT /*+ Append */ INTO calls
        (
        call_id, call_date, emp_id, call_type, details
        )
        SELECT call_id, call_date, emp_id, call_type, details
        FROM   calls_external;

This statement will read the current contents of the flat file referenced by the CALLS_EXTERNAL table and add the rows into the CALLS table. The Append hint tells Oracle to use a faster mechanism for inserting rows. Rows are written to new data blocks above the segment’s high water mark instead of fussing with the free lists. If the NOLOGGING attribute has been set for the table, then minimal redo is generated and this speeds up the inserts significantly. (Please recall that NOLOGGING has significant impacts on recoverability and standby databases.)

I wrote a SQL*Plus script to disable the primary key and drop the non-unique index on the CALLS table, execute the above INSERT statement, and enable the primary key and recreate the non-unique index. The CALLS table started out empty before I ran the script to load data. Not surprisingly, the data load took about the same amount of time as in the previous section. If the CALLS table had contained data before the load began, however, the load time would have been longer (due to longer index rebuild), and the CREATE as SELECT outlined in the previous section would not have been possible.

As with the CREATE as SELECT from an external table, the INSERT with Append hint offers very good performance with minimal coding effort. If the target table has a large number of rows relative to the number of rows being loaded each time, you may not want to drop and recreate the indexes. Also, if recoverability is a concern, then you may not be able to use the NOLOGGING attribute. With the indexes in place during the load and with full logging, the INSERT with Append hint may not be as fast as desired.

Data Loading Method Elapsed Seconds Database Server
CPU Seconds
INSERT with Append hint as
SELECT from an external table
without indexes and with
NOLOGGING attribute (time
includes index rebuild using
NOLOGGING and large sort area)
15 8

SQL*Loader Conventional and Direct Path

SQL*Loader is a utility provided by Oracle specifically for the purpose of loading large volumes of data from flat files into Oracle tables. The tool has been around for a very long time, and Oracle has enhanced it significantly over the years. It is very powerful, flexible, and fast. However, the syntax is arcane and you must launch it from the operating system. (Invoking SQL*Loader within a dbms_job or a PL/SQL stored procedure is tricky.)

For all of its awkwardness, SQL*Loader still seems to be about the fastest and most efficient way to get flat file data into Oracle. By default, SQL*Loader uses what it calls “conventional path” loading–bulk inserts, basically. The performance is not phenomenal, and there are faster alternatives. However, with a simple “direct=true” on the command line, you can invoke “direct path” loading. In a direct path load, SQL*Loader writes rows directly into new data blocks above the table’s high water mark. This is like inserting with the Append hint as seen in the last section, but it’s even better. In a direct path load, SQL*Loader also updates indexes using a very efficient bulk method.

Loading the call center data using SQL*Loader with conventional path load took about 81 seconds. The database server process used about 12 seconds of CPU time. This was with indexes in place and SQL*Loader bulk inserting rows 100 at a time. With a direct path load, SQL*Loader got the data into the database in just 9 seconds, with only 3 CPU seconds used by the database server process. This was with indexes in place.

Direct path SQL*Loader is fast and efficient. It is even fast when indexes on the target table are not dropped before the load. As for disadvantages, direct path loading has implications for recovery akin to the NOLOGGING keyword. Also, indexes on the target table are unusable for a period during the load. This can impact users trying to access the target table while a load is in progress. Also, indexes can be left in an unusable state if the SQL*Loader session were to crash midway through the load.

Data Loading Method Elapsed Seconds Database Server
CPU Seconds
SQL*Loader conventional path
(indexes in place and rows=100)
81 12
SQL*Loader direct path (indexes
in place)
9 3

Partition Exchange

Most of the data loading techniques we’ve discussed restrict user access to the table being loaded until the load is complete. This restriction is particularly bothersome when flat file data needs to be added into an existing table on a regular basis. Table partitioning, and partition exchange in particular, can be used in conjunction with the data loading techniques discussed so far in order to reduce the impact on data availability to end users. Partitioning requires the Enterprise Edition of the database software–this is the only technique in this paper requiring the Enterprise Edition.

In a partitioned table, what appears to users as one table is actually a collection of multiple segments called “partitions.” Partitions may be added and dropped as needed over time. Partitions may also be exchanged. A partition exchange is where one partition of a partitioned table is swapped with an ordinary freestanding table. The swap occurs entirely within the Oracle data dictionary–no data is physically moved–and therefore partition exchange is extremely fast.

Going back to our call center example, suppose the CALLS table is partitioned. We could create an empty partition called PART04182003. This will hold the call data for April 18, 2003. We then create a table called CALLS_TEMP that has the same columns and data types as the CALLS table.

We now use one of the popular techniques for loading the 100,000 records into the CALLS_TEMP table. Since the application accesses the CALLS table, it does not matter that the CALLS_TEMP table is not accessible during the data load. We can wait until the data load is complete to add indexes and constraints on the CALLS_TEMP table, and this will not impact the users.

Once the data load is complete, the CALLS_TEMP table contains all of the call data for April 18, 2003. The CALLS table, meanwhile, has an empty partition called PART04182003. We execute a partition exchange with the following statement:

        ALTER TABLE calls
        EXCHANGE PARTITION part04182003 WITH TABLE calls_temp
        INCLUDING INDEXES WITHOUT VALIDATION;

The partition exchange only requires an update to the data dictionary, so it runs very quickly. Now the PART04182003 partition of the CALLS table contains the data we just loaded from the flat file, and the CALLS_TEMP table is empty. Assuming the CALLS table had local indexes instead of global indexes, the INCLUDING INDEXES clause of the partition exchange will ensure that the index partitions are exchanged as well and that the indexes remain usable. The WITHOUT VALIDATION clause indicates that we are assuring Oracle that all of the rows in this partition belong there.

Conclusion

There are many different ways to load data into Oracle. Each technique offers its own balance between speed, simplicity, scalability, recoverability, and data availability. Which method is the best for your particular application will depend on your needs in each of these areas. If there were one particular method that was always best, there probably wouldn’t be so many different ways to load data into Oracle.

We have looked at some of the most popular methods for loading data into Oracle. Because concrete examples often make concepts easier to understand, we loaded a sample set of 100,000 records from a call center application using each of the techniques discussed. To help drive home the benefits of certain techniques, we looked at sample run times in a test environment. To recap, here are all of the timing figures in one place:

Data Loading Method Elapsed Seconds Database Server
CPU Seconds
Single row inserts 172 52
Single row inserts without indexes
(time includes index rebuild using
NOLOGGING and large sort area)
130 35
Bulk row inserts 100 rows at a time
without indexes (time includes
index rebuild using NOLOGGING
and large sort area)
14 7
CREATE as SELECT from an
external table with NOLOGGING
(time includes index build using
NOLOGGING and large sort area)
15 8
INSERT with Append hint as
SELECT from an external table
without indexes and with
NOLOGGING attribute (time
includes index rebuild using
NOLOGGING and large sort area)
15 8
SQL*Loader conventional path
(indexes in place and rows=100)
81 12
SQL*Loader direct path (indexes
in place)
9 3

Please keep in mind that this paper does not contain an exhaustive list of techniques for loading data into Oracle. There are many more techniques, tips, and tricks that we haven’t covered here. For example, we did not even touch on the subject of parallelism in data loads. (Inserts with the Append hint can use parallelism in the Enterprise Edition of the Oracle software. Applications, meanwhile, can take a “divide and conquer” approach to loading subsets of data in parallel.)

Comment

  1. constantly i used to read smaller articles or reviews which as well clear their motive, and that
    is also happening with this piece of writing which I am reading at
    this place.

warningComments are closed.