Tips for Loading Data into RDBMS

Sometimes people set up a little database, perhaps put some data into it, point an application at it and let it run. Sometimes that's the right balance of pragmatism and perfection. Sometimes however, there are reasons to be a little more diligent. if the situation calls for any of

large datasets
testing (either behaviour verification or performance metrics)
frequent (re)deployment, perhaps for archive instances, additional usage needs, etc.

This article therefore collects a few tips to improve the process of loading data into the RDBMS; some general tips and some suggestions for particular databases. In this article I'm going to be lazy about explcitly stating my assumptions, this is mostly about listing some of the low-hanging fruit of improving database load times, not about squeezing every bit of performance.

General Principles

Since there's no reasonable point in making optimization effort unless you can determine whether it helps or not, you want some visibility into your load process. time(1) and pv(1) are both good tools to give just a little extra visibility into your load process, although if your needs are more heavy duty, you might want to consider instrumenting the process more deeply. Here are some brief hints:

mytop(1)
a metrics tool such as ganglia with a module/plugin for mysql stats
send data at "breakpoints" in the load process to statsd or graphite

Consider, but don't blindly apply the following:

If you know your data to be imported is good, turn of your constraint checking, foreign key and uniqueness checking
Create your indices after the load
Turn off transactions
Turn off/down logging
Chunk your input data to limit the number of rows loaded in a single command/from a single file
Have the input data as "close" as possible (hardware-wise/network-wise) to the database server

MySQL

I'm assuming the storage engine InnoDB, other storage engines might benefit from ( other|different ) issues/approaches.

Dump Format

Dumping your databases to delimited-text format with mysqldump and reloading with mysqlimport is much faster than using the all sql format. Here is a basic example using a shared directory:

mysqldump -T /glusterfs/miscbackup/dbname \
          --opt --add-drop-database \
          --events --routines --triggers \
          --extended-insert --delayed-insert \
          --flush-logs --default-character-set=utf8 \
          <database name>

for table in $(ls /glusterfs/miscbackup/dbname/*.sql | s/\.sql$//);
  do
    mysql --password='badsecret' <database name> < ${table}.sql;
    mysqimport --use-threads=$(grep -c processor /proc/cpuinfo) \
        --password='badsecret' <database name> ${table}.txt;
  done

This import can be equivalently done in a mysql client session using LOAD DATA INFILE.

Constraints

Don't forget to turn these back on afterwards!

SET FOREIGN_KEY_CHECKS=0;
SET UNIQUE_CHECKS=0;

Logging

SET sql_log_bin=0;
SET GLOBAL slow_query_log = 0;

Flushing

The rate and method of flushing transactions and logs to disk affect performance. As a simplification, the less demanding mysqld/innodb is about flushing (that is, the more it lets the kernel drivers/drive controllers take control), the better performance you'll likely see. This is not always true. The downside is basically decreased robustness in the face of a process/OS/hardware crash. Since we're loading a backup, I'm assuming that a crash just means we start over and there won't be any data loss anyway, so let's get as much performance as possible.

Don't flush logs upon commit:

set global innodb_flush_log_at_trx_commit = 0;

Don't sync the binary log:
```
set global sync_binlog = 0;
```

Other Configuration

PostgreSQL

The PostgreSQL docs themselves about populating a database are very clear & useful, although there is no mention of pg_bulkload.

Briefly,

COPY: copy <table_name> from '/path/to/table.csv' DELIMITERS ',' CSV HEADER, although if you're importing data created by pg_dump(1), the points about COPY, indices and constraints are done automatically by psql(1) or pg_restore(1)
Increase checkpoint_segments
Disable WAL
Increase maintenance_work_mem
ANALYZE afterwards, this makes a huge difference for large loads

AllGoodBits.org