Tips for Loading Data into RDBMS
Sometimes people set up a little database, perhaps put some data into it, point an application at it and let it run. Sometimes that's the right balance of pragmatism and perfection. Sometimes however, there are reasons to be a little more diligent. if the situation calls for any of
- large datasets
- testing (either behaviour verification or performance metrics)
- frequent (re)deployment, perhaps for archive instances, additional usage needs, etc.
This article therefore collects a few tips to improve the process of loading data into the RDBMS; some general tips and some suggestions for particular databases. In this article I'm going to be lazy about explcitly stating my assumptions, this is mostly about listing some of the low-hanging fruit of improving database load times, not about squeezing every bit of performance.
General Principles
Since there's no reasonable point in making optimization effort unless you can determine whether it helps or not, you want some visibility into your load process. time(1) and pv(1) are both good tools to give just a little extra visibility into your load process, although if your needs are more heavy duty, you might want to consider instrumenting the process more deeply. Here are some brief hints:
- mytop(1)
- a metrics tool such as ganglia with a module/plugin for mysql stats
- send data at "breakpoints" in the load process to statsd or graphite
Consider, but don't blindly apply the following:
- If you know your data to be imported is good, turn of your constraint checking, foreign key and uniqueness checking
- Create your indices after the load
- Turn off transactions
- Turn off/down logging
- Chunk your input data to limit the number of rows loaded in a single command/from a single file
- Have the input data as "close" as possible (hardware-wise/network-wise) to the database server
MySQL
I'm assuming the storage engine InnoDB, other storage engines might benefit from ( other|different ) issues/approaches.
Dump Format
Dumping your databases to delimited-text format with mysqldump and reloading with mysqlimport is much faster than using the all sql format. Here is a basic example using a shared directory:
mysqldump -T /glusterfs/miscbackup/dbname \ --opt --add-drop-database \ --events --routines --triggers \ --extended-insert --delayed-insert \ --flush-logs --default-character-set=utf8 \ <database name> for table in $(ls /glusterfs/miscbackup/dbname/*.sql | s/\.sql$//); do mysql --password='badsecret' <database name> < ${table}.sql; mysqimport --use-threads=$(grep -c processor /proc/cpuinfo) \ --password='badsecret' <database name> ${table}.txt; done
This import can be equivalently done in a mysql client session using LOAD DATA INFILE.
Constraints
Don't forget to turn these back on afterwards!
- SET FOREIGN_KEY_CHECKS=0;
- SET UNIQUE_CHECKS=0;
Logging
- SET sql_log_bin=0;
- SET GLOBAL slow_query_log = 0;
Flushing
The rate and method of flushing transactions and logs to disk affect performance. As a simplification, the less demanding mysqld/innodb is about flushing (that is, the more it lets the kernel drivers/drive controllers take control), the better performance you'll likely see. This is not always true. The downside is basically decreased robustness in the face of a process/OS/hardware crash. Since we're loading a backup, I'm assuming that a crash just means we start over and there won't be any data loss anyway, so let's get as much performance as possible.
Don't flush logs upon commit:
set global innodb_flush_log_at_trx_commit = 0;
Don't sync the binary log:
set global sync_binlog = 0;
Other Configuration
PostgreSQL
The PostgreSQL docs themselves about populating a database are very clear & useful, although there is no mention of pg_bulkload.
Briefly,
- COPY: copy <table_name> from '/path/to/table.csv' DELIMITERS ',' CSV HEADER, although if you're importing data created by pg_dump(1), the points about COPY, indices and constraints are done automatically by psql(1) or pg_restore(1)
- Increase checkpoint_segments
- Disable WAL
- Increase maintenance_work_mem
- ANALYZE afterwards, this makes a huge difference for large loads