Getting Started with PostgreSQL

PostgreSQL is a commonly requested RDBMS, and for good reason. I'm not going into advocacy or even advantages, I shall merely explain the steps for what I consider to be a basic setup; it's not minimal by any means, but heavy usage, large datasets and special cases such as tablespaces will not be considered, neither will high availability or replication.

Installation

PostgreSQL is available in binary package form such as rpm or deb for most major linux distributions. If the version provided by your distribution is not new or sparkly enough for you, please build the latest and greatest with the mods and tweaks you desire and then make a package for it.

Basic Configuration

PostgreSQL requires its own user and cluster creation must be done by that user:

$ mkdir -p /var/pgsql/data; chown -R postgres /var/pgsql
$ su - postgres
$ initdb -D /var/pgsql/data

Starting, testing:

$ postgres -D /var/pgsql/data > /var/pgsql/logfile 2>&1 &
$ createdb test
$ psql test

And don't forget you probably want it to start upon reboot as well.

You'll probably want to change a few settings in postgresql.conf:

shared_buffers
work_mem
maintenance_work_mem
checkpoint_timeout

You can see all settings with something like::: SELECT name,setting,unit from pg_settings;

Consider also:

putting logs/WAL on a different storage subsystem
asynchronous commits with synchronous commit=off (not as risky as fsync=off, because you will never get corruption, although you are risking loss)

Tuning

As is common in the world of database administration, there are many knobs to fiddle; some of them can easily change the experience from "this sucks" to "wow" (or the other way around). But it can get rather complex, sometimes controversial and is usually highly dependent on your environment and usage situation, so I'm not going to go into it here. I'll merely list a few places I've looked when I've wanted more detail.

FreeBSD

The above steps are taken care of for you by the port; after you install the port, all you need is:

/usr/local/etc/rc.d/postgresql initdb
/usr/local/etc/rc.d/postgresql start

CentOS

The Postgresql Development Group (PGDG) publishes yum repositories for up-to-date versions of postgresql for CentOS/RHEL, which makes it very easy to stay current:

rpm -Uvh http://yum.postgresql.org/9.1/redhat/rhel-6-x86_64/pgdg-centos91-9.1-4.noarch.rpm
yum install postgresql91{,-libs,-devel,-server}
service postgresql-9.1 initdb
service  postgresql-9.1 start

OSX

If you installed via macports:

/opt/local/etc/LaunchDaemons/org.macports.postgresql84-server/postgresql84-server.wrapper start
sudo launchctl load -w /Library/LaunchDaemons/org.macports.postgresql84-server.plist

If you installed via homebrew:

brew install postgresql
initdb /usr/local/var/postgres -E utf8
pg_ctl -D /usr/local/var/postgres -l /usr/local/var/postgres/server.log start

User Administration

Authentication/Access

pg_hba.conf specifies permissible authentication methods for users -> databases from specified hosts/networks.

Roles/Privileges

Create a role for ordinary usage such as for a webapp and grant it privileges:

CREATE ROLE rolename WITH LOGIN NOSUPERUSER NOCREATEROLE ENCRYPTED PASSWORD 'clear_text_password'
GRANT SELECT, INSERT, UPDATE, DELETE, REFERENCES ON tablename TO rolename

Roles allow nested/group privileges.

If you create sequences, you need to grant select,update on them in order to allow a user to use nextval.

Managing Databases

Create a new database, the rolename should already exist (defaults to the current user):

CREATE DATABASE dbname WITH OWNER rolename

Logging

Following the basic configuration above will result in a logfile at /var/pgsql/logfile. If you take out the shell redirection of stderr when starting postgres and set some parameters, you can have postgres take care of your logging and log rotation for you:

redirect_stderr = on (must be set at server startup)
log_filename postgresql_log.%a (default postgresql-%Y-%m-%d_%H%M%S.log)
log_rotation_age = 1440
log_truncate_on_rotation = on

This will keep 7 days of logs, one log file per day named server_log.Mon, server_log.Tue, etc, and automatically overwrite last week's log with this week's log.

Vacuum

From the manual:

"PostgreSQL's VACUUM command must be run on a regular basis for several reasons:

To recover or reuse disk space occupied by updated or deleted rows.
To update data statistics used by the PostgreSQL query planner.
To protect against loss of very old data due to transaction ID wraparound."

and

"There are two variants of the VACUUM command. The first form, known as "lazy vacuum" or just VACUUM, marks expired data in tables and indexes for future reuse; it does not attempt to reclaim the space used by this expired data unless the space is at the end of the table and an exclusive table lock can be easily obtained. Unused space at the start or middle of the file does not result in the file being shortened and space returned to the operating system. This variant of VACUUM can be run concurrently with normal database operations.

The second form is the VACUUM FULL command. This uses a more aggressive algorithm for reclaiming the space consumed by expired row versions. Any space that is freed by VACUUM FULL is immediately returned to the operating system. Unfortunately, this variant of the VACUUM command acquires an exclusive lock on each table while VACUUM FULL is processing it. Therefore, frequently using VACUUM FULL can have an extremely negative effect on the performance of concurrent database queries. "

For lazy vacuuming, some edits to postgresql.conf:

autovacuum = on
autovacuum_naptime = 1000min
stats_start_collector = on
stats_row_level = on
vacuum_cost_delay = 1000

For full vacuuming, I run the following from cron, at times when usage is low:
```
vacuumdb --all --full --analyze
```

For high usage databases, defined as highly I/O constrained databases, other values for these parameters or indeed other approaches should be considered.

Reindexing

Unless you have very active databases (heavy usage or heavy churn), reindexing is probably not necessary. Consider it a performance optimisation. However, there is one major exception: after a recovery operation (any time you replay the Write Ahead Log) you will need to reindex any hash indexes

Backup and Restore

pg_dump and pg_dumpall are the tools for dumping postgresql databases. If you use --format to specify an archive format, you will be able to selectively restore from the archive, but you will need to use pg_restore to do so. Otherwise the ordinary psql is your restore tool.

Basic usage

The best way to get postgresql to show the DDL for your tables is:

pg_dump -s <dbname> [-t tablename1 [-t tablenameN]]

which, more-or-less, gives effectively the same result as MySQL's SHOW CREATE TABLE.

This will dump all the databases, roles and tablespace information:

pg_dumpall --clean > dumpfile
psql -f dumpfile postgres

For individual databases:

pg_dump --clean *dbname* > dumpfile
psql *dbname* < dumpfile

To a remote destination:

pg_dump --clean *dbname* | ssh hostname 'psql *dbname*'

From the manual: "If your database schema relies on OIDs (for instance as foreign keys) you must instruct pg_dump to dump the OIDs as well. To do this, use the -o command line option."

AllGoodBits.org