PostgreSQL has become an increasingly viable database platform to serve as storage for applications, from classic corporate database use to the latest web apps. But getting the best performance from it has not been an easy subject to learn. You need just the right combination of rules of thumb to get started, solid monitoring, and maintenance to keep your system running well, suggestions for troubleshooting, and hints for add-on tools to add the features the core database doesn’t try to handle on its own.
In this article we will look at some tips and tricks for high performance of PostgreSQL. The tips and tricks covered in this article might provide the right background trivia to make your use of the database simpler when dealing with the smaller problems.
Upgrading without any replication software
Tip: A program originally called pg_migrator at http://pgfoundry.org/projects/pg-migrator/ is capable of upgrading from 8.3 to 8.4 without the dump and reload. This process is called in-place upgrading.
Minor version upgrades
Tip: One good way to check if you have contrib modules installed is to see if the pgbench program is available. That’s one of the few contrib components that installs a full program, rather than just the scripts you can use.
Using an external drive for a database
Tip: External drives connected over USB or Firewire can be quite crippled in their abilities to report SMART and other error information, due to both the limitations of the common USB/Firewire bridge chipsets used to connect them and the associated driver software. They may not properly handle write caching for similar reasons. You should avoid putting a database on an external drive using one of those connection methods. Newer external drives using external SATA (eSATA) are much better in this regard, because they’re no different from directly attaching the SATA device.
Implementing a software RAID
Tip: When implementing a RAID array, you can do so with special hardware intended for that purpose. Many operating systems nowadays, from Windows to Linux, include software RAID that doesn’t require anything beyond the disk controller on your motherboard.
Driver support for Areca cards
Tip: Driver support for Areca cards depends heavily upon the OS you’re using, so be sure to check this carefully. Under Linux for example, you may have to experiment a bit to get a kernel whose Areca driver is extremely reliable, because this driver isn’t popular enough to get a large amount of testing. The 2.6.22 kernel works well for several heavy PostgreSQL users with these cards.
Free space map (FSM) settings
Tip: Space left behind from deletions or updates of data is placed into a free space map by VACUUM, and then new allocations are done from that free space first, rather than by allocating new disk for them instead.
Using a single leftover disk
Tip: A typical use for a single leftover disk is to create a place to store non-critical backups and other working files, such as a database dump that needs to be processed before being shipped elsewhere.
Ignoring crash recovery
Tip: If you just want to ignore crash recovery altogether, you can do that by turning off the fsync parameter. This makes the value for wal_sync_method irrelevant, because the server won’t be doing any WAL sync calls anymore.
Disk layout guideline
Tip: Avoid putting the WAL on the operating system drive, because they have completely different access patterns and both will suffer when combined. Normally this might work out fine initially, only to discover a major issue when the OS is doing things like a system update or daily maintenance activity. Rebuilding the filesystem database used by the locate utility each night is one common source on Linux for heavy OS disk activity.
Splitting WAL on Linux systems running ext3
Tip: On Linux systems running ext3, where fsync cache flushes require dumping the entire OS cache out to disk, split the WAL onto another disk as soon as you have a pair to spare for that purpose.
Common tuning techniques for good performance
Tip: Increasing read-ahead, stopping updates to file access timestamps, and adjusting the amount of memory used for caching are common tuning techniques needed to get good performance on most operating systems.
Optimization of default memory size
Tip: The default memory sizes in the postgresql.conf are not optimized for performance or for any idea of a typical configuration. They are optimized solely so that the server can start on a system with low settings for the amount of shared memory it can allocate, because that situation is so common.
A handy system column to know about; ctid
Tip: ctid, which can still be used as a way to uniquely identify a row, even in situations where you have multiple rows with the same data in them. This provides a quick way to find a row more than once, and it can be useful for cleaning up duplicate rows from a database, too.
Don’t use pg_buffercache for regular monitoring
Tip: pg_buffercache requires broad locks on parts of the buffer cache when it runs. As such, it’s extremely intensive on the server when you run any of these queries. A snapshot on a daily basis or every few hours is usually enough to get a good idea how the server is using its cache, without having the monitoring itself introduce much of a load.
Tip: The preferred path to get a lot of data into the database is using the COPY command. This is the fastest way to insert a set of rows. If that’s not practical, and you have to use INSERT instead, you should try to include as many records as possible per commit, wrapping several into a BEGIN/COMMIT block.
External loading programs
Tip: If you’re importing from an external data source (a dump out of a non-PostgreSQL database for example), you should consider a loader that saves rejected rows while continuing to work anyway, like pgloader: http://pgfoundry.org/projects/pgloader/. pgloader will not be as fast as COPY, but it’s easier to work with on dirty input data, and it can handle more types of input formats too.
Tuning for bulk loads
Tip: The most important thing to do in order to speed up bulk loads is to turn off any indexes or foreign key constraints on the table. It’s more efficient to build indexes in bulk and the result will be less fragmented.
Skipping WAL acceleration
Tip: The purpose of the write-ahead log is to protect you from partially committed data being left behind after a crash. If you create a new table in a transaction, add some data to it, and then commit at the end, at no point during that process is the WAL really necessary.
Tip: PostgreSQL 8.4 introduced an automatic parallel restore that lets you allocate multiple CPU cores on the server to their own dedicated loading processes. In addition to loading data into more than one table at once, running the parallel pg_restore will even usefully run multiple index builds in parallel.
Post load cleanup
Tip: Your data is loaded, your indexes recreated, and your constraints active. There are two maintenance chores you should consider before putting the server back into production. The first is a must-do: make sure to run ANALYZE against all the databases. This will make sure you have useful statistics for them before queries start running.
Tip: One of the most effective ways to speed up queries against large data sets that are run more than once is to cache the result in a materialized view, essentially a view that is run and its output stored for future reference.