How are transactions organized physically
When InnoDB creates a new transaction it is not yet committed. The database has not yet made any promises to the application and so we do not really have to make anything persistent so far.
To be fast InnoDB tries to assemble the transaction in a memory buffer, the innodb_log_buffer. It should be sufficiently large that you actually can assemble such a transaction in memory without needing to write it out in part into the redo log. A size of 1M to 8M is normal.
Once a transaction is to be committed InnoDB has to read the page from disk which contains the image of the row that is being changed. It then has to actually make that change in memory. The changed page is cached in a memory pool called the innodb_buffer_pool. This pool also caches unchanged pages that have been accessed by read operations. All of these pages on disk and in memory are 16K in size and the innodb_buffer_pool_size determines how much RAM we will use as a cache for such pages – usually as much as we can spare.
The modified page in memory is generally NOT written back to disk immediately. Instead the change is recorded at the end of the current redo log on disk – this is your ib_logfile0 or ib_logfile1. The page is also marked as “dirty” (= to be written back to disk) in memory.
A dirty page is written out to the tablespace in three cases:
- The redo log, which is organized as a ring buffer, is full. To free up some space we will write out dirty pages in redo log order so that we can advance the trailing pointer of the redo log ring buffer an make some room.This situation is called an Innodb_log_wait and will be registered in the status counter of the same name.
- InnoDB requires a free page from the InnoDB buffer pool but cannot find one. Usually we can free a page in the buffer pool by giving up a page that is not marked dirty. When a page is not marked dirty its contents can be reloaded from disk at any time and so we can safely give it up in memory. But when the buffer pool holds only dirty pages this is impossible and we actually have to flush dirty pages to disk before we can free them up for other uses.This situation is called Innodb_buffer_pool_wait_free and will be registered in a status counter of the same name. InnoDB tries to avoid this situation: Whenever more than innodb_max_dirty_pages_pct percent many pages are marked dirty a checkpoint is forced and dirty pges will be written.
- InnoDB feels idle and will write out batches of 64 pages each to disk once a second.This is normal and will not be specifically registered (but will of course bump Innodb_pages_written like everything else).
Relevant config entries in my.cnf:
# assembly before commit
innodb_log_buffer_size = 8M# Percentage of buffer pool pages that
# may be dirty before forcing a
# Default = 90, ok
innodb_max_dirty_pages_pct = 90
Relevant counters from SHOW GLOBAL STATUS:
Innodb_log_waits# Counter for “buffer pool all dirty”
Choosing a useful redo log size
The redo log logs transactions and entries inside the redo log are proportional to the transactions size because we are logging rows not pages. The purpose of the log is to be able to delay writeback of those 16K pages to the tablespace. Often a page contains more than one row and many transactions close together in time will change the same page or even the same row. The write to the redo log will make all those changes sychronously persistent, but can do so using linear writes to disk. Writing back changes to the table space will involve disk seeks, and disk seeks are slow. Having a redo log minimizes or eliminates such random writes.
Usually the redo log should be large enough and never fill up. Consequently your Innodb_log_waits counter should be 0 or at least not move when you look at it twice. If you experience Innodb_log_wait events one of two situations exists: Your server has write bursts larger than your redo log – the redo log is too small and must be extended. Or your server has persistent high write load and the redo log will overflow no matter how large you make it. In this case, but more disks or choose other ways to distribute the write load to more spindles.
By default the redo log consists of two files (innodb_log_files_in_group), each of which is 5M in size (innodb_log_file_size), for a total of 10M. This is usually much to small. Ideally you should have two files which are 64M to 256M in size, resulting in a total redo log of 128M to 512M. In any case the redo log cannot be larger than 4096M = 4G, even if you are on a 64 bit box.
Before MySQL 5.0 it was important not to oversize your redo log: Recovery time after a server crash has been dependent on the size of the used part of the redo log because InnoDB first did the redo log recovery and then entered operational mode. With 5.0 this changed: Log recovery is done in background while the server is already online, so the redo log size is no longer controlling the servers recovery time.
If you change the innodb_log_file_size when ib_logfile0 and ib_logfile1 already exist your InnoDB will refuse to start and leave a message in your servers error log. The message is a complaint about real log file size and configured log file size disagreeing – and would you please kindly fix that?
So in order to change the size of your redo log proper procedure must be followed: Shut down your server cleanly. Double check the log and that there is no server process any more. Then move away the two existing ib_logfile? to some other location and change the my.cnf to reflect the desired new innodb_log_file_size. Restart the server. In your error log you will find messages about new ib_logfiles being created (and a number of really scary complaints all of which you must ignore). Once the server is online again and you have checked that your data is still present you may delete the old logfiles.
Relevant entries in my.cnf:
innodb_log_files_in_group = 2# Size of one ib_logfile?
innodb_log_file_size = 256M
How InnoDB stores data in files
As shown in an earlier article of this series, InnoDB has two modes of operation: If innodb_file_per_table = 0 all data is being stored in one or multiple ibdata central tablespace files. If innodb_file_per_table = 1 data is being stored in .ibd files. The central tablespace file still must exist, but stored only administrative data and the undo log.
The central tablespace files are being created in innodb_data_home_dir, unless you specify them with full pathnames. If innodb_data_home_dir is empty, datadir is being used – this is good, so leave it empty. The individual tablespace files are then specified with some magic syntax inside innodb_data_file_path – look it up in the manual for details. The default string is “ibdata1:10M:autoextend”, which will create a 10M file named ibdata1 in datadir.
Because of the autoextend parameter this file will be grown on demand. Growth will be in steps of innodb_autoextend_increment megabytes. The default here is 8, which is far to small if you are using innodb_file_per_table = 0.
You should be operating with innodb_file_per_table = 1 to get one file per table. This enables you to reclaim disk space at the operating system level with OPTIMIZE TABLE. At innodb_file_per_table = 1 the defaults of 10M initial file and 8M steps are ok, so do not change them.
If you need or want to run with innodb_file_per_table = 0 you will have a very large ibdata1 file. Make sure that your operating system and backup tools can conveniently deal with such large files. If that is not the case you might need a more complicated innodb_data_file_path statement specifying a large number if ibdata files – each of them for example 2G in size or whatever your limit is.
One of your ibdata files should be autoextending and the step size should be reasonably large in order to help the operating system to allocate disk space in a nonfragmented way. A good step size is 1% to 5% of your disk space so that your tablespace file will eventually fill the entire disk in no more than 20 to 100 steps. This is nicely finegrained to manage disk space but coarsely grained enough for efficient allocation. On a filesystem of 200G you would then choose an innodb_autoextend_increment of 2048 (2048M = 1% of 200G) or even 10240 (10G = 10240M = 5% of 200G).
When using innodb_file_per_table = 1 please consider the increased need for filehandles. Set innodb_open_files to provide at least one filehandle per table. You might need to up open_files_limit as well – this is also used to cache .frm files and MyISAM files, so it must be even larger.
Relevant entries in my.cnf (File per Table):
innodb_file_per_table = 1# Where to put the ibdata (default: datadir)
# How to create the ibdata
innodb_data_file_path = “ibdata1:10M:autoextend”
# ibdata growth step
innodb_autoextend_increment = 8
# More file handles
# One per table
innodb_open_files = 2048
# On linux, we might up this properly
open_files_limit = 32768
Relevant entries in my.cnf (Single tablespace file):
innodb_file_per_table = 0# Where to put the ibdata (default: datadir)
# How to create the ibdata
# We want a single 2G file
innodb_data_file_path = “ibdata1:2048M:autoextend”
# ibdata growth step
# Growing in 2G steps (1% of 200G)
innodb_autoextend_increment = 2048
How InnoDB paints data to disk
We have already seen that writing commands will cause only disk reads in InnoDB. Any INSERT or UPDATE will create a log buffer and will mark pages as dirty inside the buffer pool for data and undo log pages.
Once we commit the change is written to the redo log – at least if innodb_flush_log_at_trx_commit = 1. The commit will now perform a write system call to push data from the server into the operating system file system buffer cache. It will then issue a flush system call to force the operating system to flush the file system buffer cache to disk. Only then the data is persistent and the commit is done.
A linear write to the redo log is much faster than a seek and a write to the tablespace file, but even then this is a relatively slow operation which can incur wait times of several millisecons – unless you happen to have a disk controller with a battery buffers RAM (BBU RAM). If you do not care about your data you have the option to decouble the commit from the disk and avoid those waits at the expense of data persistence.
At innodb_flush_log_at_trx_commit = 2 a commit will trigger the write system call to push data out of the server into the file system buffer cache, but the flush to force the actual disk write will only happen once a second or so. This is much faster because we do no longer have to wait for the disk. If your mysql server process crashes, no data is lost – it is all inside the file system buffer cache and will eventually be written to disk. But if your mysql server hardware crashes, though, up to one second of redo log may be missing. There may be data where the application (or a user) has been signalled a successful write which now is lost.
Depending on the business case that is being implemented it may be that this is a relevant error or not. From the computer scientists point of view “innodb_flush_log_at_trx_commit = 2″ is a violation of the ACID principle of Codd and therefore wrong. From the business perspective the behaviour can still be right. That would be the case when the missing data can be otherwise reproduced or then the correction of the error through customer service is cheaper than the additional cost in hardware that would be necessary to deliver the required performance at innodb_flush_log_at_trx_commit = 1.
With innodb_flush_log_at_trx_commit = 0 write strategy is even more relaxed – “commit” is now a purely logical operation generating now write or flush system calls. Instead the redo log is written and flushed only once a second. This is slightly faster than mode “2″ but not a lot.
No matter what value you choose for this variable, the database server will recover correctly after a crash. In any case the database will return into a consistent transactional state. What is different is the point in time (the latest visible transaction number to which the system will recover) that will be reached when recovery finishes.
Another way to influence the behavior of InnoDB is the choise of a innodb_flush_method. In Unix the valid values are “fdatasync” (the default), “O_DSYNC”, “O_DIRECT”, “littlesync”, “nosync”, in Windows “normal” and “unbuffered” (the default) as well as “async_unbuffered” (the default in Windows XP and Windows 2000) are offered.
The major idea with O_DSYNC and O_DIRECT is to open the redo log file in a way that the file system buffer cache is disabled. The database then does all the buffering and each write system call will hit the disk immediately with no need for a flush system call ever. In Linux this can be achieves by using innodb_flush_method = O_DIRECT. InnoDB will then generate only write system calls and will never call flush. This is no longer necessary because writes will always write to disk unbuffered.
Relevant configuration for my.cnf (for Linux):
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT# Secure writing for ACID compliance
innodb_flush_log_at_trx_commit = 1
innodb_flush_method = O_DIRECT
InnoDB works better when we limit the number of threads concurrently operating inside the InnoDB storage engine proper. There can be innodb_thread_concurrency many threads inside the engine at once. Many formulas exist to calculate this number (”number of things you want to keep busy = cpus + disks”, “cores times two”), but it all comes down to the fact that current versions of InnoDB expose worse performance when this value is set to larger values than 16 or 32 depending on the workload.
If you have more concurrent transactions than innodb_thread_concurrency allows surplus threads will have to wait. Often a single thread will call into the storage engine through the handler interface many times in order to execute a primitive operation such as key lookup only to return into the MySQL general SQL layer. To answer a single SQL query there may be many transitions between the general SQL layer and the InnoDB engine.
To prevent our thread from waiting each time it want to do something inside the engine it will get innodb_concurrency_tickets many “Tickets” when it is granted initial entry to the engine. The tread can now enter and leave the engine that often without needing to ask for additional permission each time. Useful values to experiment if you have a sufficiently big and busy box are “number of records in one block”, “… in one 64 block segment” or “number of records we expect this query to read”.
A similar variable is innodb_commit_concurrency, which limits the number of threads committing in parallel. This parameter limits the resource usage inside the log buffer and regulates contention on the redo log.
For historic reaons there is a variable named thread_concurrency. The value here ends up in a call to pthread_setconcurrency(), which does nothing in current versions of Linux and Solaris. In Solaris8 it affected the internal mapping of User threads to kernel threads. Ignore it on current machines, it is a no-op.
Relevant entries in my.cnf:
innodb_thread_concurrency = 16
innodb_concurrency_tickets = 500
Meta data structures
Judging by what I find in customer configurations the variable innodb_additional_mem_pool_size is the one most often set to strange values. What that value does is to control the size of a buffer for meta data structures, a cache for the Innodb internal data dictionary. The default value is 1M and normal system never will need more than 8M. I know of one customer with 40.000 InnoDB tables who required 20M for this.
Relevant entries in my.cnf: