Checkpoints in the days of SSD drives

2. RE: Checkpoints in the days of SSD drives

Like

Marcus Haarmann

IBM Champion

Posted Sat November 08, 2025 04:40 PM

Hi,

It seems you only have databases without logging and you only rely on checkpoints in order to prevent

data loss.

As I described yesterday, normally databases are set up with logging, in order to not lose a single transaction.

Your setup is quite insecure to prevent data loss. This does not depend on which type of disks you have,

but how the database is setup.

We have databases with several hundred gigabytes running, of course with unbuffered logging, replication

to a second (or third) instance (in a different location) and applications which take care about transaction handling.

It seems that the installation in your environment has been done without respecting any recommended

procedures.

When your goal is to not lose any data due to an unclean shutdown, modify the databases to be at least

buffered (better unbuffered) logging. (that can be done with ontape, look up the -b option, even with a fake backup).

In case the application has no idea of transactions, each single statement will be handled as a separate transaction

by the server.

I would not expect the database to run significantly slower because of the logging, since your data is not that huge.

In order to do this, allocate some logical logs (check the number of logs you have and see how they fill between regular checkpoints

(onstat -l is your friend). The logs are the storage where transactional data is stored, which is used for recovery.

The server is able to start from the latest checkpoint and applies the logs which are still on disk in order to get to the latest

healthy (consistent) state.

You might want to backup the logs to a separate machine to allow a recovery in case of fatally losing the machine

(mostly because of human errors, Murphy is everywhere).

If this is not your primary concern to be able to restore up to the latest transaction in case of a machine loss,

but the (virtualized) machines should just not lose data in case of a unclean shutdown,

it might be sufficient to set LTAPEDEV to NUL (on Windows, or /dev/null on Linux, which is our setup for test systems).

Then you do not need to bother with logical log backup and only rely on a full backup.

Note that this is not a recommended setup, but better than the actual one, if I understand correctly what your current situation is.

By the way, the RTO_SERVER_RESTART would work in this setup and generate checkpoints which are happening depending on the load.

But anyway, in case of an unclean shutdown, the database would recover the transactions since the last checkpoint,

which should give yourself (and your boss...) a better sleep.

This should work with any server version (I understand you have an older 11.70, but the same applies to newer versions).

14.10 would be the recommended version for now, unless you are encountering issues with your current applications,

which seem to be running with an old client sdk (however, 14.10 should be able to communicate with an old client

without real issues.).

Best,

MARCUS HAARMANN

Original Message

Original Message:
Sent: 11/7/2025 4:53:00 PM
From: Joel Schumacher
Subject: Checkpoints in the days of SSD drives

I'm wondering about recommendations for checkpoints in today's era of SSD drives.

Online (as opposed to sqlturbo) dates back to when was it? '93? '94? And I'm assuming that's when the whole cache started and checkpoints.

Disk I/O with a traditional HDD was very costly. You had to engage the drive, get it to spin up to speed, the read/write head had to be physically moved to the right point to read or write data. With all of these mechanical operations, this was going to take a while. And, Informix had 5 minute check points.

I just installed 14.10.12.5 and it is still setting the same default parameters of 5 minute check points.

I just wrote about a client that wasn't doing regular checkpoints (RTO_SERVER_RESTART was set to 60). My boss/owner was shocked that data wasn't being written to disk immediately. Even 5 minutes of potential loss was shocking. Per his request, (and before I figured out the cause), I created a job to do a manual checkpoint (onmode -c) every 60 seconds. Back in the day, you might say that's too often. It could still be doing one checkpoint when the next one starts.

But what about today with solid state drives?

Back in the day, I was on a Unix system and you had raw and cooked spaces and we used raw. So you had a contiguous section of disk and Informix took control of it. Sun Solaris (unix) didn't know anything about that area of disk, didn't control it. Informix had full control.

With the current system, it's Windows. So Informix goes through Windows for disk access.

And it's a windows file, not raw space carved out on the disk. So I'm not guaranteed my dbspaces files are contiguous. Plus, these are operating system files so you have to start with the root directory, navigate trough entries, find the sub directory, navigate to it, then look through it, to find out where the dbspace file starts on disk and all of it's segments. So it's SSD, but also has all of these complications. And if I recall, to get to the 75% point of a file that's not contiguous, you have to navigate serially through each piece of file to get to specific piece you need. So there's all of that.

AND, on top of that, we're on virtual servers. So the underlying files that are the actual storage for the virtual drives could be all over the physical disk drive too. Even if it looks contiguous on Windows, it could be spread all over the disk of the machine providing the VM.

So, even though they're SSD drives, there could be several slow downs.

I'd also like to state that our database at most has about 3 gb of data. And I don't think any of our clients have the memory allocated to Informix above 1gb. Most probably have the install default, which might be 128mb.

If my boss wants to stick with the 1 minute checkpoint intervals, is this something we should be able to do with SSD drives? if not, what is safe? Or what do I have to look at? I assume memory for cached pages - the more, the longer a checkpoint can take. Probably speeds of the drives. CPU on the VM.

------------------------------
Joel Schumacher
------------------------------

3. RE: Checkpoints in the days of SSD drives

Like

Art Kagel

IBM Champion

Posted Mon November 10, 2025 07:15 AM

Joel:

Not to beat a dead horse, as I know that others have made the same recommendation, but if you are losing data it is more because your databases are non-logged than that the checkpoints are too infrequent. I'll make some points:

If your databases are all logged; whether BUFFERED, UNBUFFERED, or ANSI MODE; then those logical log records (which other RDBMS call redo or transaction log records) are always written out to disk and only live in buffers in memory for fractions of a second. That is especially true for UNBUFFERED and ANSI mode databases where the logical log buffer is flushed as soon as any COMMIT or ROLLBACK record is written to it guaranteeing that any committed transaction can be recovered during Fast Recovery after a crash or abnormal shutdown by rolling the logical log records forward from the last completed checkpoint. It does not matter whether the actual modified data pages have been flushed before the crash, your transactions are guaranteed.
Caviat: Under BUFFERED LOG databases, the logical log buffer is not flushed until it has filled (or a checkpoint happens before that). That can leave a small number of committed transactions at risk of being rolled back during fast recovery. If your transactions are all large and your logical log buffers are not unreasonably large, that risk is tiny, but for most installations I strongly recommend using UNBUFFERED LOG databases exclusively.
You can easily change the logging status/mode of your databases using ontape (take an archive that includes the -B, -A, or -U option) or onbar (take an archive after running ondblog with the buf, unbuf, or ansi argument).
If you change your logging mode you will likely have to make some application changes, so I know this is non-trivial. There are consultants out there, myself included, who have experience doing this and can help. Note that it is possible to make these changes in ways that detect the logging mode of the database and so can be implemented and rolled out before you change the logging mode making the change relatively seamless and very low risk (obviously you will have to do lots of testing before implementation).
Some things to consider in your code once the databases are logged:
- For multi-step operations, you will need to frame these between BEGIN WORK and COMMIT WORK (or ROLLBACK WORK) statements to manage the transaction.
- Like your current application, each statement will still be its own transaction, however, you may recognize that there are multi-step operations that you are currently handling in code that can be performed within a single transaction eliminating the need to manually undo things when part of a multi-step operations fails.
- If your code is trapping specific error codes, some of those errors will generate a different code when you are in a transaction than they do when you are not in a transaction.
- You cannot issue a LOCK TABLE statement unless you first begin a transaction when connected to a logged database.
A one minute (60 second) checkpoint interval is too short and will cause your applications to slow down as the server will be spending too much time in checkpoints. I do understand your data safety consideration, however, shorter checkpoint duration is not the solution. Logging is.

Hope this is helpful.

Art

------------------------------
Art S. Kagel, President and Principal Consultant
ASK Database Management Corp.
www.askdbmgt.com
------------------------------

Original Message

Original Message:
Sent: Fri November 07, 2025 04:52 PM
From: Joel Schumacher
Subject: Checkpoints in the days of SSD drives

I'm wondering about recommendations for checkpoints in today's era of SSD drives.

Online (as opposed to sqlturbo) dates back to when was it? '93? '94? And I'm assuming that's when the whole cache started and checkpoints.

Disk I/O with a traditional HDD was very costly. You had to engage the drive, get it to spin up to speed, the read/write head had to be physically moved to the right point to read or write data. With all of these mechanical operations, this was going to take a while. And, Informix had 5 minute check points.

I just installed 14.10.12.5 and it is still setting the same default parameters of 5 minute check points.

I just wrote about a client that wasn't doing regular checkpoints (RTO_SERVER_RESTART was set to 60). My boss/owner was shocked that data wasn't being written to disk immediately. Even 5 minutes of potential loss was shocking. Per his request, (and before I figured out the cause), I created a job to do a manual checkpoint (onmode -c) every 60 seconds. Back in the day, you might say that's too often. It could still be doing one checkpoint when the next one starts.

But what about today with solid state drives?

Back in the day, I was on a Unix system and you had raw and cooked spaces and we used raw. So you had a contiguous section of disk and Informix took control of it. Sun Solaris (unix) didn't know anything about that area of disk, didn't control it. Informix had full control.

With the current system, it's Windows. So Informix goes through Windows for disk access.

And it's a windows file, not raw space carved out on the disk. So I'm not guaranteed my dbspaces files are contiguous. Plus, these are operating system files so you have to start with the root directory, navigate trough entries, find the sub directory, navigate to it, then look through it, to find out where the dbspace file starts on disk and all of it's segments. So it's SSD, but also has all of these complications. And if I recall, to get to the 75% point of a file that's not contiguous, you have to navigate serially through each piece of file to get to specific piece you need. So there's all of that.

AND, on top of that, we're on virtual servers. So the underlying files that are the actual storage for the virtual drives could be all over the physical disk drive too. Even if it looks contiguous on Windows, it could be spread all over the disk of the machine providing the VM.

So, even though they're SSD drives, there could be several slow downs.

I'd also like to state that our database at most has about 3 gb of data. And I don't think any of our clients have the memory allocated to Informix above 1gb. Most probably have the install default, which might be 128mb.

If my boss wants to stick with the 1 minute checkpoint intervals, is this something we should be able to do with SSD drives? if not, what is safe? Or what do I have to look at? I assume memory for cached pages - the more, the longer a checkpoint can take. Probably speeds of the drives. CPU on the VM.

------------------------------
Joel Schumacher
------------------------------

4. RE: Checkpoints in the days of SSD drives

Like

Milad Rasouli

Posted Mon November 10, 2025 10:00 AM

Great breakdown, Joel - this is one of those legacy behaviors that still reflects roots in the HDD era.

Even with SSDs, checkpoint tuning still comes down to balancing write latency vs. recovery tolerance. A 1‑minute checkpoint on a 3 GB database is usually fine, provided that:

The SSD latency remains consistent (no shared noisy neighbors at the VM host layer).
The buffers being flushed per checkpoint are small enough that writes complete well before the next interval starts.

What's worth checking isn't the nominal checkpoint frequency, but the actual duration and I/O rate during checkpoints (onstat -g ckp and onstat -g ioq). If you see overlapping checkpoints or I/O queue spikes, you're too aggressive for the disk or VM you're on.

In practice on SSDs, most modern Informix shops sit between 1 – 5 minutes depending on cache size and write activity. The bottleneck isn't so much the flash media now - it's VM‑layer abstraction and non‑contiguous file I/O through the Windows filesystem.

So yes, you can safely run 1‑minute checkpoints if they complete cleanly and your logs back up that performance. But if you ever see checkpoints exceeding ~ 60 seconds or overlapping, back it off to 2 – 3 minutes.

TL;DR → Measure your actual flush performance; don't rely solely on legacy defaults.

------------------------------
Milad Rasouli
------------------------------

Original Message

Original Message:
Sent: Fri November 07, 2025 04:52 PM
From: Joel Schumacher
Subject: Checkpoints in the days of SSD drives

I'm wondering about recommendations for checkpoints in today's era of SSD drives.

Online (as opposed to sqlturbo) dates back to when was it? '93? '94? And I'm assuming that's when the whole cache started and checkpoints.

Disk I/O with a traditional HDD was very costly. You had to engage the drive, get it to spin up to speed, the read/write head had to be physically moved to the right point to read or write data. With all of these mechanical operations, this was going to take a while. And, Informix had 5 minute check points.

I just installed 14.10.12.5 and it is still setting the same default parameters of 5 minute check points.

I just wrote about a client that wasn't doing regular checkpoints (RTO_SERVER_RESTART was set to 60). My boss/owner was shocked that data wasn't being written to disk immediately. Even 5 minutes of potential loss was shocking. Per his request, (and before I figured out the cause), I created a job to do a manual checkpoint (onmode -c) every 60 seconds. Back in the day, you might say that's too often. It could still be doing one checkpoint when the next one starts.

But what about today with solid state drives?

Back in the day, I was on a Unix system and you had raw and cooked spaces and we used raw. So you had a contiguous section of disk and Informix took control of it. Sun Solaris (unix) didn't know anything about that area of disk, didn't control it. Informix had full control.

With the current system, it's Windows. So Informix goes through Windows for disk access.

And it's a windows file, not raw space carved out on the disk. So I'm not guaranteed my dbspaces files are contiguous. Plus, these are operating system files so you have to start with the root directory, navigate trough entries, find the sub directory, navigate to it, then look through it, to find out where the dbspace file starts on disk and all of it's segments. So it's SSD, but also has all of these complications. And if I recall, to get to the 75% point of a file that's not contiguous, you have to navigate serially through each piece of file to get to specific piece you need. So there's all of that.

AND, on top of that, we're on virtual servers. So the underlying files that are the actual storage for the virtual drives could be all over the physical disk drive too. Even if it looks contiguous on Windows, it could be spread all over the disk of the machine providing the VM.

So, even though they're SSD drives, there could be several slow downs.

I'd also like to state that our database at most has about 3 gb of data. And I don't think any of our clients have the memory allocated to Informix above 1gb. Most probably have the install default, which might be 128mb.

If my boss wants to stick with the 1 minute checkpoint intervals, is this something we should be able to do with SSD drives? if not, what is safe? Or what do I have to look at? I assume memory for cached pages - the more, the longer a checkpoint can take. Probably speeds of the drives. CPU on the VM.

------------------------------
Joel Schumacher
------------------------------

Informix

Informix