The move away from tape backups towards disk-based backups has been going on for a while now. Storing backups on disk generally means faster backups and faster restores. However, disk isn’t as cheap as tape, and storing many terabytes or even petabytes of data on disk can lead to sprawling storage systems. To reduce the cost and physical size of backup disk, backup storage is commonly deduplicated. This can have a pretty dramatic impact on the size of backed up data, especially if your retention period include multiple full backups. When deduplicated, all those copies of the same files get reduced to one copy, so the resulting storage ends up containing one full, plus the files or blocks that have been changed during the retention period.
Deduplication is achieved using a system of pointers and a cryptograhic technique called hashing. When data is written to the backup storage, each file or block is hashed, that is, it is encrypted using a hashing algorithm such as md5 or sha1. The resulting hash string is a (theoretically) unique value that represents the data that was hashed. The backup storage system maintains a table of hash values for the data that has been previously written. If the new data’s hash value matches a value already in the table, the new data is discarded as duplicate, and a pointer is written, pointing to the existing data. If there is no match, the new data is written and the new hash value added to the table.
Hashing data is a time-consuming and processor intensive process, which can reduce backup performance and drive up the cost of the backup system. To deal with this problem, deduplicated backup systems have been designed with three basic modes of operation: pre-process, inline, and post-process deduplication. Each has its pros and cons, and the operating mode is an important consideration when designing the backup system.
In this mode of operation, data is hashed while it’s being received by the backup system. The backup system’s processors work hard to keep up with the flow of data entering the system, and when the system is connected via 10G Ethernet or 8G fibre channel, the processors can often become the bottleneck, limiting the overall throughput of the system. Performance can be improved through increased processing power and well-tuned hashing algorithms, but this mode of operation typically results in the slowest performance among the three modes. An example of a deduplicated backup system that uses inline dedupe is EMC Data Domain.
There are benefits of inline dedupe which will become clear as we discuss the other modes.
In this mode of operation, data is written directly to the backup systems disk, undeduplicated, and the data is deduplicated later. The downside is that more disk space is required to receive the undeduped data, but the upside is that the performance bottleneck of inline dedupe is eliminated. Another downside is that during post-process deduplication, the backup system is typically put in a busy state, and so is not available for continued backup and restore operations. This may or may not be a problem though, so long as there’s enough time in the day to perform backups, run the dedupe operation, and be available for restores. An example of a backup system that uses post-process dedupe is HP Store Once.
In this mode of operation, data is deduplicated at the source, either at the backup agent running on the client system being backed up, or on an intermediate backup server. This moves the processing load away from the backup system, potentially distributing this load across multiple systems. It also means that data is deduplicated before it even crosses the network. The downside is that this mode has a higher processor impact at the source. An example of a backup system that uses pre-process dedupe is EMC Avamar.
Making the Choice
Pre-process deduplication can lead to the fastest backups, especially if the backup client is relatively small. The backup client can track block-level changes between backups, hash and dedupe the changed blocks and write the changes to the backup system, completing the process in minutes or even seconds, with minimal network bandwidth consumption. However, larger workloads like large databases can become very processor intensive and time-consuming. There’s also typically a higher cost for pre-process products due to additional software complexity and testing requirements.
As stated earlier, post-process products require more disk space to accept the undeduplicated backup data, but as the cost of disk continues to decrease, this becomes less of a problem. Inline dedupe has limited performance, and may require more processing power (and therefore more cost) to maintain adequate performance. The overall performance differences of post-process and inline modes may be a wash if you include the post-process dedupe busy time in your backup window (that is, if you need your backup system available for restores during the business day).
To make the choice, you’ll need to understand the size of your backups, the ingest rate of the backup system, the length of your backup window, and the cost per GB of the backup system. It may be that you’ll choose to deploy more than one solution to fit your various workloads. I have two in my data center today, though I’m working to narrow it down to one.