The Truth About Deduplication

By | August 2, 2014

I recently went to a meeting with a potential customer, and we were talking about backup solutions. One of his questions was, “what dedup ratio do you get with your solution?”. I gave him the classic, yet entirely accurate, consultant’s response. It depends.

Nowadays, many backup solutions make use of deduplication technology to efficiently store data on disk for long-term retention. Some vendors have gone the extra mile to squeeze a bit more deduplication out of their solution, using advanced techniques such as variable block size. Sure, a ten year project at MIT may come up with tighter results than a python program I wrote in my basement, but in the end, what is it going to save you? Maybe one SATA drive in your storage device? If that.

The reality is, the dedup ratio that you achieve will be largely dependent on one thing only: how much duplicate data are you writing? In the end, it will have little to do with the vendor’s claims. Well, maybe that’s the wrong way to say it. In fact, if the vendor claims a very high dedup ratio, that probably means that their solution ingests lots of duplicate data. That’s nothing to be proud of!

If you dig into the type of data that you’re writing to your backup target, assuming that you write one full backup to an empty device, the dedup ratio that you will get will be entirely dependent on the type of data. For example, database or file system data may deduplicate to around 2:!. Virtual machine data on the other hand may dedup to around 5:1. This is because there’s a good amaount of duplicate data in VM’s (each VM is running one of a handful of the same operating systems, so those block dedup).

Now when you perform a second full backup, assuming only a small amount of change since the last backup, that second backup is going to get deduped almost completely. So if the dedup ratio of the fist backup was 2:1, now you’re at nearly 4:1. Do four full backups and you’ll be at nearly 8:1. See? It’s all about writing duplicate data. It has very little to do with the vendor’s algorithm.

At the end of the day, you want to choose a backup solution that processes and stores as little data as possible. Over the years, vendors have come up with many ways to achieve this, with varying results. For instance, TSM does file-level incremental forever backups. Only files that have changed since the last backup are ingested into the system. Backing up Oracle to Data Domain will do a full backup (or an incremental), and chuck out the duplicate blocks before writing them to disk. Avamar chucks out the duplicate blocks before sending them from the source system to the backup target. Actifio does block-level incremental forever, then dedups on the back-end.

As you can imaging, the amount of duplicate data being ingested into all of these systems is wildly different. However, if you count every backup as a full (and it’s fair to do that because you can create a synthetic full from the latest incrementals and previous full), the ultimate dedup ratio will be roughly the same. Is it calculated the same way for each? Does it really matter? Not really. In the end, the question is, how much storage do I need to protect my environment, how long does it take to backup the data, and how long does it take to recover?

That’s what the customer should be asking, and of course my answer would be Actifio. Why? I’ll save that for another post.


Leave a Reply

Your email address will not be published. Required fields are marked *