Data deduplication

Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is dependent on the chunk size), the amount of data that must be stored or transferred can be greatly reduced.

For example, a typical email system might contain 100 instances of the same 1 MB (megabyte) file attachment.

Each time the email platform is backed up, all 100 instances of the attachment are saved, requiring 100 MB storage space.

[3] In computer code, deduplication is done by, for example, storing information in variables so that they don't have to be written out individually but can be changed all at once at a central referenced location.

Storage-based data deduplication reduces the amount of storage needed for a given set of files.

It is most effective in applications where many copies of very similar or even identical data are stored on a single disk.

Hard-linking does not help with large files that have only changed in small ways, such as an email database; differences only find redundancies in adjacent versions of a single file (consider a section that was deleted and later added in again, or a logo image included in many documents).

One potential drawback is that duplicate data may be unnecessarily stored for a short time, which can be problematic if the system is nearing full capacity.

Alternatively, deduplication hash calculations can be done in-line: synchronized as data enters the target device.

On the negative side, hash calculations may be computationally expensive, thereby reducing the storage throughput.

Backing up a deduplicated file system will often cause duplication to occur resulting in the backups being bigger than the source data.

For that to happen, each chunk of data is assigned an identification, calculated by the software, typically using cryptographic hash functions.

[11] If the software either assumes that a given identification already exists in the deduplication namespace or actually verifies the identity of the two blocks of data, depending on the implementation, then it will replace that duplicate chunk with a link.

Backup application in particular commonly generate significant portions of duplicate data over time.

Data deduplication has been deployed successfully with primary storage in some cases where the system design does not require significant overhead, or impact performance.

While data deduplication may work at a segment or sub-block level, single instance storage works at the object level, eliminating redundant copies of objects such as entire files or email messages.

[12] Single-instance storage can be used alongside (or layered upon) other data duplication or data compression methods to improve performance in exchange for an increase in complexity and for (in some cases) a minor increase in storage space requirements.

Note that the system overhead associated with calculating and looking up hash values is primarily a function of the deduplication workflow.

[13] Deduplication is implemented in some filesystems such as in ZFS or Write Anywhere File Layout and in different disk arrays models.