To serve its intended purposes, a fingerprinting algorithm must be able to capture the identity of a file with virtual certainty.
For instance, in a typical business network, one usually finds many pairs or clusters of documents that differ only by minor edits or other slight modifications.
[1] It is fast and easy to implement, allows compounding, and comes with a mathematically precise analysis of the probability of collision.
Namely, the probability of two strings r and s yielding the same w-bit fingerprint does not exceed max(|r|,|s|)/2w-1, where |r| denotes the length of r in bits.
An adversarial agent can easily discover the key and use it to modify files without changing their fingerprint.
The HashKeeper database, maintained by the National Drug Intelligence Center, is a repository of fingerprints of "known to be good" and "known to be bad" computer files, for use in law enforcement applications (e.g. analyzing the contents of seized disk drives).
This method forms representative digests of documents by selecting a set of multiple substrings (n-grams) from them.