Content-addressable storage

It has been used for high-speed storage and retrieval of fixed content, such as documents stored for compliance with government regulations[citation needed].

CAS became a significant market during the 2000s, especially after the introduction of the 2002 Sarbanes–Oxley Act in the United States which required the storage of enormous numbers of documents for long periods and retrieved only rarely.

On random-access media like a floppy disk, this is accomplished using a directory that consists of some sort of list of filenames and pointers to the data.

On more modern systems and larger formats like hard drives, the directory is itself split into many subdirectories, each tracking a subset of the overall collection of files.

[1] In the context of CAS, these traditional approaches are referred to as "location-addressed", as each file is represented by a list of one or more locations, the path and filename, on the physical storage.

The downside to this approach is that any changes to the document produces a different key, which makes CAS systems unsuitable for files that are often edited.

[3] Because the keys are not human-readable, CAS systems implement a second type of directory that stores metadata that will help users find a document.

But the directory will also include fields for common identification systems like ISBN or ISSN codes, user-provided keywords, time and date stamps, and full-text search indexes.

The primary difference is that a web search is generally performed on a topic basis using an internal algorithm that finds "related" content and then produces a list of locations.

In CAS, only the internal mapping from key to physical location changes, and this exists in only one place and can be designed for efficient updating.

In contrast, updating a file in a location-based system is highly optimized, only the internal list of sectors has to be changed and many years of tuning have been applied to this operation.

In contrast, automatic deletion is a common feature, removing all files older than some legally defined requirement, say ten years.

A hardware device called the Content Addressable File Store (CAFS) was developed by International Computers Limited (ICL) in the late 1960s and put into use by British Telecom in the early 1970s for telephone directory lookups.

The user-accessible search functionality was maintained by the disk controller with a high-level application programming interface (API) so users could send queries into what appeared to be a black box that returned documents.

Paul Carpentier and Jan van Riel coined the term CAS while working at a company called FilePool in the late 1990s.

[7] CAS was not associated with peer-to-peer applications until the 2000s, when rapidly proliferating Internet access in homes and businesses led to a large number of computer users who wanted to swap files, originally doing so on centrally managed services like Napster.

In order to function without a central federating server, these services rely heavily on CAS to enforce the faithful copying and easy querying of unique files.

At the same time, the growth of the open-source software movement in the 2000s led to the rapid proliferation of CAS-based services such as Git, a version control system that uses numerous cryptographic functions such as Merkle trees to enforce data integrity between users and allow for multiple versions of files with minimal disk and network usage.

The rise of mobile computing and high capacity mobile broadband networks in the 2010s, coupled with increasing reliance on web applications for everyday computing tasks, strained the existing location-addressed client–server model commonplace among Internet services, leading to an accelerated pace of link rot and an increased reliance on centralized cloud hosting.

Within the Centera system, each content address actually represents a number of distinct data blobs, as well as optional metadata.

This provides for additional flexibility in disaster recovery situations as well as the ability to reduce storage costs by moving data off the disk to tape.