Btrfs

is a computer storage format that combines a file system based on the copy-on-write (COW) principle with a logical volume manager (distinct from Linux's LVM), developed together.

Scaling is not just about addressing the storage but also means being able to administer and to manage it with a clean interface that lets people see what's being used and makes it more reliable".

[17] The core data structure of Btrfs‍—‌the copy-on-write B-tree‍—‌was originally proposed by IBM researcher Ohad Rodeh at a USENIX conference in 2007.

[19] In 2008, the principal developer of the ext3 and ext4 file systems, Theodore Ts'o, stated that although ext4 has improved features, it is not a major advance; it uses old technology and is a stop-gap.

[23] Several Linux distributions began offering Btrfs as an experimental choice of root file system during installation.

[24][25][26] In July 2011, Btrfs automatic defragmentation and scrubbing features were merged into version 3.0 of the Linux kernel mainline.

Such cloned files are sometimes referred to as reflinks, in light of the proposed associated Linux kernel system call.

[59][60] Support for this Btrfs feature was added in version 7.5 of the GNU coreutils, via the --reflink option to the cp command.

[64] The copy-on-write (CoW) nature of Btrfs means that snapshots are quickly created, while initially consuming very little disk space.

With LVM, a logical volume is a separate block device, while a Btrfs subvolume is not and it cannot be treated or used that way.

[64] Making dd or LVM snapshots of btrfs leads to data loss if either the original or the copy is mounted while both are on the same computer.

The send–receive feature effectively creates (and applies) a set of data modifications required for converting one subvolume into another.

[50][67] The send/receive feature can be used with regularly scheduled snapshots for implementing a simple form of file system replication, or for the purpose of performing incremental backups.

As the result of having very little metadata anchored in fixed locations, Btrfs can warp to fit unusual spatial layouts of the backend storage devices.

However, as of December 2022, the btrfs documentation suggests that its --repair option be used only if you have been advised by "a developer or an experienced user".

[73] As of August 2022, the SLE documentation recommends using a Live CD, performing a backup and only using the repair option as a last resort.

[75][76] In normal use, Btrfs is mostly self-healing and can recover from broken root trees at mount time, thanks to making periodic data flushes to permanent storage, by default every 30 seconds.

[78][79] Ohad Rodeh's original proposal at USENIX 2007 noted that B+ trees, which are widely used as on-disk data structures for databases, could not efficiently allow copy-on-write-based snapshots because its leaf nodes were linked together: if a leaf was copied on write, its siblings and parents would have to be as well, as would their siblings and parents and so on until the entire tree was copied.

The result would be a data structure suitable for a high-performance object store that could perform copy-on-write snapshots, while maintaining good concurrency.

This allowed all traversal and modifications to be funneled through a single code path, against which features such as copy on write, checksumming and mirroring needed to be implemented only once to benefit the entire file system.

Directory items together can thus act as an index for path-to-inode lookups, but are not used for iteration because they are sorted by their hash, effectively randomly permuting them.

This was a design flaw that limited the number of same-directory hard links to however many could fit in a single tree block.

Applications which made heavy use of multiple same-directory hard links, such as git, GNUS, GMame and BackupPC were observed to fail at this limit.

[83] The limit was eventually removed[84] (and as of October 2012 has been merged[85] pending release in Linux 3.7) by introducing spillover extended reference items to hold hard link filenames which do not otherwise fit.

(Ext3 block groups, however, have fixed locations computed from the size of the file system, whereas those in Btrfs are dynamic and created as needed.)

These bitmaps are persisted to disk (starting in Linux 2.6.37, via the space_cache mount option[86]) as special extents that are exempt from checksumming and copy-on-write.

The scrub job scans the entire file system for integrity and automatically attempts to report and repair any bad blocks it finds along the way.

fsync-heavy workloads (like a database or a virtual machine whose running OS fsyncs frequently) could potentially generate a great deal of redundant write I/O by forcing the file system to repeatedly copy-on-write and flush frequently modified parts of trees to storage.

To preserve sharing, an update-and-swap algorithm is used, with a special relocation tree serving as scratch space for affected metadata.

[94] Superblock mirrors are kept at fixed locations:[95] 64 KiB into every block device, with additional copies at 64 MiB, 256 GiB and 1 PiB.