UTF-8

Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes.

Most software designed for any extended ASCII can read and write UTF-8 (including on Microsoft Windows) and this results in fewer internationalization issues than any alternative text encoding.

[3][4] UTF-8 is dominant for all countries/languages on the internet, with 99% global average use, is used in most standards, often the only allowed encoding, and is supported by all modern operating systems and programming languages.

The draft ISO 10646 standard contained a non-required annex called UTF-1 that provided a byte stream encoding of its 32-bit code points.

Dave Prosser of Unix System Laboratories submitted a proposal for one that had faster implementation characteristics and introduced the improvement that 7-bit ASCII characters would only represent themselves; multi-byte sequences would only include bytes with the high bit set.

The name File System Safe UCS Transformation Format (FSS-UTF)[5] and most of the text of this proposal were later preserved in the final specification.

A modification by Ken Thompson of the Plan 9 operating system group at Bell Labs made it self-synchronizing, letting a reader start anywhere and immediately detect character boundaries, at the cost of being somewhat less bit-efficient than the previous proposal.

In the following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout,[10] and then communicated their success back to X/Open, which accepted it as the specification for FSS-UTF.

Three bytes are needed for the remaining 61,440 codepoints of the Basic Multilingual Plane (BMP), including most Chinese, Japanese and Korean characters.

Unlike many earlier multi-byte text encodings such as Shift-JIS, it is self-synchronizing so searches for short strings or characters are possible and that the start of a code point can be found from a random position by backing up at most 3 bytes.

Overlong encodings (of ../ for example) have been used to bypass security validations in high-profile products including Microsoft's IIS web server[15] and Apache's Tomcat servlet container.

Carefully crafted invalid UTF-8 could make them either skip or create ASCII characters such as NUL, slash, or quotes, leading to security vulnerabilities.

It is also common to throw an exception or truncate the string at an error[17] but this turns what would otherwise be harmless errors (i.e. "file not found") into a denial of service, for instance early versions of Python 3.0 would exit immediately if the command line or environment variables contained invalid UTF-8.

"[19] The Unicode Standard requires decoders to: "... treat any ill-formed code unit sequence as an error condition.

Since RFC 3629 (November 2003), the high and low surrogates used by UTF-16 (U+D800 through U+DFFF) are not legal Unicode values, and their UTF-8 encodings must be treated as an invalid byte sequence.

The Unicode Standard neither requires nor recommends the use of the BOM for UTF-8, but warns that it may be encountered at the start of a file trans-coded from another encoding.

A BOM can confuse software that isn't prepared for it but can otherwise accept UTF-8, e.g. programming languages that permit non-ASCII bytes in string literals but not at the start of the file.

The primary advantage of UTF-16 is that the Windows API required it to be used to get access to all Unicode characters (only recently has this been fixed).

UTF-8 has the advantages of being trivial to retrofit to any system that could handle an extended ASCII, not having byte-order problems, and taking about 1/2 the space for any language using mostly Latin letters.

[31][32] The World Wide Web Consortium recommends UTF-8 as the default encoding in XML and HTML (and not just using UTF-8, also declaring it in metadata), "even when all characters are in the ASCII range ...

[58] Microsoft's SQL Server 2019 added support for UTF-8, and using it results in a 35% speed increase, and "nearly 50% reduction in storage requirements.

[69] That UTF-8 Clean-8 variant, implemented by Raku, is an encoder/decoder that preserves bytes as is (even illegal UTF-8 sequences) and allows for Normal Form Grapheme synthetics.