C string handling

[1] The memory occupied by a string is always one more code unit than the length, as space is needed to store the zero terminator.

C90 defines wide strings[1] which use a code unit of type wchar_t, which is 16 or 32 bits on modern machines.

String length and offsets are measured in bytes or wchar_t, not in "characters", which can be confusing to beginning programmers.

Some compilers or editors will require entering all non-ASCII characters as \xNN sequences for each byte of UTF-8, and/or \uNNNN for each word of UTF-16.

In fact all lengths are defined as being in bytes and this is true in all implementations, and these functions work as well with UTF-8 as with single-byte encodings.

Functions for handling memory buffers can process sequences of bytes that include null-byte as part of the data.

Functions declared in string.h are extremely popular since, as a part of the C standard library, they are guaranteed to work on any platform which supports C. However, some security issues exist with these functions, such as potential buffer overflows when not used carefully and properly, causing the programmers to prefer safer and possibly less portable variants, out of which some popular ones are listed below.

This was originally intended to track shift states in the mb encodings, but modern ones such as UTF-8 do not need this.

[96]Despite the well-established need to replace strcat[22] and strcpy[18] with functions that do not allow buffer overflows, no accepted standard has arisen.

This is partly due to the mistaken belief by many C programmers that strncat and strncpy have the desired behavior; however, neither function was designed for this (they were intended to manipulate null-padded fixed-size string buffers, a data format less commonly used in modern software), and the behavior and arguments are non-intuitive and often written incorrectly even by expert programmers.

[116][117] Even while glibc hadn't added support, strlcat and strlcpy have been implemented in a number of other C libraries including ones for OpenBSD, FreeBSD, NetBSD, Solaris, OS X, and QNX, as well as in alternative C libraries for Linux, such as libbsd, introduced in 2008,[118] and musl, introduced in 2011,[119][120] and the source code added directly to other projects such as SDL, GLib, ffmpeg, rsync, and even internally in the Linux kernel.

[122] These functions were standardized as part of POSIX.1-2024,[123] the Austin Group Defect Tracker ID 986 tracked some discussion about such plans for POSIX.

Sometimes memcpy[53] or memmove[55] are used, as they may be more efficient than strcpy as they do not repeatedly check for NUL (this is less true on modern processors).

[124] These functions were standardized with some minor changes as part of the optional C11 (Annex K) proposed by ISO/IEC WDTR 24731.

[129] Experience with these functions has shown significant problems with their adoption and errors in usage, so the removal of Annex K is proposed for the next revision of the C standard.