Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Post History
Understanding the representation of text in C "Text" is a high level abstraction that C doesn't properly support Fundamentally, C does not have any built-in "string" type. It arguably doesn't eve...
Answer
#3: Post edited
- ## Understanding the representation of text in C
- ### "Text" is a high level abstraction that C doesn't properly support
- Fundamentally, **C does not have any built-in "string" type**. It arguably doesn't even have a real *character* type. `char` is really a *numeric, integral* type which has a size of one byte. (For historical reasons, the signedness of this type is unspecified, and `char` is treated as a distinct type from both `signed char` and `unsigned char`.) An array of `char`s is essentially a raw memory buffer.
- There is neither an implicitly assumed text encoding internally (although a few functions like `tolower` assume that it is *some single-byte, ASCII-transparent* encoding), nor standard library support for multibyte encodings (The Windows API provides some support for UTF-16, but even then you have to worry about surrogate pairs.). Thus, a `char` doesn't really hold an arbitrary character of text; you can *at best pretend* that byte values 0..127 represent characters with matching Unicode code points, and other byte values represent some other set of 128 Unicode code points.
- <section class="notice is-warning">
- **Any serious, internationalized text processing** - such as resolving text direction markers, normalizing precomposed characters, collation (locale-aware sorting), inspecting character properties, clustering graphemes etc. etc. - **requires a heavyweight third-party library**, practically speaking.
- </section>
- ### C's conventions for pseudo-textual data
- As noted above, an array of `char` values is basically a raw memory buffer. C programs generally use these to represent text (in a restricted, pre-Unicode way) as *null-terminated sequences*, often called "null-terminated strings" in the C literature.
- That is, the text is represented by a sequence of bytes which ends in a zero value (called a "null byte" in the literature). This byte is generally understood to represent [Unicode code point 0](https://en.wikipedia.org/wiki/Null_character). That character, in turn, [is called "NUL" in the ASCII standard](https://en.wikipedia.org/wiki/C0_and_C1_control_codes), but [not, pedantically, named in Unicode](https://www.unicode.org/charts/PDF/U0000.pdf) (although "NULL" is recognized as a common alias). "Strings" in a running program are commonly approximated by passing around pointers to, or into, such arrays (or passing the arrays, which decay to pointers).
- <section class="notice is-warning">
- It's important to understand that *none of these uses of "null", "NUL", "NULL" etc. have anything at all to do with* pointers.
- </section>
- Standard library string-handling functions blindly assume the presence of a null terminator, and treat it as the end of string.
- <section class="notice is-warning">
- If you need to represent strings that contain embedded null characters, you will have to either use a third-party library or create the abstractions yourself (i.e., track string length separately and write your own manipulation functions - although the standard library functions may be useful as helpers).
- </section>
- Single-quoted values are *character literals*, representing a value of type `char`. Double-quoted values are *string literals*, representing a value of type `char[]`. (These names are unfortunate, but a consequence of the history.)
- Keep in mind that a `char *` *does not "contain"* the textual data, but merely points at it.
- ### Expectations for memory management
- Since C *does not provide any built-in garbage collection or other memory management*, the programmer is responsible for matching up heap allocations and deallocations (`malloc`/`free`) and for understanding the lifetime of automatic (stack) allocations.
- <details><summary>Common pitfalls</summary>
- <section class="notice is-danger">
1. Returning a string with automatic storage does not work. This includes trying to return a pointer to or into a local array, or the array itself (which decays to a pointer). After the function has returned, that memory is automatically deallocated from the stack - therefore, the pointer is dangling, and using it is undefined behavior.- The common ways to work around this are:
- * Make a new dynamic allocation and return that pointer, with the explicit understanding that the caller is responsible for deallocation. For example, the standard library `strdup` does this.
- * Return a pointer to or into an existing, passed-in array. If the function modifies the array contents, it must have some way to ensure that the modification is safe - either by doing something that *can't lengthen* the sequence (and ensures that it remains null terminated), or by *expecting the caller* to be responsible for advertising the array size.
- * Modify an existing, passed-in array without returning any pointer (perhaps using an `int` return value for an error code).
1. For historical reasons, a string literal is not typed as `const`, but modifying the array is still undefined behavior in general. The program may initialize these using non-modifiable data stored within the executable itself.1. Memory allocated to hold a "string" must include room for a null terminator. Because of pointer decay, standard library functions have no way to be aware of the array bounds. (They can, at best, trust information that was passed in separately by the caller.) If no terminator is found within the allocated memory, undefined behavior will result.- 1. Similarly, if a standard library function is expected to lengthen the data, the underlying allocation must be large enough to accommodate the result, including the final null terminator.
1. A `char *` must point at a valid allocation to be used this way. Unlike with `char []`, no allocation is implied. Doing something like `char* str; scanf("%s", str);` is undefined behavior because the pointer is dangling. A pointer *does not contain* the textual data, but merely *points at* it - hence the name.1. A `char[]` cannot be reassigned (instead, its elements can be modified), and a stack-allocated array may not be `free()`d.1. Reassigning a `char*` risks a memory leak or a double free if the previous referent was dynamically allocated, even if it previously pointed at a dynamic allocation. Such assignment *does not* modify the pointed-at memory, but just resets the pointer. Again, the programmer is responsible for tracking allocations and for arranging the program such that every dynamic allocation is eventually freed exactly once.- </section>
- </details>
- <details><summary>Tips and Tricks</summary>
- <section class="notice is-success">
- 1. To measure the "length of a string" at runtime, use the standard library `strlen`.
- <section class="notice is-warning">
- Because a `char *` doesn't contain the data, and because `sizeof` is a compile-time operator, `sizeof` can't tell you the length of the data. `sizeof` on a pointer will report the size of pointer types on the platform. Similarly, `sizeof` on an array will report the *allocated size*, regardless of where the null terminator is (or whether it's present at all).
- </section>
- 1. In both string and character literals, the syntax `\0` represents a null character. (Of course, a `char` variable can also be assigned the integer value `0`, but `'\0'` is preferred. Normally a string literally should not contain this, because standard library functions will ignore anything after that point. However, it can be useful to initialize a buffer that contains several consecutive null-terminated strings.
- <section class="notice is-warning">
- Because the null character has nothing to do with pointers, it is incorrect to assign `NULL` to a `char`. [Platforms have historically existed](https://c-faq.com/null/machexamp.html) where a "null pointer" does not consist of all unset bits, and the implicit conversion of a pointer to a single-byte integer will be platform-dependent (and should cause a compiler warning).
- </section>
- 1. When an array is initialized from a string literal, it is *not necessary to specify* the size. For example, `char[] text = "hello";` compiles, and the type of `text` will be `char[6]` - that is, *C will automatically account for the null terminator* and size the array to exactly enough space for the literal. (Alternately, you can say that the literal syntax implicitly specifies that terminator.)
- <section class="notice is-warning">
- However, be aware that brace-initialized arrays will *not* automatically have this null terminator - it must be specified explicitly:
- ```
- char[] text = {'h', 'e', 'l', 'l', 'o', '\0' /* needed! */};
- ```
- </section>
- </section>
- </details>
- ## Understanding the representation of text in C
- ### "Text" is a high level abstraction that C doesn't properly support
- Fundamentally, **C does not have any built-in "string" type**. It arguably doesn't even have a real *character* type. `char` is really a *numeric, integral* type which has a size of one byte. (For historical reasons, the signedness of this type is unspecified, and `char` is treated as a distinct type from both `signed char` and `unsigned char`.) An array of `char`s is essentially a raw memory buffer.
- There is neither an implicitly assumed text encoding internally (although a few functions like `tolower` assume that it is *some single-byte, ASCII-transparent* encoding), nor standard library support for multibyte encodings (The Windows API provides some support for UTF-16, but even then you have to worry about surrogate pairs.). Thus, a `char` doesn't really hold an arbitrary character of text; you can *at best pretend* that byte values 0..127 represent characters with matching Unicode code points, and other byte values represent some other set of 128 Unicode code points.
- <section class="notice is-warning">
- **Any serious, internationalized text processing** - such as resolving text direction markers, normalizing precomposed characters, collation (locale-aware sorting), inspecting character properties, clustering graphemes etc. etc. - **requires a heavyweight third-party library**, practically speaking.
- </section>
- ### C's conventions for pseudo-textual data
- As noted above, an array of `char` values is basically a raw memory buffer. C programs generally use these to represent text (in a restricted, pre-Unicode way) as *null-terminated sequences*, often called "null-terminated strings" in the C literature.
- That is, the text is represented by a sequence of bytes which ends in a zero value (called a "null byte" in the literature). This byte is generally understood to represent [Unicode code point 0](https://en.wikipedia.org/wiki/Null_character). That character, in turn, [is called "NUL" in the ASCII standard](https://en.wikipedia.org/wiki/C0_and_C1_control_codes), but [not, pedantically, named in Unicode](https://www.unicode.org/charts/PDF/U0000.pdf) (although "NULL" is recognized as a common alias). "Strings" in a running program are commonly approximated by passing around pointers to, or into, such arrays (or passing the arrays, which decay to pointers).
- <section class="notice is-warning">
- It's important to understand that *none of these uses of "null", "NUL", "NULL" etc. have anything at all to do with* pointers.
- </section>
- Standard library string-handling functions blindly assume the presence of a null terminator, and treat it as the end of string.
- <section class="notice is-warning">
- If you need to represent strings that contain embedded null characters, you will have to either use a third-party library or create the abstractions yourself (i.e., track string length separately and write your own manipulation functions - although the standard library functions may be useful as helpers).
- </section>
- Single-quoted values are *character literals*, representing a value of type `char`. Double-quoted values are *string literals*, representing a value of type `char[]`. (These names are unfortunate, but a consequence of the history.)
- Keep in mind that a `char *` *does not "contain"* the textual data, but merely points at it.
- ### Expectations for memory management
- Since C *does not provide any built-in garbage collection or other memory management*, the programmer is responsible for matching up heap allocations and deallocations (`malloc`/`free`) and for understanding the lifetime of automatic (stack) allocations.
- <details><summary>Common pitfalls</summary>
- <section class="notice is-danger">
- 1. Returning a string with automatic storage does not work. This includes trying to return a pointer to or into a local array, or the array itself (which decays to a pointer). **After the function has returned**, that memory is automatically deallocated from the stack - therefore, **the pointer is dangling, and using it is undefined behavior**.
- The common ways to work around this are:
- * Make a new dynamic allocation and return that pointer, with the explicit understanding that the caller is responsible for deallocation. For example, the standard library `strdup` does this.
- * Return a pointer to or into an existing, passed-in array. If the function modifies the array contents, it must have some way to ensure that the modification is safe - either by doing something that *can't lengthen* the sequence (and ensures that it remains null terminated), or by *expecting the caller* to be responsible for advertising the array size.
- * Modify an existing, passed-in array without returning any pointer (perhaps using an `int` return value for an error code).
- 1. For historical reasons, a string literal is not typed as `const`, but **modifying the array initialized from a string literal is still undefined behavior** in general. The program may initialize these using non-modifiable data stored within the executable itself.
- 1. **Memory allocated to hold a "string" must include room for a null terminator**. Because of pointer decay, standard library functions have no way to be aware of the array bounds. (They can, at best, trust information that was passed in separately by the caller.) If no terminator is found within the allocated memory, undefined behavior will result.
- 1. Similarly, if a standard library function is expected to lengthen the data, the underlying allocation must be large enough to accommodate the result, including the final null terminator.
- 1. **A `char *` must point at a valid allocation** to be used "as a string". Unlike with `char []`, no allocation is implied. Doing something like `char* str; scanf("%s", str);` is undefined behavior because the pointer is dangling. A pointer *does not contain* the textual data, but merely *points at* it - hence the name.
- 1. **A `char[]` cannot be reassigned** (instead, its elements can be modified), and **a stack-allocated array may not be `free()`d**.
- 1. **Reassigning a `char*` risks a memory leak or a double free** if the previous referent was dynamically allocated, even if it previously pointed at a dynamic allocation. Such assignment *does not* modify the pointed-at memory, but just resets the pointer. Again, **the programmer is responsible for tracking allocations** and for arranging the program such that every dynamic allocation is eventually freed exactly once.
- </section>
- </details>
- <details><summary>Tips and Tricks</summary>
- <section class="notice is-success">
- 1. To measure the "length of a string" at runtime, use the standard library `strlen`.
- <section class="notice is-warning">
- Because a `char *` doesn't contain the data, and because `sizeof` is a compile-time operator, `sizeof` can't tell you the length of the data. `sizeof` on a pointer will report the size of pointer types on the platform. Similarly, `sizeof` on an array will report the *allocated size*, regardless of where the null terminator is (or whether it's present at all).
- </section>
- 1. In both string and character literals, the syntax `\0` represents a null character. (Of course, a `char` variable can also be assigned the integer value `0`, but `'\0'` is preferred. Normally a string literally should not contain this, because standard library functions will ignore anything after that point. However, it can be useful to initialize a buffer that contains several consecutive null-terminated strings.
- <section class="notice is-warning">
- Because the null character has nothing to do with pointers, it is incorrect to assign `NULL` to a `char`. [Platforms have historically existed](https://c-faq.com/null/machexamp.html) where a "null pointer" does not consist of all unset bits, and the implicit conversion of a pointer to a single-byte integer will be platform-dependent (and should cause a compiler warning).
- </section>
- 1. When an array is initialized from a string literal, it is *not necessary to specify* the size. For example, `char[] text = "hello";` compiles, and the type of `text` will be `char[6]` - that is, *C will automatically account for the null terminator* and size the array to exactly enough space for the literal. (Alternately, you can say that the literal syntax implicitly specifies that terminator.)
- <section class="notice is-warning">
- However, be aware that brace-initialized arrays will *not* automatically have this null terminator - it must be specified explicitly:
- ```
- char[] text = {'h', 'e', 'l', 'l', 'o', '\0' /* needed! */};
- ```
- </section>
- </section>
- </details>
#2: Post edited
- ## Understanding the representation of text in C
### "Text" is a high level abstraction not provided by C- Fundamentally, **C does not have any built-in "string" type**. It arguably doesn't even have a real *character* type. `char` is really a *numeric, integral* type which has a size of one byte. (For historical reasons, the signedness of this type is unspecified, and `char` is treated as a distinct type from both `signed char` and `unsigned char`.) An array of `char`s is essentially a raw memory buffer.
- There is neither an implicitly assumed text encoding internally (although a few functions like `tolower` assume that it is *some single-byte, ASCII-transparent* encoding), nor standard library support for multibyte encodings (The Windows API provides some support for UTF-16, but even then you have to worry about surrogate pairs.). Thus, a `char` doesn't really hold an arbitrary character of text; you can *at best pretend* that byte values 0..127 represent characters with matching Unicode code points, and other byte values represent some other set of 128 Unicode code points.
- <section class="notice is-warning">
- **Any serious, internationalized text processing** - such as resolving text direction markers, normalizing precomposed characters, collation (locale-aware sorting), inspecting character properties, clustering graphemes etc. etc. - **requires a heavyweight third-party library**, practically speaking.
- </section>
- ### C's conventions for pseudo-textual data
- As noted above, an array of `char` values is basically a raw memory buffer. C programs generally use these to represent text (in a restricted, pre-Unicode way) as *null-terminated sequences*, often called "null-terminated strings" in the C literature.
- That is, the text is represented by a sequence of bytes which ends in a zero value (called a "null byte" in the literature). This byte is generally understood to represent [Unicode code point 0](https://en.wikipedia.org/wiki/Null_character). That character, in turn, [is called "NUL" in the ASCII standard](https://en.wikipedia.org/wiki/C0_and_C1_control_codes), but [not, pedantically, named in Unicode](https://www.unicode.org/charts/PDF/U0000.pdf) (although "NULL" is recognized as a common alias). "Strings" in a running program are commonly approximated by passing around pointers to, or into, such arrays (or passing the arrays, which decay to pointers).
- <section class="notice is-warning">
- It's important to understand that *none of these uses of "null", "NUL", "NULL" etc. have anything at all to do with* pointers.
- </section>
- Standard library string-handling functions blindly assume the presence of a null terminator, and treat it as the end of string.
- <section class="notice is-warning">
- If you need to represent strings that contain embedded null characters, you will have to either use a third-party library or create the abstractions yourself (i.e., track string length separately and write your own manipulation functions - although the standard library functions may be useful as helpers).
- </section>
- Single-quoted values are *character literals*, representing a value of type `char`. Double-quoted values are *string literals*, representing a value of type `char[]`. (These names are unfortunate, but a consequence of the history.)
- Keep in mind that a `char *` *does not "contain"* the textual data, but merely points at it.
- ### Expectations for memory management
- Since C *does not provide any built-in garbage collection or other memory management*, the programmer is responsible for matching up heap allocations and deallocations (`malloc`/`free`) and for understanding the lifetime of automatic (stack) allocations.
- <details><summary>Common pitfalls</summary>
- <section class="notice is-danger">
- 1. Returning a string with automatic storage does not work. This includes trying to return a pointer to or into a local array, or the array itself (which decays to a pointer). After the function has returned, that memory is automatically deallocated from the stack - therefore, the pointer is dangling, and using it is undefined behavior.
- The common ways to work around this are:
- * Make a new dynamic allocation and return that pointer, with the explicit understanding that the caller is responsible for deallocation. For example, the standard library `strdup` does this.
- * Return a pointer to or into an existing, passed-in array. If the function modifies the array contents, it must have some way to ensure that the modification is safe - either by doing something that *can't lengthen* the sequence (and ensures that it remains null terminated), or by *expecting the caller* to be responsible for advertising the array size.
- * Modify an existing, passed-in array without returning any pointer (perhaps using an `int` return value for an error code).
- 1. For historical reasons, a string literal is not typed as `const`, but modifying the array is still undefined behavior in general. The program may initialize these using non-modifiable data stored within the executable itself.
- 1. Memory allocated to hold a "string" must include room for a null terminator. Because of pointer decay, standard library functions have no way to be aware of the array bounds. (They can, at best, trust information that was passed in separately by the caller.) If no terminator is found within the allocated memory, undefined behavior will result.
- 1. Similarly, if a standard library function is expected to lengthen the data, the underlying allocation must be large enough to accommodate the result, including the final null terminator.
- 1. A `char *` must point at a valid allocation to be used this way. Unlike with `char []`, no allocation is implied. Doing something like `char* str; scanf("%s", str);` is undefined behavior because the pointer is dangling. A pointer *does not contain* the textual data, but merely *points at* it - hence the name.
- 1. A `char[]` cannot be reassigned (instead, its elements can be modified), and a stack-allocated array may not be `free()`d.
- 1. Reassigning a `char*` risks a memory leak or a double free if the previous referent was dynamically allocated, even if it previously pointed at a dynamic allocation. Such assignment *does not* modify the pointed-at memory, but just resets the pointer. Again, the programmer is responsible for tracking allocations and for arranging the program such that every dynamic allocation is eventually freed exactly once.
- </section>
- </details>
- <details><summary>Tips and Tricks</summary>
- <section class="notice is-success">
- 1. To measure the "length of a string" at runtime, use the standard library `strlen`.
- <section class="notice is-warning">
- Because a `char *` doesn't contain the data, and because `sizeof` is a compile-time operator, `sizeof` can't tell you the length of the data. `sizeof` on a pointer will report the size of pointer types on the platform. Similarly, `sizeof` on an array will report the *allocated size*, regardless of where the null terminator is (or whether it's present at all).
- </section>
- 1. In both string and character literals, the syntax `\0` represents a null character. (Of course, a `char` variable can also be assigned the integer value `0`, but `'\0'` is preferred. Normally a string literally should not contain this, because standard library functions will ignore anything after that point. However, it can be useful to initialize a buffer that contains several consecutive null-terminated strings.
- <section class="notice is-warning">
- Because the null character has nothing to do with pointers, it is incorrect to assign `NULL` to a `char`. [Platforms have historically existed](https://c-faq.com/null/machexamp.html) where a "null pointer" does not consist of all unset bits, and the implicit conversion of a pointer to a single-byte integer will be platform-dependent (and should cause a compiler warning).
- </section>
- 1. When an array is initialized from a string literal, it is *not necessary to specify* the size. For example, `char[] text = "hello";` compiles, and the type of `text` will be `char[6]` - that is, *C will automatically account for the null terminator* and size the array to exactly enough space for the literal. (Alternately, you can say that the literal syntax implicitly specifies that terminator.)
- <section class="notice is-warning">
- However, be aware that brace-initialized arrays will *not* automatically have this null terminator - it must be specified explicitly:
- ```
- char[] text = {'h', 'e', 'l', 'l', 'o', '\0' /* needed! */};
- ```
- </section>
- </section>
- </details>
- ## Understanding the representation of text in C
- ### "Text" is a high level abstraction that C doesn't properly support
- Fundamentally, **C does not have any built-in "string" type**. It arguably doesn't even have a real *character* type. `char` is really a *numeric, integral* type which has a size of one byte. (For historical reasons, the signedness of this type is unspecified, and `char` is treated as a distinct type from both `signed char` and `unsigned char`.) An array of `char`s is essentially a raw memory buffer.
- There is neither an implicitly assumed text encoding internally (although a few functions like `tolower` assume that it is *some single-byte, ASCII-transparent* encoding), nor standard library support for multibyte encodings (The Windows API provides some support for UTF-16, but even then you have to worry about surrogate pairs.). Thus, a `char` doesn't really hold an arbitrary character of text; you can *at best pretend* that byte values 0..127 represent characters with matching Unicode code points, and other byte values represent some other set of 128 Unicode code points.
- <section class="notice is-warning">
- **Any serious, internationalized text processing** - such as resolving text direction markers, normalizing precomposed characters, collation (locale-aware sorting), inspecting character properties, clustering graphemes etc. etc. - **requires a heavyweight third-party library**, practically speaking.
- </section>
- ### C's conventions for pseudo-textual data
- As noted above, an array of `char` values is basically a raw memory buffer. C programs generally use these to represent text (in a restricted, pre-Unicode way) as *null-terminated sequences*, often called "null-terminated strings" in the C literature.
- That is, the text is represented by a sequence of bytes which ends in a zero value (called a "null byte" in the literature). This byte is generally understood to represent [Unicode code point 0](https://en.wikipedia.org/wiki/Null_character). That character, in turn, [is called "NUL" in the ASCII standard](https://en.wikipedia.org/wiki/C0_and_C1_control_codes), but [not, pedantically, named in Unicode](https://www.unicode.org/charts/PDF/U0000.pdf) (although "NULL" is recognized as a common alias). "Strings" in a running program are commonly approximated by passing around pointers to, or into, such arrays (or passing the arrays, which decay to pointers).
- <section class="notice is-warning">
- It's important to understand that *none of these uses of "null", "NUL", "NULL" etc. have anything at all to do with* pointers.
- </section>
- Standard library string-handling functions blindly assume the presence of a null terminator, and treat it as the end of string.
- <section class="notice is-warning">
- If you need to represent strings that contain embedded null characters, you will have to either use a third-party library or create the abstractions yourself (i.e., track string length separately and write your own manipulation functions - although the standard library functions may be useful as helpers).
- </section>
- Single-quoted values are *character literals*, representing a value of type `char`. Double-quoted values are *string literals*, representing a value of type `char[]`. (These names are unfortunate, but a consequence of the history.)
- Keep in mind that a `char *` *does not "contain"* the textual data, but merely points at it.
- ### Expectations for memory management
- Since C *does not provide any built-in garbage collection or other memory management*, the programmer is responsible for matching up heap allocations and deallocations (`malloc`/`free`) and for understanding the lifetime of automatic (stack) allocations.
- <details><summary>Common pitfalls</summary>
- <section class="notice is-danger">
- 1. Returning a string with automatic storage does not work. This includes trying to return a pointer to or into a local array, or the array itself (which decays to a pointer). After the function has returned, that memory is automatically deallocated from the stack - therefore, the pointer is dangling, and using it is undefined behavior.
- The common ways to work around this are:
- * Make a new dynamic allocation and return that pointer, with the explicit understanding that the caller is responsible for deallocation. For example, the standard library `strdup` does this.
- * Return a pointer to or into an existing, passed-in array. If the function modifies the array contents, it must have some way to ensure that the modification is safe - either by doing something that *can't lengthen* the sequence (and ensures that it remains null terminated), or by *expecting the caller* to be responsible for advertising the array size.
- * Modify an existing, passed-in array without returning any pointer (perhaps using an `int` return value for an error code).
- 1. For historical reasons, a string literal is not typed as `const`, but modifying the array is still undefined behavior in general. The program may initialize these using non-modifiable data stored within the executable itself.
- 1. Memory allocated to hold a "string" must include room for a null terminator. Because of pointer decay, standard library functions have no way to be aware of the array bounds. (They can, at best, trust information that was passed in separately by the caller.) If no terminator is found within the allocated memory, undefined behavior will result.
- 1. Similarly, if a standard library function is expected to lengthen the data, the underlying allocation must be large enough to accommodate the result, including the final null terminator.
- 1. A `char *` must point at a valid allocation to be used this way. Unlike with `char []`, no allocation is implied. Doing something like `char* str; scanf("%s", str);` is undefined behavior because the pointer is dangling. A pointer *does not contain* the textual data, but merely *points at* it - hence the name.
- 1. A `char[]` cannot be reassigned (instead, its elements can be modified), and a stack-allocated array may not be `free()`d.
- 1. Reassigning a `char*` risks a memory leak or a double free if the previous referent was dynamically allocated, even if it previously pointed at a dynamic allocation. Such assignment *does not* modify the pointed-at memory, but just resets the pointer. Again, the programmer is responsible for tracking allocations and for arranging the program such that every dynamic allocation is eventually freed exactly once.
- </section>
- </details>
- <details><summary>Tips and Tricks</summary>
- <section class="notice is-success">
- 1. To measure the "length of a string" at runtime, use the standard library `strlen`.
- <section class="notice is-warning">
- Because a `char *` doesn't contain the data, and because `sizeof` is a compile-time operator, `sizeof` can't tell you the length of the data. `sizeof` on a pointer will report the size of pointer types on the platform. Similarly, `sizeof` on an array will report the *allocated size*, regardless of where the null terminator is (or whether it's present at all).
- </section>
- 1. In both string and character literals, the syntax `\0` represents a null character. (Of course, a `char` variable can also be assigned the integer value `0`, but `'\0'` is preferred. Normally a string literally should not contain this, because standard library functions will ignore anything after that point. However, it can be useful to initialize a buffer that contains several consecutive null-terminated strings.
- <section class="notice is-warning">
- Because the null character has nothing to do with pointers, it is incorrect to assign `NULL` to a `char`. [Platforms have historically existed](https://c-faq.com/null/machexamp.html) where a "null pointer" does not consist of all unset bits, and the implicit conversion of a pointer to a single-byte integer will be platform-dependent (and should cause a compiler warning).
- </section>
- 1. When an array is initialized from a string literal, it is *not necessary to specify* the size. For example, `char[] text = "hello";` compiles, and the type of `text` will be `char[6]` - that is, *C will automatically account for the null terminator* and size the array to exactly enough space for the literal. (Alternately, you can say that the literal syntax implicitly specifies that terminator.)
- <section class="notice is-warning">
- However, be aware that brace-initialized arrays will *not* automatically have this null terminator - it must be specified explicitly:
- ```
- char[] text = {'h', 'e', 'l', 'l', 'o', '\0' /* needed! */};
- ```
- </section>
- </section>
- </details>
#1: Initial revision
## Understanding the representation of text in C ### "Text" is a high level abstraction not provided by C Fundamentally, **C does not have any built-in "string" type**. It arguably doesn't even have a real *character* type. `char` is really a *numeric, integral* type which has a size of one byte. (For historical reasons, the signedness of this type is unspecified, and `char` is treated as a distinct type from both `signed char` and `unsigned char`.) An array of `char`s is essentially a raw memory buffer. There is neither an implicitly assumed text encoding internally (although a few functions like `tolower` assume that it is *some single-byte, ASCII-transparent* encoding), nor standard library support for multibyte encodings (The Windows API provides some support for UTF-16, but even then you have to worry about surrogate pairs.). Thus, a `char` doesn't really hold an arbitrary character of text; you can *at best pretend* that byte values 0..127 represent characters with matching Unicode code points, and other byte values represent some other set of 128 Unicode code points. <section class="notice is-warning"> **Any serious, internationalized text processing** - such as resolving text direction markers, normalizing precomposed characters, collation (locale-aware sorting), inspecting character properties, clustering graphemes etc. etc. - **requires a heavyweight third-party library**, practically speaking. </section> ### C's conventions for pseudo-textual data As noted above, an array of `char` values is basically a raw memory buffer. C programs generally use these to represent text (in a restricted, pre-Unicode way) as *null-terminated sequences*, often called "null-terminated strings" in the C literature. That is, the text is represented by a sequence of bytes which ends in a zero value (called a "null byte" in the literature). This byte is generally understood to represent [Unicode code point 0](https://en.wikipedia.org/wiki/Null_character). That character, in turn, [is called "NUL" in the ASCII standard](https://en.wikipedia.org/wiki/C0_and_C1_control_codes), but [not, pedantically, named in Unicode](https://www.unicode.org/charts/PDF/U0000.pdf) (although "NULL" is recognized as a common alias). "Strings" in a running program are commonly approximated by passing around pointers to, or into, such arrays (or passing the arrays, which decay to pointers). <section class="notice is-warning"> It's important to understand that *none of these uses of "null", "NUL", "NULL" etc. have anything at all to do with* pointers. </section> Standard library string-handling functions blindly assume the presence of a null terminator, and treat it as the end of string. <section class="notice is-warning"> If you need to represent strings that contain embedded null characters, you will have to either use a third-party library or create the abstractions yourself (i.e., track string length separately and write your own manipulation functions - although the standard library functions may be useful as helpers). </section> Single-quoted values are *character literals*, representing a value of type `char`. Double-quoted values are *string literals*, representing a value of type `char[]`. (These names are unfortunate, but a consequence of the history.) Keep in mind that a `char *` *does not "contain"* the textual data, but merely points at it. ### Expectations for memory management Since C *does not provide any built-in garbage collection or other memory management*, the programmer is responsible for matching up heap allocations and deallocations (`malloc`/`free`) and for understanding the lifetime of automatic (stack) allocations. <details><summary>Common pitfalls</summary> <section class="notice is-danger"> 1. Returning a string with automatic storage does not work. This includes trying to return a pointer to or into a local array, or the array itself (which decays to a pointer). After the function has returned, that memory is automatically deallocated from the stack - therefore, the pointer is dangling, and using it is undefined behavior. The common ways to work around this are: * Make a new dynamic allocation and return that pointer, with the explicit understanding that the caller is responsible for deallocation. For example, the standard library `strdup` does this. * Return a pointer to or into an existing, passed-in array. If the function modifies the array contents, it must have some way to ensure that the modification is safe - either by doing something that *can't lengthen* the sequence (and ensures that it remains null terminated), or by *expecting the caller* to be responsible for advertising the array size. * Modify an existing, passed-in array without returning any pointer (perhaps using an `int` return value for an error code). 1. For historical reasons, a string literal is not typed as `const`, but modifying the array is still undefined behavior in general. The program may initialize these using non-modifiable data stored within the executable itself. 1. Memory allocated to hold a "string" must include room for a null terminator. Because of pointer decay, standard library functions have no way to be aware of the array bounds. (They can, at best, trust information that was passed in separately by the caller.) If no terminator is found within the allocated memory, undefined behavior will result. 1. Similarly, if a standard library function is expected to lengthen the data, the underlying allocation must be large enough to accommodate the result, including the final null terminator. 1. A `char *` must point at a valid allocation to be used this way. Unlike with `char []`, no allocation is implied. Doing something like `char* str; scanf("%s", str);` is undefined behavior because the pointer is dangling. A pointer *does not contain* the textual data, but merely *points at* it - hence the name. 1. A `char[]` cannot be reassigned (instead, its elements can be modified), and a stack-allocated array may not be `free()`d. 1. Reassigning a `char*` risks a memory leak or a double free if the previous referent was dynamically allocated, even if it previously pointed at a dynamic allocation. Such assignment *does not* modify the pointed-at memory, but just resets the pointer. Again, the programmer is responsible for tracking allocations and for arranging the program such that every dynamic allocation is eventually freed exactly once. </section> </details> <details><summary>Tips and Tricks</summary> <section class="notice is-success"> 1. To measure the "length of a string" at runtime, use the standard library `strlen`. <section class="notice is-warning"> Because a `char *` doesn't contain the data, and because `sizeof` is a compile-time operator, `sizeof` can't tell you the length of the data. `sizeof` on a pointer will report the size of pointer types on the platform. Similarly, `sizeof` on an array will report the *allocated size*, regardless of where the null terminator is (or whether it's present at all). </section> 1. In both string and character literals, the syntax `\0` represents a null character. (Of course, a `char` variable can also be assigned the integer value `0`, but `'\0'` is preferred. Normally a string literally should not contain this, because standard library functions will ignore anything after that point. However, it can be useful to initialize a buffer that contains several consecutive null-terminated strings. <section class="notice is-warning"> Because the null character has nothing to do with pointers, it is incorrect to assign `NULL` to a `char`. [Platforms have historically existed](https://c-faq.com/null/machexamp.html) where a "null pointer" does not consist of all unset bits, and the implicit conversion of a pointer to a single-byte integer will be platform-dependent (and should cause a compiler warning). </section> 1. When an array is initialized from a string literal, it is *not necessary to specify* the size. For example, `char[] text = "hello";` compiles, and the type of `text` will be `char[6]` - that is, *C will automatically account for the null terminator* and size the array to exactly enough space for the literal. (Alternately, you can say that the literal syntax implicitly specifies that terminator.) <section class="notice is-warning"> However, be aware that brace-initialized arrays will *not* automatically have this null terminator - it must be specified explicitly: ``` char[] text = {'h', 'e', 'l', 'l', 'o', '\0' /* needed! */}; ``` </section> </section> </details>