Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Common string handling pitfalls in C programming
This is a self-answered Q&A meant as a C string handling FAQ. It will ask several questions at once which isn't ideal, but they are all closely related and I'd rather not fragment the post into several._
Code written by beginners to C, or found on C programming forums, frequently contains a few specific string handling bugs. Even experienced programmers coming from a higher level language and picking up C may make these mistakes.
These bugs seem to result from expecting C to have a built-in string class (like most languages do) which would handle all string operations and memory allocation for them.
Here are some frequently occurring bugs with corresponding questions:
-
char str = "hello";
.This will luckily not even compile if the compiler is configured correctly, see What compiler options are recommended for beginners learning C?
Question: Why doesn't this work? Does C have a string class?
-
char str[5] = "hello";
.Compiles just fine, yet when printing this there will be garbage printed or other strange behavior. This bug is related to character arrays and missing null termination.
Question: What exactly does a string consist of in C?
-
char* str; scanf("%s", str);
Compiles just fine, though if lucky there can be warnings. This bug is related to memory allocation.
Question: Who is responsible for allocating memory for the string?
-
char* str = malloc(5+1); str = "hello";
Compiles just fine, though there are memory leaks.
Question: How can a string get assigned a new value?
-
char str[5+1] = "hello";
...if(str == "hello")
.Compiles just fine but gives the wrong results.
Question: How do you properly compare strings?
2 answers
The reader is assumed to understand how arrays and pointers work in C. You cannot understand pointers before you understand arrays, and you cannot understand strings before you understand pointers and arrays both. Decent C books therefore teaches arrays, then pointers, then strings, in that order.
Some of the below text was taken from an original article written by me here on Stack Overflow.
Does C have a string class?
No, and `char` is definitely not such a class.
Therefore the code shown will not compile. You cannot assign a string to a single character, because a single character is what it sounds like, a single letter.
In C you have to handle everything manually: allocation, assignment, copies, comparisons. There is a standard library string.h
which does contain some helpful functions though.
What exactly does a string consist of in C?
A C string is a character array that ends with a null terminator.
All characters have a symbol table value. The null terminator is the symbol value 0
(zero). It is used to mark the end of a string. This is necessary since the size of the string isn't stored anywhere. Therefore, every time you allocate room for a string, you must include sufficient space for the null terminator character.
The example code does not do this, it only allocates room for the 5 characters of "hello"
. Correct code should be:
char str[6] = "hello";
Or equivalently, you can write self-documenting code for 5 characters plus 1 null terminator:
char str[5+1] = "hello";
But you can also use this and let the compiler do the counting and pick the size:
char str[] = "hello"; // Will allocate 6 bytes automatically
If you don't append a null terminator at the end of a string, then library functions expecting a string won't work properly and you will get "undefined behavior" bugs such as garbage output or program crashes. That's what happens if you attempt to print the string in Bug 2).
The most common way to write a null terminator character in C is by using a so-called "octal escape sequence", looking like this: '\0'
. This is 100% equivalent to writing 0
, but the \
serves as self-documenting code to state that the zero is explicitly meant to be a null terminator. Code such as if(str[i] == '\0')
will check if the specific character is the null terminator.
So you can even do the above examples explicitly, character by character:
char str[6] = {'h', 'e', 'l', 'l', 'o', '\0'};
Please note that the term null terminator has nothing to do with null pointers or the NULL
macro! This can be confusing - very similar names but very different meanings. This is why the null terminator is sometimes referred to as NUL
with one L, not to be confused with NULL
or null pointers.
The "hello"
part of the code is called a string literal. This is to be regarded as a read-only string. The ""
syntax means that the compiler will append a null terminator in the end of the string literal automatically. So if you print out sizeof("hello")
you will get 6, not 5, because you get the size of the array including a null terminator.
Who is responsible for allocating memory for the string?
You are - the C programmer.
That's why the code crashes: you cannot just store a string where an uninitialized pointer points at. It needs to point at valid, allocated memory.
So you need to allocate an array somewhere, sufficiently large to hold the string, including null termination. You could do this as a local character array as in the above examples, or you can do this by determining the size in run-time.
When allocating memory for a string dynamically in run-time, remember to also allocate room for the null terminator:
char input[n] = ... ;
...
char* str = malloc(strlen(input) + 1);
Notably, this array also has to be read/write memory. If we do something like
char* str = "hello"; str[0] = 'a';
then it compiles just fine but crashes in run-time. This is because the string literal "hello"
is a read-only memory, null-terminated character array stored by the compiler in specialized read-only memory.
You can use string literals just as strings, but you can never write to them. Therefore it is strongly recommended to only point at them with a pointer to read-only data:
const char* str = "hello";
.
This pointer can however (unlike a pointer to dynamic memory, see Bug 4)) be safely set to point at a different string literal, so when dealing with a lot of string look-ups, an array of pointers to const char
might be a sensible choice.
How can a string get assigned a new value?
Either during initialization or by modifying the pointed-at memory.
The above examples show various different ways to create a string by allocating an array or by having a pointer point at a string allocated elsewhere. But if you need to change this string in run-time, you can't just write str = "new value"
.
In case str
in that example is an array, then it won't work because C simply wasn't designed to do assignment to arrays in run-time. In case str
is a pointer, then it will work by having str
point at a string literal as previously explained.
But it will forget all about where it previously pointed - if it for example previously pointed at dynamically allocated memory like in Bug 4), then we have a memory leak.
The normal way to assign a value to a string in run-time is to use the strcpy
function (which is a perfectly safe function, see Is strcpy dangerous and what should be used instead? ). It works as strcpy(destination, source)
, where destination must be a valid memory area holding a large-enough character array. For details see man strcpy.
How do you properly compare strings?
Character by character.
Code such as Bug 5) with the ==
equality operator, won't work because it doesn't compare the contents of the strings, just their addresses. So Bug 5) is just comparing the address of a local character array with the address of a string literal, which is nonsense.
Instead, the character arrays have to be compared character by character. Note that they can have different lengths too, so one needs to check for the null terminator of either character array while iterating through them.
The strcmp()
function does all this in an efficient manner, so the easiest and most correct solution is just to call that one. It works as strcmp(first_string, second_string)
and returns a value less than 0, larger than zero or zero, if the first string is considered less than, more than or equal to the second string. The strcmp
implementation will likely just compare symbol values of the characters, so "less than" might mean alphabetically, though without care taken of things like lower/upper case, digits or punctuation. See man strcmp for details.
0 comment threads
Understanding the representation of text in C
"Text" is a high level abstraction that C doesn't properly support
Fundamentally, C does not have any built-in "string" type. It arguably doesn't even have a real character type. char
is really a numeric, integral type which has a size of one byte. (For historical reasons, the signedness of this type is unspecified, and char
is treated as a distinct type from both signed char
and unsigned char
.) An array of char
s is essentially a raw memory buffer.
There is neither an implicitly assumed text encoding internally (although a few functions like tolower
assume that it is some single-byte, ASCII-transparent encoding), nor standard library support for multibyte encodings (The Windows API provides some support for UTF-16, but even then you have to worry about surrogate pairs.). Thus, a char
doesn't really hold an arbitrary character of text; you can at best pretend that byte values 0..127 represent characters with matching Unicode code points, and other byte values represent some other set of 128 Unicode code points.
Any serious, internationalized text processing - such as resolving text direction markers, normalizing precomposed characters, collation (locale-aware sorting), inspecting character properties, clustering graphemes etc. etc. - requires a heavyweight third-party library, practically speaking.
C's conventions for pseudo-textual data
As noted above, an array of char
values is basically a raw memory buffer. C programs generally use these to represent text (in a restricted, pre-Unicode way) as null-terminated sequences, often called "null-terminated strings" in the C literature.
That is, the text is represented by a sequence of bytes which ends in a zero value (called a "null byte" in the literature). This byte is generally understood to represent Unicode code point 0. That character, in turn, is called "NUL" in the ASCII standard, but not, pedantically, named in Unicode (although "NULL" is recognized as a common alias). "Strings" in a running program are commonly approximated by passing around pointers to, or into, such arrays (or passing the arrays, which decay to pointers).
It's important to understand that none of these uses of "null", "NUL", "NULL" etc. have anything at all to do with pointers.
Standard library string-handling functions blindly assume the presence of a null terminator, and treat it as the end of string.
If you need to represent strings that contain embedded null characters, you will have to either use a third-party library or create the abstractions yourself (i.e., track string length separately and write your own manipulation functions - although the standard library functions may be useful as helpers).
Single-quoted values are character literals, representing a value of type char
. Double-quoted values are string literals, representing a value of type char[]
. (These names are unfortunate, but a consequence of the history.)
Keep in mind that a char *
does not "contain" the textual data, but merely points at it.
Expectations for memory management
Since C does not provide any built-in garbage collection or other memory management, the programmer is responsible for matching up heap allocations and deallocations (malloc
/free
) and for understanding the lifetime of automatic (stack) allocations.
Common pitfalls
-
Returning a string with automatic storage does not work. This includes trying to return a pointer to or into a local array, or the array itself (which decays to a pointer). After the function has returned, that memory is automatically deallocated from the stack - therefore, the pointer is dangling, and using it is undefined behavior.
The common ways to work around this are:
-
Make a new dynamic allocation and return that pointer, with the explicit understanding that the caller is responsible for deallocation. For example, the standard library
strdup
does this. -
Return a pointer to or into an existing, passed-in array. If the function modifies the array contents, it must have some way to ensure that the modification is safe - either by doing something that can't lengthen the sequence (and ensures that it remains null terminated), or by expecting the caller to be responsible for advertising the array size.
-
Modify an existing, passed-in array without returning any pointer (perhaps using an
int
return value for an error code).
-
-
For historical reasons, a string literal is not typed as
const
, but modifying the array initialized from a string literal is still undefined behavior in general. The program may initialize these using non-modifiable data stored within the executable itself. -
Memory allocated to hold a "string" must include room for a null terminator. Because of pointer decay, standard library functions have no way to be aware of the array bounds. (They can, at best, trust information that was passed in separately by the caller.) If no terminator is found within the allocated memory, undefined behavior will result.
-
Similarly, if a standard library function is expected to lengthen the data, the underlying allocation must be large enough to accommodate the result, including the final null terminator.
-
A
char *
must point at a valid allocation to be used "as a string". Unlike withchar []
, no allocation is implied. Doing something likechar* str; scanf("%s", str);
is undefined behavior because the pointer is dangling. A pointer does not contain the textual data, but merely points at it - hence the name. -
A
char[]
cannot be reassigned (instead, its elements can be modified), and a stack-allocated array may not befree()
d. -
Reassigning a
char*
risks a memory leak or a double free if the previous referent was dynamically allocated, even if it previously pointed at a dynamic allocation. Such assignment does not modify the pointed-at memory, but just resets the pointer. Again, the programmer is responsible for tracking allocations and for arranging the program such that every dynamic allocation is eventually freed exactly once.
Tips and Tricks
-
To measure the "length of a string" at runtime, use the standard library
strlen
.Because a
char *
doesn't contain the data, and becausesizeof
is a compile-time operator,sizeof
can't tell you the length of the data.sizeof
on a pointer will report the size of pointer types on the platform. Similarly,sizeof
on an array will report the allocated size, regardless of where the null terminator is (or whether it's present at all). -
In both string and character literals, the syntax
\0
represents a null character. (Of course, achar
variable can also be assigned the integer value0
, but'\0'
is preferred. Normally a string literally should not contain this, because standard library functions will ignore anything after that point. However, it can be useful to initialize a buffer that contains several consecutive null-terminated strings.Because the null character has nothing to do with pointers, it is incorrect to assign
NULL
to achar
. Platforms have historically existed where a "null pointer" does not consist of all unset bits, and the implicit conversion of a pointer to a single-byte integer will be platform-dependent (and should cause a compiler warning). -
When an array is initialized from a string literal, it is not necessary to specify the size. For example,
char[] text = "hello";
compiles, and the type oftext
will bechar[6]
- that is, C will automatically account for the null terminator and size the array to exactly enough space for the literal. (Alternately, you can say that the literal syntax implicitly specifies that terminator.)However, be aware that brace-initialized arrays will not automatically have this null terminator - it must be specified explicitly:
char[] text = {'h', 'e', 'l', 'l', 'o', '\0' /* needed! */};
1 comment thread