Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs

Dashboard
Notifications
Mark all as read
Q&A

Common string handling pitfalls in C programming

+5
−0

Preface: This is a self-answered Q&A meant as a C string handling FAQ. It will ask several question at once which isn't ideal, but they are all closely related and I'd rather not fragment the post into several.


When reading C programming forums or code written by beginners, there is a number of frequently recurring bugs related to string handling. These are not only written by complete beginners, but as often by experienced programmers coming from a higher level language and picking up C.

The common bugs originates from them assuming that C like most languages has a built-in string class which will handle all string handling and memory allocation for them. Here follows some frequently occurring bugs and their related questions:



  • Bug 2) char str[5] = "hello";.

    Compiles just fine, yet when printing this there will be garbage printed or other strange behavior. This bug is related to character arrays and missing null termination.

    Question: What exactly does a string consist of in C?


  • Bug 3) char* str; scanf("%s", str);

    Compiles just fine, though if lucky there can be warnings. This bug is related to memory allocation.

    Question: Who is responsible for allocating memory for the string?


  • Bug 4) char* str = malloc(5+1); str = "hello";

    Compiles just fine, though there are memory leaks.

    Question: How can a string get assigned a new value?


  • Bug 5) char str[5+1] = "hello"; ... if(str == "hello").

    Compiles just fine but gives the wrong results.

    Question: How do you properly compare strings?

Why does this post require moderator attention?
You might want to add some details to your flag.
Why should this post be closed?

1 comment thread

Bug 1 isn't a possible bug (2 comments)

1 answer

+6
−0

The reader is assumed to understand how arrays and pointers work in C. You cannot understand pointers before you understand arrays, and you cannot understand strings before you understand pointers and arrays both. Decent C books therefore teaches arrays, then pointers, then strings, in that order.

Some of the below text was taken from an original article written by me here on Stack Overflow.


Q: Does C have a string class?
A: No it does not and that string class (which C doesn't have) is not char.

Therefore bug 1) will not compile. You cannot assign a string to a single character, because a single character is what it sounds like, a single letter.

In C you have to handle everything manually: allocation, assignment, copies, comparisons. There is a standard library string.h which does contain some helpful functions though.


Q: What exactly does a string consist of in C?
A: A C string is a character array that ends with a null terminator.

All characters have a symbol table value. The null terminator is the symbol value 0 (zero). It is used to mark the end of a string. This is necessary since the size of the string isn't stored anywhere. Therefore, every time you allocate room for a string, you must include sufficient space for the null terminator character.

Bug 2) does not do this, it only allocates room for the 5 characters of "hello". Correct code should be:

char str[6] = "hello";

Or equivalently, you can write self-documenting code for 5 characters plus 1 null terminator:

char str[5+1] = "hello";

But you can also use this and let the compiler do the counting and pick the size:

char str[] = "hello"; // Will allocate 6 bytes automatically

If you don't append a null terminator at the end of a string, then library functions expecting a string won't work properly and you will get "undefined behavior" bugs such as garbage output or program crashes. That's what happens if you attempt to print the string in Bug 2).

The most common way to write a null terminator character in C is by using a so-called "octal escape sequence", looking like this: '\0'. This is 100% equivalent to writing 0, but the \ serves as self-documenting code to state that the zero is explicitly meant to be a null terminator. Code such as if(str[i] == '\0') will check if the specific character is the null terminator.

So you can even do the above examples explicitly, character by character:

char str[6] = {'h', 'e', 'l', 'l', 'o', '\0'};

Please note that the term null terminator has nothing to do with null pointers or the NULL macro! This can be confusing - very similar names but very different meanings. This is why the null terminator is sometimes referred to as NUL with one L, not to be confused with NULL or null pointers.

The "hello" part in Bug 2) is called a string literal. This is to be regarded as a read-only string. The "" syntax means that the compiler will append a null terminator in the end of the string literal automatically. So if you print out sizeof("hello") you will get 6, not 5, because you get the size of the array including a null terminator.


Q: Who is responsible for allocating memory for the string?
A: You are - the C programmer.

That's why Bug 3) causes a crash, you cannot just store a string where an uninitialized pointer points at. It needs to point at valid, allocated memory.

So you need to allocate an array somewhere, sufficiently large to hold the string, including null termination. You could do this as a local character array as in the above examples, or you can do this by determining the size in run-time.

When allocating memory for a string dynamically in run-time, remember to also allocate room for the null terminator:

char input[n] = ... ;
...
char* str = malloc(strlen(input) + 1);

Notably, this array also has to be read/write memory. If we do something like
char* str = "hello"; str[0] = 'a';

then it compiles just fine but crashes in run-time. This is because the string literal "hello" is a read-only memory, null-terminated character array stored by the compiler in specialized read-only memory.

You can use string literals just as strings, but you can never write to them. Therefore it is strongly recommended to only point at them with a pointer to read-only data:
const char* str = "hello";.
This pointer can however (unlike a pointer to dynamic memory, see Bug 4)) be safely set to point at a different string literal, so when dealing with a lot of string look-ups, an array of pointers to const char might be a sensible choice.


Q: How can a string get assigned a new value?
A: Either during initialization or through strcpy().

The above examples show various different ways to create a string by allocating an array or by having a pointer point at a string allocated elsewhere. But if you need to change this string in run-time, you can't just write str = "new value".

In case str in that example is an array, then it won't work because C simply wasn't designed to do assignment to arrays in run-time. In case str is a pointer, then it will work by having str point at a string literal as previously explained. But it will forget all about where it previously pointed - if it for example previously pointed at dynamically allocated memory like in Bug 4), then we have a memory leak.

The normal way to assign a value to a string in run-time is to use the strcpy function (which is a perfectly safe function, see Is strcpy dangerous and what should be used instead? ). It works as strcpy(destination, source), where destination must be a valid memory area holding a large-enough character array. For details see man strcpy.


Q: How do you properly compare strings?
A: By comparing them character by character, usually done with strcmp().

Code such as Bug 5) with the == equality operator, won't work because it doesn't compare the contents of the strings, just their addresses. So Bug 5) is just comparing the address of a local character array with the address of a string literal, which is nonsense.

Instead, the character arrays have to be compared character by character. Note that they can have different lengths too, so one needs to check for the null terminator of either character array while iterating through them.

The strcmp() function does all this in an efficient manner, so the easiest and most correct solution is just to call that one. It works as strcmp(first_string, second_string) and returns a value less than 0, larger than zero or zero, if the first string is considered less than, more than or equal to the second string. The strcmp implementation will likely just compare symbol values of the characters, so "less than" might mean alphabetically, though without care taken of things like lower/upper case, digits or punctuation. See man strcmp for details.

Why does this post require moderator attention?
You might want to add some details to your flag.

0 comment threads

Sign up to answer this question »