What happens to memory after '\0' in a C string?

CStringPointersMallocC Strings

C Problem Overview


Surprisingly simple/stupid/basic question, but I have no idea: Suppose I want to return the user of my function a C-string, whose length I do not know at the beginning of the function. I can place only an upper bound on the length at the outset, and, depending on processing, the size may shrink.

The question is, is there anything wrong with allocating enough heap space (the upper bound) and then terminating the string well short of that during processing? i.e. If I stick a '\0' into the middle of the allocated memory, does (a.) free() still work properly, and (b.) does the space after the '\0' become inconsequential? Once '\0' is added, does the memory just get returned, or is it sitting there hogging space until free() is called? Is it generally bad programming style to leave this hanging space there, in order to save some upfront programming time computing the necessary space before calling malloc?

To give this some context, let's say I want to remove consecutive duplicates, like this:

input "Hello oOOOo !!" --> output "Helo oOo !"

... and some code below showing how I'm pre-computing the size resulting from my operation, effectively performing processing twice to get the heap size right.

char* RemoveChains(const char* str)
{
	if (str == NULL) {
    	return NULL;
    }
    if (strlen(str) == 0) {
    	char* outstr = (char*)malloc(1);
    	*outstr = '\0';
    	return outstr;
    }
    const char* original = str; // for reuse
    char prev = *str++;       // [prev][str][str+1]...
    unsigned int outlen = 1;  // first char auto-counted

    // Determine length necessary by mimicking processing
    while (*str) {
    	if (*str != prev) { // new char encountered
	    	++outlen;
    		prev = *str; // restart chain
	    }
	    ++str; // step pointer along input
    }

    // Declare new string to be perfect size
    char* outstr = (char*)malloc(outlen + 1);
    outstr[outlen] = '\0';
    outstr[0] = original[0];
    outlen = 1;

    // Construct output
    prev = *original++;
    while (*original) {
	    if (*original != prev) {
	    	outstr[outlen++] = *original;
	    	prev = *original;
	    }
	    ++original;
    }
	return outstr;
}

C Solutions


Solution 1 - C

> If I stick a '\0' into the middle of the allocated memory, does > > (a.) free() still work properly, and

Yes.

> (b.) does the space after the '\0' become inconsequential? Once '\0' is added, does the memory just get returned, or is it sitting there hogging space until free() is called?

Depends. Often, when you allocate large amounts of heap space, the system first allocates virtual address space - as you write to the pages some actual physical memory is assigned to back it (and that may later get swapped out to disk when your OS has virtual memory support). Famously, this distinction between wasteful allocation of virtual address space and actual physical/swap memory allows sparse arrays to be reasonably memory efficient on such OSs.

Now, the granularity of this virtual addressing and paging is in memory page sizes - that might be 4k, 8k, 16k...? Most OSs have a function you can call to find out the page size. So, if you're doing a lot of small allocations then rounding up to page sizes is wasteful, and if you have a limited address space relative to the amount of memory you really need to use then depending on virtual addressing in the way described above won't scale (for example, 4GB RAM with 32-bit addressing). On the other hand, if you have a 64-bit process running with say 32GB of RAM, and are doing relatively few such string allocations, you have an enormous amount of virtual address space to play with and the rounding up to page size won't amount to much.

But - note the difference between writing throughout the buffer then terminating it at some earlier point (in which case the once-written-to memory will have backing memory and could end up in swap) versus having a big buffer in which you only ever write to the first bit then terminate (in which case backing memory is only allocated for the used space rounded up to page size).

It's also worth pointing out that on many operating systems heap memory may not be returned to the Operating System until the process terminates: instead, the malloc/free library notifies the OS when it needs to grow the heap (e.g. using sbrk() on UNIX or VirtualAlloc() on Windows). In that sense, free() memory is free for your process to re-use, but not free for other processes to use. Some Operating Systems do optimise this - for example, using a distinct and independently releasble memory region for very large allocations.

> Is it generally bad programming style to leave this hanging space there, in order to save some upfront programming time computing the necessary space before calling malloc?

Again, it depends on how many such allocations you're dealing with. If there are a great many relative to your virtual address space / RAM - you want to explicitly let the memory library know not all the originally requested memory is actually needed using realloc(), or you could even use strdup() to allocate a new block more tightly based on actual needs (then free() the original) - depending on your malloc/free library implementation that might work out better or worse, but very few applications would be significantly affected by any difference.

Sometimes your code may be in a library where you can't guess how many string instances the calling application will be managing - in such cases it's better to provide slower behaviour that never gets too bad... so lean towards shrinking the memory blocks to fit the string data (a set number of additional operations so doesn't affect big-O efficiency) rather than having an unknown proportion of the original string buffer wasted (in a pathological case - zero or one character used after arbitrarily large allocations). As a performance optimisation you might only bother returning memory if unusued space is >= the used space - tune to taste, or make it caller-configurable.

You comment on another answer:

> So it comes down to judging whether the realloc will take longer, or the preprocessing size determination?

If performance is your top priority, then yes - you'd want to profile. If you're not CPU bound, then as a general rule take the "preprocessing" hit and do a right-sized allocation - there's just less fragmentation and mess. Countering that, if you have to write a special preprocessing mode for some function - that's an extra "surface" for errors and code to maintain. (This trade-off decision is commonly needed when implementing your own asprintf() from snprintf(), but there at least you can trust snprintf() to act as documented and don't personally have to maintain it).

Solution 2 - C

> Once '\0' is added, does the memory just get returned, or is it > sitting there hogging space until free() is called?

There's nothing magical about \0. You have to call realloc if you want to "shrink" the allocated memory. Otherwise the memory will just sit there until you call free.

> If I stick a '\0' into the middle of the allocated memory, does (a.) > free() still work properly

Whatever you do in that memory free will always work properly if you pass it the exact same pointer returned by malloc. Of course if you write outside it all bets are off.

Solution 3 - C

\0 is just one more character from malloc and free perspective, they don't care what data you put in the memory. So free will still work whether you add \0 in the middle or don't add \0 at all. The extra space allocated will still be there, it won't be returned back to the process as soon as you add \0 to the memory. I personally would prefer to allocate only the required amount of memory instead of allocating at some upper bound as that will just wasting the resource.

Solution 4 - C

The \0is a pure convention to interpret character arrays as stings - it is independent of the memory management. I.e., if you want to get your money back, you should call realloc. The string does not care about memory (what is a source of many security problems).

Solution 5 - C

As soon as you get memory from heap by calling malloc(), the memory is yours to use. Inserting \0 is like inserting any other character. This memory will remain in your possession until you free it or until OS claims it back.

Solution 6 - C

malloc just allocates a chunk of memory .. Its upto you to use however you want and call free from the initial pointer position... Inserting '\0' in the middle has no consequence...

To be specific malloc doesnt know what type of memory you want (It returns onle a void pointer) ..

Let us assume you wish to allocate 10 bytes of memory starting 0x10 to 0x19 ..

char * ptr = (char *)malloc(sizeof(char) * 10);

Inserting a null at 5th position (0x14) does not free the memory 0x15 onwards...

However a free from 0x10 frees the entire chunk of 10 bytes..

Solution 7 - C

  1. free() will still work with a NUL byte in memory

  2. the space will remain wasted until free() is called, or unless you subsequently shrink the allocation

Solution 8 - C

Generally, memory is memory is memory. It doesn't care what you write into it. BUT it has a race, or if you prefer a flavor (malloc, new, VirtualAlloc, HeapAlloc, etc). This means that the party that allocates a piece of memory must also provide the means to deallocate it. If your API comes in a DLL, then it should provide a free function of some sort. This of course puts a burden on the caller right? So why not put the WHOLE burden on the caller? The BEST way to deal with dynamically allocated memory is to NOT allocate it yourself. Have the caller allocate it and pass it on to you. He knows what flavor he allocated, and he is responsible to free it whenever he is done using it.

How does the caller know how much to allocate? Like many Windows APIs have your function return the required amount of bytes when called e.g. with a NULL pointer, then do the job when provided with a non-NULL pointer (using IsBadWritePtr if it is suitable for your case to double-check accessibility).

This can also be much much more efficient. Memory allocations COST a lot. Too many memory allocations cause heap fragmentation and then the allocations cost even more. That's why in kernel mode we use the so called "look-aside lists". To minimize the number of memory allocations done, we reuse the blocks we have already allocated and "freed", using services that the NT Kernel provides to driver writers. If you pass on the responsibility for memory allocation to your caller, then he might be passing you cheap memory from the stack (_alloca), or passing you the same memory over and over again without any additional allocations. You don't care of course, but you DO allow your caller to be in charge of optimal memory handling.

Solution 9 - C

To elaborate on the use of the NULL terminator in C: You cannot allocate a "C string" you can allocate a char array and store a string in it, but malloc and free just see it as an array of the requested length.

A C string is not a data type but a convention for using a char array where the null character '\0' is treated as the string terminator. This is a way to pass strings around without having to pass a length value as a separate argument. Some other programming languages have explicit string types that store a length along with the character data to allow passing strings in a single parameter.

Functions that document their arguments as "C strings" are passed char arrays but have no way of knowing how big the array is without the null terminator so if it is not there things will go horribly wrong.

You will notice functions that expect char arrays that are not necessarily treated as strings will always require a buffer length parameter to be passed. For example if you want to process char data where a zero byte is a valid value you can't use '\0' as a terminator character.

Solution 10 - C

You could do what some of the MS Windows APIs do where you (the caller) pass a pointer and the size of the memory you allocated. If the size isn't enough, you're told how many bytes to allocate. If it was enough, the memory is used and the result is the number of bytes used.

Thus the decision about how to efficiently use memory is left to the caller. They can allocate a fixed 255 bytes (common when working with paths in Windows) and use the result from the function call to know whether more bytes are needed (not the case with paths due to MAX_PATH being 255 without bypassing Win32 API) or whether most of the bytes can be ignored... The caller could also pass zero as the memory size and be told exactly how much needs to be allocated - not as efficient processing-wise, but could be more efficient space-wise.

Solution 11 - C

You can certainly preallocate to an upperbound, and use all or something less. Just make sure you actually use all or something less.

Making two passes is also fine.

You asked the right questions about the tradeoffs.

How do you decide?

Use two passes, initially, because:

1. you'll know you aren't wasting memory.
2. you're going to profile to find out where
   you need to optimize for speed anyway.
3. upperbounds are hard to get right before
   you've written and tested and modified and
   used and updated the code in response to new
   requirements for a while.
4. simplest thing that could possibly work.

You might tighten up the code a little, too. Shorter is usually better. And the more the code takes advantage of known truths, the more comfortable I am that it does what it says.

char* copyWithoutDuplicateChains(const char* str)
    {
    if (str == NULL) return NULL;

    const char* s = str;
    char prev = *s;               // [prev][s+1]...
    unsigned int outlen = 1;      // first character counted

    // Determine length necessary by mimicking processing

    while (*s)
        { while (*++s == prev);  // skip duplicates
          ++outlen;              // new character encountered
          prev = *s;             // restart chain
        }

    // Construct output

    char* outstr = (char*)malloc(outlen);
    s = str;
    *outstr++ = *s;               // first character copied
    while (*s)
        { while (*++s == prev);   // skip duplicates
          *outstr++ = *s;         // copy new character
        }

    // done

    return outstr;
    }

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionErika ElectraView Question on Stackoverflow
Solution 1 - CTony DelroyView Answer on Stackoverflow
Solution 2 - CcnicutarView Answer on Stackoverflow
Solution 3 - CNaveenView Answer on Stackoverflow
Solution 4 - CMatthiasView Answer on Stackoverflow
Solution 5 - CScarletAmaranthView Answer on Stackoverflow
Solution 6 - CAnerudhan GopalView Answer on Stackoverflow
Solution 7 - CAlnitakView Answer on Stackoverflow
Solution 8 - CDimitrios StaikosView Answer on Stackoverflow
Solution 9 - COzoneView Answer on Stackoverflow
Solution 10 - CIan YatesView Answer on Stackoverflow
Solution 11 - CJim SawyerView Answer on Stackoverflow