initializing std::string from char* without copy

C++StringMemory ManagementStl

C++ Problem Overview


I have a situation where I need to process large (many GB's) amounts of data as such:

  1. build a large string by appending many smaller (C char*) strings
  2. trim the string
  3. convert the string into a C++ const std::string for processing (read only)
  4. repeat

The data in each iteration are independent.

My question is, I'd like to minimise (if possible eliminate) heap allocated memory usage, as it at the moment is my largest performance problem.

Is there a way to convert a C string (char*) into a stl C++ string (std::string) without requiring std::string to internally alloc/copy the data?

Alternatively, could I use stringstreams or something similar to re-use a large buffer?

Edit: Thanks for the answers, for clarity, I think a revised question would be:

How can I build (via multiple appends) a stl C++ string efficiently. And if performing this action in a loop, where each loop is totally independant, how can I re-use thisallocated space.

C++ Solutions


Solution 1 - C++

You can't actually form a std::string without copying the data. A stringstream would probably reuse the memory from pass to pass (though I think the standard is silent on whether it actually has to), but it still wouldn't avoid the copying.

A common approach to this sort of problem is to write the code which processes the data in step 3 to use a begin/end iterator pair; then it can easily process either a std::string, a vector of chars, a pair of raw pointers, etc. Unlike passing it a container type like std::string, it would no longer know or care how the memory was allocated, since it would still belong to the caller. Carrying this idea to its logical conclusion is boost::range, which adds all the overloaded constructors to still let the caller just pass a string/vector/list/any sort of container with .begin() and .end(), or separate iterators.

Having written your processing code to work on an arbitrary iterator range, you could then even write a custom iterator (not as hard as it sounds, basically just an object with some standard typedefs, and operator ++/*/=/==/!= overloaded to get a forward-only iterator) that takes care of advancing to the next fragment each time it hit the end of the one it's working on, skipping over whitespace (I assume that's what you meant by trim). That you never had to assemble the whole string contiguously at all. Whether or not this would be a win depends on how many fragments/how large of fragments you have. This is essentially what the SGI rope mentioned by Martin York is: a string where append forms a linked list of fragments instead of a contiguous buffer, which is thus suitable for much longer values.


UPDATE (since I still see occasional upvotes on this answer):

C++17 introduces another choice: std::string_view, which replaced std::string in many function signatures, is a non-owning reference to a character data. It is implicitly convertible from std::string, but can also be explicitly constructed from contiguous data owned somewhere else, avoiding the unnecessary copying std::string imposes.

Solution 2 - C++

Is it at all possible to use a C++ string in step 1? If you use string::reserve(size_t), you can allocate a large enough buffer to prevent multiple heap allocations while appending the smaller strings, and then you can just use that same C++ string throughout all of the remaining steps.

See this link for more information on the reserve function.

Solution 3 - C++

To help with really big strings SGI has the class Rope in its STL.
Non standard but may be usefull.

http://www.sgi.com/tech/stl/Rope.html

Apparently rope is in the next version of the standard :-)
Note the developer joke. A rope is a big string. (Ha Ha) :-)

Solution 4 - C++

This is a lateral thinking answer, not directly addressing the question but "thinking" around it. Might be useful, might not...

Readonly processing of std::string doesn't really require a very complex subset of std::string's features. Is there a possibility that you could do search/replace on the code that performs all the processing on std::strings so it takes some other type instead? Start with a blank class:

class lightweight_string { };

Then replace all std::string references with lightweight_string. Perform a compilation to find out exactly what operations are needed on lightweight_string for it to act as a drop-in replacement. Then you can make your implementation work however you want.

Solution 5 - C++

Is each iteration independent enough that you can use the same std::string for each iteration? One would hope that your std::string implementation is smart enough to re-use memory if you assign a const char * to it when it was previously used for something else.

Assigning a char * into a std::string must always at least copy the data. Memory management is one of the main reasons to use std::string, so you won't be a able to override it.

Solution 6 - C++

In this case, might it be better to process the char* directly, instead of assigning it to a std::string.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAkuseteView Question on Stackoverflow
Solution 1 - C++puetzkView Answer on Stackoverflow
Solution 2 - C++e.JamesView Answer on Stackoverflow
Solution 3 - C++Martin YorkView Answer on Stackoverflow
Solution 4 - C++Daniel EarwickerView Answer on Stackoverflow
Solution 5 - C++David NormanView Answer on Stackoverflow
Solution 6 - C++AlanView Answer on Stackoverflow