Using strtok with a std::string

C++ Problem Overview

I have a string that I would like to tokenize. But the C strtok() function requires my string to be a char*. How can I do this simply?

I tried:

token = strtok(str.c_str(), " ");

which fails because it turns it into a const char*, not a char*

C++ Solutions

Solution 1 - C++

#include <iostream>
#include <string>
#include <sstream>
int main(){
    std::string myText("some-text-to-tokenize");
    std::istringstream iss(myText);
    std::string token;
    while (std::getline(iss, token, '-'))
    {
	    std::cout << token << std::endl;
    }
    return 0;
}

Or, as mentioned, use boost for more flexibility.

Solution 2 - C++

Duplicate the string, tokenize it, then free it.

char *dup = strdup(str.c_str());
token = strtok(dup, " ");
free(dup);

Solution 3 - C++

If boost is available on your system (I think it's standard on most Linux distros these days), it has a Tokenizer class you can use.
If not, then a quick Google turns up a hand-rolled tokenizer for std::string that you can probably just copy and paste. It's very short.

And, if you don't like either of those, then here's a split() function I wrote to make my life easier. It'll break a string into pieces using any of the chars in "delim" as separators. Pieces are appended to the "parts" vector:

 void split(const string& str, const string& delim, vector<string>& parts) {
   size_t start, end = 0;
   while (end < str.size()) {
     start = end;
     while (start < str.size() && (delim.find(str[start]) != string::npos)) {
       start++;  // skip initial whitespace
     }
     end = start;
     while (end < str.size() && (delim.find(str[end]) == string::npos)) {
       end++; // skip to end of word
     }
     if (end-start != 0) {  // just ignore zero-length strings.
       parts.push_back(string(str, start, end-start));
     }
   }
 }

Solution 4 - C++

There is a more elegant solution.

With std::string you can use resize() to allocate a suitably large buffer, and &s[0] to get a pointer to the internal buffer.

At this point many fine folks will jump and yell at the screen. But this is the fact. About 2 years ago

the library working group decided (meeting at Lillehammer) that just like for std::vector, std::string should also formally, not just in practice, have a guaranteed contiguous buffer.

The other concern is does strtok() increases the size of the string. The MSDN documentation says:

Each call to strtok modifies strToken by inserting a null character after the token returned by that call.

But this is not correct. Actually the function replaces the first occurrence of a separator character with \0. No change in the size of the string. If we have this string:

one-two---three--four

we will end up with

one\0two\0--three\0-four

So my solution is very simple:


std::string str("some-text-to-split");
char seps[] = "-";
char *token;




token = strtok( &str[0], seps );
while( token != NULL )
{
/* Do your thing */
token = strtok( NULL, seps );
}
token = strtok( &str[0], seps );
while( token != NULL )
{
/* Do your thing */
token = strtok( NULL, seps );
}

Read the discussion on http://www.archivum.info/comp.lang.c++/2008-05/02889/does_std::string_have_something_like_CString::GetBuffer">http://www.archivum.info/comp.lang.c++/2008-05/02889/does_std::string_have_something_like_CString::GetBuffer</a>

Solution 5 - C++

With C++17 str::string receives data() overload that returns a pointer to modifieable buffer so string can be used in strtok directly without any hacks:

#include <string>
#include <iostream>
#include <cstring>
#include <cstdlib>

int main()
{
	::std::string text{"pop dop rop"};
	char const * const psz_delimiter{" "};
	char * psz_token{::std::strtok(text.data(), psz_delimiter)};
	while(nullptr != psz_token)
	{
		::std::cout << psz_token << ::std::endl;
		psz_token = std::strtok(nullptr, psz_delimiter);
	}
	return EXIT_SUCCESS;
}

output >pop
dop
rop

Solution 6 - C++

EDIT: usage of const cast is only used to demonstrate the effect of strtok() when applied to a pointer returned by string::c_str().

You should not use strtok() since it modifies the tokenized string which may lead to undesired, if not undefined, behaviour as the C string "belongs" to the string instance.

#include <string>
#include <iostream>

int main(int ac, char **av)
{
    std::string theString("hello world");
    std::cout << theString << " - " << theString.size() << std::endl;
    
    //--- this cast *only* to illustrate the effect of strtok() on std::string 
    char *token = strtok(const_cast<char  *>(theString.c_str()), " ");

    std::cout << theString << " - " << theString.size() << std::endl;

    return 0;
}

After the call to strtok(), the space was "removed" from the string, or turned down to a non-printable character, but the length remains unchanged.

>./a.out
hello world - 11
helloworld - 11

Therefore you have to resort to native mechanism, duplication of the string or an third party library as previously mentioned.

Solution 7 - C++

I suppose the language is C, or C++...

strtok, IIRC, replace separators with \0. That's what it cannot use a const string. To workaround that "quickly", if the string isn't huge, you can just strdup() it. Which is wise if you need to keep the string unaltered (what the const suggest...).

On the other hand, you might want to use another tokenizer, perhaps hand rolled, less violent on the given argument.

Solution 8 - C++

Assuming that by "string" you're talking about std::string in C++, you might have a look at the Tokenizer package in Boost.

Solution 9 - C++

Chris's answer is probably fine when using std::string; however in case you want to use std::basic_string, std::getline can't be used. Here is a possible other implementation:

template <class CharT> bool tokenizestring(const std::basic_string<CharT> &input, CharT separator, typename std::basic_string<CharT>::size_type &pos, std::basic_string<CharT> &token) {
	if (pos >= input.length()) {
		// if input is empty, or ends with a separator, return an empty token when the end has been reached (and return an out-of-bound position so subsequent call won't do it again)
		if ((pos == 0) || ((pos > 0) && (pos == input.length()) && (input[pos-1] == separator))) {
			token.clear();
			pos=input.length()+1;
			return true;
		}
		return false;
	}
	typename std::basic_string<CharT>::size_type separatorPos=input.find(separator, pos);
	if (separatorPos == std::basic_string<CharT>::npos) {
		token=input.substr(pos, input.length()-pos);
		pos=input.length();
	} else {
		token=input.substr(pos, separatorPos-pos);
		pos=separatorPos+1;
	}
	return true;
}

Then use it like this:

std::basic_string<char16_t> s;
std::basic_string<char16_t> token;
std::basic_string<char16_t>::size_type tokenPos=0;
while (tokenizestring(s, (char16_t)' ', tokenPos, token)) {
	...
}

Solution 10 - C++

First off I would say use boost tokenizer.
Alternatively if your data is space separated then the string stream library is very useful.

But both the above have already been covered.
So as a third C-Like alternative I propose copying the std::string into a buffer for modification.

std::string   data("The data I want to tokenize");

// Create a buffer of the correct length:
std::vector<char>  buffer(data.size()+1);

// copy the string into the buffer
strcpy(&buffer[0],data.c_str());

// Tokenize
strtok(&buffer[0]," ");

Solution 11 - C++

If you don't mind open source, you could use the subbuffer and subparser classes from https://github.com/EdgeCast/json_parser. The original string is left intact, there is no allocation and no copying of data. I have not compiled the following so there may be errors.

std::string input_string("hello world");
subbuffer input(input_string);
subparser flds(input, ' ', subparser::SKIP_EMPTY);
while (!flds.empty())
{
    subbuffer fld = flds.next();
    // do something with fld
}

// or if you know it is only two fields
subbuffer fld1 = input.before(' ');
subbuffer fld2 = input.sub(fld1.length() + 1).ltrim(' ');

Solution 12 - C++

Typecasting to (char*) got it working for me!

token = strtok((char *)str.c_str(), " ");

Solution 13 - C++

It fails because str.c_str() returns constant string but char * strtok (char * str, const char * delimiters ) requires volatile string. So you need to use *const_cast< char > inorder to make it voletile. I am giving you a complete but small program to tokenize the string using C strtok() function.

   #include <iostream>
   #include <string>
   #include <string.h> 
   using namespace std;
   int main() {
       string s="20#6 5, 3";
       // strtok requires volatile string as it modifies the supplied string in order to tokenize it 
       char *str=const_cast< char *>(s.c_str());    
       char *tok;
       tok=strtok(str, "#, " );     
       int arr[4], i=0;    
       while(tok!=NULL){
           arr[i++]=stoi(tok);
           tok=strtok(NULL, "#, " );
       }     
       for(int i=0; i<4; i++) cout<<arr[i]<<endl;
      
          
       return 0;
   }

NOTE: strtok may not be suitable in all situation as the string passed to function gets modified by being broken into smaller strings. Pls., ref to get better understanding of strtok functionality.

How strtok works

Added few print statement to better understand the changes happning to string in each call to strtok and how it returns token.

#include <iostream>
#include <string>
#include <string.h> 
using namespace std;
int main() {
    string s="20#6 5, 3";
    char *str=const_cast< char *>(s.c_str());    
    char *tok;
    cout<<"string: "<<s<<endl;
    tok=strtok(str, "#, " );     
    cout<<"String: "<<s<<"\tToken: "<<tok<<endl;   
    while(tok!=NULL){
        tok=strtok(NULL, "#, " );
        cout<<"String: "<<s<<"\t\tToken: "<<tok<<endl;
    }
    return 0;
}

Output:

string: 20#6 5, 3

String: 206 5, 3	Token: 20
String: 2065, 3		Token: 6
String: 2065 3		Token: 5
String: 2065 3		Token: 3
String: 2065 3		Token:

strtok iterate over the string first call find the non delemetor character (2 in this case) and marked it as token start then continues scan for a delimeter and replace it with null charater (# gets replaced in actual string) and return start which points to token start character( i.e., it return token 20 which is terminated by null). In subsequent call it start scaning from the next character and returns token if found else null. subsecuntly it returns token 6, 5, 3.

Content Type	Original Author	Original Content on Stackoverflow
Question	Bill	View Question on Stackoverflow
Solution 1 - C++	Chris Blackwell	View Answer on Stackoverflow
Solution 2 - C++	DocMax	View Answer on Stackoverflow
Solution 3 - C++	Todd Gamblin	View Answer on Stackoverflow
Solution 4 - C++	Martin Dimitrov	View Answer on Stackoverflow
Solution 5 - C++	user7860670	View Answer on Stackoverflow
Solution 6 - C++	philant	View Answer on Stackoverflow
Solution 7 - C++	PhiLho	View Answer on Stackoverflow
Solution 8 - C++	Sherm Pendley	View Answer on Stackoverflow
Solution 9 - C++	Jérôme	View Answer on Stackoverflow
Solution 10 - C++	Martin York	View Answer on Stackoverflow
Solution 11 - C++	Scott Yeager	View Answer on Stackoverflow
Solution 12 - C++	khushgrover	View Answer on Stackoverflow
Solution 13 - C++	maximus	View Answer on Stackoverflow