Most efficient way to iterate over all the chars in an NSString

Objective C

Objective C Problem Overview


What's the best way to iterate over all the chars in an NSString? Would you want to loop over the length of the string and use the method.

[aNSString characterAtIndex:index];

or would you want to user a char buffer based on the NSString?

Objective C Solutions


Solution 1 - Objective C

I think it's important that people understand how to deal with unicode, so I ended up writing a monster answer, but in the spirit of tl;dr I will start with a snippet that should work fine. If you want to know details (which you should!), please continue reading after the snippet.

NSUInteger len = [str length];
unichar buffer[len+1];

[str getCharacters:buffer range:NSMakeRange(0, len)];

NSLog(@"getCharacters:range: with unichar buffer");
for(int i = 0; i < len; i++) {
  NSLog(@"%C", buffer[i]);
}

Still with me? Good!

The current accepted answer seem to be confusing bytes with characters/letters. This is a common problem when encountering unicode, especially from a C background. Strings in Objective-C are represented as unicode characters (unichar) which are much bigger than bytes and shouldn't be used with standard C string manipulation functions.

(Edit: This is not the full story! To my great shame, I'd completely forgotten to account for composable characters, where a "letter" is made up of multiple unicode codepoints. This gives you a situation where you can have one "letter" resolving to multiple unichars, which in turn are multiple bytes each. Hoo boy. Please refer to this great answer for the details on that.)

The proper answer to the question depends on whether you want to iterate over the characters/letters (as distinct from the type char) or the bytes of the string (what the type char actually means). In the spirit of limiting confusion, I will use the terms byte and letter from now on, avoiding the possibly ambigious term character.

If you want to do the former and iterate over the letters in the string, you need to exclusively deal with unichars (sorry, but we're in the future now, you can't ignore it anymore). Finding the amount of letters is easy, it's the string's length property. An example snippet is as such (same as above):

NSUInteger len = [str length];
unichar buffer[len+1];

[str getCharacters:buffer range:NSMakeRange(0, len)];

NSLog(@"getCharacters:range: with unichar buffer");
for(int i = 0; i < len; i++) {
  NSLog(@"%C", buffer[i]);
}

If, on the other hand, you want to iterate over the bytes in a string, it starts getting complicated and the result will depend entirely upon the encoding you choose to use. The decent default choice is UTF8, so that's what I will show.

Doing this you have to figure out how many bytes the resulting UTF8 string will be, a step where it's easy to go wrong and use the string's -length. One main reason this very easy to do wrong, especially for a US developer, is that a string with letters falling into the 7-bit ASCII spectrum will have equal byte and letter lengths. This is because UTF8 encodes 7-bit ASCII letters with a single byte, so a simple test string and basic english text might work perfectly fine.

The proper way to do this is to use the method -lengthOfBytesUsingEncoding:NSUTF8StringEncoding (or other encoding), allocate a buffer with that length, then convert the string to the same encoding with -cStringUsingEncoding: and copy it into that buffer. Example code here:

NSUInteger byteLength = [str lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
char proper_c_buffer[byteLength+1];
strncpy(proper_c_buffer, [str cStringUsingEncoding:NSUTF8StringEncoding], byteLength);

NSLog(@"strncpy with proper length");
for(int i = 0; i < byteLength; i++) {
  NSLog(@"%c", proper_c_buffer[i]);
}

Just to drive the point home as to why it's important to keep things straight, I will show example code that handles this iteration in four different ways, two wrong and two correct. This is the code:

#import <Foundation/Foundation.h>

int main() {
  NSString *str = @"буква";
  NSUInteger len = [str length];

  // Try to store unicode letters in a char array. This will fail horribly
  // because getCharacters:range: takes a unichar array and will probably
  // overflow or do other terrible things. (the compiler will warn you here,
  // but warnings get ignored)
  char c_buffer[len+1];
  [str getCharacters:c_buffer range:NSMakeRange(0, len)];

  NSLog(@"getCharacters:range: with char buffer");
  for(int i = 0; i < len; i++) {
    NSLog(@"Byte %d: %c", i, c_buffer[i]);
  }

  // Copy the UTF string into a char array, but use the amount of letters
  // as the buffer size, which will truncate many non-ASCII strings.
  strncpy(c_buffer, [str UTF8String], len);

  NSLog(@"strncpy with UTF8String");
  for(int i = 0; i < len; i++) {
    NSLog(@"Byte %d: %c", i, c_buffer[i]);
  }

  // Do It Right (tm) for accessing letters by making a unichar buffer with
  // the proper letter length
  unichar buffer[len+1];
  [str getCharacters:buffer range:NSMakeRange(0, len)];

  NSLog(@"getCharacters:range: with unichar buffer");
  for(int i = 0; i < len; i++) {
    NSLog(@"Letter %d: %C", i, buffer[i]);
  }

  // Do It Right (tm) for accessing bytes, by using the proper
  // encoding-handling methods
  NSUInteger byteLength = [str lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
  char proper_c_buffer[byteLength+1];
  const char *utf8_buffer = [str cStringUsingEncoding:NSUTF8StringEncoding];
  // We copy here because the documentation tells us the string can disappear
  // under us and we should copy it. Just to be safe
  strncpy(proper_c_buffer, utf8_buffer, byteLength);

  NSLog(@"strncpy with proper length");
  for(int i = 0; i < byteLength; i++) {
    NSLog(@"Byte %d: %c", i, proper_c_buffer[i]);
  }
  return 0;
}

Running this code will output the following (with NSLog cruft trimmed out), showing exactly HOW different the byte and letter representations can be (the two last outputs):

getCharacters:range: with char buffer
Byte 0: 1
Byte 1: 
Byte 2: C
Byte 3: 
Byte 4: :
strncpy with UTF8String
Byte 0: Ð
Byte 1: ±
Byte 2: Ñ
Byte 3: 
Byte 4: Ð
getCharacters:range: with unichar buffer
Letter 0: б
Letter 1: у
Letter 2: к
Letter 3: в
Letter 4: а
strncpy with proper length
Byte 0: Ð
Byte 1: ±
Byte 2: Ñ
Byte 3: 
Byte 4: Ð
Byte 5: º
Byte 6: Ð
Byte 7: ²
Byte 8: Ð
Byte 9: °

Solution 2 - Objective C

While Daniel's solution will probably work most of the time, I think the solution is dependent on the context. For example, I have a spelling app and need to iterate over each character as it appears onscreen which may not correspond to the way it is represented in memory. This is especially true for text provided by the user.

Using something like this category on NSString:

- (void) dumpChars
{
	NSMutableArray	*chars = [NSMutableArray array];
	NSUInteger		len = [self length];
	unichar			buffer[len+1];

	[self getCharacters: buffer range: NSMakeRange(0, len)];
	for (int i=0; i<len; i++) {
		[chars addObject: [NSString stringWithFormat: @"%C", buffer[i]]];
	}

	NSLog(@"%@ = %@", self, [chars componentsJoinedByString: @", "]);
}

And feeding it a word like mañana might produce:

mañana = m, a, ñ, a, n, a

But it could just as easily produce:

mañana = m, a, n, ̃, a, n, a

The former will be produced if the string is in precomposed unicode form and the later if it's in decomposed form.

You might think this could be avoided by using the result of NSString's precomposedStringWithCanonicalMapping or precomposedStringWithCompatibilityMapping, but this is not necessarily the case as Apple warns in Technical Q&A 1225. For example a string like e̊gâds (which I totally made up) still produces the following even after converting to a precomposed form.

 e̊gâds = e, ̊, g, â, d, s

The solution for me is to use NSString's enumerateSubstringsInRange passing NSStringEnumerationByComposedCharacterSequences as the enumeration option. Rewriting the earlier example to look like this:

- (void) dumpSequences
{
	NSMutableArray	*chars = [NSMutableArray array];

	[self enumerateSubstringsInRange: NSMakeRange(0, [self length]) options: NSStringEnumerationByComposedCharacterSequences
		usingBlock: ^(NSString *inSubstring, NSRange inSubstringRange, NSRange inEnclosingRange, BOOL *outStop) {
		[chars addObject: inSubstring];
	}];

	NSLog(@"%@ = %@", self, [chars componentsJoinedByString: @", "]);
}

If we feed this version e̊gâds then we get

e̊gâds = e̊, g, â, d, s

as expected, which is what I want.

The section of documentation on Characters and Grapheme Clusters may also be helpful in explaining some of this.

Note: Looks like some of the unicode strings I used are tripping up SO when formatted as code. The strings I used are mañana, and e̊gâds.

Solution 3 - Objective C

Neither. The "Optimize Your Text Manipulations" section of the "Cocoa Performance Guidelines" in the Xcode Documentation recommends:

> If you want to iterate over the > characters of a string, one of the > things you should not do is use the > characterAtIndex: method to retrieve > each character separately. This method > is not designed for repeated access. > Instead, consider fetching the > characters all at once using the > getCharacters:range: method and > iterating over the bytes directly. > > If you want to search a string for > specific characters or substrings, do > not iterate through the characters one > by one. Instead, use higher level > methods such as rangeOfString:, > rangeOfCharacterFromSet:, or > substringWithRange:, which are > optimized for searching the NSString > characters.

See this Stack Overflow answer on How to remove whitespace from right end of NSString for an example of how to let rangeOfCharacterFromSet: iterate over the characters of the string instead of doing it yourself.

Solution 4 - Objective C

I would definitely get a char buffer first, then iterate over that.

NSString *someString = ...

unsigned int len = [someString length];
char buffer[len];

//This way:
strncpy(buffer, [someString UTF8String]);

//Or this way (preferred):

[someString getCharacters:buffer range:NSMakeRange(0, len)];

for(int i = 0; i < len; ++i) {
   char current = buffer[i];
   //do something with current...
}

Solution 5 - Objective C

This is little different solution for the question but I thought maybe this will be useful for someone. What I wanted was to actually iterate as actual unicode character in NSString. So, I found this solution:

NSString * str = @"hello 🤠💩";

NSRange range = NSMakeRange(0, str.length);
[str enumerateSubstringsInRange:range
                          options:NSStringEnumerationByComposedCharacterSequences
                       usingBlock:^(NSString *substring, NSRange substringRange,
                                    NSRange enclosingRange, BOOL *stop)
{
    NSLog(@"%@", substring);
}];

Solution 6 - Objective C

Although you would technically be getting individual NSString values, here is an alternative approach:

NSRange range = NSMakeRange(0, 1);
for (__unused int i = range.location; range.location < [starring length]; range.location++) {
  NSLog(@"%@", [aNSString substringWithRange:range]);
}

(The __unused int i bit is necessary to silence the compiler warning.)

Solution 7 - Objective C

try enum string with blocks

Create Category of NSString

.h

@interface NSString (Category)

- (void)enumerateCharactersUsingBlock:(void (^)(NSString *character, NSInteger idx, bool *stop))block;

@end

.m

@implementation NSString (Category)

- (void)enumerateCharactersUsingBlock:(void (^)(NSString *character, NSInteger idx, bool *stop))block
{
    bool _stop = NO;
    for(NSInteger i = 0; i < [self length] && !_stop; i++)
    {
        NSString *character = [self substringWithRange:NSMakeRange(i, 1)];
        block(character, i, &_stop);
    }
}
@end

example

NSString *string = @"Hello World";
[string enumerateCharactersUsingBlock:^(NSString *character, NSInteger idx, bool *stop) {
        NSLog(@"char %@, i: %li",character, (long)idx);
}];

Solution 8 - Objective C

You should not use

NSUInteger len = [str length];
unichar buffer[len+1];

you should use memory allocation

NSUInteger len = [str length];
unichar* buffer = (unichar*) malloc (len+1)*sizeof(unichar);

and in the end use

free(buffer);

in order to avoid memory problems.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionaahrensView Question on Stackoverflow
Solution 1 - Objective CDaniel BruceView Answer on Stackoverflow
Solution 2 - Objective CCasey FleserView Answer on Stackoverflow
Solution 3 - Objective Cma11hew28View Answer on Stackoverflow
Solution 4 - Objective CJacob RelkinView Answer on Stackoverflow
Solution 5 - Objective CCodeOverRideView Answer on Stackoverflow
Solution 6 - Objective CScott GardnerView Answer on Stackoverflow
Solution 7 - Objective Cuser1644430View Answer on Stackoverflow
Solution 8 - Objective CmarcusthierfelderView Answer on Stackoverflow