How to iterate UTF-8 string in PHP?

PhpUtf 8

Php Problem Overview


How to iterate a UTF-8 string character by character using indexing?

When you access a UTF-8 string with the bracket operator $str[0] the utf-encoded character consists of 2 or more elements.

For example:

$str = "Kąt";
$str[0] = "K";
$str[1] = "�";
$str[2] = "�";
$str[3] = "t";

but I would like to have:

$str[0] = "K";
$str[1] = "ą";
$str[2] = "t";

It is possible with mb_substr but this is extremely slow, ie.

mb_substr($str, 0, 1) = "K"
mb_substr($str, 1, 1) = "ą"
mb_substr($str, 2, 1) = "t"

Is there another way to interate the string character by character without using mb_substr?

Php Solutions


Solution 1 - Php

Use preg_split. With "u" modifier it supports UTF-8 unicode.

$chrArray = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);

Solution 2 - Php

Preg split will fail over very large strings with a memory exception and mb_substr is slow indeed, so here is a simple, and effective code, which I'm sure, that you could use:

function nextchar($string, &$pointer){
	if(!isset($string[$pointer])) return false;
	$char = ord($string[$pointer]);
	if($char < 128){
		return $string[$pointer++];
	}else{
		if($char < 224){
			$bytes = 2;
		}elseif($char < 240){
			$bytes = 3;
		}else{
			$bytes = 4;
		}
		$str =  substr($string, $pointer, $bytes);
		$pointer += $bytes;
		return $str;
	}
}

This I used for looping through a multibyte string char by char and if I change it to the code below, the performance difference is huge:

function nextchar($string, &$pointer){
	if(!isset($string[$pointer])) return false;
	return mb_substr($string, $pointer++, 1, 'UTF-8');
}

Using it to loop a string for 10000 times with the code below produced a 3 second runtime for the first code and 13 seconds for the second code:

function microtime_float(){
	list($usec, $sec) = explode(' ', microtime());
	return ((float)$usec + (float)$sec);
}

$source = 'árvíztűrő tükörfúrógépárvíztűrő tükörfúrógépárvíztűrő tükörfúrógépárvíztűrő tükörfúrógépárvíztűrő tükörfúrógép';

$t = Array(
	0 => microtime_float()
);

for($i = 0; $i < 10000; $i++){
	$pointer = 0;
	while(($chr = nextchar($source, $pointer)) !== false){
		//echo $chr;
	}
}

$t[] = microtime_float();

echo $t[1] - $t[0].PHP_EOL.PHP_EOL;

Solution 3 - Php

In answer to comments posted by @Pekla and @Col. Shrapnel I have compared preg_split with mb_substr.

alt text

The image shows, that preg_split took 1.2s, while mb_substr almost 25s.

Here is the code of the functions:

function split_preg($str){
	return preg_split('//u', $str, -1);		
}

function split_mb($str){
	$length = mb_strlen($str);
	$chars = array();
	for ($i=0; $i<$length; $i++){
		$chars[] = mb_substr($str, $i, 1);
	}
	$chars[] = "";
	return $chars;
}

Solution 4 - Php

Using Lajos Meszaros' wonderful function as inspiration I created a multi-byte string iterator class.

// Multi-Byte String iterator class
class MbStrIterator implements Iterator
{
	private $iPos	= 0;
	private $iSize	= 0;
	private $sStr	= null;

	// Constructor
	public function __construct(/*string*/ $str)
	{
		// Save the string
		$this->sStr		= $str;

		// Calculate the size of the current character
		$this->calculateSize();
	}

	// Calculate size
	private function calculateSize() {

		// If we're done already
		if(!isset($this->sStr[$this->iPos])) {
			return;
		}

		// Get the character at the current position
		$iChar	= ord($this->sStr[$this->iPos]);

    	// If it's a single byte, set it to one
		if($iChar < 128) {
			$this->iSize	= 1;
		}

		// Else, it's multi-byte
		else {

			// Figure out how long it is
			if($iChar < 224) {
				$this->iSize = 2;
			} else if($iChar < 240){
				$this->iSize = 3;
			} else if($iChar < 248){
				$this->iSize = 4;
			} else if($iChar == 252){
				$this->iSize = 5;
			} else {
				$this->iSize = 6;
			}
		}
	}

	// Current
	public function current() {

		// If we're done
		if(!isset($this->sStr[$this->iPos])) {
			return false;
		}

		// Else if we have one byte
		else if($this->iSize == 1) {
			return $this->sStr[$this->iPos];
		}

		// Else, it's multi-byte
		else {
			return substr($this->sStr, $this->iPos, $this->iSize);
		}
	}

	// Key
	public function key()
	{
		// Return the current position
		return $this->iPos;
	}

	// Next
	public function next()
	{
		// Increment the position by the current size and then recalculate
		$this->iPos	+= $this->iSize;
		$this->calculateSize();
	}

	// Rewind
	public function rewind()
	{
		// Reset the position and size
		$this->iPos		= 0;
		$this->calculateSize();
	}

	// Valid
	public function valid()
	{
		// Return if the current position is valid
		return isset($this->sStr[$this->iPos]);
	}
}

It can be used like so

foreach(new MbStrIterator("Kąt") as $c) {
    echo "{$c}\n";
}

Which will output

K
ą
t

Or if you really want to know the position of the start byte as well

foreach(new MbStrIterator("Kąt") as $i => $c) {
    echo "{$i}: {$c}\n";
}

Which will output

0: K
1: ą
3: t

Solution 5 - Php

You could parse each byte of the string and determine whether it is a single (ASCII) character or the start of a multi-byte character:

> The UTF-8 encoding is variable-width, with each character represented by 1 to 4 bytes. Each byte has 0–4 leading consecutive '1' bits followed by a '0' bit to indicate its type. 2 or more '1' bits indicates the first byte in a sequence of that many bytes.

you would walk through the string and, instead of increasing the position by 1, read the current character in full and then increase the position by the length that character had.

The Wikipedia article has the interpretation table for each character [retrieved 2010-10-01]:

   0-127 Single-byte encoding (compatible with US-ASCII)
 128-191 Second, third, or fourth byte of a multi-byte sequence
 192-193 Overlong encoding: start of 2-byte sequence, 
         but would encode a code point ≤ 127
  ........

Solution 6 - Php

I had the same issue as OP and I try to avoid regex in PHP since it fails or even crashes with long strings. I used Mészáros Lajos' answer with some changes since I have mbstring.func_overload set to 7.

function nextchar($string, &$pointer, &$asciiPointer){
   if(!isset($string[$asciiPointer])) return false;
    $char = ord($string[$asciiPointer]);
    if($char < 128){
        $pointer++;
        return $string[$asciiPointer++];
    }else{
        if($char < 224){
            $bytes = 2;
        }elseif($char < 240){
            $bytes = 3;
        }elseif($char < 248){
            $bytes = 4;
        }elseif($char = 252){
            $bytes = 5;
        }else{
            $bytes = 6;
        }
        $str =  substr($string, $pointer++, 1);
        $asciiPointer+= $bytes;
        return $str;
    }
}

With mbstring.func_overload set to 7, substr actually calls mb_substr. So substr gets the right value in this case. I had to add a second pointer. One keeps track of the multi-byte char in the string, the other keeps track of the single-byte char. The multi-byte value is used for substr (since it's actually mb_substr), while the single-byte value is used for retrieving the byte in this fashion: $string[$index].

Obviously if PHP ever decides to fix the [] access to work properly with multi-byte values, this will fail. But also, this fix wouldn't be needed in the first place.

Solution 7 - Php

I think the most efficient solution would be to work through the string using mb_substr. In each iteration of the loop, mb_substr would be called twice (to find the next character and the remaining string). It would pass only the remaining string to the next iteration. This way, the main overhead in each iteration would be finding the next character (done twice), which takes only one to five or so operations, depending on the byte length of the character.

If this description is not clear, let me know and I'll provide a working PHP function.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionczukView Question on Stackoverflow
Solution 1 - PhpvartecView Answer on Stackoverflow
Solution 2 - PhpLajos MészárosView Answer on Stackoverflow
Solution 3 - PhpczukView Answer on Stackoverflow
Solution 4 - PhpChris NasrView Answer on Stackoverflow
Solution 5 - PhpPekkaView Answer on Stackoverflow
Solution 6 - PhpAndrewView Answer on Stackoverflow
Solution 7 - PhpDavid SpectorView Answer on Stackoverflow