How to count characters in a unicode string in C
CStringUnicodeAsciiC Problem Overview
Lets say I have a string:
char theString[] = "你们好āa";
Given that my encoding is utf-8, this string is 12 bytes long (the three hanzi characters are three bytes each, the latin character with the macron is two bytes, and the 'a' is one byte:
strlen(theString) == 12
How can I count the number of characters? How can i do the equivalent of subscripting so that:
theString[3] == "好"
How can I slice, and cat such strings?
C Solutions
Solution 1 - C
You only count the characters that have the top two bits are not set to 10
(i.e., everything less that 0x80
or greater than 0xbf
).
That's because all the characters with the top two bits set to 10
are UTF-8 continuation bytes.
See here for a description of the encoding and how strlen
can work on a UTF-8 string.
For slicing and dicing UTF-8 strings, you basically have to follow the same rules. Any byte starting with a 0
bit or a 11
sequence is the start of a UTF-8 code point, all others are continuation characters.
Your best bet, if you don't want to use a third-party library, is to simply provide functions along the lines of:
utf8left (char *destbuff, char *srcbuff, size_t sz);
utf8mid (char *destbuff, char *srcbuff, size_t pos, size_t sz);
utf8rest (char *destbuff, char *srcbuff, size_t pos;
to get, respectively:
- the left
sz
UTF-8 bytes of a string. - the
sz
UTF-8 bytes of a string, starting atpos
. - the rest of the UTF-8 bytes of a string, starting at
pos
.
This will be a decent building block to be able to manipulate the strings sufficiently for your purposes.
Solution 2 - C
Try this for size:
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
// returns the number of utf8 code points in the buffer at s
size_t utf8len(char *s)
{
size_t len = 0;
for (; *s; ++s) if ((*s & 0xC0) != 0x80) ++len;
return len;
}
// returns a pointer to the beginning of the pos'th utf8 codepoint
// in the buffer at s
char *utf8index(char *s, size_t pos)
{
++pos;
for (; *s; ++s) {
if ((*s & 0xC0) != 0x80) --pos;
if (pos == 0) return s;
}
return NULL;
}
// converts codepoint indexes start and end to byte offsets in the buffer at s
void utf8slice(char *s, ssize_t *start, ssize_t *end)
{
char *p = utf8index(s, *start);
*start = p ? p - s : -1;
p = utf8index(s, *end);
*end = p ? p - s : -1;
}
// appends the utf8 string at src to dest
char *utf8cat(char *dest, char *src)
{
return strcat(dest, src);
}
// test program
int main(int argc, char **argv)
{
// slurp all of stdin to p, with length len
char *p = malloc(0);
size_t len = 0;
while (true) {
p = realloc(p, len + 0x10000);
ssize_t cnt = read(STDIN_FILENO, p + len, 0x10000);
if (cnt == -1) {
perror("read");
abort();
} else if (cnt == 0) {
break;
} else {
len += cnt;
}
}
// do some demo operations
printf("utf8len=%zu\n", utf8len(p));
ssize_t start = 2, end = 3;
utf8slice(p, &start, &end);
printf("utf8slice[2:3]=%.*s\n", end - start, p + start);
start = 3; end = 4;
utf8slice(p, &start, &end);
printf("utf8slice[3:4]=%.*s\n", end - start, p + start);
return 0;
}
Sample run:
matt@stanley:~/Desktop$ echo -n 你们好āa | ./utf8ops
utf8len=5
utf8slice[2:3]=好
utf8slice[3:4]=ā
Note that your example has an off by one error. theString[2] == "好"
Solution 3 - C
The easiest way is to use a library like ICU
Solution 4 - C
Depending on your notion of "character", this question can get more or less involved.
First off, you should transform your byte string into a string of unicode codepoints. You can do this with iconv()
of ICU, though if this is the only thing you do, iconv()
is a lot easier, and it's part of POSIX.
Your string of unicode codepoints could be something like a null-terminated uint32_t[]
, or if you have C1x, an array of char32_t
. The size of that array (i.e. its number of elements, not its size in bytes) is the number of codepoints (plus the terminator), and that should give you a very good start.
However, the notion of a "printable character" is fairly complex, and you may prefer to count graphemes rather than codepoints - for instance, an a
with an accent ^
can be expressed as two unicode codepoints, or as a combined legacy codepoint â
- both are valid, and both are required by the unicode standard to be treated equally. There is a process called "normalization" which turns your string into a definite version, but there are many graphemes which are not expressible as a single codepoint, and in general there is no way around a proper library that understands this and counts graphemes for you.
That said, it's up to you to decide how complex your scripts are and how thoroughly you want to treat them. Transforming into unicode codepoints is a must, everything beyond that is at your discretion.
Don't hesitate to ask questions about ICU if you decide that you need it, but feel free to explore the vastly simpler iconv()
first.
Solution 5 - C
In the real world, theString[3]=foo;
is not a meaningful operation. Why would you ever want to replace a character at a particular position in the string with a different character? There's certainly no natural-language-text processing task for which this operation is meaningful.
Counting characters is also unlikely to be meaningful. How many characters (for your idea of "character") are there in "á"? How about "á"? Now how about "གི"? If you need this information for implementing some sort of text editing, you're going to have to deal with these hard questions, or just use an existing library/gui toolkit. I would recommend the latter unless you're an expert on world scripts and languages and think you can do better.
For all other purposes, strlen
tells you exactly the piece of information that's actually useful: how much storage space a string takes. This is what's needed for combining and separating strings. If all you want to do is combine strings or separate them at a particular delimiter, snprintf
(or strcat
if you insist...) and strstr
are all you need.
If you want to perform higher-level natural-language-text operations, like capitalization, line breaking, etc. or even higher-level operations like pluralization, tense changes, etc. then you'll need either a library like ICU or respectively something much higher-level and linguistically-capable (and specific to the language(s) you're working with).
Again, most programs do not have any use for this sort of thing and just need to assemble and parse text without any considerations to natural language.
Solution 6 - C
while (s[i]) {
if ((s[i] & 0xC0) != 0x80)
j++;
i++;
}
return (j);
This will count characters in a UTF-8 String... (Found in this article: Even faster UTF-8 character counting)
However I'm still stumped on slicing and concatenating?!?
Solution 7 - C
In general we should use a different data type for unicode characters.
For example, you can use the wide char data type
wchar_t theString[] = L"你们好āa";
Note the L modifier that tells that the string is composed of wide chars.
The length of that string can be calculated using the wcslen
function, which behaves like strlen
.
Solution 8 - C
One thing that's not clear from the above answers is why it's not simple. Each character is encoded in one way or another - it doesn't have to be UTF-8, for example - and each character may have multiple encodings, with varying ways to handle combining of accents, etc. The rules are really complicated, and vary by encoding (e.g., utf-8 vs. utf-16).
This question has enormous security concerns, so it is imperative that this be done correctly. Use an OS-supplied library or a well-known third-party library to manipulate unicode strings; don't roll your own.
Solution 9 - C
I did similar implementation years back. But I do not have code with me.
For each unicode characters, first byte describes the number of bytes follow it to construct a unicode character. Based on the first byte you can determine the length of each unicode character.
I think its a good UTF8 library. enter link description here
Solution 10 - C
A sequence of code points constitute a single syllable / letter / character in many other Non Western-European languages (eg: all Indic languages)
So, when you are counting the length OR finding the substring (there are definitely use cases of finding the substrings - let us say playing a hangman game), you need to advance syllable by syllable , not by code point by code point.
So the definition of the character/syllable and where you actually break the string into "chunks of syllables" depends upon the nature of the language you are dealing with. For example, the pattern of the syllables in many Indic languages (Hindi, Telugu, Kannada, Malayalam, Nepali, Tamil, Punjabi, etc.) can be any of the following
V (Vowel in their primary form appearing at the beginning of the word)
C (consonant)
C + V (consonant + vowel in their secondary form)
C + C + V
C + C + C + V
You need to parse the string and look for the above patterns to break the string and to find the substrings.
I do not think it is possible to have a general purpose method which can magically break the strings in the above fashion for any unicode string (or sequence of code points) - as the pattern that works for one language may not be applicable for another letter;
I guess there may be some methods / libraries that can take some definition / configuration parameters as the input to break the unicode strings into such syllable chunks. Not sure though! Appreciate if some one can share how they solved this problem using any commercially available or open source methods.