Upper vs Lower Case
StringLanguage AgnosticUppercaseString Problem Overview
When doing case-insensitive comparisons, is it more efficient to convert the string to upper case or lower case? Does it even matter?
It is suggested in this SO post that C# is more efficient with ToUpper because "Microsoft optimized it that way." But I've also read this argument that converting ToLower vs. ToUpper depends on what your strings contain more of, and that typically strings contain more lower case characters which makes ToLower more efficient.
In particular, I would like to know:
-
Is there a way to optimize ToUpper or ToLower such that one is faster than the other?
-
Is it faster to do a case-insensitive comparison between upper or lower case strings, and why?
-
Are there any programming environments (eg. C, C#, Python, whatever) where one case is clearly better than the other, and why?
String Solutions
Solution 1 - String
Converting to either upper case or lower case in order to do case-insensitive comparisons is incorrect due to "interesting" features of some cultures, particularly Turkey. Instead, use a StringComparer with the appropriate options.
MSDN has some great guidelines on string handling. You might also want to check that your code passes the Turkey test.
EDIT: Note Neil's comment around ordinal case-insensitive comparisons. This whole realm is pretty murky :(
Solution 2 - String
From Microsoft on MSDN:
> Best Practices for Using Strings in the .NET Framework > ================================ > > Recommendations for String Usage > > - Use the String.ToUpperInvariant method instead of the String.ToLowerInvariant method when you normalize strings for comparison.
Why? From Microsoft:
> Normalize strings to uppercase > ================
> There is a small group of characters that when converted to lowercase cannot make a round trip.
What is example of such a character that cannot make a round trip?
- Start: Greek Rho Symbol (U+03f1) ϱ
- Uppercase: Capital Greek Rho (U+03a1) Ρ
- Lowercase: Small Greek Rho (U+03c1) ρ
> ϱ , Ρ , ρ
Original: ϱ
ToUpper: Ρ
ToLower: ρ
That is why, if your want to do case insensitive comparisons you convert the strings to uppercase, and not lowercase.
So if you have to choose one, choose Uppercase.
Solution 3 - String
According to MSDN it is more efficient to pass in the strings and tell the comparison to ignore case:
> String.Compare(strA, strB, StringComparison.OrdinalIgnoreCase) > is equivalent to (but faster than) calling > > String.Compare(ToUpperInvariant(strA), ToUpperInvariant(strB), StringComparison.Ordinal). > > These comparisons are still very fast.
Of course, if you are comparing one string over and over again then this may not hold.
Solution 4 - String
Based on strings tending to have more lowercase entries, ToLower should theoretically be faster (lots of compares, but few assignments).
In C, or when using individually-accessible elements of each string (such as C strings or the STL's string type in C++), it's actually a byte comparison - so comparing UPPER
is no different from lower
.
If you were sneaky and loaded your strings into long
arrays instead, you'd get a very fast comparison on the whole string because it could compare 4 bytes at a time. However, the load time might make it not worthwhile.
Why do you need to know which is faster? Unless you're doing a metric buttload of comparisons, one running a couple cycles faster is irrelevant to the speed of overall execution, and sounds like premature optimization :)
Solution 5 - String
Microsoft has optimized ToUpperInvariant()
, not ToUpper()
. The difference is that invariant is more culture friendly. If you need to do case-insensitive comparisons on strings that may vary in culture, use Invariant, otherwise the performance of invariant conversion shouldn't matter.
I can't say whether ToUpper() or ToLower() is faster though. I've never tried it since I've never had a situation where performance mattered that much.
Solution 6 - String
If you are doing string comparison in C# it is significantly faster to use .Equals() instead of converting both strings to upper or lower case. Another big plus for using .Equals() is that more memory isn't allocated for the 2 new upper/lower case strings.
Solution 7 - String
I wanted some actual data on this, so I pulled the full list of two byte
characters that fail with ToLower
or ToUpper
. I then ran this test below:
using System;
class Program {
static void Main() {
char[][] pairs = {
new[]{'\u00E5','\u212B'},new[]{'\u00C5','\u212B'},new[]{'\u0399','\u1FBE'},
new[]{'\u03B9','\u1FBE'},new[]{'\u03B2','\u03D0'},new[]{'\u03B5','\u03F5'},
new[]{'\u03B8','\u03D1'},new[]{'\u03B8','\u03F4'},new[]{'\u03D1','\u03F4'},
new[]{'\u03B9','\u1FBE'},new[]{'\u0345','\u03B9'},new[]{'\u0345','\u1FBE'},
new[]{'\u03BA','\u03F0'},new[]{'\u00B5','\u03BC'},new[]{'\u03C0','\u03D6'},
new[]{'\u03C1','\u03F1'},new[]{'\u03C2','\u03C3'},new[]{'\u03C6','\u03D5'},
new[]{'\u03C9','\u2126'},new[]{'\u0392','\u03D0'},new[]{'\u0395','\u03F5'},
new[]{'\u03D1','\u03F4'},new[]{'\u0398','\u03D1'},new[]{'\u0398','\u03F4'},
new[]{'\u0345','\u1FBE'},new[]{'\u0345','\u0399'},new[]{'\u0399','\u1FBE'},
new[]{'\u039A','\u03F0'},new[]{'\u00B5','\u039C'},new[]{'\u03A0','\u03D6'},
new[]{'\u03A1','\u03F1'},new[]{'\u03A3','\u03C2'},new[]{'\u03A6','\u03D5'},
new[]{'\u03A9','\u2126'},new[]{'\u0398','\u03F4'},new[]{'\u03B8','\u03F4'},
new[]{'\u03B8','\u03D1'},new[]{'\u0398','\u03D1'},new[]{'\u0432','\u1C80'},
new[]{'\u0434','\u1C81'},new[]{'\u043E','\u1C82'},new[]{'\u0441','\u1C83'},
new[]{'\u0442','\u1C84'},new[]{'\u0442','\u1C85'},new[]{'\u1C84','\u1C85'},
new[]{'\u044A','\u1C86'},new[]{'\u0412','\u1C80'},new[]{'\u0414','\u1C81'},
new[]{'\u041E','\u1C82'},new[]{'\u0421','\u1C83'},new[]{'\u1C84','\u1C85'},
new[]{'\u0422','\u1C84'},new[]{'\u0422','\u1C85'},new[]{'\u042A','\u1C86'},
new[]{'\u0463','\u1C87'},new[]{'\u0462','\u1C87'}
};
int upper = 0, lower = 0;
foreach (char[] pair in pairs) {
Console.Write(
"U+{0:X4} U+{1:X4} pass: ",
Convert.ToInt32(pair[0]),
Convert.ToInt32(pair[1])
);
if (Char.ToUpper(pair[0]) == Char.ToUpper(pair[1])) {
Console.Write("ToUpper ");
upper++;
} else {
Console.Write(" ");
}
if (Char.ToLower(pair[0]) == Char.ToLower(pair[1])) {
Console.Write("ToLower");
lower++;
}
Console.WriteLine();
}
Console.WriteLine("upper pass: {0}, lower pass: {1}", upper, lower);
}
}
Result below. Note I also tested with the Invariant
versions, and result was
exact same. Interestingly, one of the pairs fails with both. But based on this
ToUpper is the best option.
U+00E5 U+212B pass: ToLower
U+00C5 U+212B pass: ToLower
U+0399 U+1FBE pass: ToUpper
U+03B9 U+1FBE pass: ToUpper
U+03B2 U+03D0 pass: ToUpper
U+03B5 U+03F5 pass: ToUpper
U+03B8 U+03D1 pass: ToUpper
U+03B8 U+03F4 pass: ToLower
U+03D1 U+03F4 pass:
U+03B9 U+1FBE pass: ToUpper
U+0345 U+03B9 pass: ToUpper
U+0345 U+1FBE pass: ToUpper
U+03BA U+03F0 pass: ToUpper
U+00B5 U+03BC pass: ToUpper
U+03C0 U+03D6 pass: ToUpper
U+03C1 U+03F1 pass: ToUpper
U+03C2 U+03C3 pass: ToUpper
U+03C6 U+03D5 pass: ToUpper
U+03C9 U+2126 pass: ToLower
U+0392 U+03D0 pass: ToUpper
U+0395 U+03F5 pass: ToUpper
U+03D1 U+03F4 pass:
U+0398 U+03D1 pass: ToUpper
U+0398 U+03F4 pass: ToLower
U+0345 U+1FBE pass: ToUpper
U+0345 U+0399 pass: ToUpper
U+0399 U+1FBE pass: ToUpper
U+039A U+03F0 pass: ToUpper
U+00B5 U+039C pass: ToUpper
U+03A0 U+03D6 pass: ToUpper
U+03A1 U+03F1 pass: ToUpper
U+03A3 U+03C2 pass: ToUpper
U+03A6 U+03D5 pass: ToUpper
U+03A9 U+2126 pass: ToLower
U+0398 U+03F4 pass: ToLower
U+03B8 U+03F4 pass: ToLower
U+03B8 U+03D1 pass: ToUpper
U+0398 U+03D1 pass: ToUpper
U+0432 U+1C80 pass: ToUpper
U+0434 U+1C81 pass: ToUpper
U+043E U+1C82 pass: ToUpper
U+0441 U+1C83 pass: ToUpper
U+0442 U+1C84 pass: ToUpper
U+0442 U+1C85 pass: ToUpper
U+1C84 U+1C85 pass: ToUpper
U+044A U+1C86 pass: ToUpper
U+0412 U+1C80 pass: ToUpper
U+0414 U+1C81 pass: ToUpper
U+041E U+1C82 pass: ToUpper
U+0421 U+1C83 pass: ToUpper
U+1C84 U+1C85 pass: ToUpper
U+0422 U+1C84 pass: ToUpper
U+0422 U+1C85 pass: ToUpper
U+042A U+1C86 pass: ToUpper
U+0463 U+1C87 pass: ToUpper
U+0462 U+1C87 pass: ToUpper
upper pass: 46, lower pass: 8
Solution 8 - String
It really shouldn't ever matter. With ASCII characters, it definitely doesn't matter - it's just a few comparisons and a bit flip for either direction. Unicode might be a little more complicated, since there are some characters that change case in weird ways, but there really shouldn't be any difference unless your text is full of those special characters.
Solution 9 - String
Doing it right, there should be a small, insignificant speed advantage if you convert to lower case, but this is, as many has hinted, culture dependent and is not inherit in the function but in the strings you convert (lots of lower case letters means few assignments to memory) -- converting to upper case is faster if you have a string with lots of upper case letters.