How are strings compared?

PythonStringComparison

Python Problem Overview


I'm wondering how Python does string comparison, more specifically how it determines the outcome when a less than < or greater than > operator is used.

For instance if I put print('abc' < 'bac') I get True. I understand that it compares corresponding characters in the string, however its unclear as to why there is more, for lack of a better term, "weight" placed on the fact that a is less thanb (first position) in first string rather than the fact that a is less than b in the second string (second position).

Python Solutions


Solution 1 - Python

From the docs:

> The comparison uses lexicographical > ordering: first the first two items > are compared, and if they differ this > determines the outcome of the > comparison; if they are equal, the > next two items are compared, and so > on, until either sequence is > exhausted.

Also:

> Lexicographical ordering for strings uses the Unicode code point number to order individual characters.

or on Python 2:

> Lexicographical ordering for strings uses the ASCII ordering for individual characters.

As an example:

>>> 'abc' > 'bac'
False
>>> ord('a'), ord('b')
(97, 98)

The result False is returned as soon as a is found to be less than b. The further items are not compared (as you can see for the second items: b > a is True).

Be aware of lower and uppercase:

>>> [(x, ord(x)) for x in abc]
[('a', 97), ('b', 98), ('c', 99), ('d', 100), ('e', 101), ('f', 102), ('g', 103), ('h', 104), ('i', 105), ('j', 106), ('k', 107), ('l', 108), ('m', 109), ('n', 110), ('o', 111), ('p', 112), ('q', 113), ('r', 114), ('s', 115), ('t', 116), ('u', 117), ('v', 118), ('w', 119), ('x', 120), ('y', 121), ('z', 122)]
>>> [(x, ord(x)) for x in abc.upper()]
[('A', 65), ('B', 66), ('C', 67), ('D', 68), ('E', 69), ('F', 70), ('G', 71), ('H', 72), ('I', 73), ('J', 74), ('K', 75), ('L', 76), ('M', 77), ('N', 78), ('O', 79), ('P', 80), ('Q', 81), ('R', 82), ('S', 83), ('T', 84), ('U', 85), ('V', 86), ('W', 87), ('X', 88), ('Y', 89), ('Z', 90)]

Solution 2 - Python

Python string comparison is lexicographic:

From Python Docs: http://docs.python.org/reference/expressions.html

>Strings are compared lexicographically using the numeric equivalents (the result of the built-in function ord()) of their characters. Unicode and 8-bit strings are fully interoperable in this behavior.

Hence in your example, 'abc' < 'bac', 'a' comes before (less-than) 'b' numerically (in ASCII and Unicode representations), so the comparison ends right there.

Solution 3 - Python

Python and just about every other computer language use the same principles as (I hope) you would use when finding a word in a printed dictionary:

(1) Depending on the human language involved, you have a notion of character ordering: 'a' < 'b' < 'c' etc

(2) First character has more weight than second character: 'az' < 'za' (whether the language is written left-to-right or right-to-left or boustrophedon is quite irrelevant)

(3) If you run out of characters to test, the shorter string is less than the longer string: 'foo' < 'food'

Typically, in a computer language the "notion of character ordering" is rather primitive: each character has a human-language-independent number ord(character) and characters are compared and sorted using that number. Often that ordering is not appropriate to the human language of the user, and then you need to get into "collating", a fun topic.

Solution 4 - Python

Take a look also at https://stackoverflow.com/questions/1097908/how-do-i-sort-unicode-strings-alphabetically-in-python where the discussion is about sorting rules given by the Unicode Collation Algorithm (http://www.unicode.org/reports/tr10/).

To reply to the comment

> What? How else can ordering be defined other than left-to-right?

by S.Lott, there is a famous counter-example when sorting French language. It involves accents: indeed, one could say that, in French, letters are sorted left-to-right and accents right-to-left. Here is the counter-example: we have e < é and o < ô, so you would expect the words cote, coté, côte, côté to be sorted as cote < coté < côte < côté. Well, this is not what happens, in fact you have: cote < côte < coté < côté, i.e., if we remove "c" and "t", we get oe < ôe < oé < ôé, which is exactly right-to-left ordering.

And a last remark: you shouldn't be talking about left-to-right and right-to-left sorting but rather about forward and backward sorting.

Indeed there are languages written from right to left and if you think Arabic and Hebrew are sorted right-to-left you may be right from a graphical point of view, but you are wrong on the logical level!

Indeed, Unicode considers character strings encoded in logical order, and writing direction is a phenomenon occurring on the glyph level. In other words, even if in the word שלום the letter shin appears on the right of the lamed, logically it occurs before it. To sort this word one will first consider the shin, then the lamed, then the vav, then the mem, and this is forward ordering (although Hebrew is written right-to-left), while French accents are sorted backwards (although French is written left-to-right).

Solution 5 - Python

This is a lexicographical ordering. It just puts things in dictionary order.

Solution 6 - Python

A pure Python equivalent for string comparisons would be:

def less(string1, string2):
    # Compare character by character
    for idx in range(min(len(string1), len(string2))):
        # Get the "value" of the character
        ordinal1, ordinal2 = ord(string1[idx]), ord(string2[idx])
        # If the "value" is identical check the next characters
        if ordinal1 == ordinal2:
            continue
        # It's not equal so we're finished at this index and can evaluate which is smaller.
        else:
            return ordinal1 < ordinal2
    # We're out of characters and all were equal, so the result depends on the length
    # of the strings.
    return len(string1) < len(string2)

This function does the equivalent of the real method (Python 3.6 and Python 2.7) just a lot slower. Also note that the implementation isn't exactly "pythonic" and only works for < comparisons. It's just to illustrate how it works. I haven't checked if it works like Pythons comparison for combined unicode characters.

A more general variant would be:

from operator import lt, gt

def compare(string1, string2, less=True):
    op = lt if less else gt
    for char1, char2 in zip(string1, string2):
        ordinal1, ordinal2 = ord(char1), ord(char1)
        if ordinal1 == ordinal2:
            continue
        else:
            return op(ordinal1, ordinal2)
    return op(len(string1), len(string2))

Solution 7 - Python

Strings are compared lexicographically using the numeric equivalents (the result of the built-in function ord()) of their characters. Unicode and 8-bit strings are fully interoperable in this behavior.

Solution 8 - Python

The comparison uses lexicographical ordering

For Python 3: Convert it into Unicode to get its order or number that is being used as "weightage" here For Python 2: Use ASCII to get its order or number that is being used as "weightage" here

People already answered it in detail, I am leaving it here for anyone who is in a hurry :)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestiondaveluptView Question on Stackoverflow
Solution 1 - Pythonuser225312View Answer on Stackoverflow
Solution 2 - Python逆さまView Answer on Stackoverflow
Solution 3 - PythonJohn MachinView Answer on Stackoverflow
Solution 4 - PythonyannisView Answer on Stackoverflow
Solution 5 - PythonMichael J. BarberView Answer on Stackoverflow
Solution 6 - PythonMSeifertView Answer on Stackoverflow
Solution 7 - PythonSenthil KumaranView Answer on Stackoverflow
Solution 8 - Python0xWick.ethView Answer on Stackoverflow