Detecting if a string has unique characters: comparing my solution to "Cracking the Coding Interview?"

JavaStringAlgorithmBig OTime Complexity

Java Problem Overview


I am working through the book "Cracking the Coding Interview" and I have come across questions here asking for answers, but I need help comparing my answer to the solution. My algorithm works, but I am having difficulty understand the solution in the book. Mainly because I don't understand what some of the operators are really doing.

The task is: "Implement an algorithm to determine if a string has all unique characters. What if you cannot use additional data structures?"

This is my solution:

public static boolean checkForUnique(String str){
	boolean containsUnique = false;

	for(char c : str.toCharArray()){
		if(str.indexOf(c) == str.lastIndexOf(c)){
			containsUnique = true;
		} else {
			containsUnique = false;
		}
	}

	return containsUnique;
}

It works, but how efficient is this? I saw that the complexity of the index functions for String in Java are O(n*m)

Here is the solution from the book:

public static boolean isUniqueChars(String str) {
	if (str.length() > 256) {
		return false;
	}
	int checker = 0;
	for (int i = 0; i < str.length(); i++) {
		int val = str.charAt(i) - 'a';
		if ((checker & (1 << val)) > 0) return false;
		checker |= (1 << val);
	}
	return true;
}

A couple things I am not quite understanding with the solution. First, what does the "|=" operator do? Why is 'a' subtracted from the current character in the string for the value of "val"? I know "<<" is a bitwise left shift, but what does (checker & (1<<val)) do? I know it is bitwise and, but I am not understanding it since I am not understanding the line where checker gets a value.

I am just not familiar with these operations and unfortunately the book does not give an explanation of the solutions, probably because it assumes you already understand these operations.

Java Solutions


Solution 1 - Java

There are two separate questions here: what's the efficiency of your solution, and what is the reference solution doing? Let's treat each independently.

First, your solution:

public static boolean checkForUnique(String str){
    boolean containsUnique = false;

    for(char c : str.toCharArray()){
        if(str.indexOf(c) == str.lastIndexOf(c)){
            containsUnique = true;
        } else {
            containsUnique = false;
        }
    }

    return containsUnique;
}

Your solution essentially consists of a loop over all characters in the string (let's say there are n of them), checking on each iteration whether the first and last index of the characters are the same. The indexOf and lastIndexOf methods each take time O(n), because they have to scan across all the characters of the string to determine if any of them match the one you're looking for. Therefore, since your loop runs O(n) times and does O(n) work per iteration, its runtime is O(n2).

However, there's something iffy about your code. Try running it on the string aab. Does it work correctly on this input? As a hint, as soon as you determine that there are two or more duplicated characters, you're guaranteed that there are duplicates and you can return that not all characters are unique.

Now, let's look at the reference:

public static boolean isUniqueChars(String str) {
    if (str.length() > 256) { // NOTE: Are you sure this isn't 26?
        return false;
    }
    int checker = 0;
    for (int i = 0; i < str.length(); i++) {
        int val = str.charAt(i) - 'a';
        if ((checker & (1 << val)) > 0) return false;
        checker |= (1 << val);
    }
    return true;
}

This solution is cute. The basic idea is the following: imagine that you have an array of 26 booleans, each one tracking whether a particular character has appeared in the string already. You start with all of them false. You then iterate across the characters of the string, and each time you see a character you look into the array slot for that character. If it's false, this is the first time you've seen the character and you can set the slot to true. If it's true, you've already seen this character and you can immediately report that there's a duplicate.

Notice that this method doesn't allocate an array of booleans. Instead, it opts for a clever trick. Since there are only 26 different characters possible and there are 32 bits in an int, the solution creates an int variable where each bit of the variable corresponds to one of the characters in the string. Instead of reading and writing an array, the solution reads and writes the bits of the number.

For example, look at this line:

if ((checker & (1 << val)) > 0) return false;

What does checker & (1 << val) do? Well, 1 << val creates an int value that has all bits zero except for the valth bit. It then uses bitwise AND to AND this value with checker. If the bit at position val in checker is already set, then this evaluates to a nonzero value (meaning we've already seen the number) and we can return false. Otherwise, it evaluates to 0, and we haven't seen the number.

The next line is this:

checker |= (1 << val);

This uses the "bitwise OR with assignment" operator, which is equivalent to

checker = checker | (1 << val);

This ORs checker with a value that has a 1 bit set only at position val, which turns the bit on. It's equivalent to setting the valth bit of the number to 1.

This approach is much faster than yours. First, since the function starts off by checking if the string has length greater than 26 (I'm assuming the 256 is a typo), the function never has to test any string of length 27 or greater. Therefore, the inner loop runs at most 26 times. Each iteration does O(1) work in bitwise operations, so the overall work done is O(1) (O(1) iterations times O(1) work per iteration), which is significantly faster than your implementation.

If you haven't seen bitwise operations used this way, I'd recommend searching for "bitwise operators" on Google to learn more.

Hope this helps!

Solution 2 - Java

The book solution is one I don't like and I believe is dysfunctional..... templatetypedef has posted a comprehensive answer which indicates that the solution is a good one. I disagree, since the book's answer assumes that the string only has lower-case characters, (ascii) and does no validation to ensure that.

public static boolean isUniqueChars(String str) {
    // short circuit - supposed to imply that
    // there are no more than 256 different characters.
    // this is broken, because in Java, char's are Unicode,
    // and 2-byte values so there are 32768 values
    // (or so - technically not all 32768 are valid chars)
    if (str.length() > 256) {
        return false;
    }
    // checker is used as a bitmap to indicate which characters
    // have been seen already
    int checker = 0;
    for (int i = 0; i < str.length(); i++) {
        // set val to be the difference between the char at i and 'a'
        // unicode 'a' is 97
        // if you have an upper-case letter e.g. 'A' you will get a
        // negative 'val' which is illegal
        int val = str.charAt(i) - 'a';
        // if this lowercase letter has been seen before, then
        // the corresponding bit in checker will have been set and
        // we can exit immediately.
        if ((checker & (1 << val)) > 0) return false;
        // set the bit to indicate we have now seen the letter.
        checker |= (1 << val);
    }
    // none of the characters has been seen more than once.
    return true;
}

The bottom line, given templatedef's answer too, is that there's not actually enough information to determine whether the book's answer is right.

I distrust it though.

templatedef's answer on the complexity is one I agree with though.... ;-)

EDIT: As an exercise, I converted the book's answer to one that will work (albeit slower than the book's answer - BigInteger is slow).... This version does the same logic as the book's, but does not have the same validation and assumption problems (but it is slower). It is useful to show the logic too.

public static boolean isUniqueChars(String str) {
    if (str.length() > 32768) {
        return false;
    }
    BigInteger checker = new BigInteger(0);
    for (int i = 0; i < str.length(); i++) {
        int val = str.charAt(i);
        if (checker.testBit(val)) return false;
        checker = checker.setBit(val);
    }
    // none of the characters has been seen more than once.
    return true;
}

Solution 3 - Java

Since a char value can hold one of only 256 different values, any string that's longer than 256 characters must contain at least one duplicate.

The remainder of the code uses checker as a sequence of bits, with each bit representing one character. It seems to convert each character to an integer, starting with a = 1. It then checks the corresponding bit in checker. If it's set, it means that character has already been seen, and we therefore know that the string contains at least one duplicate character. If the character hasn't yet been seen, the code sets the corresponding bit in checker and continues.

Specifically, (1<<val) generates an integer with a single 1 bit in position val. For example, (1<<3) would be binary 1000, or 8. The expression checker & (1<<val) will return zero if the bit in position val is not set (that is, has value 0) in checker, and (1<<val), which is always non-zero, if that bit is set. The expression checker |= (1<<val) will set that bit in checker.

However, the algorithm seems to be flawed: it doesn't seem to account for the uppercase characters and punctuation (which generally come before the lowercase ones lexicographically). It would also seem to require a 256-bit integer, which is not standard.

As rolfl mentions in the comment below, I prefer your solution because it works. You can optimize it by returning false as soon as you identify a non-unique character.

Solution 4 - Java

The solution from the book is case insensitive. 'A' and 'a' is considered duplicate as per the implementation.

Explanation: for input string with char 'A', 'A' - 'a' is -32 so '1 << val' will be evaluated as 1 << -32. shift on any negative number will shift the bits in opposite direction. thus 1 << -32 will be 1 >> 32. Which will set the first bit to 1. This is also the case with char 'a'. Thus 'A' and 'a' are considered duplicate characters. Similarly for 'B' and 'b' second bit is set to 1 and so on.

Solution 5 - Java

6th edition update

    public static void main(String[] args) {
        System.out.println(isUniqueChars("abcdmc")); // false
        System.out.println(isUniqueChars("abcdm")); // true
        System.out.println(isUniqueChars("abcdm\u0061")); // false because \u0061 is unicode a
    }


    public static boolean isUniqueChars(String str) {
        /*
         You should first ask your interviewer if the string is an ASCII string or a Unicode string.
         Asking this question will show an eye for detail and a solid foundation in computer science.
         We'll assume for simplicity the character set is ASCII.
         If this assumption is not valid, we would need to increase the storage size.
         */
        // at 6th edition of the book, there is no pre condition on string's length
        /*
         We can reduce our space usage by a factor of eight by using a bit vector.
         We will assume, in the below code, that the string only uses the lowercase letters a through z.
         This will allow us to use just a single int.
          */
        // printing header to provide nice csv format log, you may uncomment
//        System.out.println("char,val,valBinaryString,leftShift,leftShiftBinaryString,checker");
        int checker = 0;
        for (int i = 0; i < str.length(); i++) {
            /*
                Dec Binary Character
                97	01100001	a
                98	01100010	b
                99	01100011	c
                100	01100100	d
                101	01100101	e
                102	01100110	f
                103	01100111	g
                104	01101000	h
                105	01101001	i
                106	01101010	j
                107	01101011	k
                108	01101100	l
                109	01101101	m
                110	01101110	n
                111	01101111	o
                112	01110000	p
                113	01110001	q
                114	01110010	r
                115	01110011	s
                116	01110100	t
                117	01110101	u
                118	01110110	v
                119	01110111	w
                120	01111000	x
                121	01111001	y
                122	01111010	z
             */
            // a = 97 as you can see in ASCII table above
            // set val to be the difference between the char at i and 'a'
            // b = 1, d = 3.. z = 25
            char c = str.charAt(i);
            int val = c - 'a';
            // means "shift 1 val numbers places to the left"
            // for example; if str.charAt(i) is "m", which is the 13th letter, 109 (g in ASCII) minus 97 equals 12
            // it returns 1 and 12 zeros = 1000000000000 (which is also the number 4096)
            int leftShift = 1 << val;
            /*
                An integer is represented as a sequence of bits in memory.
                For interaction with humans, the computer has to display it as decimal digits, but all the calculations
                are carried out as binary.
                123 in decimal is stored as 1111011 in memory.

                The & operator is a bitwise "And".
                The result is the bits that are turned on in both numbers.

                1001 & 1100 = 1000, since only the first bit is turned on in both.

                It will be nicer to look like this

                1001 &
                1100
                =
                1000

                Note that ones only appear in a place when both arguments have a one in that place.

             */
            int bitWiseAND = checker & leftShift;
            String leftShiftBinaryString = Integer.toBinaryString(leftShift);
            String checkerBinaryString = leftPad(Integer.toBinaryString(checker), leftShiftBinaryString.length());
            String leftShiftBinaryStringWithPad = leftPad(leftShiftBinaryString, checkerBinaryString.length());
//            System.out.printf("%s &\n%s\n=\n%s\n\n", checkerBinaryString, leftShiftBinaryStringWithPad, Integer.toBinaryString(bitWiseAND));
            /*
            in our example with string "abcdmc"
            0 &
            1
            =
            0

            01 &
            10
            =
            0

            011 &
            100
            =
            0

            0111 &
            1000
            =
            0

            0000000001111 &
            1000000000000
            =
            0

            1000000001111 &
            0000000000100
            =
            100
             */
//            System.out.println(c + "," + val + "," + Integer.toBinaryString(val) + "," + leftShift + "," + Integer.toBinaryString(leftShift) + "," + checker);
            /*
            char val valBinaryString leftShift leftShiftBinaryString checker
            a	0	    0	            1	    1	                    0
            b	1	    1	            2	    10	                    1
            c	2	    10	            4	    100	                    3
            d	3	    11	            8	    1000	                7
            m	12	    1100	        4096	1000000000000	        15
            c	2	    10	            4	    100	                    4111
             */
            if (bitWiseAND > 0) {
                return false;
            }
            // setting 1 on on the left shift
            /*
            0000000001111 |
            1000000000000
            =
            1000000001111
             */
            checker = checker | leftShift;
        }
        return true;
        /*
        If we can't use additional data structures, we can do the following:
        1. Compare every character of the string to every other character of the string.
            This will take 0( n 2 ) time and 0(1) space
        2. If we are allowed to modify the input string, we could sort the string in O(n log(n)) time and then linearly
            check the string for neighboring characters that are identical.
            Careful, though: many sorting algorithms take up extra space.

        These solutions are not as optimal in some respects, but might be better depending on the constraints of the problem.
         */
    }

    private static String leftPad(String s, int i) {
        StringBuilder sb = new StringBuilder(s);
        int charsToGo = i - sb.length();
        while (charsToGo > 0) {
            sb.insert(0, '0');
            charsToGo--;
        }
        return sb.toString();
    }

Solution 6 - Java

As referenced from 'Cracking the Coding Interview', an alternative solution exists:

boolean isUniqueChars(String str) {
  if(str.length() > 128) return false;

  boolean[] char_set = new boolean[128];
  for(int i = 0; i < str.length(); i++) {
    int val = str.charAt(i);
    
    if(char_set[val]) {
      return false;
    }
    char_set[val] = true;
  }
  return true;
}

Of course, to achieve better space complexity, please refer to the above example by @templatetypedef

Solution 7 - Java

This is the necessary correction to the book's code:

public static boolean checkForUnique(String str){
    boolean containsUnique = false;

    for(char c : str.toCharArray()){
        if(str.indexOf(c) == str.lastIndexOf(c)){
            containsUnique = true;
        } else {
            return false;
        }
    }

    return containsUnique;
}

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionSeephorView Question on Stackoverflow
Solution 1 - JavatemplatetypedefView Answer on Stackoverflow
Solution 2 - JavarolflView Answer on Stackoverflow
Solution 3 - JavaAdam LissView Answer on Stackoverflow
Solution 4 - Javaallen josephView Answer on Stackoverflow
Solution 5 - JavawhoopdedooView Answer on Stackoverflow
Solution 6 - JavaasusView Answer on Stackoverflow
Solution 7 - JavawolfieView Answer on Stackoverflow