What does 256 means for 128 unique characters in ascii table

Ascii

Ascii Problem Overview


If I need to check a string has unique characters, I understand if we are considering characters in Ascii table, then there will 128 of them.

However, why do we need to make a boolean array of size 256 to hold 128 characters to check if element existed at least once in a string? Shouldn't a boolean array of size 128 sufficient?

Here's a quote from from the book "Cracking the Coding Interview":

if (str.length() > 128) return false;
boolean[] char_set = new boolean[256]; //which is strange since it clearly says over 128 its false

.....

Ascii Solutions


Solution 1 - Ascii

Basically, we use only 128 total character which is used mostly during program. But total number of Character in ASCII table is 256 (0 to 255). 0 to 31(total 32 character ) is called as ASCII control characters (character code 0-31). 32 to 127 character is called as ASCII printable characters (character code 32-127). 128 to 255 is called as The extended ASCII codes (character code 128-255).

check reference: http://www.ascii-code.com/

Most of the extended ASCII character isn't present in the QWERTY (ENGLISH) keyboard, so this is the reason, author took 128 total character in that example in "Cracking the coding interview" book.

Solution 2 - Ascii

No, there are 256 ASCII characters. This includes standard ASCII characters(0-127) and Extended ASCII characters(128-255).

For More Info. Please refer to: http://www.flexcomm.com/library/ASCII256.htm

Solution 3 - Ascii

Many people these days use the term "ASCII" in a sloppy fashion to describe ISO-8859-1 (also known as Latin-1), a character set that includes the [32 .. 126] printable-character values in the old-timey ASCII character set and also values in the range [128..255]. Latin-1 does a reasonably good job of covering Western European languages, whereas ASCII is limited to the non-accented characters used in basic English.

ASCII also includes control characters in the range [0-31] and 127. These don't represent printable characters (although unicode provides characters at those positions). They are return, linefeed, tab, ctrl-c, formfeed and the like. Some of them are holdovers from the olden days of teletype and telex machines.

Teletype with 8-channel paper tape punch and reader

Notice how the paper tape has eight bit positions in each frame. Those are the bits of ASCII / Latin-1. "Delete" aka Rubout is 127 or 0111 1111. Why? because it was possible to punch out all seven holes in the tape and so rub out a character.

That may account for the suggestion someone made to use a 256-position array to tabulate text in that kind of character set.

Solution 4 - Ascii

I believe the use of 128 and 256 in the same function is a mistake in that book edition. In the newer 6th edition (2016), the code example states:

if (str.length() > 128) return false;
boolean[] char_set = new boolean[128];

and the author adds the comment:

> It's OK to assume 256 characters. This would be the case in extended ASCII.

So, use either 128 or 256, not both, for that book exercise.

Solution 5 - Ascii

The author probably confused characters and bytes. You should also understand the related concept of encoding.

A byte is eight bits. A byte was traditionally often used to store a character, though very early computers only required 7 bits to store a character. The ASCII standard for encoding characters in 7 bits was ratified in 1963, though at the time there were also competing character encodings (of which EBCDIC still survives to this day).

When you only use 7 of the available 8 bits, you might have ideas for what to do with the spare bit. One of the common approaches was to encode additional non-standard characters which were not available in the ASCII standard. A large number of legacy 8-bit encodings have been defined, some of which have been published as standards as well. Some are still in popular use; some examples are ISO-8859-1 (aka Latin-1) and the Windows code pages (437, 850, and 1252 are still in common use in Western countries, despite their many drawbacks). Many of them are "extended ASCII" encodings which are compatible with ASCII in the first 128 characters; though the term "extended ASCII" is not really technically well-defined.

If you are processing a sequence of bytes, you do want to be able to cope with byte values in the range 0-255 and not just the ones which are defined in ASCII. On the other hand, if you have guarantees that none of the bytes you are going to process will have values above 127 (such as, for example, if your input is known to be ASCII because it comes from a source which is incapable of producing anything else), it is excessive to reserve room for values you know you are not going to need.

Going forward, most modern systems use Unicode in one form or another. On Windows, and apparently still in Java, you should expect UTF-16; elsewhere, UTF-8 is rapidly becoming the de facto standard. Both of these require your code to be able to handle 8-bit bytes cleanly, though the code points are not (necessarily, in UTF-8, or ever, in UTF-16) encoded in a single byte.

As for the code you posted, you are correct that 128 character positions is enough if you discard any byte whose value is larger than 127. On the other hand, depending on what data you expect to process, discarding non-ASCII characters may not at all be the right thing to do; and then, if you don't discard anything, you do need to handle all 256.

Either way, if you only discard values larger than 128, you need 129 positions in the array (there are 129 integers in the range 0 through 128). This is probably just a silly off-by-one bug.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionLydiaView Question on Stackoverflow
Solution 1 - AsciiMOHIT M SHARMAView Answer on Stackoverflow
Solution 2 - AsciiABU ABDULLAH JAVEED ASHRAFView Answer on Stackoverflow
Solution 3 - AsciiO. JonesView Answer on Stackoverflow
Solution 4 - AsciionosendaiView Answer on Stackoverflow
Solution 5 - AsciitripleeeView Answer on Stackoverflow