Least used delimiter character in normal text < ASCII 128

AsciiDelimiterCsv

Ascii Problem Overview


For coding reasons which would horrify you (I'm too embarrassed to say), I need to store a number of text items in a single string.

I will delimit them using a character.

Which character is best to use for this, i.e. which character is the least likely to appear in the text? Must be printable and probably less than 128 in ASCII to avoid locale issues.

Ascii Solutions


Solution 1 - Ascii

I would choose "Unit Separator" ASCII code "US": ASCII 31 (0x1F)

In the old, old days, most things were done serially, without random access. This meant that a few control codes were embedded into ASCII.

ASCII 28 (0x1C) File Separator - Used to indicate separation between files on a data input stream.
ASCII 29 (0x1D) Group Separator - Used to indicate separation between tables on a data input stream (called groups back then).
ASCII 30 (0x1E) Record Separator - Used to indicate separation between records within a table (within a group).  These roughly map to a tuple in modern nomenclature.
ASCII 31 (0x1F) Unit Separator - Used to indicate separation between units within a record.  The roughly map to fields in modern nomenclature.

Unit Separator is in ASCII, and there is Unicode support for displaying it (typically a "us" in the same glyph) but many fonts don't display it.

If you must display it, I would recommend displaying it in-application, after it was parsed into fields.

Solution 2 - Ascii

Assuming for some embarrassing reason you can't use CSV I'd say go with the data. Take some sample data, and do a simple character count for each value 0-127. Choose one of the ones which doesn't occur. If there is too much choice get a bigger data set. It won't take much time to write, and you'll get the answer best for you.

The answer will be different for different problem domains, so | (pipe) is common in shell scripts, ^ is common in math formulae, and the same is probably true for most other characters.

I personally think I'd go for | (pipe) if given a choice but going with real data is safest.

And whatever you do, make sure you've worked out an escaping scheme!

Solution 3 - Ascii

When using different languages, this symbol: ¬

proved to be the best. However I'm still testing.

Solution 4 - Ascii

Probably | or ^ or ~ you could also combine two characters

Solution 5 - Ascii

You said "printable", but that can include characters such as a tab (0x09) or form feed (0x0c). I almost always choose tabs rather than commas for delimited files, since commas can sometimes appear in text.

(Interestingly enough the ascii table has characters GS (0x1D), RS (0x1E), and US (0x1F) for group, record, and unit separators, whatever those are/were.)

If by "printable" you mean a character that a user could recognize and easily type in, I would go for the pipe | symbol first, with a few other weird characters (@ or ~ or ^ or \, or backtick which I can't seem to enter here) as a possibility. These characters +=!$%&*()-'":;<>,.?/ seem like they would be more likely to occur in user input. As for underscore _ and hash # and the brackets {}[] I don't know.

Solution 6 - Ascii

How about you use a CSV style format? Characters can be escaped in a standard CSV format, and there's already a lot of parsers already written.

Solution 7 - Ascii

Can you use a pipe symbol? That's usually the next most common delimiter after comma or tab delimited strings. It's unlikely most text would contain a pipe, and ord('|') returns 124 for me, so that seems to fit your requirements.

Solution 8 - Ascii

For fast escaping I use stuff like this: say you want to concatinate str1, str2 and str3 what I do is:

delimitedStr=str1.Replace("@","@a").Replace("|","@p")+"|"+str2.Replace("@","@a").Replace("|","@p")+"|"+str3.Replace("@","@a").Replace("|","@p");

then to retrieve original use:

splitStr=delimitedStr.Split("|".ToCharArray());
str1=splitStr[0].Replace("@p","|").Replace("@a","@");
str2=splitStr[1].Replace("@p","|").Replace("@a","@");
str3=splitStr[2].Replace("@p","|").Replace("@a","@");

note: the order of the replace is important

its unbreakable and easy to implement

Solution 9 - Ascii

Pipe for the win! |

Solution 10 - Ascii

We use ascii 0x7f which is pseudo-printable and hardly ever comes up in regular usage.

Solution 11 - Ascii

Well it's going to depend on the nature of your text to some extent but a vertical bar 0x7C doesn't crop up in text very often.

Solution 12 - Ascii

I don't think I've ever seen an ampersand followed by a comma in natural text, but you can check the file first to see if it contains the delimiter, and if so, use an alternative. If you want to always be able to know that the delimiter you use will not cause a conflict, then do a loop checking the file for the delimiter you want, and if it exists, then double the string until the file no longer has a match. It doesn't matter if there are similar strings because your program will only look for exact delimiter matches.

Solution 13 - Ascii

This can be good or bad (usually bad) depending on the situation and language, but keep mind mind that you can always Base64 encode the whole thing. You then don't have to worry about escaping and unescaping various patterns on each side, and you can simply seperate and split strings based on a character which isn't used in your Base64 charset.

I have had to resort to this solution when faced with putting XML documents into XML properties/nodes. Properties can't have CDATA blocks in them at all, and nodes escaped as CDATA obviously cannot have further CDATA blocks inside that without breaking the structure.

CSV is probably a better idea for most situations, though.

Solution 14 - Ascii

Both pipe and caret are the obvious choices. I would note that if users are expected to type the entire response, caret is easier to find on any keyboard than is pipe.

Solution 15 - Ascii

I've used double pipe and double caret before. The idea of a non printable char works if your not hand creating or modifying the file. For quick random access file storage and retrieval field width is used. You don't even have to read the file.. your literally pulling from the file by reference. This is how databases do some storage.. but they also manage the spaces between records and such. And it introduced the problem of max data element width. (Index attach a header which is used to define the width of each element and it's data type in the original old days.. later they introduced compression with remapping chars. This allows for a text file to get about 1/8 the size in transmission.. variable length char encoding for the win

Solution 16 - Ascii

make it dynamic : )

announce your control characters in the file header

for example

delimiter: ~
escape: \
wrapline: $
width: 19

hello world~this i$
s \\just\\ a sampl$
e text~$someVar$~h$
ere is some \~\~ma$
rkdown strikethrou$
gh\~\~ text

would give the strings
hello world
this is \just\ a sample text
$someVar$
here is some ~~markdown strikethrough~~ text

i have implemented something similar:
a plaintar text container format,
to escape and wrap utf16 text in ascii,
as an alternative to mime multipart messages.
see https://github.com/milahu/live-diff-html-editor

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionToo embarrassed to sayView Question on Stackoverflow
Solution 1 - AsciiEdwin BuckView Answer on Stackoverflow
Solution 2 - AsciiNick FortescueView Answer on Stackoverflow
Solution 3 - AsciiIcarinView Answer on Stackoverflow
Solution 4 - AsciiSQLMenaceView Answer on Stackoverflow
Solution 5 - AsciiJason SView Answer on Stackoverflow
Solution 6 - AsciiAlex FortView Answer on Stackoverflow
Solution 7 - AsciiJayView Answer on Stackoverflow
Solution 8 - AsciiMohammad AminView Answer on Stackoverflow
Solution 9 - AsciiEppzView Answer on Stackoverflow
Solution 10 - AsciiJoeView Answer on Stackoverflow
Solution 11 - AsciiJacksonView Answer on Stackoverflow
Solution 12 - AsciiMatthew LynamView Answer on Stackoverflow
Solution 13 - AsciiCoxyView Answer on Stackoverflow
Solution 14 - AsciiWill JohnsonView Answer on Stackoverflow
Solution 15 - AsciiAlbert PodzunasView Answer on Stackoverflow
Solution 16 - AsciiMila NautikusView Answer on Stackoverflow