What's the best separator/delimiter character(s) for a plaintext db file?

DatabaseSqliteDelimiter

Database Problem Overview


What's the best separator/delimiter character(s) for a plaintext db file?

I considered using |, ,, <TAB>, ;, etc. But they all seem to be possible to break when the nearby entries have special enough characters.

So, the experienced database users, what delimiter character(s) do you suggest to use?

Database Solutions


Solution 1 - Database

Well, there are few separator characters in US-ASCII, hex 1c, 1d, 1e and 1f. The plain text shouldn't contain them.

1c  FS  ␜  ^\  File Separator
1d  GS  ␝  ^]  Group Separator
1e  RS 	␞  ^^  Record Separator
1f  US 	␟  ^_  Unit Separator

Solution 2 - Database

No matter which character you choose as your separator, you'll want to escape any instance of that character in your data.

Perhaps tilde(~), or go to a high-ASCII character.

Either way, if there's any chance that it could sneak into your data, you'd want to escape it before writing to your plaintext file.

Solution 3 - Database

I think the best way to join string with a three cherries '@@@'.

Solution 4 - Database

For a particular data warehousing situation where we had control over the source file, but escaping and qualifying were onerous, we were able to make the business decision that one extended ASCII character would be stripped from the data (if it ever occurs, which it hasn't).

On creation of the delimited source file, we stripped out any instances of █ (alt+219) in the data and use that character for the delimiter. Bonus, that character is really easy to spot.

Solution 5 - Database

Actually, it depends on the type of data you are trying to separate, we needed a separator for the machine events data and a couple of them were proposed:

=) or ^_^.

We chose ^_^ because it actually worked based on the number of samples tested and it also looks cute!

Solution 6 - Database

Personally I like using « as a delimiter character to split data in CSV files, I don't think I've ever found a naturally occurring instance of « and » personally, so here are my two cents about it.

Solution 7 - Database

I usually prefer non-printable characters like "\u0001", for instance I use this as a column delimiter in most of my Azure Data Analytics U-SQL Scripts. That is assuming you can use a multi-character custom delimiter

Solution 8 - Database

You could use the special separator characters (hex 1c -> 1f), yet they are non-printable, and some technologies have issues processing data containing them.

So, plan B, if your data is in UTF-8, you could pick a random UTF-8 character that is extremely unlikely to appear in any source data you receive.

Yet, even then, if you want to be sure you'll not run into issues, you better always scan your entire dataset for this character, and if it appears, simply pick another UTF-8 character.

I tend to hate encapsulation with a passion, and avoid it whenever possible, as explained in my post under the chapter 'encapsulation' here: https://theonemanitdepartment.wordpress.com/2014/12/15/the-absolute-minimum-everyone-working-with-data-absolutely-positively-must-know-about-file-types-encoding-delimiters-and-data-types-no-excuses/

Solution 9 - Database

If you can't control the data being put into it, don't use a plain text db. There can be no generally right answer here. Without context or constraints this is a false question.

To wit: If I said I was only going to accept lower case letters as data, I could use any other symbol as a separator. Even, say, the number 9, and I'd be fine. No symbol other than a lower case character would be better than any other.

Conversely, if said I could accept any character, then I don't have any characters left for a separator, and I'd be left with a very sorry database that could only store a single value.

If you have to try too hard to get your db into plain text, you probably want a binary db. Have you looked at sqlite? It's pretty darned easy to use, is available in many contexts, and comes with a ton of benefits over a plain text db.

Solution 10 - Database

I propose the interrobang character "‽". More details: https://en.wikipedia.org/wiki/Interrobang

Solution 11 - Database

If you have the option of a string as column separator, use "" as delimiter. You can make up any string for that matter and gives you flexibility.

Solution 12 - Database

I've used an ePUB convertor before and the delimiter char was the notational quote character, anywhere it had been used it would be rewritten to file as @, simple but effective even if it did destroy the sample material being produced.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionMeng LuView Question on Stackoverflow
Solution 1 - DatabaseMichasView Answer on Stackoverflow
Solution 2 - Databasep.campbellView Answer on Stackoverflow
Solution 3 - DatabaseEmisView Answer on Stackoverflow
Solution 4 - DatabasejapageView Answer on Stackoverflow
Solution 5 - DatabaseCoder AbsoluteView Answer on Stackoverflow
Solution 6 - DatabaseCesar Bourdain CostaView Answer on Stackoverflow
Solution 7 - Databasedim_userView Answer on Stackoverflow
Solution 8 - DatabaseWouterView Answer on Stackoverflow
Solution 9 - DatabaseChris SubagioView Answer on Stackoverflow
Solution 10 - DatabaseFierascu GheorgheView Answer on Stackoverflow
Solution 11 - DatabasesvarghView Answer on Stackoverflow
Solution 12 - DatabaseEasY_COdingView Answer on Stackoverflow