Split a string by spaces -- preserving quoted substrings -- in Python

PythonRegex

Python Problem Overview


I have a string which is like this:

this is "a test"

I'm trying to write something in Python to split it up by space while ignoring spaces within quotes. The result I'm looking for is:

['this','is','a test']

PS. I know you are going to ask "what happens if there are quotes within the quotes, well, in my application, that will never happen.

Python Solutions


Solution 1 - Python

You want split, from the built-in shlex module.

>>> import shlex
>>> shlex.split('this is "a test"')
['this', 'is', 'a test']

This should do exactly what you want.

If you want to preserve the quotation marks, then you can pass the posix=False kwarg.

>>> shlex.split('this is "a test"', posix=False)
['this', 'is', '"a test"']

Solution 2 - Python

Have a look at the shlex module, particularly shlex.split.

>>> import shlex
>>> shlex.split('This is "a test"')
['This', 'is', 'a test']

Solution 3 - Python

I see regex approaches here that look complex and/or wrong. This surprises me, because regex syntax can easily describe "whitespace or thing-surrounded-by-quotes", and most regex engines (including Python's) can split on a regex. So if you're going to use regexes, why not just say exactly what you mean?:

test = 'this is "a test"'  # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]

Explanation:

[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators

shlex probably provides more features, though.

Solution 4 - Python

Depending on your use case, you may also want to check out the csv module:

import csv
lines = ['this is "a string"', 'and more "stuff"']
for row in csv.reader(lines, delimiter=" "):
    print(row)

Output:

['this', 'is', 'a string']
['and', 'more', 'stuff']

Solution 5 - Python

I use shlex.split to process 70,000,000 lines of squid log, it's so slow. So I switched to re.

Please try this, if you have performance problem with shlex.

import re

def line_split(line):
    return re.findall(r'[^"\s]\S*|".+?"', line)

Solution 6 - Python

It seems that for performance reasons re is faster. Here is my solution using a least greedy operator that preserves the outer quotes:

re.findall("(?:\".*?\"|\S)+", s)

Result:

['this', 'is', '"a test"']

It leaves constructs like aaa"bla blub"bbb together as these tokens are not separated by spaces. If the string contains escaped characters, you can match like that:

>>> a = "She said \"He said, \\\"My name is Mark.\\\"\""
>>> a
'She said "He said, \\"My name is Mark.\\""'
>>> for i in re.findall("(?:\".*?[^\\\\]\"|\S)+", a): print(i)
...
She
said
"He said, \"My name is Mark.\""

Please note that this also matches the empty string "" by means of the \S part of the pattern.

Solution 7 - Python

The main problem with the accepted shlex approach is that it does not ignore escape characters outside quoted substrings, and gives slightly unexpected results in some corner cases.

I have the following use case, where I need a split function that splits input strings such that either single-quoted or double-quoted substrings are preserved, with the ability to escape quotes within such a substring. Quotes within an unquoted string should not be treated differently from any other character. Some example test cases with the expected output:

 input string        | expected output

'abc def' | ['abc', 'def'] "abc \s def" | ['abc', '\s', 'def'] '"abc def" ghi' | ['abc def', 'ghi'] "'abc def' ghi" | ['abc def', 'ghi'] '"abc \" def" ghi' | ['abc " def', 'ghi'] "'abc \' def' ghi" | ["abc ' def", 'ghi'] "'abc \s def' ghi" | ['abc \s def', 'ghi'] '"abc \s def" ghi' | ['abc \s def', 'ghi'] '"" test' | ['', 'test'] "'' test" | ['', 'test'] "abc'def" | ["abc'def"] "abc'def'" | ["abc'def'"] "abc'def' ghi" | ["abc'def'", 'ghi'] "abc'def'ghi" | ["abc'def'ghi"] 'abc"def' | ['abc"def'] 'abc"def"' | ['abc"def"'] 'abc"def" ghi' | ['abc"def"', 'ghi'] 'abc"def"ghi' | ['abc"def"ghi'] "r'AA' r'._xyz$'" | ["r'AA'", "r'._xyz$'"] 'abc"def ghi"' | ['abc"def ghi"'] 'abc"def ghi""jkl"' | ['abc"def ghi""jkl"'] 'a"b c"d"e"f"g h"' | ['a"b c"d"e"f"g h"'] 'c="ls /" type key' | ['c="ls /"', 'type', 'key'] "abc'def ghi'" | ["abc'def ghi'"] "c='ls /' type key" | ["c='ls /'", 'type', 'key']

I ended up with the following function to split a string such that the expected output results for all input strings:

import re




def quoted_split(s):
def strip_quotes(s):
if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
return s[1:-1]
return s
return [strip_quotes(p).replace('\"', '"').replace("\'", "'") 

for p in re.findall(r'(?:[^"\s]"(?:\.|[^"])"[^"\s])+|(?:[^'\s]'(?:\.|[^'])'[^'\s])+|[^\s]+', s)]

def quoted_split(s): def strip_quotes(s): if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]: return s[1:-1] return s return [strip_quotes(p).replace('\"', '"').replace("\'", "'")
for p in re.findall(r'(?:[^"\s]"(?:\.|[^"])"[^"\s])+|(?:[^'\s]'(?:\.|[^'])'[^'\s])+|[^\s]+', s)]

It ain't pretty; but it works. The following test application checks the results of other approaches (shlex and csv for now) and the custom split implementation:

#!/bin/python2.7




import csv
import re
import shlex




from timeit import timeit




def test_case(fn, s, expected):
try:
if fn(s) == expected:
print '[ OK ] %s -> %s' % (s, fn(s))
else:
print '[FAIL] %s -> %s' % (s, fn(s))
except Exception as e:
print '[FAIL] %s -> exception: %s' % (s, e)




def test_case_no_output(fn, s, expected):
try:
fn(s)
except:
pass




def test_split(fn, test_case_fn=test_case):
test_case_fn(fn, 'abc def', ['abc', 'def'])
test_case_fn(fn, "abc \s def", ['abc', '\s', 'def'])
test_case_fn(fn, '"abc def" ghi', ['abc def', 'ghi'])
test_case_fn(fn, "'abc def' ghi", ['abc def', 'ghi'])
test_case_fn(fn, '"abc \" def" ghi', ['abc " def', 'ghi'])
test_case_fn(fn, "'abc \' def' ghi", ["abc ' def", 'ghi'])
test_case_fn(fn, "'abc \s def' ghi", ['abc \s def', 'ghi'])
test_case_fn(fn, '"abc \s def" ghi', ['abc \s def', 'ghi'])
test_case_fn(fn, '"" test', ['', 'test'])
test_case_fn(fn, "'' test", ['', 'test'])
test_case_fn(fn, "abc'def", ["abc'def"])
test_case_fn(fn, "abc'def'", ["abc'def'"])
test_case_fn(fn, "abc'def' ghi", ["abc'def'", 'ghi'])
test_case_fn(fn, "abc'def'ghi", ["abc'def'ghi"])
test_case_fn(fn, 'abc"def', ['abc"def'])
test_case_fn(fn, 'abc"def"', ['abc"def"'])
test_case_fn(fn, 'abc"def" ghi', ['abc"def"', 'ghi'])
test_case_fn(fn, 'abc"def"ghi', ['abc"def"ghi'])
test_case_fn(fn, "r'AA' r'._xyz$'", ["r'AA'", "r'._xyz$'"])
test_case_fn(fn, 'abc"def ghi"', ['abc"def ghi"'])
test_case_fn(fn, 'abc"def ghi""jkl"', ['abc"def ghi""jkl"'])
test_case_fn(fn, 'a"b c"d"e"f"g h"', ['a"b c"d"e"f"g h"'])
test_case_fn(fn, 'c="ls /" type key', ['c="ls /"', 'type', 'key'])
test_case_fn(fn, "abc'def ghi'", ["abc'def ghi'"])
test_case_fn(fn, "c='ls /' type key", ["c='ls /'", 'type', 'key'])




def csv_split(s):
return list(csv.reader([s], delimiter=' '))[0]




def re_split(s):
def strip_quotes(s):
if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
return s[1:-1]
return s
return [strip_quotes(p).replace('\"', '"').replace("\'", "'") for p in re.findall(r'(?:[^"\s]"(?:\.|[^"])"[^"\s])+|(?:[^'\s]'(?:\.|[^'])'[^'\s])+|[^\s]+', s)]




if name == 'main':
print 'shlex\n'
test_split(shlex.split)
print



print 'csv\n'
test_split(csv_split)
print

print 're\n'
test_split(re_split)
print

iterations = 100
setup = 'from __main__ import test_split, test_case_no_output, csv_split, re_split\nimport shlex, re'
def benchmark(method, code):
    print '%s: %.3fms per iteration' % (method, (1000 * timeit(code, setup=setup, number=iterations) / iterations))
benchmark('shlex', 'test_split(shlex.split, test_case_no_output)')
benchmark('csv', 'test_split(csv_split, test_case_no_output)')
benchmark('re', 'test_split(re_split, test_case_no_output)')


Output:

shlex

[ OK ] abc def -> ['abc', 'def'] [FAIL] abc \s def -> ['abc', 's', 'def'] [ OK ] "abc def" ghi -> ['abc def', 'ghi'] [ OK ] 'abc def' ghi -> ['abc def', 'ghi'] [ OK ] "abc " def" ghi -> ['abc " def', 'ghi'] [FAIL] 'abc ' def' ghi -> exception: No closing quotation [ OK ] 'abc \s def' ghi -> ['abc \s def', 'ghi'] [ OK ] "abc \s def" ghi -> ['abc \s def', 'ghi'] [ OK ] "" test -> ['', 'test'] [ OK ] '' test -> ['', 'test'] [FAIL] abc'def -> exception: No closing quotation [FAIL] abc'def' -> ['abcdef'] [FAIL] abc'def' ghi -> ['abcdef', 'ghi'] [FAIL] abc'def'ghi -> ['abcdefghi'] [FAIL] abc"def -> exception: No closing quotation [FAIL] abc"def" -> ['abcdef'] [FAIL] abc"def" ghi -> ['abcdef', 'ghi'] [FAIL] abc"def"ghi -> ['abcdefghi'] [FAIL] r'AA' r'._xyz$' -> ['rAA', 'r._xyz$'] [FAIL] abc"def ghi" -> ['abcdef ghi'] [FAIL] abc"def ghi""jkl" -> ['abcdef ghijkl'] [FAIL] a"b c"d"e"f"g h" -> ['ab cdefg h'] [FAIL] c="ls /" type key -> ['c=ls /', 'type', 'key'] [FAIL] abc'def ghi' -> ['abcdef ghi'] [FAIL] c='ls /' type key -> ['c=ls /', 'type', 'key']

csv

[ OK ] abc def -> ['abc', 'def'] [ OK ] abc \s def -> ['abc', '\s', 'def'] [ OK ] "abc def" ghi -> ['abc def', 'ghi'] [FAIL] 'abc def' ghi -> ["'abc", "def'", 'ghi'] [FAIL] "abc " def" ghi -> ['abc \', 'def"', 'ghi'] [FAIL] 'abc ' def' ghi -> ["'abc", "\'", "def'", 'ghi'] [FAIL] 'abc \s def' ghi -> ["'abc", '\s', "def'", 'ghi'] [ OK ] "abc \s def" ghi -> ['abc \s def', 'ghi'] [ OK ] "" test -> ['', 'test'] [FAIL] '' test -> ["''", 'test'] [ OK ] abc'def -> ["abc'def"] [ OK ] abc'def' -> ["abc'def'"] [ OK ] abc'def' ghi -> ["abc'def'", 'ghi'] [ OK ] abc'def'ghi -> ["abc'def'ghi"] [ OK ] abc"def -> ['abc"def'] [ OK ] abc"def" -> ['abc"def"'] [ OK ] abc"def" ghi -> ['abc"def"', 'ghi'] [ OK ] abc"def"ghi -> ['abc"def"ghi'] [ OK ] r'AA' r'._xyz$' -> ["r'AA'", "r'._xyz$'"] [FAIL] abc"def ghi" -> ['abc"def', 'ghi"'] [FAIL] abc"def ghi""jkl" -> ['abc"def', 'ghi""jkl"'] [FAIL] a"b c"d"e"f"g h" -> ['a"b', 'c"d"e"f"g', 'h"'] [FAIL] c="ls /" type key -> ['c="ls', '/"', 'type', 'key'] [FAIL] abc'def ghi' -> ["abc'def", "ghi'"] [FAIL] c='ls /' type key -> ["c='ls", "/'", 'type', 'key']

re

[ OK ] abc def -> ['abc', 'def'] [ OK ] abc \s def -> ['abc', '\s', 'def'] [ OK ] "abc def" ghi -> ['abc def', 'ghi'] [ OK ] 'abc def' ghi -> ['abc def', 'ghi'] [ OK ] "abc " def" ghi -> ['abc " def', 'ghi'] [ OK ] 'abc ' def' ghi -> ["abc ' def", 'ghi'] [ OK ] 'abc \s def' ghi -> ['abc \s def', 'ghi'] [ OK ] "abc \s def" ghi -> ['abc \s def', 'ghi'] [ OK ] "" test -> ['', 'test'] [ OK ] '' test -> ['', 'test'] [ OK ] abc'def -> ["abc'def"] [ OK ] abc'def' -> ["abc'def'"] [ OK ] abc'def' ghi -> ["abc'def'", 'ghi'] [ OK ] abc'def'ghi -> ["abc'def'ghi"] [ OK ] abc"def -> ['abc"def'] [ OK ] abc"def" -> ['abc"def"'] [ OK ] abc"def" ghi -> ['abc"def"', 'ghi'] [ OK ] abc"def"ghi -> ['abc"def"ghi'] [ OK ] r'AA' r'._xyz$' -> ["r'AA'", "r'._xyz$'"] [ OK ] abc"def ghi" -> ['abc"def ghi"'] [ OK ] abc"def ghi""jkl" -> ['abc"def ghi""jkl"'] [ OK ] a"b c"d"e"f"g h" -> ['a"b c"d"e"f"g h"'] [ OK ] c="ls /" type key -> ['c="ls /"', 'type', 'key'] [ OK ] abc'def ghi' -> ["abc'def ghi'"] [ OK ] c='ls /' type key -> ["c='ls /'", 'type', 'key']

shlex: 0.335ms per iteration csv: 0.036ms per iteration re: 0.068ms per iteration

So performance is much better than shlex, and can be improved further by precompiling the regular expression, in which case it will outperform the csv approach.

Solution 8 - Python

Since this question is tagged with regex, I decided to try a regex approach. I first replace all the spaces in the quotes parts with \x00, then split by spaces, then replace the \x00 back to spaces in each part.

Both versions do the same thing, but splitter is a bit more readable then splitter2.

import re

s = 'this is "a test" some text "another test"'

def splitter(s):
    def replacer(m):
        return m.group(0).replace(" ", "\x00")
    parts = re.sub('".+?"', replacer, s).split()
    parts = [p.replace("\x00", " ") for p in parts]
    return parts
    
def splitter2(s):
    return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]

print splitter2(s)

Solution 9 - Python

Speed test of different answers:

import re
import shlex
import csv

line = 'this is "a test"'

%timeit [p for p in re.split("( |\\\".*?\\\"|'.*?')", line) if p.strip()]
100000 loops, best of 3: 5.17 µs per loop

%timeit re.findall(r'[^"\s]\S*|".+?"', line)
100000 loops, best of 3: 2.88 µs per loop

%timeit list(csv.reader([line], delimiter=" "))
The slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.4 µs per loop

%timeit shlex.split(line)
10000 loops, best of 3: 50.2 µs per loop

Solution 10 - Python

To preserve quotes use this function:

def getArgs(s):
    args = []
    cur = ''
    inQuotes = 0
    for char in s.strip():
        if char == ' ' and not inQuotes:
            args.append(cur)
            cur = ''
        elif char == '"' and not inQuotes:
            inQuotes = 1
            cur += char
        elif char == '"' and inQuotes:
            inQuotes = 0
            cur += char
        else:
            cur += char
    args.append(cur)
    return args

Solution 11 - Python

Hmm, can't seem to find the "Reply" button... anyway, this answer is based on the approach by Kate, but correctly splits strings with substrings containing escaped quotes and also removes the start and end quotes of the substrings:

  [i.strip('"').strip("'") for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

This works on strings like 'This is " a \\\"test\\\"\\\'s substring"' (the insane markup is unfortunately necessary to keep Python from removing the escapes).

If the resulting escapes in the strings in the returned list are not wanted, you can use this slightly altered version of the function:

[i.strip('"').strip("'").decode('string_escape') for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

Solution 12 - Python

To get around the unicode issues in some Python 2 versions, I suggest:

from shlex import split as _split
split = lambda a: [b.decode('utf-8') for b in _split(a.encode('utf-8'))]

Solution 13 - Python

As an option try tssplit:

In [1]: from tssplit import tssplit
In [2]: tssplit('this is "a test"', quote='"', delimiter='')
Out[2]: ['this', 'is', 'a test']

Solution 14 - Python

I suggest:

test string:

s = 'abc "ad" \'fg\' "kk\'rdt\'" zzz"34"zzz "" \'\''

to capture also "" and '':

import re
re.findall(r'"[^"]*"|\'[^\']*\'|[^"\'\s]+',s)

result:

['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz', '""', "''"]

to ignore empty "" and '':

import re
re.findall(r'"[^"]+"|\'[^\']+\'|[^"\'\s]+',s)

result:

['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz']

Solution 15 - Python

Try this:

  def adamsplit(s):
    result = []
    inquotes = False
    for substring in s.split('"'):
      if not inquotes:
        result.extend(substring.split())
      else:
        result.append(substring)
      inquotes = not inquotes
    return result

Some test strings:

'This is "a test"' -> ['This', 'is', 'a test']
'"This is \'a test\'"' -> ["This is 'a test'"]

Solution 16 - Python

If you don't care about sub strings than a simple

>>> 'a short sized string with spaces '.split()

Performance:

>>> s = " ('a short sized string with spaces '*100).split() "
>>> t = timeit.Timer(stmt=s)
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
171.39 usec/pass

Or string module

>>> from string import split as stringsplit; 
>>> stringsplit('a short sized string with spaces '*100)

Performance: String module seems to perform better than string methods

>>> s = "stringsplit('a short sized string with spaces '*100)"
>>> t = timeit.Timer(s, "from string import split as stringsplit")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
154.88 usec/pass

Or you can use RE engine

>>> from re import split as resplit
>>> regex = '\s+'
>>> medstring = 'a short sized string with spaces '*100
>>> resplit(regex, medstring)

Performance

>>> s = "resplit(regex, medstring)"
>>> t = timeit.Timer(s, "from re import split as resplit; regex='\s+'; medstring='a short sized string with spaces '*100")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
540.21 usec/pass

For very long strings you should not load the entire string into memory and instead either split the lines or use an iterative loop

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAdam PierceView Question on Stackoverflow
Solution 1 - PythonJerubView Answer on Stackoverflow
Solution 2 - PythonAllenView Answer on Stackoverflow
Solution 3 - PythonKateView Answer on Stackoverflow
Solution 4 - PythonRyan GinstromView Answer on Stackoverflow
Solution 5 - PythonDaniel DaiView Answer on Stackoverflow
Solution 6 - PythonhochlView Answer on Stackoverflow
Solution 7 - PythonTon van den HeuvelView Answer on Stackoverflow
Solution 8 - PythonelifinerView Answer on Stackoverflow
Solution 9 - Pythonhar777View Answer on Stackoverflow
Solution 10 - PythonTHE_MAD_KINGView Answer on Stackoverflow
Solution 11 - Pythonuser261478View Answer on Stackoverflow
Solution 12 - PythonmoschlarView Answer on Stackoverflow
Solution 13 - PythonMikhail ZakharovView Answer on Stackoverflow
Solution 14 - PythonhussicView Answer on Stackoverflow
Solution 15 - PythonpjzView Answer on Stackoverflow
Solution 16 - PythonGregoryView Answer on Stackoverflow