Split a string by spaces -- preserving quoted substrings -- in Python
PythonRegexPython Problem Overview
I have a string which is like this:
this is "a test"
I'm trying to write something in Python to split it up by space while ignoring spaces within quotes. The result I'm looking for is:
['this','is','a test']
PS. I know you are going to ask "what happens if there are quotes within the quotes, well, in my application, that will never happen.
Python Solutions
Solution 1 - Python
You want split
, from the built-in shlex
module.
>>> import shlex
>>> shlex.split('this is "a test"')
['this', 'is', 'a test']
This should do exactly what you want.
If you want to preserve the quotation marks, then you can pass the posix=False
kwarg.
>>> shlex.split('this is "a test"', posix=False)
['this', 'is', '"a test"']
Solution 2 - Python
Have a look at the shlex
module, particularly shlex.split
.
>>> import shlex
>>> shlex.split('This is "a test"')
['This', 'is', 'a test']
Solution 3 - Python
I see regex approaches here that look complex and/or wrong. This surprises me, because regex syntax can easily describe "whitespace or thing-surrounded-by-quotes", and most regex engines (including Python's) can split on a regex. So if you're going to use regexes, why not just say exactly what you mean?:
test = 'this is "a test"' # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]
Explanation:
[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators
shlex probably provides more features, though.
Solution 4 - Python
Depending on your use case, you may also want to check out the csv
module:
import csv
lines = ['this is "a string"', 'and more "stuff"']
for row in csv.reader(lines, delimiter=" "):
print(row)
Output:
['this', 'is', 'a string']
['and', 'more', 'stuff']
Solution 5 - Python
I use shlex.split to process 70,000,000 lines of squid log, it's so slow. So I switched to re.
Please try this, if you have performance problem with shlex.
import re
def line_split(line):
return re.findall(r'[^"\s]\S*|".+?"', line)
Solution 6 - Python
It seems that for performance reasons re
is faster. Here is my solution using a least greedy operator that preserves the outer quotes:
re.findall("(?:\".*?\"|\S)+", s)
Result:
['this', 'is', '"a test"']
It leaves constructs like aaa"bla blub"bbb
together as these tokens are not separated by spaces. If the string contains escaped characters, you can match like that:
>>> a = "She said \"He said, \\\"My name is Mark.\\\"\""
>>> a
'She said "He said, \\"My name is Mark.\\""'
>>> for i in re.findall("(?:\".*?[^\\\\]\"|\S)+", a): print(i)
...
She
said
"He said, \"My name is Mark.\""
Please note that this also matches the empty string ""
by means of the \S
part of the pattern.
Solution 7 - Python
The main problem with the accepted shlex
approach is that it does not ignore escape characters outside quoted substrings, and gives slightly unexpected results in some corner cases.
I have the following use case, where I need a split function that splits input strings such that either single-quoted or double-quoted substrings are preserved, with the ability to escape quotes within such a substring. Quotes within an unquoted string should not be treated differently from any other character. Some example test cases with the expected output:
input string | expected output
'abc def' | ['abc', 'def'] "abc \s def" | ['abc', '\s', 'def'] '"abc def" ghi' | ['abc def', 'ghi'] "'abc def' ghi" | ['abc def', 'ghi'] '"abc \" def" ghi' | ['abc " def', 'ghi'] "'abc \' def' ghi" | ["abc ' def", 'ghi'] "'abc \s def' ghi" | ['abc \s def', 'ghi'] '"abc \s def" ghi' | ['abc \s def', 'ghi'] '"" test' | ['', 'test'] "'' test" | ['', 'test'] "abc'def" | ["abc'def"] "abc'def'" | ["abc'def'"] "abc'def' ghi" | ["abc'def'", 'ghi'] "abc'def'ghi" | ["abc'def'ghi"] 'abc"def' | ['abc"def'] 'abc"def"' | ['abc"def"'] 'abc"def" ghi' | ['abc"def"', 'ghi'] 'abc"def"ghi' | ['abc"def"ghi'] "r'AA' r'._xyz$'" | ["r'AA'", "r'._xyz$'"] 'abc"def ghi"' | ['abc"def ghi"'] 'abc"def ghi""jkl"' | ['abc"def ghi""jkl"'] 'a"b c"d"e"f"g h"' | ['a"b c"d"e"f"g h"'] 'c="ls /" type key' | ['c="ls /"', 'type', 'key'] "abc'def ghi'" | ["abc'def ghi'"] "c='ls /' type key" | ["c='ls /'", 'type', 'key']
I ended up with the following function to split a string such that the expected output results for all input strings:
import re
def quoted_split(s):
def strip_quotes(s):
if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
return s[1:-1]
return s
return [strip_quotes(p).replace('\"', '"').replace("\'", "'")
for p in re.findall(r'(?:[^"\s]"(?:\.|[^"])"[^"\s])+|(?:[^'\s]'(?:\.|[^'])'[^'\s])+|[^\s]+', s)]
def quoted_split(s):
def strip_quotes(s):
if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
return s[1:-1]
return s
return [strip_quotes(p).replace('\"', '"').replace("\'", "'")
for p in re.findall(r'(?:[^"\s]"(?:\.|[^"])"[^"\s])+|(?:[^'\s]'(?:\.|[^'])'[^'\s])+|[^\s]+', s)]
It ain't pretty; but it works. The following test application checks the results of other approaches (shlex
and csv
for now) and the custom split implementation:
#!/bin/python2.7
import csv
import re
import shlex
from timeit import timeit
def test_case(fn, s, expected):
try:
if fn(s) == expected:
print '[ OK ] %s -> %s' % (s, fn(s))
else:
print '[FAIL] %s -> %s' % (s, fn(s))
except Exception as e:
print '[FAIL] %s -> exception: %s' % (s, e)
def test_case_no_output(fn, s, expected):
try:
fn(s)
except:
pass
def test_split(fn, test_case_fn=test_case):
test_case_fn(fn, 'abc def', ['abc', 'def'])
test_case_fn(fn, "abc \s def", ['abc', '\s', 'def'])
test_case_fn(fn, '"abc def" ghi', ['abc def', 'ghi'])
test_case_fn(fn, "'abc def' ghi", ['abc def', 'ghi'])
test_case_fn(fn, '"abc \" def" ghi', ['abc " def', 'ghi'])
test_case_fn(fn, "'abc \' def' ghi", ["abc ' def", 'ghi'])
test_case_fn(fn, "'abc \s def' ghi", ['abc \s def', 'ghi'])
test_case_fn(fn, '"abc \s def" ghi', ['abc \s def', 'ghi'])
test_case_fn(fn, '"" test', ['', 'test'])
test_case_fn(fn, "'' test", ['', 'test'])
test_case_fn(fn, "abc'def", ["abc'def"])
test_case_fn(fn, "abc'def'", ["abc'def'"])
test_case_fn(fn, "abc'def' ghi", ["abc'def'", 'ghi'])
test_case_fn(fn, "abc'def'ghi", ["abc'def'ghi"])
test_case_fn(fn, 'abc"def', ['abc"def'])
test_case_fn(fn, 'abc"def"', ['abc"def"'])
test_case_fn(fn, 'abc"def" ghi', ['abc"def"', 'ghi'])
test_case_fn(fn, 'abc"def"ghi', ['abc"def"ghi'])
test_case_fn(fn, "r'AA' r'._xyz$'", ["r'AA'", "r'._xyz$'"])
test_case_fn(fn, 'abc"def ghi"', ['abc"def ghi"'])
test_case_fn(fn, 'abc"def ghi""jkl"', ['abc"def ghi""jkl"'])
test_case_fn(fn, 'a"b c"d"e"f"g h"', ['a"b c"d"e"f"g h"'])
test_case_fn(fn, 'c="ls /" type key', ['c="ls /"', 'type', 'key'])
test_case_fn(fn, "abc'def ghi'", ["abc'def ghi'"])
test_case_fn(fn, "c='ls /' type key", ["c='ls /'", 'type', 'key'])
def csv_split(s):
return list(csv.reader([s], delimiter=' '))[0]
def re_split(s):
def strip_quotes(s):
if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
return s[1:-1]
return s
return [strip_quotes(p).replace('\"', '"').replace("\'", "'") for p in re.findall(r'(?:[^"\s]"(?:\.|[^"])"[^"\s])+|(?:[^'\s]'(?:\.|[^'])'[^'\s])+|[^\s]+', s)]
if name == 'main':
print 'shlex\n'
test_split(shlex.split)
print
print 'csv\n'
test_split(csv_split)
print
print 're\n'
test_split(re_split)
print
iterations = 100
setup = 'from __main__ import test_split, test_case_no_output, csv_split, re_split\nimport shlex, re'
def benchmark(method, code):
print '%s: %.3fms per iteration' % (method, (1000 * timeit(code, setup=setup, number=iterations) / iterations))
benchmark('shlex', 'test_split(shlex.split, test_case_no_output)')
benchmark('csv', 'test_split(csv_split, test_case_no_output)')
benchmark('re', 'test_split(re_split, test_case_no_output)')
Output:
shlex[ OK ] abc def -> ['abc', 'def'] [FAIL] abc \s def -> ['abc', 's', 'def'] [ OK ] "abc def" ghi -> ['abc def', 'ghi'] [ OK ] 'abc def' ghi -> ['abc def', 'ghi'] [ OK ] "abc " def" ghi -> ['abc " def', 'ghi'] [FAIL] 'abc ' def' ghi -> exception: No closing quotation [ OK ] 'abc \s def' ghi -> ['abc \s def', 'ghi'] [ OK ] "abc \s def" ghi -> ['abc \s def', 'ghi'] [ OK ] "" test -> ['', 'test'] [ OK ] '' test -> ['', 'test'] [FAIL] abc'def -> exception: No closing quotation [FAIL] abc'def' -> ['abcdef'] [FAIL] abc'def' ghi -> ['abcdef', 'ghi'] [FAIL] abc'def'ghi -> ['abcdefghi'] [FAIL] abc"def -> exception: No closing quotation [FAIL] abc"def" -> ['abcdef'] [FAIL] abc"def" ghi -> ['abcdef', 'ghi'] [FAIL] abc"def"ghi -> ['abcdefghi'] [FAIL] r'AA' r'._xyz$' -> ['rAA', 'r._xyz$'] [FAIL] abc"def ghi" -> ['abcdef ghi'] [FAIL] abc"def ghi""jkl" -> ['abcdef ghijkl'] [FAIL] a"b c"d"e"f"g h" -> ['ab cdefg h'] [FAIL] c="ls /" type key -> ['c=ls /', 'type', 'key'] [FAIL] abc'def ghi' -> ['abcdef ghi'] [FAIL] c='ls /' type key -> ['c=ls /', 'type', 'key']
csv
[ OK ] abc def -> ['abc', 'def'] [ OK ] abc \s def -> ['abc', '\s', 'def'] [ OK ] "abc def" ghi -> ['abc def', 'ghi'] [FAIL] 'abc def' ghi -> ["'abc", "def'", 'ghi'] [FAIL] "abc " def" ghi -> ['abc \', 'def"', 'ghi'] [FAIL] 'abc ' def' ghi -> ["'abc", "\'", "def'", 'ghi'] [FAIL] 'abc \s def' ghi -> ["'abc", '\s', "def'", 'ghi'] [ OK ] "abc \s def" ghi -> ['abc \s def', 'ghi'] [ OK ] "" test -> ['', 'test'] [FAIL] '' test -> ["''", 'test'] [ OK ] abc'def -> ["abc'def"] [ OK ] abc'def' -> ["abc'def'"] [ OK ] abc'def' ghi -> ["abc'def'", 'ghi'] [ OK ] abc'def'ghi -> ["abc'def'ghi"] [ OK ] abc"def -> ['abc"def'] [ OK ] abc"def" -> ['abc"def"'] [ OK ] abc"def" ghi -> ['abc"def"', 'ghi'] [ OK ] abc"def"ghi -> ['abc"def"ghi'] [ OK ] r'AA' r'._xyz$' -> ["r'AA'", "r'._xyz$'"] [FAIL] abc"def ghi" -> ['abc"def', 'ghi"'] [FAIL] abc"def ghi""jkl" -> ['abc"def', 'ghi""jkl"'] [FAIL] a"b c"d"e"f"g h" -> ['a"b', 'c"d"e"f"g', 'h"'] [FAIL] c="ls /" type key -> ['c="ls', '/"', 'type', 'key'] [FAIL] abc'def ghi' -> ["abc'def", "ghi'"] [FAIL] c='ls /' type key -> ["c='ls", "/'", 'type', 'key']
re
[ OK ] abc def -> ['abc', 'def'] [ OK ] abc \s def -> ['abc', '\s', 'def'] [ OK ] "abc def" ghi -> ['abc def', 'ghi'] [ OK ] 'abc def' ghi -> ['abc def', 'ghi'] [ OK ] "abc " def" ghi -> ['abc " def', 'ghi'] [ OK ] 'abc ' def' ghi -> ["abc ' def", 'ghi'] [ OK ] 'abc \s def' ghi -> ['abc \s def', 'ghi'] [ OK ] "abc \s def" ghi -> ['abc \s def', 'ghi'] [ OK ] "" test -> ['', 'test'] [ OK ] '' test -> ['', 'test'] [ OK ] abc'def -> ["abc'def"] [ OK ] abc'def' -> ["abc'def'"] [ OK ] abc'def' ghi -> ["abc'def'", 'ghi'] [ OK ] abc'def'ghi -> ["abc'def'ghi"] [ OK ] abc"def -> ['abc"def'] [ OK ] abc"def" -> ['abc"def"'] [ OK ] abc"def" ghi -> ['abc"def"', 'ghi'] [ OK ] abc"def"ghi -> ['abc"def"ghi'] [ OK ] r'AA' r'._xyz$' -> ["r'AA'", "r'._xyz$'"] [ OK ] abc"def ghi" -> ['abc"def ghi"'] [ OK ] abc"def ghi""jkl" -> ['abc"def ghi""jkl"'] [ OK ] a"b c"d"e"f"g h" -> ['a"b c"d"e"f"g h"'] [ OK ] c="ls /" type key -> ['c="ls /"', 'type', 'key'] [ OK ] abc'def ghi' -> ["abc'def ghi'"] [ OK ] c='ls /' type key -> ["c='ls /'", 'type', 'key']
shlex: 0.335ms per iteration csv: 0.036ms per iteration re: 0.068ms per iteration
So performance is much better than shlex
, and can be improved further by precompiling the regular expression, in which case it will outperform the csv
approach.
Solution 8 - Python
Since this question is tagged with regex, I decided to try a regex approach. I first replace all the spaces in the quotes parts with \x00, then split by spaces, then replace the \x00 back to spaces in each part.
Both versions do the same thing, but splitter is a bit more readable then splitter2.
import re
s = 'this is "a test" some text "another test"'
def splitter(s):
def replacer(m):
return m.group(0).replace(" ", "\x00")
parts = re.sub('".+?"', replacer, s).split()
parts = [p.replace("\x00", " ") for p in parts]
return parts
def splitter2(s):
return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]
print splitter2(s)
Solution 9 - Python
Speed test of different answers:
import re
import shlex
import csv
line = 'this is "a test"'
%timeit [p for p in re.split("( |\\\".*?\\\"|'.*?')", line) if p.strip()]
100000 loops, best of 3: 5.17 µs per loop
%timeit re.findall(r'[^"\s]\S*|".+?"', line)
100000 loops, best of 3: 2.88 µs per loop
%timeit list(csv.reader([line], delimiter=" "))
The slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.4 µs per loop
%timeit shlex.split(line)
10000 loops, best of 3: 50.2 µs per loop
Solution 10 - Python
To preserve quotes use this function:
def getArgs(s):
args = []
cur = ''
inQuotes = 0
for char in s.strip():
if char == ' ' and not inQuotes:
args.append(cur)
cur = ''
elif char == '"' and not inQuotes:
inQuotes = 1
cur += char
elif char == '"' and inQuotes:
inQuotes = 0
cur += char
else:
cur += char
args.append(cur)
return args
Solution 11 - Python
Hmm, can't seem to find the "Reply" button... anyway, this answer is based on the approach by Kate, but correctly splits strings with substrings containing escaped quotes and also removes the start and end quotes of the substrings:
[i.strip('"').strip("'") for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]
This works on strings like 'This is " a \\\"test\\\"\\\'s substring"'
(the insane markup is unfortunately necessary to keep Python from removing the escapes).
If the resulting escapes in the strings in the returned list are not wanted, you can use this slightly altered version of the function:
[i.strip('"').strip("'").decode('string_escape') for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]
Solution 12 - Python
To get around the unicode issues in some Python 2 versions, I suggest:
from shlex import split as _split
split = lambda a: [b.decode('utf-8') for b in _split(a.encode('utf-8'))]
Solution 13 - Python
As an option try tssplit:
In [1]: from tssplit import tssplit
In [2]: tssplit('this is "a test"', quote='"', delimiter='')
Out[2]: ['this', 'is', 'a test']
Solution 14 - Python
I suggest:
test string:
s = 'abc "ad" \'fg\' "kk\'rdt\'" zzz"34"zzz "" \'\''
to capture also "" and '':
import re
re.findall(r'"[^"]*"|\'[^\']*\'|[^"\'\s]+',s)
result:
['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz', '""', "''"]
to ignore empty "" and '':
import re
re.findall(r'"[^"]+"|\'[^\']+\'|[^"\'\s]+',s)
result:
['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz']
Solution 15 - Python
Try this:
def adamsplit(s):
result = []
inquotes = False
for substring in s.split('"'):
if not inquotes:
result.extend(substring.split())
else:
result.append(substring)
inquotes = not inquotes
return result
Some test strings:
'This is "a test"' -> ['This', 'is', 'a test']
'"This is \'a test\'"' -> ["This is 'a test'"]
Solution 16 - Python
If you don't care about sub strings than a simple
>>> 'a short sized string with spaces '.split()
Performance:
>>> s = " ('a short sized string with spaces '*100).split() "
>>> t = timeit.Timer(stmt=s)
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
171.39 usec/pass
Or string module
>>> from string import split as stringsplit;
>>> stringsplit('a short sized string with spaces '*100)
Performance: String module seems to perform better than string methods
>>> s = "stringsplit('a short sized string with spaces '*100)"
>>> t = timeit.Timer(s, "from string import split as stringsplit")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
154.88 usec/pass
Or you can use RE engine
>>> from re import split as resplit
>>> regex = '\s+'
>>> medstring = 'a short sized string with spaces '*100
>>> resplit(regex, medstring)
Performance
>>> s = "resplit(regex, medstring)"
>>> t = timeit.Timer(s, "from re import split as resplit; regex='\s+'; medstring='a short sized string with spaces '*100")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
540.21 usec/pass
For very long strings you should not load the entire string into memory and instead either split the lines or use an iterative loop