In Python, how do I split a string and keep the separators?

Python Problem Overview

Here's the simplest way to explain this. Here's what I'm using:

re.split('\W', 'foo/bar spam\neggs')
-> ['foo', 'bar', 'spam', 'eggs']

Here's what I want:

someMethod('\W', 'foo/bar spam\neggs')
-> ['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']

The reason is that I want to split a string into tokens, manipulate it, then put it back together again.

Python Solutions

Solution 1 - Python

>>> re.split('(\W)', 'foo/bar spam\neggs')
['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']

Solution 2 - Python

If you are splitting on newline, use splitlines(True).

>>> 'line 1\nline 2\nline without newline'.splitlines(True)
['line 1\n', 'line 2\n', 'line without newline']

(Not a general solution, but adding this here in case someone comes here not realizing this method existed.)

Solution 3 - Python

another example, split on non alpha-numeric and keep the separators

import re
a = "foo,bar@candy*ice%cream"
re.split('([^a-zA-Z0-9])',a)

output:

['foo', ',', 'bar', '@', 'candy', '*', 'ice', '%', 'cream']

explanation

re.split('([^a-zA-Z0-9])',a)

() <- keep the separators
[] <- match everything in between
^a-zA-Z0-9 <-except alphabets, upper/lower and numbers.

Solution 4 - Python

If you have only 1 separator, you can employ list comprehensions:

text = 'foo,bar,baz,qux'  
sep = ','

Appending/prepending separator:

result = [x+sep for x in text.split(sep)]
#['foo,', 'bar,', 'baz,', 'qux,']
# to get rid of trailing
result[-1] = result[-1].strip(sep)
#['foo,', 'bar,', 'baz,', 'qux']

result = [sep+x for x in text.split(sep)]
#[',foo', ',bar', ',baz', ',qux']
# to get rid of trailing
result[0] = result[0].strip(sep)
#['foo', ',bar', ',baz', ',qux']

Separator as it's own element:

result = [u for x in text.split(sep) for u in (x, sep)]
#['foo', ',', 'bar', ',', 'baz', ',', 'qux', ',']
results = result[:-1]   # to get rid of trailing

Solution 5 - Python

Another no-regex solution that works well on Python 3

# Split strings and keep separator
test_strings = ['<Hello>', 'Hi', '<Hi> <Planet>', '<', '']

def split_and_keep(s, sep):
   if not s: return [''] # consistent with string.split()

   # Find replacement character that is not used in string
   # i.e. just use the highest available character plus one
   # Note: This fails if ord(max(s)) = 0x10FFFF (ValueError)
   p=chr(ord(max(s))+1) 

   return s.replace(sep, sep+p).split(p)

for s in test_strings:
   print(split_and_keep(s, '<'))


# If the unicode limit is reached it will fail explicitly
unicode_max_char = chr(1114111)
ridiculous_string = '<Hello>'+unicode_max_char+'<World>'
print(split_and_keep(ridiculous_string, '<'))

Solution 6 - Python

You can also split a string with an array of strings instead of a regular expression, like this:

def tokenizeString(aString, separators):
    #separators is an array of strings that are being used to split the string.
    #sort separators in order of descending length
    separators.sort(key=len)
    listToReturn = []
    i = 0
    while i < len(aString):
        theSeparator = ""
        for current in separators:
            if current == aString[i:i+len(current)]:
                theSeparator = current
        if theSeparator != "":
            listToReturn += [theSeparator]
            i = i + len(theSeparator)
        else:
            if listToReturn == []:
                listToReturn = [""]
            if(listToReturn[-1] in separators):
                listToReturn += [""]
            listToReturn[-1] += aString[i]
            i += 1
    return listToReturn
    

print(tokenizeString(aString = "\"\"\"hi\"\"\" hello + world += (1*2+3/5) '''hi'''", separators = ["'''", '+=', '+', "/", "*", "\\'", '\\"', "-=", "-", " ", '"""', "(", ")"]))

Solution 7 - Python

# This keeps all separators  in result 
##########################################################################
import re
st="%%(c+dd+e+f-1523)%%7"
sh=re.compile('[\+\-//\*\<\>\%\(\)]')

def splitStringFull(sh, st):
   ls=sh.split(st)
   lo=[]
   start=0
   for l in ls:
	 if not l : continue
	 k=st.find(l)
	 llen=len(l)
	 if k> start:
	   tmp= st[start:k]
	   lo.append(tmp)
	   lo.append(l)
	   start = k + llen
	 else:
	   lo.append(l)
	   start =llen
   return lo
  #############################

li= splitStringFull(sh , st)
['%%(', 'c', '+', 'dd', '+', 'e', '+', 'f', '-', '1523', ')%%', '7']

Solution 8 - Python

One Lazy and Simple Solution

Assume your regex pattern is split_pattern = r'(!|\?)'

First, you add some same character as the new separator, like '[cut]'

new_string = re.sub(split_pattern, '\\1[cut]', your_string)

Then you split the new separator, new_string.split('[cut]')

Solution 9 - Python

replace all seperator: (\W) with seperator + new_seperator: (\W;)
split by the new_seperator: (;)

def split_and_keep(seperator, s):
  return re.split(';', re.sub(seperator, lambda match: match.group() + ';', s))

print('\W', 'foo/bar spam\neggs')

Solution 10 - Python

Here is a simple .split solution that works without regex.

This is an answer for https://stackoverflow.com/questions/7866128/python-split-without-removing-the-delimiter, so not exactly what the original post asks but the other question was closed as a duplicate for this one.

def splitkeep(s, delimiter):
    split = s.split(delimiter)
    return [substr + delimiter for substr in split[:-1]] + [split[-1]]

Random tests:

import random

CHARS = [".", "a", "b", "c"]
assert splitkeep("", "X") == [""]  # 0 length test
for delimiter in ('.', '..'):
    for _ in range(100000):
        length = random.randint(1, 50)
        s = "".join(random.choice(CHARS) for _ in range(length))
        assert "".join(splitkeep(s, delimiter)) == s

Solution 11 - Python

If one wants to split string while keeping separators by regex without capturing group:

def finditer_with_separators(regex, s):
    matches = []
    prev_end = 0
    for match in regex.finditer(s):
        match_start = match.start()
        if (prev_end != 0 or match_start > 0) and match_start != prev_end:
            matches.append(s[prev_end:match.start()])
        matches.append(match.group())
        prev_end = match.end()
    if prev_end < len(s):
        matches.append(s[prev_end:])
    return matches

regex = re.compile(r"[\(\)]")
matches = finditer_with_separators(regex, s)

If one assumes that regex is wrapped up into capturing group:

def split_with_separators(regex, s):
    matches = list(filter(None, regex.split(s)))
    return matches

regex = re.compile(r"([\(\)])")
matches = split_with_separators(regex, s)

Both ways also will remove empty groups which are useless and annoying in most of the cases.

Solution 12 - Python

install wrs "WITHOUT REMOVING SPLITOR" BY DOING

pip install wrs

(developed by Rao Hamza)

import wrs
text  = "Now inbox “how to make spam ad” Invest in hard email marketing."
splitor = 'email | spam | inbox'
list = wrs.wr_split(splitor, text)
print(list)

result: ['now ', 'inbox “how to make ', 'spam ad” invest in hard ', 'email marketing.']

Solution 13 - Python

May I just leave it here

s = 'foo/bar spam\neggs'
print(s.replace('/', '+++/+++').replace(' ', '+++ +++').replace('\n', '+++\n+++').split('+++'))

['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']

Solution 14 - Python

I had a similar issue trying to split a file path and struggled to find a simple answer. This worked for me and didn't involve having to substitute delimiters back into the split text:

my_path = 'folder1/folder2/folder3/file1'

import re

re.findall('[^/]+/|[^/]+', my_path)

returns:

['folder1/', 'folder2/', 'folder3/', 'file1']

Solution 15 - Python

I found this generator based approach more satisfying:

def split_keep(string, sep):
    """Usage:
    >>> list(split_keep("a.b.c.d", "."))
    ['a.', 'b.', 'c.', 'd']
    """
    start = 0
    while True:
        end = string.find(sep, start) + 1
        if end == 0:
            break
        yield string[start:end]
        start = end
    yield string[start:]

It avoids the need to figure out the correct regex, while in theory should be fairly cheap. It doesn't create new string objects and, delegates most of the iteration work to the efficient find method.

... and in Python 3.8 it can be as short as:

def split_keep(string, sep):
    start = 0
    while (end := string.find(sep, start) + 1) > 0:
        yield string[start:end]
        start = end
    yield string[start:]

Solution 16 - Python

Use re.split and also your regular expression comes from variable and also you have multi separator ,you can use as the following:

# BashSpecialParamList is the special param in bash,
# such as your separator is the bash special param
BashSpecialParamList = ["$*", "$@", "$#", "$?", "$-", "$$", "$!", "$0"]
# aStr is the the string to be splited
aStr = "$a Klkjfd$0 $? $#%$*Sdfdf"

reStr = "|".join([re.escape(sepStr) for sepStr in BashSpecialParamList])

re.split(f'({reStr})', aStr)

# Then You can get the result:
# ['$a Klkjfd', '$0', ' ', '$?', ' ', '$#', '%', '$*', 'Sdfdf']

reference: GNU Bash Special Parameters

Solution 17 - Python

Some of those answers posted before, will repeat delimiter, or have some other bugs which I faced in my case. You can use this function, instead:

def split_and_keep_delimiter(input, delimiter):
    result      = list()
    idx         = 0
    while delimiter in input:
        idx     = input.index(delimiter);
        result.append(input[0:idx+len(delimiter)])
        input = input[idx+len(delimiter):]
    result.append(input)
    return result

Content Type	Original Author	Original Content on Stackoverflow
Question	Ken Kinder	View Question on Stackoverflow
Solution 1 - Python	Commodore Jaeger	View Answer on Stackoverflow
Solution 2 - Python	Mark Lodato	View Answer on Stackoverflow
Solution 3 - Python	anurag	View Answer on Stackoverflow
Solution 4 - Python	Granitosaurus	View Answer on Stackoverflow
Solution 5 - Python	ootwch	View Answer on Stackoverflow
Solution 6 - Python	Anderson Green	View Answer on Stackoverflow
Solution 7 - Python	Moisey Oysgelt	View Answer on Stackoverflow
Solution 8 - Python	Yilei Wang	View Answer on Stackoverflow
Solution 9 - Python	kobako	View Answer on Stackoverflow
Solution 10 - Python	orestisf	View Answer on Stackoverflow
Solution 11 - Python	Dmitriy Sintsov	View Answer on Stackoverflow
Solution 12 - Python	Rao Mohammad	View Answer on Stackoverflow
Solution 13 - Python	Marat Zakirov	View Answer on Stackoverflow
Solution 14 - Python	Conor	View Answer on Stackoverflow
Solution 15 - Python	Chen Levy	View Answer on Stackoverflow
Solution 16 - Python	HobbyMarks	View Answer on Stackoverflow
Solution 17 - Python	Tayyebi	View Answer on Stackoverflow