How to do CamelCase split in python

Python Problem Overview

What I was trying to achieve, was something like this:

>>> camel_case_split("CamelCaseXYZ")
['Camel', 'Case', 'XYZ']
>>> camel_case_split("XYZCamelCase")
['XYZ', 'Camel', 'Case']

So I searched and found this perfect regular expression:

(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])

As the next logical step I tried:

>>> re.split("(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])", "CamelCaseXYZ")
['CamelCaseXYZ']

Why does this not work, and how do I achieve the result from the linked question in python?

Edit: Solution summary

I tested all provided solutions with a few test cases:

string:					''
AplusKminus:			['']
casimir_et_hippolyte:	[]
two_hundred_success:	[]
kalefranz:				string index out of range # with modification: either [] or ['']

string:					' '
AplusKminus:     		[' ']
casimir_et_hippolyte:	[]
two_hundred_success:	[' ']
kalefranz:				[' ']

string:					'lower'
all algorithms:			['lower']

string:					'UPPER'
all algorithms:			['UPPER']

string:					'Initial'
all algorithms:			['Initial']

string:					'dromedaryCase'
AplusKminus:    		['dromedary', 'Case']
casimir_et_hippolyte:	['dromedary', 'Case']
two_hundred_success:	['dromedary', 'Case']
kalefranz:				['Dromedary', 'Case'] # with modification: ['dromedary', 'Case']

string:					'CamelCase'
all algorithms:			['Camel', 'Case']

string:					'ABCWordDEF'
AplusKminus:		    ['ABC', 'Word', 'DEF']
casimir_et_hippolyte:	['ABC', 'Word', 'DEF']
two_hundred_success:	['ABC', 'Word', 'DEF']
kalefranz:				['ABCWord', 'DEF']

In summary you could say the solution by @kalefranz does not match the question (see the last case) and the solution by @casimir et hippolyte eats a single space, and thereby violates the idea that a split should not change the individual parts. The only difference among the remaining two alternatives is that my solution returns a list with the empty string on an empty string input and the solution by @200_success returns an empty list. I don't know how the python community stands on that issue, so I say: I am fine with either one. And since 200_success's solution is simpler, I accepted it as the correct answer.

Python Solutions

Solution 1 - Python

As @AplusKminus has explained, re.split() never splits on an empty pattern match. Therefore, instead of splitting, you should try finding the components you are interested in.

Here is a solution using re.finditer() that emulates splitting:

def camel_case_split(identifier):
    matches = finditer('.+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)', identifier)
    return [m.group(0) for m in matches]

Solution 2 - Python

Use re.sub() and split()

import re

name = 'CamelCaseTest123'
splitted = re.sub('([A-Z][a-z]+)', r' \1', re.sub('([A-Z]+)', r' \1', name)).split()

Result

'CamelCaseTest123' -> ['Camel', 'Case', 'Test123']
'CamelCaseXYZ' -> ['Camel', 'Case', 'XYZ']
'XYZCamelCase' -> ['XYZ', 'Camel', 'Case']
'XYZ' -> ['XYZ']
'IPAddress' -> ['IP', 'Address']

Solution 3 - Python

Most of the time when you don't need to check the format of a string, a global research is more simple than a split (for the same result):

re.findall(r'[A-Z](?:[a-z]+|[A-Z]*(?=[A-Z]|$))', 'CamelCaseXYZ')

returns

['Camel', 'Case', 'XYZ']

To deal with dromedary too, you can use:

re.findall(r'[A-Z]?[a-z]+|[A-Z]+(?=[A-Z]|$)', 'camelCaseXYZ')

Note: (?=[A-Z]|$) can be shorten using a double negation (a negative lookahead with a negated character class): (?![^A-Z])

Solution 4 - Python

Working solution, without regexp

I am not that good at regexp. I like to use them for search/replace in my IDE but I try to avoid them in programs.

Here is a quite straightforward solution in pure python:

def camel_case_split(s):
    idx = list(map(str.isupper, s))
    # mark change of case
    l = [0]
    for (i, (x, y)) in enumerate(zip(idx, idx[1:])):
        if x and not y:  # "Ul"
            l.append(i)
        elif not x and y:  # "lU"
            l.append(i+1)
    l.append(len(s))
    # for "lUl", index of "U" will pop twice, have to filter that
    return [s[x:y] for x, y in zip(l, l[1:]) if x < y]

And some tests

def test():
	TESTS = [
		("aCamelCaseWordT", ['a', 'Camel', 'Case', 'Word', 'T']),
		("CamelCaseWordT", ['Camel', 'Case', 'Word', 'T']),
		("CamelCaseWordTa", ['Camel', 'Case', 'Word', 'Ta']),
		("aCamelCaseWordTa", ['a', 'Camel', 'Case', 'Word', 'Ta']),
		("Ta", ['Ta']),
		("aT", ['a', 'T']),
		("a", ['a']),
		("T", ['T']),
		("", []),
		("XYZCamelCase", ['XYZ', 'Camel', 'Case']),
		("CamelCaseXYZ", ['Camel', 'Case', 'XYZ']),
		("CamelCaseXYZa", ['Camel', 'Case', 'XY', 'Za']),
	]
	for (q,a) in TESTS:
		assert camel_case_split(q) == a

if __name__ == "__main__":
	test()

Solution 5 - Python

I just stumbled upon this case and wrote a regular expression to solve it. It should work for any group of words, actually.

RE_WORDS = re.compile(r'''
    # Find words in a string. Order matters!
    [A-Z]+(?=[A-Z][a-z]) |  # All upper case before a capitalized word
    [A-Z]?[a-z]+ |  # Capitalized words / all lower case
    [A-Z]+ |  # All upper case
    \d+  # Numbers
''', re.VERBOSE)

The key here is the lookahead on the first possible case. It will match (and preserve) uppercase words before capitalized ones:

assert RE_WORDS.findall('FOOBar') == ['FOO', 'Bar']

Solution 6 - Python

import re

re.split('(?<=[a-z])(?=[A-Z])', 'camelCamelCAMEL')
# ['camel', 'Camel', 'CAMEL'] <-- result

# '(?<=[a-z])'         --> means preceding lowercase char (group A)
# '(?=[A-Z])'          --> means following UPPERCASE char (group B)
# '(group A)(group B)' --> 'aA' or 'aB' or 'bA' and so on

Solution 7 - Python

The documentation for python's re.split says:

> Note that split will never split a string on an empty pattern match.

When seeing this:

>>> re.findall("(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])", "CamelCaseXYZ")
['', '']

it becomes clear, why the split does not work as expected. The remodule finds empty matches, just as intended by the regular expression.

Since the documentation states that this is not a bug, but rather intended behavior, you have to work around that when trying to create a camel case split:

def camel_case_split(identifier):
    matches = finditer('(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])', identifier)
    split_string = []
    # index of beginning of slice
    previous = 0
    for match in matches:
        # get slice
        split_string.append(identifier[previous:match.start()])
        # advance index
        previous = match.start()
    # get remaining string
    split_string.append(identifier[previous:])
    return split_string

Solution 8 - Python

This solution also supports numbers, spaces, and auto remove underscores:

def camel_terms(value):
    return re.findall('[A-Z][a-z]+|[0-9A-Z]+(?=[A-Z][a-z])|[0-9A-Z]{2,}|[a-z0-9]{2,}|[a-zA-Z0-9]', value)

Some tests:

tests = [
    "XYZCamelCase",
    "CamelCaseXYZ",
    "Camel_CaseXYZ",
    "3DCamelCase",
    "Camel5Case",
    "Camel5Case5D",
    "Camel Case XYZ"
]

for test in tests:
    print(test, "=>", camel_terms(test))

results:

XYZCamelCase => ['XYZ', 'Camel', 'Case']
CamelCaseXYZ => ['Camel', 'Case', 'XYZ']
Camel_CaseXYZ => ['Camel', 'Case', 'XYZ']
3DCamelCase => ['3D', 'Camel', 'Case']
Camel5Case => ['Camel', '5', 'Case']
Camel5Case5D => ['Camel', '5', 'Case', '5D']
Camel Case XYZ => ['Camel', 'Case', 'XYZ']

Solution 9 - Python

Simple solution:

re.sub(r"([a-z0-9])([A-Z])", r"\1 \2", str(text))

Solution 10 - Python

Here's another solution that requires less code and no complicated regular expressions:

def camel_case_split(string):
    bldrs = [[string[0].upper()]]
    for c in string[1:]:
        if bldrs[-1][-1].islower() and c.isupper():
            bldrs.append([c])
        else:
            bldrs[-1].append(c)
    return [''.join(bldr) for bldr in bldrs]

##Edit

The above code contains an optimization that avoids rebuilding the entire string with every appended character. Leaving out that optimization, a simpler version (with comments) might look like

def camel_case_split2(string):
    # set the logic for creating a "break"
    def is_transition(c1, c2):
      return c1.islower() and c2.isupper()
    
    # start the builder list with the first character
    # enforce upper case
    bldr = [string[0].upper()]
    for c in string[1:]:
        # get the last character in the last element in the builder
        # note that strings can be addressed just like lists
        previous_character = bldr[-1][-1]
        if is_transition(previous_character, c):
            # start a new element in the list
            bldr.append(c)
        else:
            # append the character to the last string
            bldr[-1] += c
    return bldr

Solution 11 - Python

I know that the question added the tag of regex. But still, I always try to stay as far away from regex as possible. So, here is my solution without regex:

def split_camel(text, char):
    if len(text) <= 1: # To avoid adding a wrong space in the beginning
        return text+char
    if char.isupper() and text[-1].islower(): # Regular Camel case
        return text + " " + char
    elif text[-1].isupper() and char.islower() and text[-2] != " ": # Detect Camel case in case of abbreviations
        return text[:-1] + " " + text[-1] + char
    else: # Do nothing part
        return text + char

text = "PathURLFinder"
text = reduce(split_camel, a, "")
print text
# prints "Path URL Finder"
print text.split(" ")
# prints "['Path', 'URL', 'Finder']"

EDIT: As suggested, here is the code to put the functionality in a single function.

def split_camel(text):
    def splitter(text, char):
	    if len(text) <= 1: # To avoid adding a wrong space in the beginning
		    return text+char
		if char.isupper() and text[-1].islower(): # Regular Camel case
    		return text + " " + char
	    elif text[-1].isupper() and char.islower() and text[-2] != " ": # Detect Camel case in case of abbreviations
			return text[:-1] + " " + text[-1] + char
    	else: # Do nothing part
	    	return text + char
	converted_text = reduce(splitter, text, "")
    return converted_text.split(" ")

split_camel("PathURLFinder")
# prints ['Path', 'URL', 'Finder']

Solution 12 - Python

Putting a more comprehensive approach otu ther. It takes care of several issues like numbers, string starting with lower case, single letter words etc.

def camel_case_split(identifier, remove_single_letter_words=False):
    """Parses CamelCase and Snake naming"""
    concat_words = re.split('[^a-zA-Z]+', identifier)

    def camel_case_split(string):
        bldrs = [[string[0].upper()]]
        string = string[1:]
        for idx, c in enumerate(string):
            if bldrs[-1][-1].islower() and c.isupper():
                bldrs.append([c])
            elif c.isupper() and (idx+1) < len(string) and string[idx+1].islower():
                bldrs.append([c])
            else:
                bldrs[-1].append(c)

        words = [''.join(bldr) for bldr in bldrs]
        words = [word.lower() for word in words]
        return words
    words = []
    for word in concat_words:
        if len(word) > 0:
            words.extend(camel_case_split(word))
    if remove_single_letter_words:
        subset_words = []
        for word in words:
            if len(word) > 1:
                subset_words.append(word)
        if len(subset_words) > 0:
            words = subset_words
    return words

Solution 13 - Python

My requirement was a bit more specific than the OP. In particular, in addition to handling all OP cases, I needed the following which the other solutions do not provide:

treat all non-alphanumeric input (e.g. !@#$%^&*() etc) as a word separator
handle digits as follows:
- cannot be in the middle of a word
- cannot be at the beginning of the word unless the phrase starts with a digit

def splitWords(s):
	new_s = re.sub(r'[^a-zA-Z0-9]', ' ',                  # not alphanumeric
	    re.sub(r'([0-9]+)([^0-9])', '\\1 \\2',            # digit followed by non-digit
	    	re.sub(r'([a-z])([A-Z])','\\1 \\2',           # lower case followed by upper case
	    		re.sub(r'([A-Z])([A-Z][a-z])', '\\1 \\2', # upper case followed by upper case followed by lower case
	    			s
	    		)
	    	)
	    )
	)
	return [x for x in new_s.split(' ') if x]

Output:

for test in ['', ' ', 'lower', 'UPPER', 'Initial', 'dromedaryCase', 'CamelCase', 'ABCWordDEF', 'CamelCaseXYZand123.how23^ar23e you doing AndABC123XYZdf']:
    print test + ':' + str(splitWords(test))

:[]
 :[]
lower:['lower']
UPPER:['UPPER']
Initial:['Initial']
dromedaryCase:['dromedary', 'Case']
CamelCase:['Camel', 'Case']
ABCWordDEF:['ABC', 'Word', 'DEF']
CamelCaseXYZand123.how23^ar23e you doing AndABC123XYZdf:['Camel', 'Case', 'XY', 'Zand123', 'how23', 'ar23', 'e', 'you', 'doing', 'And', 'ABC123', 'XY', 'Zdf']

Solution 14 - Python

I think below is the optimim

Def count_word(): Return(re.findall(‘[A-Z]?[a-z]+’, input(‘please enter your string’))

Print(count_word())

Content Type	Original Author	Original Content on Stackoverflow
Question	AplusKminus	View Question on Stackoverflow
Solution 1 - Python	200_success	View Answer on Stackoverflow
Solution 2 - Python	Jossef Harush Kadouri	View Answer on Stackoverflow
Solution 3 - Python	Casimir et Hippolyte	View Answer on Stackoverflow
Solution 4 - Python	Setop	View Answer on Stackoverflow
Solution 5 - Python	emyller	View Answer on Stackoverflow
Solution 6 - Python	endusol	View Answer on Stackoverflow
Solution 7 - Python	AplusKminus	View Answer on Stackoverflow
Solution 8 - Python	mnesarco	View Answer on Stackoverflow
Solution 9 - Python	vbfh	View Answer on Stackoverflow
Solution 10 - Python	kalefranz	View Answer on Stackoverflow
Solution 11 - Python	thiruvenkadam	View Answer on Stackoverflow
Solution 12 - Python	datarpit	View Answer on Stackoverflow
Solution 13 - Python	mwag	View Answer on Stackoverflow
Solution 14 - Python	Ahmoody	View Answer on Stackoverflow