How to split long regular expression rules to multiple lines in Python

Python Problem Overview

Is this actually doable? I have some very long regex pattern rules that are hard to understand because they don't fit into the screen at once. Example:

test = re.compile('(?P<full_path>.+):\d+:\s+warning:\s+Member\s+(?P<member_name>.+)\s+\((?P<member_type>%s)\) of (class|group|namespace)\s+(?P<class_name>.+)\s+is not documented' % (self.__MEMBER_TYPES), re.IGNORECASE)

Backslash or triple quotes won't work.

EDIT. I ended using the VERBOSE mode. Here's how the regexp pattern looks now:

test = re.compile('''
  (?P<full_path>                                  # Capture a group called full_path
    .+                                            #   It consists of one more characters of any type
  )                                               # Group ends                      
  :                                               # A literal colon
  \d+                                             # One or more numbers (line number)
  :                                               # A literal colon
  \s+warning:\s+parameters\sof\smember\s+         # An almost static string
  (?P<member_name>                                # Capture a group called member_name
    [                                             #   
      ^:                                          #   Match anything but a colon (so finding a colon ends group)
    ]+                                            #   Match one or more characters
   )                                              # Group ends
   (                                              # Start an unnamed group 
     ::                                           #   Two literal colons
     (?P<function_name>                           #   Start another group called function_name
       \w+                                        #     It consists on one or more alphanumeric characters
     )                                            #   End group
   )*                                             # This group is entirely optional and does not apply to C
   \s+are\snot\s\(all\)\sdocumented''',           # And line ends with an almost static string
   re.IGNORECASE|re.VERBOSE)                      # Let's not worry about case, because it seems to differ between Doxygen versions

Python Solutions

Solution 1 - Python

You can split your regex pattern by quoting each segment. No backslashes needed.

test = re.compile(('(?P<full_path>.+):\d+:\s+warning:\s+Member'
                   '\s+(?P<member_name>.+)\s+\((?P<member_type>%s)\) '
                   'of (class|group|namespace)\s+(?P<class_name>.+)'
                   '\s+is not documented') % (self.__MEMBER_TYPES), re.IGNORECASE)

You can also use the raw string flag 'r' and you'll have to put it before each segment.

See the docs.

Solution 2 - Python

From http://docs.python.org/reference/lexical_analysis.html#string-literal-concatenation:

> Multiple adjacent string literals (delimited by whitespace), possibly > using different quoting conventions, are allowed, and their meaning is > the same as their concatenation. Thus, "hello" 'world' is equivalent > to "helloworld". This feature can be used to reduce the number of > backslashes needed, to split long strings conveniently across long > lines, or even to add comments to parts of strings, for example:

re.compile("[A-Za-z_]"       # letter or underscore
           "[A-Za-z0-9_]*"   # letter, digit or underscore
          )

> Note that this feature is defined at the syntactical level, but > implemented at compile time. The ‘+’ operator must be used to > concatenate string expressions at run time. Also note that literal > concatenation can use different quoting styles for each component > (even mixing raw strings and triple quoted strings).

Solution 3 - Python

Just for completeness, the missing answer here is using the re.X or re.VERBOSE flag, which the OP eventually pointed out. Besides saving quotes, this method is also portable on other regex implementations such as Perl.

From https://docs.python.org/2/library/re.html#re.X:

re.X
re.VERBOSE

This flag allows you to write regular expressions that look nicer and are more readable by allowing you to visually separate logical sections of the pattern and add comments. Whitespace within the pattern is ignored, except when in a character class or when preceded by an unescaped backslash. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

This means that the two following regular expression objects that match a decimal number are functionally equal:

a = re.compile(r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X)

b = re.compile(r"\d+\.\d*")

Solution 4 - Python

Personally, I don't use re.VERBOSE because I don't like to escape the blank spaces and I don't want to put '\s' instead of blank spaces when '\s' isn't required.
The more the symbols in a regex pattern are precise relatively to the characters sequences that must be catched, the faster the regex object acts. I nearly never use '\s'

To avoid re.VERBOSE, you can do as it has been already said:

test = re.compile(
'(?P<full_path>.+)'
':\d+:\s+warning:\s+Member\s+' # comment
'(?P<member_name>.+)'
'\s+\('
'(?P<member_type>%s)' # comment
'\) of '
'(class|group|namespace)'
#      ^^^^^^ underlining something to point out
'\s+'
'(?P<class_name>.+)'
#      vvv overlining something important too
'\s+is not documented'\
% (self.__MEMBER_TYPES),
 
re.IGNORECASE)

Pushing the strings to the left gives a lot of space to write comments.

But this manner isn't so good when the pattern is very long because it isn't possible to write

test = re.compile(
'(?P<full_path>.+)'
':\d+:\s+warning:\s+Member\s+' # comment
'(?P<member_name>.+)'
'\s+\('
'(?P<member_type>%s)' % (self.__MEMBER_TYPES)  # !!!!!! INCORRECT SYNTAX !!!!!!!
'\) of '
'(class|group|namespace)'
#      ^^^^^^ underlining something to point out
'\s+'
'(?P<class_name>.+)'
#      vvv overlining something important too
'\s+is not documented',

re.IGNORECASE)

then in case the pattern is very long, the number of lines between
the part % (self.__MEMBER_TYPES) at the end
and the string '(?P<member_type>%s)' to which it is applied
can be big and we loose the easiness in reading the pattern.

That's why I like to use a tuple to write a very long pattern:

pat = ''.join((
'(?P<full_path>.+)',
# you can put a comment here, you see: a very very very long comment
':\d+:\s+warning:\s+Member\s+',
'(?P<member_name>.+)',
'\s+\(',
'(?P<member_type>%s)' % (self.__MEMBER_TYPES), # comment here
'\) of ',
# comment here
'(class|group|namespace)',
#       ^^^^^^ underlining something to point out
'\s+',
'(?P<class_name>.+)',
#      vvv overlining something important too
'\s+is not documented'))

This manner allows to define the pattern as a function:

def pat(x):

    return ''.join((\
'(?P<full_path>.+)',
# you can put a comment here, you see: a very very very long comment
':\d+:\s+warning:\s+Member\s+',
'(?P<member_name>.+)',
'\s+\(',
'(?P<member_type>%s)' % x , # comment here
'\) of ',
# comment here
'(class|group|namespace)',
#       ^^^^^^ underlining something to point out
'\s+',
'(?P<class_name>.+)',
#      vvv overlining something important too
'\s+is not documented'))

test = re.compile(pat(self.__MEMBER_TYPES), re.IGNORECASE)

Solution 5 - Python

Either use string concatenation like in the answer of naeg or use re.VERBOSE/re.X, but be careful this option will ignore whitespace and comments. You have some spaces in your regex, so those would be ignored and you need to either escape them or use \s

So e.g.

test = re.compile("""(?P<full_path>.+):\d+: # some comment
    \s+warning:\s+Member\s+(?P<member_name>.+) #another comment
    \s+\((?P<member_type>%s)\)\ of\ (class|group|namespace)\s+
    (?P<class_name>.+)\s+is\ not\ documented""" % (self.__MEMBER_TYPES), re.IGNORECASE | re.X)

Solution 6 - Python

The Python compiler will automatically concatenate adjacent string literals. So one way you can do this is to break up your regular expression into multiple strings, one on each line, and let the Python compiler recombine them. It doesn't matter what whitespace you have between the strings, so you can have line breaks and even leading spaces to align the fragments meaningfully.

Content Type	Original Author	Original Content on Stackoverflow
Question	Makis	View Question on Stackoverflow
Solution 1 - Python	naeg	View Answer on Stackoverflow
Solution 2 - Python	N3dst4	View Answer on Stackoverflow
Solution 3 - Python	Thomas Guyot-Sionnest	View Answer on Stackoverflow
Solution 4 - Python	eyquem	View Answer on Stackoverflow
Solution 5 - Python	stema	View Answer on Stackoverflow
Solution 6 - Python	Ben	View Answer on Stackoverflow