How can I do multiple substitutions using regex?

PythonRegexString

Python Problem Overview


I can use this code below to create a new file with the substitution of a with aa using regular expressions.

import re

with open("notes.txt") as text:
    new_text = re.sub("a", "aa", text.read())
    with open("notes2.txt", "w") as result:
        result.write(new_text)

I was wondering do I have to use this line, new_text = re.sub("a", "aa", text.read()), multiple times but substitute the string for others letters that I want to change in order to change more than one letter in my text?

That is, so a-->aa,b--> bb and c--> cc.

So I have to write that line for all the letters I want to change or is there an easier way. Perhaps to create a "dictionary" of translations. Should I put those letters into an array? I'm not sure how to call on them if I do.

Python Solutions


Solution 1 - Python

The answer proposed by @nhahtdh is valid, but I would argue less pythonic than the canonical example, which uses code less opaque than his regex manipulations and takes advantage of python's built-in data structures and anonymous function feature.

A dictionary of translations makes sense in this context. In fact, that's how the Python Cookbook does it, as shown in this example (copied from ActiveState http://code.activestate.com/recipes/81330-single-pass-multiple-replace/ )

import re 

def multiple_replace(dict, text):
  # Create a regular expression  from the dictionary keys
  regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))

  # For each match, look-up corresponding value in dictionary
  return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) 

if __name__ == "__main__": 

  text = "Larry Wall is the creator of Perl"

  dict = {
    "Larry Wall" : "Guido van Rossum",
    "creator" : "Benevolent Dictator for Life",
    "Perl" : "Python",
  } 

  print multiple_replace(dict, text)

So in your case, you could make a dict trans = {"a": "aa", "b": "bb"} and then pass it into multiple_replace along with the text you want translated. Basically all that function is doing is creating one huge regex containing all of your regexes to translate, then when one is found, passing a lambda function to regex.sub to perform the translation dictionary lookup.

You could use this function while reading from your file, for example:

with open("notes.txt") as text:
    new_text = multiple_replace(replacements, text.read())
with open("notes2.txt", "w") as result:
    result.write(new_text)

I've actually used this exact method in production, in a case where I needed to translate the months of the year from Czech into English for a web scraping task.

As @nhahtdh pointed out, one downside to this approach is that it is not prefix-free: dictionary keys that are prefixes of other dictionary keys will cause the method to break.

Solution 2 - Python

You can use capturing group and backreference:

re.sub(r"([characters])", r"\1\1", text.read())

Put characters that you want to double up in between []. For the case of lower case a, b, c:

re.sub(r"([abc])", r"\1\1", text.read())

In the replacement string, you can refer to whatever matched by a capturing group () with \n notation where n is some positive integer (0 excluded). \1 refers to the first capturing group. There is another notation \g<n> where n can be any non-negative integer (0 allowed); \g<0> will refer to the whole text matched by the expression.


If you want to double up all characters except new line:

re.sub(r"(.)", r"\1\1", text.read())

If you want to double up all characters (new line included):

re.sub(r"(.)", r"\1\1", text.read(), 0, re.S)

Solution 3 - Python

You can use the pandas library and the replace function. I represent one example with five replacements:

df = pd.DataFrame({'text': ['Billy is going to visit Rome in November', 'I was born in 10/10/2010', 'I will be there at 20:00']})

to_replace=['Billy','Rome','January|February|March|April|May|June|July|August|September|October|November|December', '\d{2}:\d{2}', '\d{2}/\d{2}/\d{4}']
replace_with=['name','city','month','time', 'date']

print(df.text.replace(to_replace, replace_with, regex=True))

And the modified text is:

0    name is going to visit city in month
1                      I was born in date
2                 I will be there at time

You can find the example here

Solution 4 - Python

Using tips from how to make a 'stringy' class, we can make an object identical to a string but for an extra sub method:

import re
class Substitutable(str):
  def __new__(cls, *args, **kwargs):
    newobj = str.__new__(cls, *args, **kwargs)
    newobj.sub = lambda fro,to: Substitutable(re.sub(fro, to, newobj))
    return newobj

This allows to use the builder pattern, which looks nicer, but works only for a pre-determined number of substitutions. If you use it in a loop, there is no point creating an extra class anymore. E.g.

>>> h = Substitutable('horse')
>>> h
'horse'
>>> h.sub('h', 'f')
'forse'
>>> h.sub('h', 'f').sub('f','h')
'horse'

Solution 5 - Python

None of the other solutions work if your patterns are themselves regexes.

For that, you need:

def multi_sub(pairs, s):
    def repl_func(m):
        # only one group will be present, use the corresponding match
        return next(
            repl
            for (patt, repl), group in zip(pairs, m.groups())
            if group is not None
        )
    pattern = '|'.join("({})".format(patt) for patt, _ in pairs)
    return re.sub(pattern, repl_func, s)

Which can be used as:

>>> multi_sub([
...     ('a+b', 'Ab'),
...     ('b', 'B'),
...     ('a+', 'A.'),
... ], "aabbaa")  # matches as (aab)(b)(aa)
'AbBA.'

Note that this solution does not allow you to put capturing groups in your regexes, or use them in replacements.

Solution 6 - Python

I found I had to modify Emmett J. Butler's code by changing the lambda function to use myDict.get(mo.group(1),mo.group(1)). The original code wasn't working for me; using myDict.get() also provides the benefit of a default value if a key is not found.

OIDNameContraction = {
								'Fucntion':'Func',
								'operated':'Operated',
								'Asist':'Assist',
								'Detection':'Det',
								'Control':'Ctrl',
								'Function':'Func'
}

replacementDictRegex = re.compile("(%s)" % "|".join(map(re.escape, OIDNameContraction.keys())))

oidDescriptionStr = replacementDictRegex.sub(lambda mo:OIDNameContraction.get(mo.group(1),mo.group(1)), oidDescriptionStr)

Solution 7 - Python

If you dealing with files, I have a simple python code about this problem. More info here.

import re 

 def multiple_replace(dictionary, text):
  # Create a regular expression  from the dictionaryary keys
  
  regex = re.compile("(%s)" % "|".join(map(re.escape, dictionary.keys())))

  # For each match, look-up corresponding value in dictionaryary
  String = lambda mo: dictionary[mo.string[mo.start():mo.end()]]
  return regex.sub(String , text)
  

if __name__ == "__main__":

dictionary = {
	"Wiley Online Library" : "Wiley",
	"Chemical Society Reviews" : "Chem. Soc. Rev.",
} 

with open ('LightBib.bib', 'r') as Bib_read:
	with open ('Abbreviated.bib', 'w') as Bib_write:
		read_lines = Bib_read.readlines()
		for rows in read_lines:
			#print(rows)
			text = rows
			new_text = multiple_replace(dictionary, text)
			#print(new_text)
			Bib_write.write(new_text)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionEuridice01View Question on Stackoverflow
Solution 1 - PythonEmmett ButlerView Answer on Stackoverflow
Solution 2 - PythonnhahtdhView Answer on Stackoverflow
Solution 3 - PythonGeorge PipisView Answer on Stackoverflow
Solution 4 - PythonLeoView Answer on Stackoverflow
Solution 5 - PythonEricView Answer on Stackoverflow
Solution 6 - PythonJordan McBainView Answer on Stackoverflow
Solution 7 - PythonHamid ZareeView Answer on Stackoverflow