Remove Sub String by using Python

PythonRegexString

Python Problem Overview


I already extract some information from a forum. It is the raw string I have now:

string = 'i think mabe 124 + <font color="black"><font face="Times New Roman">but I don\'t have a big experience it just how I see it in my eyes <font color="green"><font face="Arial">fun stuff'

The thing I do not like is the sub string "<font color="black"><font face="Times New Roman">" and "<font color="green"><font face="Arial">". I do want to keep the other part of string except this. So the result should be like this

resultString = "i think mabe 124 + but I don't have a big experience it just how I see it in my eyes fun stuff"

How could I do this? Actually I used beautiful soup to extract the string above from a forum. Now I may prefer regular expression to remove the part.

Python Solutions


Solution 1 - Python

import re
re.sub('<.*?>', '', string)
"i think mabe 124 + but I don't have a big experience it just how I see it in my eyes fun stuff"

The re.sub function takes a regular expresion and replace all the matches in the string with the second parameter. In this case, we are searching for all tags ('<.*?>') and replacing them with nothing ('').

The ? is used in re for non-greedy searches.

More about the re module.

Solution 2 - Python

>>> import re
>>> st = " i think mabe 124 + <font color=\"black\"><font face=\"Times New Roman\">but I don't have a big experience it just how I see it in my eyes <font color=\"green\"><font face=\"Arial\">fun stuff"
>>> re.sub("<.*?>","",st)
" i think mabe 124 + but I don't have a big experience it just how I see it in my eyes fun stuff"
>>> 

Solution 3 - Python

BeautifulSoup(text, features="html.parser").text 

For the people who were seeking deep info in my answer, sorry.

I'll explain it.

Beautifulsoup is a widely use python package that helps the user (developer) to interact with HTML within python.

The above like just take all the HTML text (text) and cast it to Beautifulsoup object - that means behind the sense its parses everything up (Every HTML tag within the given text)

Once done so, we just request all the text from within the HTML object.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionWenhao.SHEView Question on Stackoverflow
Solution 1 - PythonjuliomalegriaView Answer on Stackoverflow
Solution 2 - PythonAbhijitView Answer on Stackoverflow
Solution 3 - PythonBenny ElgazarView Answer on Stackoverflow