How to use regex to parse a number from HTML?

PythonRegex

Python Problem Overview


I want to write a simple regular expression in Python that extracts a number from HTML. The HTML sample is as follows:

Your number is <b>123</b>

Now, how can I extract "123", i.e. the contents of the first bold text after the string "Your number is"?

Python Solutions


Solution 1 - Python

import re
m = re.search("Your number is <b>(\d+)</b>",
      "xxx Your number is <b>123</b>  fdjsk")
if m:
    print m.groups()[0]


Solution 2 - Python

Given s = "Your number is <b>123</b>" then:

 import re 
 m = re.search(r"\d+", s)

will work and give you

 m.group()
'123'

The regular expression looks for 1 or more consecutive digits in your string.

Note that in this specific case we knew that there would be a numeric sequence, otherwise you would have to test the return value of re.search() to make sure that m contained a valid reference, otherwise m.group() would result in a AttributeError: exception.

Of course if you are going to process a lot of HTML you want to take a serious look at BeautifulSoup - it's meant for that and much more. The whole idea with BeautifulSoup is to avoid "manual" parsing using string ops or regular expressions.

Solution 3 - Python

import re
x = 'Your number is <b>123</b>'
re.search('(?<=Your number is )<b>(\d+)</b>',x).group(0)

this searches for the number that follows the 'Your number is' string

Solution 4 - Python

import re
print re.search(r'(\d+)', 'Your number is <b>123</b>').group(0)

Solution 5 - Python

The simplest way is just extract digit(number)

re.search(r"\d+",text)

Solution 6 - Python

val="Your number is <b>123</b>"

###Option : 1 m=re.search(r'(<.?>)(\d+)(<.?>)',val)

m.group(2)

###Option : 2

re.sub(r'([\s\S]+)(<.*?>)(\d+)(<.*?>)',r'\3',val)

Solution 7 - Python

import re
found = re.search("your number is <b>(\d+)</b>", "something.... Your number is <b>123</b> something...")

if found:
    print found.group()[0]

Here (\d+) is the grouping, since there is only one group [0] is used. When there are several groupings [grouping index] should be used.

Solution 8 - Python

To extract as python list you can use findall

>>> import re
>>> string = 'Your number is <b>123</b>'
>>> pattern = '\d+'
>>> re.findall(pattern,string)
['123']
>>>

Solution 9 - Python

You can use the following example to solve your problem:

import re

search = re.search(r"\d+",text).group(0) #returns the number that is matched in the text

print("Starting Index Of Digit", search.start())

print("Ending Index Of Digit:", search.end())

Solution 10 - Python

import re
x = 'Your number is <b>123</b>'
output = re.search('(?<=Your number is )<b>(\d+)</b>',x).group(1)
print(output)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionSaqibView Question on Stackoverflow
Solution 1 - PythonYevgen YampolskiyView Answer on Stackoverflow
Solution 2 - PythonLevonView Answer on Stackoverflow
Solution 3 - PythonmuffelView Answer on Stackoverflow
Solution 4 - PythonJacob AbrahamView Answer on Stackoverflow
Solution 5 - PythonAvinash KumarView Answer on Stackoverflow
Solution 6 - Pythonuser4613285View Answer on Stackoverflow
Solution 7 - PythonSykam Sreekar ReddyView Answer on Stackoverflow
Solution 8 - PythonArunView Answer on Stackoverflow
Solution 9 - Pythonsadiq shahView Answer on Stackoverflow
Solution 10 - PythonAnand KView Answer on Stackoverflow