Beautiful Soup and extracting a div and its contents by ID

PythonBeautifulsoup

Python Problem Overview


soup.find("tagName", { "id" : "articlebody" })

Why does this NOT return the <div id="articlebody"> ... </div> tags and stuff in between? It returns nothing. And I know for a fact it exists because I'm staring right at it from

soup.prettify()

soup.find("div", { "id" : "articlebody" }) also does not work.

(EDIT: I found that BeautifulSoup wasn't correctly parsing my page, which probably meant the page I was trying to parse isn't properly formatted in SGML or whatever)

Python Solutions


Solution 1 - Python

You should post your example document, because the code works fine:

>>> import BeautifulSoup
>>> soup = BeautifulSoup.BeautifulSoup('<html><body><div id="articlebody"> ... </div></body></html')
>>> soup.find("div", {"id": "articlebody"})
<div id="articlebody"> ... </div>

Finding <div>s inside <div>s works as well:

>>> soup = BeautifulSoup.BeautifulSoup('<html><body><div><div id="articlebody"> ... </div></div></body></html')
>>> soup.find("div", {"id": "articlebody"})
<div id="articlebody"> ... </div>

Solution 2 - Python

To find an element by its id:

div = soup.find(id="articlebody")

Solution 3 - Python

Beautiful Soup 4 supports most CSS selectors with the .select() method, therefore you can use an id selector such as:

soup.select('#articlebody')

If you need to specify the element's type, you can add a type selector before the id selector:

soup.select('div#articlebody')

The .select() method will return a collection of elements, which means that it would return the same results as the following .find_all() method example:

soup.find_all('div', id="articlebody")
# or
soup.find_all(id="articlebody")

If you only want to select a single element, then you could just use the .find() method:

soup.find('div', id="articlebody")
# or
soup.find(id="articlebody")

Solution 4 - Python

I think there is a problem when the 'div' tags are too much nested. I am trying to parse some contacts from a facebook html file, and the Beautifulsoup is not able to find tags "div" with class "fcontent".

This happens with other classes as well. When I search for divs in general, it turns only those that are not so much nested.

The html source code can be any page from facebook of the friends list of a friend of you (not the one of your friends). If someone can test it and give some advice I would really appreciate it.

This is my code, where I just try to print the number of tags "div" with class "fcontent":

from BeautifulSoup import BeautifulSoup 
f = open('/Users/myUserName/Desktop/contacts.html')
soup = BeautifulSoup(f) 
list = soup.findAll('div', attrs={'class':'fcontent'})
print len(list)

Solution 5 - Python

Most probably because of the default beautifulsoup parser has problem. Change a different parser, like 'lxml' and try again.

Solution 6 - Python

In the beautifulsoup source this line allows divs to be nested within divs; so your concern in lukas' comment wouldn't be valid.

NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']

What I think you need to do is to specify the attrs you want such as

source.find('div', attrs={'id':'articlebody'})

Solution 7 - Python

have you tried soup.findAll("div", {"id": "articlebody"})?

sounds crazy, but if you're scraping stuff from the wild, you can't rule out multiple divs...

Solution 8 - Python

I used:

soup.findAll('tag', attrs={'attrname':"attrvalue"})

As my syntax for find/findall; that said, unless there are other optional parameters between the tag and attribute list, this shouldn't be different.

Solution 9 - Python

Here is a code fragment

soup = BeautifulSoup(:"index.html")
titleList = soup.findAll('title')
divList = soup.findAll('div', attrs={ "class" : "article story"})

As you can see I find all tags and then I find all <div> tags with class="article" inside</p>

Solution 10 - Python

Happened to me also while trying to scrape Google.
I ended up using pyquery.
Install:

pip install pyquery

Use:

from pyquery import PyQuery    
pq = PyQuery('<html><body><div id="articlebody"> ... </div></body></html')
tag = pq('div#articlebody')

Solution 11 - Python

The Id property is always uniquely identified. That means you can use it directly without even specifying the element. Therefore, it is a plus point if your elements have it to parse through the content.

divEle = soup.find(id = "articlebody")

Solution 12 - Python

from bs4 import BeautifulSoup
from requests_html import HTMLSession

url = 'your_url'
session = HTMLSession()
resp = session.get(url)

# if element with id "articlebody" is dynamic, else need not to render
resp.html.render()

soup = bs(resp.html.html, "lxml")
soup.find("div", {"id": "articlebody"})

Solution 13 - Python

soup.find("tagName",attrs={ "id" : "articlebody" })

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionTony StarkView Question on Stackoverflow
Solution 1 - PythonLukáš LalinskýView Answer on Stackoverflow
Solution 2 - PythonjfsView Answer on Stackoverflow
Solution 3 - PythonJosh CrozierView Answer on Stackoverflow
Solution 4 - PythonomarView Answer on Stackoverflow
Solution 5 - PythonliangView Answer on Stackoverflow
Solution 6 - PythondagoofView Answer on Stackoverflow
Solution 7 - Pythonuser106514View Answer on Stackoverflow
Solution 8 - Pythonuser257111View Answer on Stackoverflow
Solution 9 - PythonRecursionView Answer on Stackoverflow
Solution 10 - PythonShohamView Answer on Stackoverflow
Solution 11 - PythonIqra.View Answer on Stackoverflow
Solution 12 - Pythonbot8080View Answer on Stackoverflow
Solution 13 - PythonShri narayanView Answer on Stackoverflow