Skip to content Skip to sidebar Skip to footer

Attempting A Nested Scrape Using Beautifulsoup

My code is as follows:

Hello

Solution 1:

If you were having problems with nextSibling it's because your html actually looks like this:

<h1><aname="hello">Hello</a></h1>\n #<---newline
<divclass="colmask">

See the newline after the </h1>? Even though a newline is invisible, it is still considered text, and therefore it becomes a BeautifulSoup element(a NavigableString), and it's considered the nextSibling of the <h1> tag.

Newlines can also present problems when trying to get, say, the third child of the following <div>:

<div>
  <div>hello</div>
  <div>world</div>
  <div>goodbye</div>
<div>

Here is the numbering of the children:

<div>\n #<---newline plus spaces at start of next line = child 0
  <div>hello</div>\n #<--newline plus spaces at start of next line = child 2
  <div>world</div>\n #<--newline plus spaces at start of next line = child 4
  <div>goodbye</div>\n #<--newline = child 6
<div>

The divs are actually children numbers 1, 3, and 5. If you are having trouble parsing html, then 101% of the time it's because the newlines at the end of each line are tripping you up. The newlines always have to be accounted for and factored into your thinking about where things are located.

To get the <div> tag here:

<h1><aname="hello">Hello</a></h1>\n #<---newline
<divclass="colmask">

...you could write:

h1.nextSibling.nextSibling

But to skip ALL the whitespace between tags, it's easier to use findNextSibling(), which allows you to specify the tag name of the next sibling you want to locate:

findNextSibling('div')

Here is an example:

from BeautifulSoup import BeautifulSoup

withopen('data2.txt') as f:
    html = f.read()

soup = BeautifulSoup(html)

for h1 in soup.findAll('h1'):
    colmask_div = h1.findNextSibling('div')

    for box_div in colmask_div.findAll('div'):
        h4 = box_div.find('h4')

        for ul in box_div.findAll('ul'):
            print'{} : {} : {}'.format(h1.text, h4.text, ul.li.a.text)



--output:--
Hello : My Favorite Number is : 1
Hello : My Favorite Number is : 2
Hello : My Favorite Number is : 3
Hello : My Favorite Number is : 4
Hello : Your Favorite Number is : 1
Hello : Your Favorite Number is : 2
Hello : Your Favorite Number is : 3
Hello : Your Favorite Number is : 4
Goodbye : Their Favorite Number is : 1
Goodbye : Their Favorite Number is : 2
Goodbye : Their Favorite Number is : 3
Goodbye : Their Favorite Number is : 4
Goodbye : Our Favorite Number is : 1
Goodbye : Our Favorite Number is : 2
Goodbye : Our Favorite Number is : 3
Goodbye : Our Favorite Number is : 4

Post a Comment for "Attempting A Nested Scrape Using Beautifulsoup"