To lessen my troubles, I stopped hanging out with vultures, and empty saviours like you

January 1st, 2005, By Duncan Gough

// Ugly/Beautiful, what a truly Beautiful Soup

I’ve long suspected it but the last couple of screen scrapers I’ve wanted to write have been surprisingly easy, thanks to BeautifulSoup.

Normally, when I have a project that needs to scrape some information from some web page, I abandon it as the thought or the practice of writing yet another parser in whatever language is just too much.

Now, having twice parsed web pages whose standards-compliances I couldn’t care less about (and have no control over), I know that it can be done in around 5 minutes. Note, this isn’t a w00t-python post, I’m sure there are tools that do this in Perl or PHP that are just as good but BeautifulSoup combined with the python interactive prompt makes for some very fast work.

Now, as BeautifulSoup says:

“You didn’t write that awful page. You’re just trying to get some data out of it. Right now, you don’t really care what HTML is supposed to look like. Neither does this parser”.

It’s a little lacking in examples so I thought show you how I’ve used it:

1/ To get a web page into BeautifulSoup:

url = 'http://www.example.com'html = urllib2.urlopen(url).read()soup = BeautifulSoup()soup.feed(html)

2/ To pull out all the <a> links:

for link in soup('a'):    print link

3/ To pull out all the table data that has (thankfully) been given a useful class:

for thumb in soup('td', {'class' : 'thumbs'}):    print thumb

4/ To get all the data from a particular div:

for div in soup('div', {'class' : 'thumbs-filename'}):    print div.contents[0]

All of which forms parts of a 20-odd link script I’ve just written to backup a couple of pages from fotopic.net (since their toolkit seems to be broken at the vital moment). I started writing it thinking it would take forever, that I’d be bogged down in parsing, splitting and validating the raw html. Which, of course, was far, far away from my experience, so I just had to do something with all that spare time.

Thinking about it, I guess that the nearest tool I can find in PHP is the new Tidy extension, but that’s PHP 5 only.

All things considered, the best thing about BeautifulSoup is that it allows me to bastardise a Thelonius Monk standard, it makes the ‘ugly, beautiful’ ;-)

– http://www.frayed.org/crowes/songs/sometimes-salvation.html

« Never let your trousers go..I like songs that are sad and long »