Linux Format

LETTER OF THE MONTH

Scraping the barrel

-

My heart sank when I noticed LXF309 had a tutorial on web-scraping in Python. Yes, sure enough, another article on using Requests and Beautiful Soup. Both are fine libraries, but together they do not constitute a scraping framework. The article claimed 70 lines of code; of the code that the excellent Scrapy framework gave me, I had to change only 35! You were right to caution against hitting sites too hard, but I solved that problem by uncommenti­ng a single line to enable the autothrott­le extension. Yes, Scrapy can use Xpath, a powerful and portable domain-specific language for querying web pages, and it can be character-building at first, but it can also use Beautiful Soup. These days, the mere parsing of HTML is often neither sufficient or necessary for web-crawling.

The real power of Scrapy is the separation of the code that makes requests from the code that parses them. It makes it much harder for novices to make a mess of things by providing a hand-rail. For more advanced users, features like headless browsers, testing and monitoring are provided by well-integrated third-party plugins. I spent four years in the music industry crawling hundreds of thousands of accounts across dozens of sites, I know whereof I speak. https://github.com/augeas/lxf_archive

Dr Greenway

Neil says…

Thank you for your thoughtful reply, you obviously know a great deal more on the subject than myself (which wouldn’t be hard) or indeed David, who admits knows nothing of Scrapy. We didn’t set out to write a scraping framework as such; I see our coding tutorials as basic standalone guides to encourage people to play around and experiment with coding. So, they tend to be just examples rather than best in class. There’s always more than one way to skin a cat… But Scrapy does seem to be something we should look into.

Neil says…

I suspect we’ll get back to looking at Vlang at some point, but right now we’re kicking off a multi-part C++ project creating a basic shell from, erm, basics… While I keep getting suggestion­s for fun Python projects, like web scraping. It’s so hard to fit it all in!

 ?? ?? It’s V for Vlang, I’m not sure that’s how the old saying goes?
It’s V for Vlang, I’m not sure that’s how the old saying goes?
 ?? ??

Newspapers in English

Newspapers from Australia