Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> It could just be me, but I prefer more control, and using the requests library gave me that

I had the exact same opinion until a month back when I started using scrapy + scrapyd[1] for a serious scraping task. Yes, it's a long running spider which is why I decided to use Scrapy in the first place. But so far I have been really impressed with it in general and might consider it even for one-off crawling in future. IMO, pipelines etc. are worth learning. I would also recommend using the httpcache middleware during development which makes testing and debugging easy without having to download content again and again.

On a separate note, for one-off scraping, it's also worth checking whether a python script is required at all. For example, if you just need to download all images, wget with recursive download option should work pretty well in most cases. For eg. a while back I wanted to download all game of life configs (plain text files with .cells extension) from a certain site[2]. Initially I wrote a Python script, but when I learnt about wget's -r option, I replaced it with the following one-liner:

$ wget -P ./cells -nd -r -l 2 -A cells http://www.bitstorm.org/gameoflife/lexicon/

[1]: http://scrapyd.readthedocs.org/en/latest/

[2]: http://www.bitstorm.org/gameoflife/lexicon/



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: