Did you try pypy before rewriting?

truncate · on Oct 17, 2015

Didn't solve my problem. PyPy still retains multi-threading and GIL stuff if I remember correctly.

lqdc13 · on Oct 17, 2015

Did you try requests_futures lib? It's all about async and network speed with crawling. Not so much cpu.

truncate · on Oct 17, 2015

I thought it would be IO bound (that's why I started with Python at first place), but since I was extracting links as well and working a bit on graph it turned out to be more CPU intensive. But well, maybe I could have written better code, better libraries, maybe multiprocessing (would have been painful though with multiprocessing). I do admit, I didn't look much into how I could improve it within Python. I just went with Go because it was quicker that way for me.

lqdc13 · on Oct 17, 2015

well.. extracting links etc is super fast with lxml's xpath. It is written in C, and I don't think it would be faster if you write your own parser.

For example, to extract links from hacker news homepage, you would just do

    xpath('//tr/td[@class="title"]/a/@href')

This will be really fast. You can do it even faster with a more specific xpath. I extracted about 10k links a second from documents this way and was still network bound. Usually you are primarily limited by websites throttling you.

truncate · on Oct 18, 2015

I was using beautifulsoup with lxml backend I believe. I should have mentioned earlier. There were some other graph manipulation stuff too, like favoring links with more inlinks, keeping web crawler polite but still busy by looking at other domains. This is more expensive that extracting links I guess. I had a submission deadline, but whatever I tried in that time with Python didn't work. It was just easier to write faster code in Go (except maybe where regex are involved, now I remember I used some Go markup parser instead that is now in their library).