I thought it would be IO bound (that's why I started with Python at first place), but since I was extracting links as well and working a bit on graph it turned out to be more CPU intensive. But well, maybe I could have written better code, better libraries, maybe multiprocessing (would have been painful though with multiprocessing). I do admit, I didn't look much into how I could improve it within Python. I just went with Go because it was quicker that way for me.
well.. extracting links etc is super fast with lxml's xpath.
It is written in C, and I don't think it would be faster if you write your own parser.
For example, to extract links from hacker news homepage, you would just do
xpath('//tr/td[@class="title"]/a/@href')
This will be really fast. You can do it even faster with a more specific xpath. I extracted about 10k links a second from documents this way and was still network bound. Usually you are primarily limited by websites throttling you.
I was using beautifulsoup with lxml backend I believe. I should have mentioned earlier. There were some other graph manipulation stuff too, like favoring links with more inlinks, keeping web crawler polite but still busy by looking at other domains. This is more expensive that extracting links I guess. I had a submission deadline, but whatever I tried in that time with Python didn't work. It was just easier to write faster code in Go (except maybe where regex are involved, now I remember I used some Go markup parser instead that is now in their library).