We actually do what you describe as well sometimes. In particular when we scrape sites with robust bot counter-measures to save on Crawlera [1] usage, or on crawls that take long enough that there's a genuine possibility that the site might change before you're done.
On the one hand side there's no shortage of users who want to crawl popular sites to monitor e.g. search engine ranking or prices. Which is kind of shady in some sense, or not - when there's no API there's no other way...
On the other there are also areas of the web where crawlers are simply not welcome. For instance, DARPA uses a number of our technologies to monitor the dark web for criminal activities:
As an "early" programmer playing with web scraping with the Nokogiri gem, I've been wondering about this aspect (although haven't encountered it yet).
Are there legal implications to scraping a site that actively tries to prevent bots from scraping it? I mean, if the data is publicly accessible on the web, could they go after you?
I don't plan on doing this for any malicious reasons or anything, and like I said, I haven't encountered it yet. Just having the "what if" thought of what my legal risks might be if I'm playing around with this and whether a site could come after me.
> Are there legal implications to scraping a site that actively tries to prevent bots from scraping it? I mean, if the data is publicly accessible on the web, could they go after you?
When we do projects, the baseline is if Google can see it we can too. So from a legal standpoint if Google is covered so are we.
From a legal standpoint firms do go after web scrapers. And lose more often than not. The exception is when you're logged in when you crawl. In that case you've implicitly accepted the terms of use. Some companies aggressively sue when you're logged in while scraping, so it's best to stay on the safe side. Further reading on the topic:
[1]: http://crawlera.com