Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I got so annoyed by this kind of tedious web scraping work (maintenance, proxies, etc.) that I'm now trying to fully automate it with LLMs. AI should automate repetitive and un-creative work, and web scraping definitely fits this description.

It's a boring but challenging problem.

I've started using LLMs to generate web scrapers and data processing steps on the fly that adapt to website changes. Using an LLM for every data extraction, would be expensive and slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient.

The service is using many small AI agents that basically just pick the right strategy for a specific sub-task in our workflows. In our case, an agent is a medium-sized LLM prompt that has a) context and b) a set of functions available to call. Tasks involve automatically deciding how to access a website (proxy, browser), naviage through pages, analyze network calls, and transform the data into the same structure.

The main challenge:

We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of websites in a reliable, scalable, and cost-efficient way is a whole different beast.

The integration of tightly constrained agents with traditional engineering methods effectively solved this issue.

Feel free to give it a try: https://www.kadoa.com/add



Kadoa looks great. For tool discovery/usage, are you using LangChain or something else?

Also, do you support scraping private sites, ie. sites that require a login/password to access the data to scrape?

Thank you!


We found LangChain and other agentic frameworks to have too much overhead, so we built our own tailored orchestration layer. Authenticated scraping is currently in beta, could you email me your use case (see my profile)?


+1 on the question about scraping behind authentication. One huge use case we have as an ecommerce store is to crawl data from our vendors, which do not have (or incomplete) export files


Incredible product, will give it a spin soon. How do you do under volume? I tried it out with Google but it was quite slow.


Minimum extraction cost 100 credits , so only 250 pages could be parsed with the regular plan?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: