I got so annoyed by this kind of tedious web scraping work (maintenance, proxies, etc.) that I'm now trying to fully automate it with LLMs.
AI should automate repetitive and un-creative work, and web scraping definitely fits this description.
It's a boring but challenging problem.
I've started using LLMs to generate web scrapers and data processing steps on the fly that adapt to website changes. Using an LLM for every data extraction, would be expensive and slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient.
The service is using many small AI agents that basically just pick the right strategy for a specific sub-task in our workflows. In our case, an agent is a medium-sized LLM prompt that has a) context and b) a set of functions available to call.
Tasks involve automatically deciding how to access a website (proxy, browser), naviage through pages, analyze network calls, and transform the data into the same structure.
The main challenge:
We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of websites in a reliable, scalable, and cost-efficient way is a whole different beast.
The integration of tightly constrained agents with traditional engineering methods effectively solved this issue.
We found LangChain and other agentic frameworks to have too much overhead, so we built our own tailored orchestration layer. Authenticated scraping is currently in beta, could you email me your use case (see my profile)?
+1 on the question about scraping behind authentication. One huge use case we have as an ecommerce store is to crawl data from our vendors, which do not have (or incomplete) export files
It's a boring but challenging problem.
I've started using LLMs to generate web scrapers and data processing steps on the fly that adapt to website changes. Using an LLM for every data extraction, would be expensive and slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient.
The service is using many small AI agents that basically just pick the right strategy for a specific sub-task in our workflows. In our case, an agent is a medium-sized LLM prompt that has a) context and b) a set of functions available to call. Tasks involve automatically deciding how to access a website (proxy, browser), naviage through pages, analyze network calls, and transform the data into the same structure.
The main challenge:
We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of websites in a reliable, scalable, and cost-efficient way is a whole different beast.
The integration of tightly constrained agents with traditional engineering methods effectively solved this issue.
Feel free to give it a try: https://www.kadoa.com/add