Exactly, semantically understanding the website structure is only one challenge of many with web scraping:
* Ensuring data accuracy (avoiding hallucination, adapting to website changes, etc.)
* Handling large data volumes
* Managing proxy infrastructure
* Elements of RPA to automate scraping tasks like pagination, login, and form-filling
At https://kadoa.com, we are spending a lot of effort solving each of these points with custom engineering and fine-tuned LLM steps.
Extracting a few data records from a single page with GPT is quite easy. Reliably extracting 100k records from 10 different websites on a daily basis is a whole different beast :)
* Ensuring data accuracy (avoiding hallucination, adapting to website changes, etc.)
* Handling large data volumes
* Managing proxy infrastructure
* Elements of RPA to automate scraping tasks like pagination, login, and form-filling
At https://kadoa.com, we are spending a lot of effort solving each of these points with custom engineering and fine-tuned LLM steps.
Extracting a few data records from a single page with GPT is quite easy. Reliably extracting 100k records from 10 different websites on a daily basis is a whole different beast :)