Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Exactly, semantically understanding the website structure is only one challenge of many with web scraping:

* Ensuring data accuracy (avoiding hallucination, adapting to website changes, etc.)

* Handling large data volumes

* Managing proxy infrastructure

* Elements of RPA to automate scraping tasks like pagination, login, and form-filling

At https://kadoa.com, we are spending a lot of effort solving each of these points with custom engineering and fine-tuned LLM steps.

Extracting a few data records from a single page with GPT is quite easy. Reliably extracting 100k records from 10 different websites on a daily basis is a whole different beast :)



Frustrating the only option to learn more is to book a demo and things like the API documentation are dead ends: https://www.kadoa.com/kadoa-api

The landing page does not provide nearly enough information on how it works in practice. Is it automated or is custom code written for each site?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: