Exactly, semantically understanding the website structure is only one challenge ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		hubraumhugo on March 25, 2023 \| parent \| context \| favorite \| on: Experimental library for scraping websites using O... Exactly, semantically understanding the website structure is only one challenge of many with web scraping: * Ensuring data accuracy (avoiding hallucination, adapting to website changes, etc.) * Handling large data volumes * Managing proxy infrastructure * Elements of RPA to automate scraping tasks like pagination, login, and form-filling At https://kadoa.com, we are spending a lot of effort solving each of these points with custom engineering and fine-tuned LLM steps. Extracting a few data records from a single page with GPT is quite easy. Reliably extracting 100k records from 10 different websites on a daily basis is a whole different beast :)

ec109685 on March 27, 2023 [–]

Frustrating the only option to learn more is to book a demo and things like the API documentation are dead ends: https://www.kadoa.com/kadoa-api

The landing page does not provide nearly enough information on how it works in practice. Is it automated or is custom code written for each site?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact