I'm not sure why you think the web is less parseable now. HTML5 is well described and it's easy to get a compliant HTML5 parser for whatever language. Back in the day, people were just doing regex.
There's a separate issue that a lot of stuff requires JS, but the JS mostly just calls JSON endpoints, so that's easy to scrape. The tricky thing is scraping ASPX sites that jump through a bunch of hoops instead of having a simple backend API.
Your timeline is completely confused. ASPX came out in 2002, after HTML4. HTML2 was the era when CGI scripts were dominant.
Next, HTML2+SGML were also well-designed, and people weren't just doing regexp. The mess didn't come in until HTML3 and even more so, HTML4.
Today, it's easy to parse /specific/ pages. If I want to automate one web page, and it's well-formed, HTML5+AJAX makes that easy.
However, in contrast to HTML2, it's very hard to parse pages /generically/. That's why I gave the example of Altavista and a11y tools, which need to work with any web site.
Try to make something like that today: a generic search spider, a web browser, or an a11y tool. See how far you get talking to JSON endpoints. They're often easy enough to reverse-engineer for a specific web site, but you need a human-in-the-loop for each web site. With HTML2, one would build tools which could work with /any/ web site.
And boy were there a lot of tools. Look at all the web browsers of the nineties, and the innovation there.
Oooo well said. One of the first programs I ever wrote was a scraper for such an ASPX site. Parsing state ids and reposting them over and over again... what a joy it was.
Much as an argument confusing World War II Germany with the Holy Roman Empire might be called 'well-said.' You're confusing the early web era with the dot-com boost/bust period.
The early web were the era of dozens, perhaps hundreds of competing web browsers which were made possible by simple, well-engineered web standards. Pages were served statically, or with CGI scripts. You had a whole swarm of generic spiders, crawlers, and bots which automated things on the web for you. Anyone could write a web browser, so many people did.
The dot-com boom/bust had companies doubling in size every few months, people who could barely code HTML making 6-figure salaries, Netscape imploding, early JavaScript (which, at the time, looked like a high schooler's attempt at a programming language), and web standards with every conceivable ill-thought-out idea grafted in.
If one of the first programs you ever wrote was a scraper for an ASPX site, you never saw the elegance of the early days. ASPX came out not just after HTML3, but after HTML4.
If you define early web as pre-1998, then you’re essentially talking about five guys who all had computer science backgrounds. Yes, they were good at their jobs, but it was never going to last. Increasing the number of web developers by 1000x by definition had to drag down their average skill level to the average skill level of the population at large.
Most definitions of the early web include the PHP Cambrian explosion because essentially all websites today got their start then and only a few horseshoe crabs sites (mostly the homepages for CS profs!) predating it survive. Gopher sites were also probably really easy to scrape too. ;-)
It was before your time, kid. (1) I think you underestimate the early web by quite a bit. It had a lot more awesome than you give it credit for, and if not for dot-com bubble + bust, it would have evolved in a much more thoughtful way (2) And dot-com boom and growing developers 1000x didn't need to involve Netscape, Microsoft/IE, or the W3C implosions of the time. Those were a question of management decisions and personalities.
But my original comment was 100% unambiguous: "I liked HTML2. I hated basically everything which went into HTML3 and HTML4."
Y'all responded by citing bad examples from the HTML3 / HTML4 era as examples of things going wrong...
---
Note: Before I get jumped on for "kid," it's the username.
Fair enough. I actually was a kid in 1998. I believe I started “programming” HTML in 1997 or so (copying view source and uploading to my internet host). There were some cool things like Hot Wired and Suck.com (and the bus on the MSN splash page!), but it was just a vastly smaller space than now. Even Geocities doesn’t really make your cutoff, so it’s hard to compare.
There's a separate issue that a lot of stuff requires JS, but the JS mostly just calls JSON endpoints, so that's easy to scrape. The tricky thing is scraping ASPX sites that jump through a bunch of hoops instead of having a simple backend API.