The reality is that the HTML+CSS+JS is the canonical form, because it is the for...

pwg · 2025-12-14T20:18:51 1765743531

The reality is that the ratio of "total websites" to "websites with an API" is likely on the order of 1M:1 (a guess). From the scraper's perspective, the chances of even finding a website with an API is so low that they don't bother. Retrieving the HTML gets them 99% of what they want, and works with 100% of the websites they scrape.

Investing the effort to 1) recognize, without programmer intervention, that some random website has an API and then 2) automatically, without further programmer intervention, retrieve the website data from that API and make intelligent use of it, is just not worth it to them when retrieving the HTML just works every time.

edit: corrected inverted ratio

JimDabell · 2025-12-14T23:09:12 1765753752

I’ve implemented a search crawler before, and detecting and switching to the WordPress API was one of the first things I implemented because it’s such an easy win. Practically every WordPress website had it open and there are a vast number of WordPress sites. The content that you can pull from the API is far easier to deal with because you can just pull all the articles and have the raw content plus metadata like tags, without having to try to separate the page content from all the junk that whatever theme they are using adds.

> The reality is that the ratio of "total websites" to "websites with an API" is likely on the order of 1M:1 (a guess).

This is entirely wrong. Aside from the vast number of WordPress sites, the other APIs the article mentions are things like ActivityPub, oEmbed, and sitemaps. Add on things like Atom, RSS, JSON Feed, etc. and the majority of sites have some kind of alternative to HTML that is easier for crawlers to deal with. It’s nothing like 1M:1.

> Investing the effort to 1) recognize, without programmer intervention, that some random website has an API and then 2) automatically, without further programmer intervention, retrieve the website data from that API and make intelligent use of it, is just not worth it to them when retrieving the HTML just works every time.

You are treating this like it’s some kind of open-ended exercise where you have to write code to figure out APIs on the fly. This is not the case. This is just “Hey, is there a <link rel=https://api.w.org/> in the page? Pull from the WordPress API instead”. That gets you better quality content, more efficiently, for >40% of all sites just by implementing one API.

alsetmusic · 2025-12-15T12:23:51 1765801431

> Investing the effort to 1) recognize, without programmer intervention, that some random website has an API

Hrm…

>> Like most WordPress blogs, my site has an API.

I think WordPress is big enough to warrant the effort. The fact that AI companies are destroying the web isn't news. But they could certainly do it a with a little less jackass. I support this take.

danielheath · 2025-12-14T21:17:06 1765747026

Right - the scraper operators already have an implementation which can use the HTML; why would they waste programmers time writing an API client when the existing system already does what they need?

sdenton4 · 2025-12-14T20:32:24 1765744344

If only there were some convenient technology that could help us sort out these many small cases automatically...

Gud · 2025-12-14T20:39:12 1765744752

Then again, why bother?

junon · 2025-12-14T20:32:21 1765744341

1M:1 by the way, but I agree.

dlcarrier · 2025-12-14T19:59:43 1765742383

Not only is abandonment of the API possible, but hosts may restrict it on purpose, requiring paid access to use acessability/usability tools.

For example, Reddit encouraged those tools to use the API, then once it gained traction, they began charging exorbitant fees effectively blocking every blocking such tools.

culi · 2025-12-14T20:09:08 1765742948

That's a good point. Anyone who used the API properly were left with egg on their face and anyone who misused the site and just scraped HTML ended up unharmed

ryandrake · 2025-12-14T21:18:43 1765747123

Web developers in general have a horrible track record with many notable "rug pulls" and "lol the old API is deprecated, use the new one" behaviors. I'm not surprised that people don't trust APIs.

dolmen · 2025-12-14T21:34:01 1765748041

This isn't about people.

KK7NIL · 2025-12-14T23:08:21 1765753701

APIs are always about people, they're an implicit contract. This is also why API design is largely the only difficult part of software design (there are tough technical challenges too sometimes, but they are much easier to plan for and contain).

modeless · 2025-12-14T20:47:40 1765745260

I want AI to use the same interfaces humans use. If AIs use APIs designed specifically for them, then eventually in the future the human interface will become an afterthought. I don't want to live in a world where I have to use AI because there's no reasonable human interface to do anything anymore.

You know how you sometimes have to call a big company's customer support and try to convince some rep in India to press the right buttons on their screen to fix your issue, because they have a special UI you don't get to use? Imagine that, but it's an AI, and everything works that way.

sowbug · 2025-12-14T20:29:58 1765744198

I'm reminded of Larry Wall's advice that programs should be "strict in what they emit, and liberal in what they accept." Which, to the extent the world follows this philosophy, has caused no end of misery. Scrapers are just recognizing reality and being liberal in what they accept.

A1kmm · 2025-12-14T20:37:51 1765744671

I think it's Jon Postel who was the original source of the principle (it's often called Postel's Law). https://www.rfc-editor.org/rfc/rfc761#section-2.10 is an example dating back to 1980.

athenot · 2025-12-14T20:35:24 1765744524

This is Postel's Law, aka the Principle of Robustness:

    "be conservative in what you send, be liberal in what you accept"

https://en.wikipedia.org/wiki/Robustness_principle

llbbdd · 2025-12-14T19:23:12 1765740192

Yeah APIs exist because computers used to require very explicitly structured data, with LLMs a lot of the ambiguity of HTML disappears as far as a scraper is concerned.

swatcoder · 2025-12-14T20:07:24 1765742844

> LLMs a lot of the ambiguity of HTML disappears as far as a scraper is concerned

The more effective way to think about it is that "the ambiguity" silently gets blended into the data. It might disappear from superficial inspection, but it's not gone.

The LLM is essentially just doing educated guesswork without leaving a consistent or thorough audit trail. This is a fairly novel capability and there are times where this can be sufficient, so I don't mean to understate it.

But it's a different thing than making ambiguity "disappear" when it comes to systems that actually need true accuracy, specificity, and non-ambiguity.

Where it matters, there's no substitute for "very explicit structured data" and never really can be.

llbbdd · 2025-12-14T21:53:09 1765749189

Disappear might be an extremely strong word here, but yeah as you said as the delta closes between what a human user and an AI user are able to interpret from the same text, it becomes good enough for some nines of cases. Even if on paper it became mathematically "good enough" for high-risk cases like medical or government data structured data will still have a lot of value. I just think more and more structured data is going to be cleaned up from unstructured data except for those higher precision cases.

dmitrygr · 2025-12-14T19:43:49 1765741429

"computers used to require"

please do not write code. ever. Thinking like this is why people now think that 16GB RAM is to little and 4 cores is the minimum.

API -> ~200,000 cycles to get data, RAM O(size of data), precise result

HTML -> LLM -> ~30,000,000,000 cycles to get data, RAM O(size of LLM weights), results partially random and unpredictable

hartator · 2025-12-14T19:46:32 1765741592

If API doesn’t have the data you want, this point is moot.

dotancohen · 2025-12-14T19:52:13 1765741933

Not GP, but I disagree. I've written successful, robust web scrapers without LLMs for decades.

What do you think the E in perl stands for?

llbbdd · 2025-12-14T22:16:27 1765750587

This is probably just a parallel discussion. I written plenty of successful web scrapers without LLM's, but in the last couple years, I've written a lot more where I didn't need to look at the web markup for more than a few seconds first, if at all. Often you can just copy-paste an example page into the LLM and have it generate accurate, consistent selectors. It's not much different when integrating with a formal API, except that the API usually has more explicit usage rules, and APIs will also often restrict data that can very obviously be used competitively.

llbbdd · 2025-12-15T04:11:26 1765771886

Double-posting so I'm sorry but the more I read this the less it makes sense. The parent reply was talking about data that was straight-up not available via the API, how does perl help with that?

hartator · 2025-12-15T04:16:35 1765772195

Yeah, I don’t get it either. Someone trying AI to mass reply?

llbbdd · 2025-12-15T04:30:30 1765773030

Maybe Perl is more powerful than I've ever given it credit for.

llbbdd · 2025-12-14T21:50:29 1765749029

A lot of software engineering is recognizing the limitations of the domain that you're trying to work in, and adapting your tools to that environment, but thank you for your contribution to the discussion.

EDIT: I hemmed and hawed about responding to your attitude directly, but do you talk to people anywhere but here? Is this the attitude you would bring to normal people in your life?

Dick Van Dyke is 100 years old today. Do you think the embittered and embarrassing way you talk to strangers on the internet is positioning your health to enable you to live that long, or do you think the positive energy he brings to life has an effect? Will you readily die to support your animosity?

shadowgovt · 2025-12-14T20:19:55 1765743595

On the other hand, I already have an HTML parser, and your bespoke API would require a custom tool to access.

Multiply that by every site, and that approach does not scale. Parsing HTML scales.

swiftcoder · 2025-12-14T20:42:35 1765744955

You already have a JSON and XML parser too, and the website offers standardised APIs in both of those

shadowgovt · 2025-12-14T22:43:26 1765752206

Not standardized enough; I can't guarantee the format of an API is RESTful, I can't know apriori what the response format is (arbitrary servers on the internet can't be trusted to be setting content type headers properly) or How to crawl it given the response data, etc. we ultimately never solved the problem of universal self- describing APIs, so a general crawling service can't trust they work.

In contrast, I can always trust that whatever is returned to be consumed by the browser is in the format that is consumable by a browser, because if it isn't the site isn't a website. Html is pretty much the only format guaranteed to be working.

dmitrygr · 2025-12-14T20:25:54 1765743954

parsing html -> lazy but ok

using an llm to parse html -> please do not

llbbdd · 2025-12-14T22:07:31 1765750051

> Lazy but ok

You're absolutely welcome on your own free time to waste it on whatever feels right

> using an llm to parse html -> please do not

have you used any of these tools with a beginner's mindset in like, five years?

venturecruelty · 2025-12-14T19:50:57 1765741857

Weeping and gnashing of teeth because RAM is expensive, and then you learn that people buy 128 GB for their desktops so they can ask a chatbot how to scrape HTML. Amazing.

llbbdd · 2025-12-15T04:14:35 1765772075

The more I've thought about it the RAM part is hardly the craziest bit. Where the fuck do you even buy a computer with less than 4 cores in 2025? Pawn shop?

llbbdd · 2025-12-14T21:57:06 1765749426

isn't it ridiculous? This is hacker news. Nobody with the spare time to post here is living on the street. Buy some RAM or rent it. I can't believe honestly how many people on here I see bemoaning the fact that they haven't upgraded their laptops in 20 years and it's somehow anyone else's problem.

lechatonnoir · 2025-12-14T20:14:45 1765743285

it's kind of hard to tell what your position is here. should people not ask chatbots how to scrape html? should people not purchase RAM to run chatbots locally?

shadowgovt · 2025-12-14T23:00:41 1765753241

I may be out of the loop; is system RAM key for LLMs? I thought they were mostly graphics RAM constrained.

cr125rider · 2025-12-14T19:43:18 1765741398

Exactly. This parallels “the most accurate docs are the passing test cases”

btown · 2025-12-14T20:03:08 1765742588

I like to go a level beyond this and say: "Passing tests are fine and all, but the moment your tests mock or record-replay even the smallest bit of external data, the only accurate docs are your production error logs, or lack thereof."

handfuloflight · 2025-12-15T00:15:05 1765757705

Absolutely.

1718627440 · 2025-12-15T10:26:15 1765794375

This is something the XML ecosystem (which is now getting killed) actually got right and is the primary reason people don't want to have it killed.

echelon · 2025-12-14T19:36:28 1765740988

[flagged]

edent · 2025-12-14T19:45:04 1765741504

As I wrote:

> Like most WordPress blogs, my site has an API.

WordPress, for all its faults, powers a fair number of websites. The schema is identical across all of them.

gldrk · 2025-12-14T19:55:45 1765742145

If you decide to move your blog to another platform, are you going to maintain API compatibility?

sdenton4 · 2025-12-14T20:35:25 1765744525

Shouldn't the llm that all this scraping is powering be able to help figure out which websites have an API and how to use it?

_puk · 2025-12-14T20:48:22 1765745302

Is there a meta tag to point to the API / MCP?