The New York Times warns AI search engine Perplexity to stop using its content

mock-possum · on Oct 15, 2024

I don’t understand - why is the NYT making content publicly available, and then complaining about how to public consumes it? If they don’t want people looking at it then why are they putting it somewhere to be looked at? Why single out AI, or this company, in particular?

staticman2 · on Oct 15, 2024

Your framing is confused. A license to read a web site is not a license to publish or paraphrase the contents on your own web site.

blackeyeblitzar · on Oct 15, 2024

They need Google to crawl their site to get traffic and views. They can’t push around Google or extract rent from them. Perplexity is a smaller company that needs content like the NYT more than the NYT needs them. And they are easier to shake down. Plus AI tools tend to not bring traffic to their underlying sources. If you get your answer up front, why visit the site and deal with their ads?

So I view it as partially a way for them to make money and partially a way for them to reasonably defend their content. But I don’t know what the NYT or others will do when Google copies other AI products like Perplexity. They won’t be able to say ‘no’.

Axiverse · on Oct 15, 2024

Google has a separate user-agent identifier in for robots.txt for AI products that can be blocked separately from the search crawler. Many sites block the AI crawler while allowing the google search crawler. AI content is controlled using googlebot-extended user-agent

See https://developers.google.com/search/docs/crawling-indexing/...

In fact NYT does block google from using its content for GenAI products:

User-agent: Google-Extended Disallow: /

As well as chat gpt

User-agent: GPTBot Disallow: /

So idk why this would be unexpected.

See https://www.nytimes.com/robots.txt

jatins · on Oct 15, 2024

they explicitly prohibit Perplexity in their robots.txt and one of the rules of being a "good citizen" of the web is that you respect robots.txt