Interestingly, I looked at github insights and found that this repo had 49 clone...

tonnydourado · 2026-01-27T14:02:06 1769522526

Particularly on GitHub, might not even be LLMs, just regular bots looking for committed secrets (AWS keypairs, passwords, etc.)

Phelinofist · 2026-01-27T13:54:00 1769522040

I selfhost Gitea. The instance is crawled by AI crawlers (checked the IPs). They never cloned, they just browse and take it directly from there.

Phelinofist · 2026-01-27T16:14:29 1769530469

For reference, this is how I do it in my Caddyfile:

   (block_ai) {
       @ai_bots {
           header_regexp User-Agent (?i)(anthropic-ai|ClaudeBot|Claude-Web|Claude-SearchBot|GPTBot|ChatGPT-User|Google-Extended|CCBot|PerplexityBot|ImagesiftBot)
       }

       abort @ai_bots
   }

Then, in a specific app block include it via

   import block_ai

seba_dos1 · 2026-01-28T01:30:52 1769563852

Most of then pretend to be real users though and don't identify themselves with their user agent strings.

zaphar · 2026-01-27T18:38:56 1769539136

I have almost exactly this in my own caddyfile :-D The order of the items in the regex is a little different but mostly the same items. I just pulled them from my web access logs over time and update it every once in a while.

Zambyte · 2026-01-27T14:21:33 1769523693

i run a cgit server on an r720 in my apartment with my code on it and that puppy screams whenever sam wants his code

blocking openai ips did wonders for the ambient noise levels in my apartment. they're not the only ones obviously, but they're they only ones i had to block to stay sane

MarsIronPI · 2026-01-27T14:54:18 1769525658

Have you considered putting it behind Anubis or an equivalent?

Zambyte · 2026-01-27T15:02:42 1769526162

Yes, but I haven't and would prefer not to

MarsIronPI · 2026-01-27T22:04:14 1769551454

Understandable. It's an outrage that we even have to consider such measures.

nerdponx · 2026-01-27T13:38:03 1769521083

Time to start including deliberate bugs. The correct version is in a private repository.

teiferer · 2026-01-27T14:47:41 1769525261

And what purpose would this serve, exactly?

adastra22 · 2026-01-27T15:39:33 1769528373

Spite.

below43 · 2026-01-27T20:07:50 1769544470

They used to do this with maps - eg. fake islands - to pick up when they were copied.

program_whiz · 2026-01-27T15:06:54 1769526414

while I think this is a fun idea -- we are in such a dystopian timeline that I fear you will end up being prosecuted under a digital equivalent of various laws like "why did you attack the intruder instead of fleeing" or "you can't simply remove a squatter because its your house, therefore you get an assault charge."

A kind of "they found this code, therefore you have a duty not to poison their model as they take it." Meanwhile if I scrape a website and discover data I'm not supposed to see (e.g. bank details being publicly visible) then I will go to jail for pointing it out. :(

nerdponx · 2026-01-27T19:32:41 1769542361

I think if we're at the point where posting deliberate mistakes to poison training data is considered a crime, we would be far far far down the path of authoritarian corporate regulatory capture, much farther than we are now (fortunately).

wredcoll · 2026-01-27T18:34:27 1769538867

Look, I get the fantasy of someday pulling out my musket^W ar15 and rushing downstairs to blow away my wife^W an evil intruder, but, like, we live in a society. And it has a lot of benefits, but it does mean you don't get to be "king of your castle" any more.

Living in a country with hundreds of millions of other civilians or a city with tens of thousands means compromising what you're allowed to do when it affects other people.

There's a reason we have attractive nuisance laws and you aren't allowed to put a slide on your yard that electrocutes anyone who touches it.

None of this, of course, applies to "poisoning" llms, that's whatever. But all your examples involved actual humans being attacked, not some database.

program_whiz · 2026-01-27T21:41:37 1769550097

Thanks that was the term I was looking for "attractive nuisance". I wouldn't be surprised if a tech company could make that case -- this user caused us tangible harm and cost (training, poisoned models) and left their data out for us to consume. Its the equivalent of putting poison candy on a park table your honor!

teo_zero · 2026-01-27T23:02:31 1769554951

That reminds me of the protagonist of Charles Stross's novel "Accelerando", a prolific inventor who is accused by the IRS to have caused millions of losses because he releases all his ideas in the public domain instead of profiting from them and paying taxes on such profits.

0x696C6961 · 2026-01-27T14:05:46 1769522746

This has been happening before LLMs too.

teiferer · 2026-01-27T14:40:58 1769524858

I don't really get why they need to clone in order to scrape ...?

> It feels weird to think that LLMs are being trained on my code, especially when I'm painfully aware of every corner I'm cutting.

That's very much expected. That's why the quality of LLM coding agents is like it is. (No offense.)

The "asking LLMs for advice" part is where the circular aspect starts to come into the picture. Not worse than looking at StackOverflow though which then links to other people who in turn turned to StackOverflow for advice.

storystarling · 2026-01-27T19:50:00 1769543400

Cloning gets you the raw text objects directly. If you scrape the web UI you're dealing with a lot of markup overhead that just burns compute during ingestion. For training data you usually want the structure to be as clean as possible from the start.

teiferer · 2026-01-28T06:46:11 1769582771

Sure, cloning a local copy. But why clone on github?

adastra22 · 2026-01-27T15:40:15 1769528415

The quality of LLM coding agents is pretty good now.