Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>It’s not copyright violation to train ML on content.

I agree. It'd be a nice gesture to reach out to the creators of the training data, like is usual with web scrapers. But collecting and analyzing data publicly available on the web is ok.

>So the license doesn’t matter unless there’s some “can’t use this for ML training” license that I don’t know about (and doesn’t seem to be legal).

I disagree. While Copilot is, at heart, a ML model, the copyright trouble comes from its usage. It consumes copyright code (ok), analyzes copyright code (still ok), and then produces code which sometimes is a copy of copyright code (not ok). The only way it'd be ok is if Copilot followed all licensing requirements when it produced copies of other works.

Personally, I won't touch it for work until either Copilot abides by the licenses or there's robust case law.



> It'd be a nice gesture to reach out to the creators of the training data, like is usual with web scrapers.

I don’t think this is practical. And who notifies people of scraping content? I would’ve annoyed if I got spam from sites that scraped my content.


I've contacted websites about scraping when it'd be a repeat thing and they didn't have a robots.txt file available. Also if their stance on enforcing copyright was hazy (e.g. medical coding created by a non-profit). Sometimes, they pointed me toward an API I didn't know about.

>I don’t think this is practical.

I don't like people ignoring things just because they're impractical for ML. That leads to crap like automated account banning without possiblity of talking to a living customer service representative.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: