Open source developers need a new kind of license with a ML model training clause, so there is no more ambiguity if they don't want their code to be used in this way.
People have been suggesting this ever since Copilot was announced and it doesn't work on any level. They're using all code on GitHub, even the ones with no license and which you can't use for any purpose and the reasoning is that they see it as fair use - which supersedes any licenses and copyrights in the US.
They only claimed that training the model was fair use. What about its output? I argue that its output is still affected by the copyright of its inputs, the same way the output of a compiler is affected by the copyright of its inputs.
That doesn't work: your suggestion applies at too late a stage in the flowchart. It looks like:
1. Do you need a license to use materials for training, or to use the output model?
2. If so, does the code's license allow this?
GitHub is claiming 'no' for #1, that they do not need any sort of license to the training materials. This is reasonably standard in ML; it's also how GPT-3 etc were trained.
Now, whether a court will agree with their interpretation is an interesting question, but if they are correct then #2 doesn't come into play.
If the answer is 'no' for #1 than the GPL might as well not exist because now we can just launder it through co-pilot and close it off, a rather distorted interpretation of "fair use" if you ask me.
"Dear copilot, I'm writing a Unix-like operating system...."
I made new licenses [1] [2] that attempt this. The problem with adding a clause against ML training is that that is (supposedly) fair use. What my licenses do is concede that but claim that the output of those algorithms is still under the copyright and license.
I hope that even if it wouldn't work, it puts enough doubt in companies' minds that they wouldn't want to use a model trained by code under those licenses.