The question is ultimately going to come down to - "Is Copilot the same as a human programmer reading a lot of GPL code and rehashing, in a non-infringing way, the algorithms, functions, and designs used in a lot of FOSS? Or is Copilot performing more of a copy and paste pastiche of code that is protected by intellectual property law?"
On a tangential note, I always find the discussions surrounding FOSS licenses and copyright rather amusing in a sad way. There's a certain kind of entitlement a lot of people feel towards FOSS that they certainly do not express towards proprietary software and I imagine this a great source of the resentment and burn-out FOSS maintainers feel.
>> "Is Copilot the same as a human programmer reading a lot of GPL code and rehashing, in a non-infringing way, the algorithms, functions, and designs used in a lot of FOSS? Or is Copilot performing more of a copy and paste pastiche of code that is protected by intellectual property law?"
IANAL, but isnt the concept of "derived data" pretty standard? You dont need to copy data for it to be infringing. I've tackled derived data clauses regularly when negotiating data contracts at work and there is always verbiage and discussion around it (e.g., are we allowed to publish an average of the purchased data)
An average is not related to artistic aspects of the data and so can't be a derivative in the copyright sense (based on international law - one of the principle Conventions is fully titled "the Berne convention for the protection of literary and artistic laws", that's because copyright protects literary and artistic works).
Provided you have rights to access a body of statistics, then copyright has nothing to say -- save overreaching national caselaw (!) -- on your derivation of mathematical, technical, or scientific data from that work.
But a contractual clause, in general, doesn't care about copyright; of you've contracted not to derive data from a work then that's orthogonal to copyright.
IANA(IP)L, this is my opinion and unrelated to my employment.
> The question is ultimately going to come down to - "Is Copilot the same as a human programmer reading a lot of GPL code and rehashing, in a non-infringing way, the algorithms, functions, and designs used in a lot of FOSS? Or is Copilot performing more of a copy and paste pastiche of code that is protected by intellectual property law?"
Of course it isn't the same as a human programmer doing anything. It's a complex piece of software, which we happen to misuse the term "AI" to describe, but it is not intelligent.
Precisely. I don't see a difference between what Google indexes on its search engine, and what CoPilot can recommend. Google has been and still does get slapped on the wrist when they don't respond to take down requests. It seems this is missing from CoPilot currently, and will open them up to a number of lawsuits in the future if it continues to operate as it does.
Except Google isn’t creating anything new, Copilot is. I’ve had it kick out some very interesting short stories based on Sherlock Holmes, so if I published those would they infringe?
Google returned nothing which contained exact matches of some of the more interesting dialogue. It would be a serious find - worthy of a paper - to disprove that GPT-3 is generating novel text/code.
It's easy to prove that it sometimes regurgitates text verbatim — just play with it for a while. Having certainty that any given span of text/code is novel is extraordinarily difficult.
Each search results page is novel and unique. Even two different users making the same query will get different results thanks to the "search personalization" Google is doing these days.
Google search doesn't synthesize anything. It collects results and orders them according to an algorithm. Copilot and similar language models can synthesize new text. That's clearly different than just presenting existing text.
Copilot can't create new novel concepts. In the end it is just an complex mathematical formula that return a set of code references with a set of length determined by the math.
The illusion of creativity is similar to that of technology. Sufficient advanced technology is indistinguishable from magic, and sufficient advanced math is indistinguishable from intelligence. The relation between AI and math is the same as the relation between magic and technology.
All of the large language models can emit text that has never been seen in the training set (unless you go so far to consider each character to be a snippet).
> Infringement isn't about how the infringing system works, it's about the product of that work.
Exactly this. It makes zero difference that you produced your infringing work with the help of a program that happens to be extremely complex and marketed as "AI".
But that's the whole point of copyright. The same piece of code you copy from a Google search can legally be used by you if your developer came up with it, and not if Oracle came up with it. Where you copied it from is the entire point.
of course it's not intelligent, but we still have to decide how the law applies to the actions of software, or otherwise re-frame the whole thing to include the co-pilot developers doing the copyright infringement when they trained the model - not the current discussion which gives agency to the IDE plug-in "choosing" a code snippet to paste.
I don't see how the way the software was built is particularly relevant.
It's just a tool used by the developer; the onus is on the developer to ensure they don't infringe the licenses of the source code they incorporate in their software. Since Copilot makes it impossible to know where it's barfing code up from and what license that code is under, a developer who cares about not getting sued probably needs to avoid using Copilot.
Eh, the law is about making a copy. If my IDE plug-in fills in some code, the question is, did I copy the code, did the robot copy the code, or did the developers that wrote "cp github.db ~/trainingset" copy the code?
The authors of the tool created something that can be used for copyright infringement.
The tool itself lacks agency, it did what it was programmed to do.
If you took the tool's suggestions and proceeded to published a derivative work, you may have infringed.
This really doesn't feel any different from P2P filesharing services. Rightsholders have targeted tool publishers in the past, because they are the largest single target and not anonymous; but ultimately the infringement is performed by the end user.
This isn't complicated at all. You copied the code, which isn't an issue until you then go on to do something which infringes the license (e.g. publish under a different license, publish binaries without publishing source, publish without attribution, whatever it is that the license requires).
A law that uses the archaic terms "copy and paste", referring to a time when people would make an analog photocopy of a document written using a typewriter, trim it out with scissors or a knife, and glue it to their book with the pasty remains from boiling animal collagen cannot be trusted to apply word-for-word in a time when technology has obsoleted the glue, typewriter, xerox machine, and even the paper.
It is not the same as a human, no, but it's not hard to choose a definition of the word "intelligent" that can accurately describe something that can be done by a program.
When a human walks around a puddle, are they demonstrating intelligence? When a horse avoids stepping in a hole, is the horse intelligent? When a robotic vacuum avoids a stairway, is it intelligent? When a self-driving car avoids a bollard, is that intelligent?
Whether there's a being inside the device that believes it experiences consciousness or not, the same outcome happens. A Searle's Chinese Room that produces copies of Chinese IP, a trained monkey that does so, or a human that does the same thing, the outcome is very similar.
Perhaps it's a little bit like employing a human programmer with an eidetic memory who occasionally remembers entire largish functions.
If he were able to remember a large enough piece of copyrighted code, and reused it, then it still wouldn't be fair use, even if he changed a variable name here or there, or the license message.
Yeah, that's definitely the impression I get from the few Copilot examples I've seen. I've not personally used Copilot so I refrained from making absolute statements about its behavior in my top comment.
But I think the conclusion most people are settling on is that it's definitely infringing.
A possible response that I'd predict from GitHub would be to attribute much/all of the responsibility to the user.
The argument would be along the lines of: you as the user are the one who asked the eidetic programmer (nice terminology, @bencollier49) to produce code for your project; all we did is make the programmer available to you.
Does GitHub own the code generated by GitHub Copilot?
GitHub Copilot is a tool, like a compiler or a pen. GitHub does not own the suggestions GitHub Copilot generates. The code you write with GitHub Copilot’s help belongs to you, and you are responsible for it. We recommend that you carefully test, review, and vet the code before pushing it to production, as you would with any code you write that incorporates material you did not independently originate.
Does GitHub Copilot recite code from the training set?
The vast majority of the code that GitHub Copilot suggests has never been seen before. Our latest internal research shows that about 1% of the time, a suggestion may contain some code snippets longer than ~150 characters that matches the training set. Previous research showed that many of these cases happen when GitHub Copilot is unable to glean sufficient context from the code you are writing, or when there is a common, perhaps even universal, solution to the problem.
I've used Copilot for months and honestly it's become one of my most favorite inventions in all of programming- and this is key- even when it screws up (such as by suggesting Ruby-syntax code to autocomplete Elixir code). It tickles the "childlike joy" funnybone in me, the same one that got me into programming to begin with. I don't know how long it will take for typing "#ANSI yellow" (for example) and autocompleting to the right codes to get old, or every time it autocompletes anything considered "boilerplate," but it hasn't, yet!
You know, pretty much all of programming can be summed up as "tedious labor elimination," and this tool directs that same labor elimination at the work of programming itself (I no longer have to constantly google syntax idiosyncrasies etc.), and NOW coders are pissed? I don't get it. Eat your own dog food, people, because this is what it looks and tastes like.
As to the copyright infringement or licensing-violation claims, I have yet to see it autocomplete an entire algorithm correctly, or one copied verbatim from somewhere, although that could be mitigated. You still have to pay attention (kind of like Tesla autopilot), it's not going to eliminate your job.
No one is complaining about copilot making programming easier or automating it.
We're upset because it's quite literally infringing on intellectual property. Infringing on intellectual property that's been set aside for the exclusive use of the commons.
god bless AI for moving human society beyond silly notions like ideas-as-property
copyright was established to increase the innovation and creative will of the arts and sciences, what could increase that creative force more than an AI assistant who has seen every creative work ever made?
Except that is not what is happening here. The problem is that AI is being used to take code, which was provided to the commons under the explicit condition that anything built with it is also released under the same terms, is now being fed to a magic mystery machine to produce code that can supposedly legally be witheld from the comments. The only code that this affects is the one that was already shared - you won't see Microsoft feeding Windows and Office source code into Copilot anytime soon.
Do you use it? Have you ever used it? How many people making negative comments about it here have actually used it? I don't actually believe many have. I suggest at least trying it out before lighting your torches.
If it infringes everyone equally and everyone equally benefits from the infringement, has a net wrong actually occurred? (which of course begs the "do the ends justify the means" question...)
I don't see how this is any different a form of "infringement" than me copying and pasting snippets of other peoples' code, and then modifying it to suit my particular context, without specific attribution, except that the latter is a much more laborious and time-consuming process than copilot autocomplete, and programming is all about tedium elimination
> If it infringes everyone equally and everyone equally benefits from the infringement, has a net wrong actually occurred?
It’s not done equally though. Copyleft code is extremely likely to be on GitHub somewhere, while internal proprietary code is often not. Copilot will thus have been trained more on the former than the latter.
> I don't see how this is any different a form of "infringement" than me copying and pasting snippets of other peoples' code, and then modifying it to suit my particular context, without specific attribution
It’s no different, but that is also copyright infringement.
so basically all of Stackoverflow is copyright infringement and has been for decades? Find me the programmer who has never either 1) copied and pasted directly from the internet, or 2) taken an idea found on the internet and massaged it for their own purposes. I mean... this is basically why programming is so lucrative IMHO. Everyone is piggybacking off of everyone else's work (at least in open source)
The tens of thousands of developers in a company I am familiar with have taken a basic training on intellectual property concepts and software licenses.
A typical case mentioned in the training is that code from StackOverflow is (probably) licenses under CC-BY-SA 4.0 and as such it can never be copied inside their proprietary-licensed code base.
(Recent) StackOverflow contributions are licensed under CC BY-SA 4.0 by default (though the author can of course release it under any additional licenses they choose): https://stackoverflow.com/help/licensing
If the code is really sufficiently trivial (and I’d guess that most code samples you’ll find on StackOverflow are) you may have a fair use argument in the US. Generally speaking though (and especially for anything nontrivial) you need to respect the license. CC BY-SA 4.0 is one-way compatible with GPLv3, though, so that helps if you’re including it in a GPLv3 codebase: https://creativecommons.org/2015/10/08/cc-by-sa-4-0-now-one-...
Even apart from copyright aspect, it would be nice if we as programmers would improve our attitude towards attribution. If researchers can cite the work that has influenced theirs without legal threats than so can we.
Github explicitly leaves out proprietary code bases, included microsoft windows source code (Microsoft own github and uses it for their own products).
If Microsoft included their own source code when training copilot then at least they would be intellectually honest, but they don't. They only consider GPL and other free and open source code to be up for grabs.
This kind of reminds me of when someone reverse engineers a piece of software to document interfaces, protocols or APIs for the purpose of writing compatible software. Then a second person not involved in the RE process implement compatible software from the documentation the first person wrote.
This is to avoid any contamination and verbatim copies of code. Once you have read a piece of code there is a risk of "contamination" and you will be influenced by it. It does not matter if you directly copy it, write it out from memory or use an AI to regurgitate it. It will be a copy of the code. To me this is very clear.
This sounds like “taint” in the M&A space. I’ve very limited experience of it and would be interested in hearing more from the better informed folks on this topic!
My limited experience: my then-employer opted not to acquire a company after doing due diligence. Ultimately we decided that the price of acquisition (both paid out, and also incurred in internal time) was below the cost of building a comparable product ourselves.
As the dev who did the tech portion of the due diligence I was now “tainted” by my knowledge of their system. As a result I could not work directly on the effort to build our own comparable solution.
A human who will type out the fast inverse square root algorithm line by line won't be exempt from copyright/license infringement just because he remembered it from the top of their head. However, using the same concepts is likely to be fine outside silly jurisdictions where software patents are a thing.
The difference is that AI isn't able to grasp concepts, it's only capable of rehashing patterns. If it is able to understand concepts then it should be shut down and researched immediately, because it's either close to gaining consciousness or already has done so.
The core of copilot is a file or a block of memory laying out a bunch of floating points that get processed and turned into code. This arrangement of floats is derived from source code, with licenses and copyright notices.
I don't think it's any different from turning code into a compiled program. Any developer will understand that a compiled version of GPL code is a derived work and subject to the GPL license. Why would a compiler that turns code into floats be any different? Sure, those floats get mixed up with the floats from other source code, but linking to GPL'd code does something very similar and is also covered by the license.
It's possible to consider copilot similar to hashing: a SHA hash of a binary isn't subject to the binary's license, that'd be silly. However, hashes are inherently one-way, and copilot isn't.
A question I'd like to ask Microsoft is "if I steal the Windows source code and train an AI on it, can that AI be freely distributed and used for Wine/ReactOS/etc?" If Microsoft sticks to the stance that AI isn't subject to the licenses on software then a leaked source AI should be fine, but if they want to protect their intellectual property then they will send cease and desist letters to anyone even thinking about using such an AI model for code completion. My expectation is that Microsoft will act against such an AI.
Regardless, the fact that Github did not ask permission or provide an opt out before training started is a huge middle finger to all open source developers. Even if they can get away with this stuff legally, this approach has surely offended many open source developers who want big tech companies to abide by their code licenses. I don't do much open source work myself but I've been offended by the whole process from the day copilot rolled out and I don't believe I'm alone in this.
> A human who will type out the fast inverse square root algorithm line by line won't be exempt from copyright/license infringement just because he remembered it from the top of their head.
A human would probably try to defend against a copyright infringement suit over that by arguing something like the following.
There isn't sufficient creative expression in fast inverse square root (FISR) to be copyrightable. There is plenty of creativity in that thing, but it is in things that are not copyrightable such as the underlying mathematics that it is using. Copyright covers expression of ideas, not use of ideas (that's patents) or the ideas themselves.
The expression in FISR that they probably are copying from is pretty much all just in choosing the names of variables, and most implementations I've seen just use pretty normal names that follow normal naming conventions that people use when they aren't putting any thought into naming their variables.
That level of expression is arguably not creative enough to support copyright, at least in the US after Feist Publications, Inc., v. Rural Telephone Service Co., 499 U.S. 340 (1991) [1].
(I'm assuming that the human didn't do anything stupid, like reproduce the comments too).
I think the FISR is one of the few algorithms that I would actually consider creatively enough to match the creativity requirement. It's counter intuitive math that I would think the vast majority of programmers would never be able to come up with. It's an elegant bit twiddling algorithm that requires one or two blog posts to truly understand, it's not something you read and think "oh, that makes sense, moving on".
Algorithms for generic mathematical operations such as the dot product or matrix multiplication are often trivial to deduce, though optimizer vectorized versions perhaps less so. Most helper functions are unoriginal enough that no reasonable copyright law would protect them, which is also the case for (too) many cases of patented code.
The copyright question does ignore the code license question, though. If a complicated algorithm like FISR is not original enough the what protects any boring old operating system code? What stands in the way of publicly hosting Microsoft's leaked sources, as clearly the code is all quite trivial? There is very little in an operating system that other operating system developers haven't thought of or would reasonably have come up with had they been constrained to the same restrictions.
The variable names are one thing, though they could be chosen much more descriptively. However, the system also output the comment "// what the fuck?" which is not only terribly nondescriptive, it's also something that the system couldn't have come up with if it would have learned from code in any practical form.
The suit you linked is about the difference between information and creativity. However, the case surrounds a data set, something simply factual, rather than a composed piece of information such as code or a book. Code listed on Github is not similar to the listings in a phone book. If they were, all software copyright, proprietary or otherwise, goes down the drain. I think that's impractical to say the least.
FISR could have been patented (and be now in the public domain anyway), but only it's specific implementation in DOOM is covered by copyright.
Also, your argument follows a composition fallacy: emergent properties exist, and thus you cannot simply say that because each individual piece of a whole is trivial, the whole is trivial. Heck, software pretty much by definition goes against that. For relevant precedent, there is no shortage of information that becomes classified when in aggregate. Knowing where a certain piece of infrastructure is isn't likely classified, but knowing where all the strategically important pieces of infrastructure are certainly is.
Which is why the question isn't whether the users of Copilot are infringing someone's GPL (they'd likely have a solid defense based on the individual piece not being sufficient to hold copyright protection), it's whether Copilot itself constitutes a derivative work of its input data, which it consumed as whole (copyrighted) works.
That's a philosophical question that nobody can know a definitive answer to.
Personally I'd say the difference is understanding why a certain pattern works rather than blindly inserting whatever works. It's the classic Chinese Room thought experiment.
Just reading certain code is enough to taint a human programmer though. Some companies have policies against hiring developers with experience on some OSS projects because they have their own clean room implementation they want to protect.
On your tangential note: I always assumed many in the FLOSS side are actually against most cases of copyright as applied to software, but since it is the regulatory standard, they put a strong emphasis on making it work for their purposes, thus the somewhat ironic “copyleft”.
It’s a “don’t hate the player hate the game” situation for them
This is definitely the orthodox take. If shared source code was the norm and software wasn't subject to copyright (or really if either of those two conditions were met), there'd be no need for FOSS as an ideology. The purpose of copyleft is to ensure that there's a permanent bulwark against code meant for the commons being co-opted by proprietary software vendors and having changes walled off from the community who created the software in the first place.
Source code is essential to FOSS, a public domain binary-only copy of Microsoft Windows definitely would not be FOSS. This is the second item of the open source definition.
Sure, that is a useful condition and is a no brainer to add if you need leverage copyright anyway.
But would it be enough to spur the open source movement on its own if you could legally decompile all binaries and redistribute that? Probably not.
Its not like source vs. binary is a clear distinction - between code obfuscation, generated code, transpilation, etc. there is a lot of wiggle room what should or should not be OK.
The GPL makes it a pretty clear distinction, "preferred form for modification" is pretty clear, but decided on a case-by-case basis. Obfuscated code is not source, generated code is not source, transpilation is often not source but could be depending on how you use it afterwards, bitmap images are often not source but they can be, executables are usually not source but could be, videos are not source but could be. Some links discussing what source is here:
People on the FLOSS side are for software freedom, copyleft is just one of the tools we can use within the current regulatory framework of copyright. If copyright ever went away, we would have to use different tools but would have different opportunities too.
Those people these days are a vanishing minority. This is not the early-00s anymore.
The reality is that, nowadays, the overwhelming majority of developers touches FOSS code every day and just assumes they're entitled to use it as they see fit. The folks that came up with "copyleft" or care about licenses, are very much not in the driving seat. Blame FAANGs and their hatred for GPL.
I think the problem goes a bit deeper than that. From an IP perspective, I think it's reasonable to consider that training an AI on some form of work is using said work to build a new one, just like it would be if it was manually copied in or reproduced.
The problem is that, iirc, GPL didn't consider this at all and still uses language focused on copying code, so something like copilot might slip through the cracks of those definitions.
Then again, the license uses this language when it allows usage of the code in the first place, so one could say that either a) this usage is covered by the license, in which case all conditions apply, or b) it is not covered by the license, in which case... github wouldn't be allowed to use the code at all.
To give an analogy: I think feeding code into an AI is essentially analogous to compiling the code. A machine turns it into something more usable and the original human-written content isn't part of the result anymore, but the intellectual property gets dragged through the process nonetheless. Why would it be any different just because the mechanism of transforming the code into executable software gets a bit more complicated through the usage of AI?
> Is Copilot the same as a human programmer reading a lot of GPL code and rehashing, in a non-infringing way, the algorithms, functions, and designs used in a lot of FOSS?
It literally can't do it in an "in a non-infringing way" as it wasn't made to do it "in a non-infringing way".
People were able to get copy-pasted code verbatim. It means it does not know whether what it does infringe on the GPL or not.
Let say you find a human that never knew anything about copyright and you show him a bunch of Disney movies and you ask him to make you a movie and he literally copy one of their movie. Does it make it non-infringing? (Funny thing is, even people aware of copyrights does infringe it... so yeah hard to say even a machine could make some non-infringing content).
The solution would be to at least make him aware of copyrights and works with that, but first is it even possible, and seconds, is it even enough...
Sadly nothing will ever be done, at least not until it we feed it Disney movies and it start to affect their bottom lines.
> On a tangential note, I always find the discussions surrounding FOSS licenses and copyright rather amusing in a sad way. There's a certain kind of entitlement a lot of people feel towards FOSS that they certainly do not express towards proprietary software and I imagine this a great source of the resentment and burn-out FOSS maintainers feel.
Definitely. Many of my acquaintances complaining about Github Copilot without trying it themselves regularly pirate movies, shows and music. They also always cheer if there is some court ruling against Facebook or Google, no matter what the actual case is even about.
> The question is ultimately going to come down to - "Is Copilot the same as a human programmer reading a lot of GPL code and rehashing, in a non-infringing way, the algorithms, functions, and designs used in a lot of FOSS? Or is Copilot performing more of a copy and paste pastiche of code that is protected by intellectual property law?"
It seems to me that the regurgitation only happens if you post the first half of the code, expecting the second half. I imagine that the software sees how several hundred repositories (which are all forks) have a very similar pattern and tells you the best fitting approximation of how they continue, which is again very similar.
In the future I can definitely see Github updating their license and some kind of exodus by FOSSers towards GitLab. But I believe that many open source projects will just put up with it, similar to how Youtubers and Twitch streamers want to stay on the premier platform.
On a tangential note, I always find the discussions surrounding FOSS licenses and copyright rather amusing in a sad way. There's a certain kind of entitlement a lot of people feel towards FOSS that they certainly do not express towards proprietary software and I imagine this a great source of the resentment and burn-out FOSS maintainers feel.