Last week, GitHub Copilot, an AI code generation artifact jointly launched by Microsoft, GitHub, and OpenAI, attracted huge attention once it was officially announced: Which developer doesn’t want such a “virtual programmer” to free his hands?
So even though GitHub Copilot is currently in an imperfect technology preview stage, many developers can’t wait to try it out.
But this test, the problem came up: why is the code generated by GitHub Copilot so familiar, even the comments are “original”, is this plagiarism?
True Thor’s “Hammer”
In fact, on the question of GitHub Copilot directly copying code, Microsoft responded as early as the official announcement: “Only 0.1% of the cases, the code suggestions provided by GitHub Copilot may contain some characters or fragments from the training set.”
But this “0.1% situation” in Microsoft’s mouth has already appeared.
A developer @mitsuhiko announced his findings on Twitter: Let GitHub Copilot generate a Fast Inverse Square Root algorithm, and the resulting code is actually the same as the “legendary code” in Quake 3 Exactly the same! (Note: The fast reciprocal square root algorithm is also known as the fast reciprocal square root algorithm, which is well known because it appeared in the Quake 3 source code.)
This code is undoubtedly “plagiarism”: it not only contains the magic number “0x5f3759df” in the fast square root reciprocal algorithm that no one has understood so far, but even the “Quake 3″ developer’s complaints about this series of numbers are preserved ” authentic”.
In this way, GitHub Copilot’s “plagiarism code” is not only a real hammer, but also a real Thor’s “hammer”, which cannot be excused, and the resulting code copyright problems have become more and more serious.
Is GitHub Copilot a derivative work under the GPL?
There is a contradiction in the process of GitHub Copilot directly copying the fast reciprocal square root algorithm, that is, this code is open sourced under the GNU GPL 2.0 agreement, while GitHub Copilot will be extended to provide paid services in the future.
(Note: The GNU GPL 2.0 license requires that any derivative works that include this open source license, even if only a few lines of code, must make the full source code and the right to modify and distribute them freely available.)
On this basis, a huge controversy has arisen: this phenomenon means that GitHub Copilot must have used the code under the GPL agreement during the training process, so the works produced by the machine learning system, and even the machine learning system itself, are considered to be stipulated in the GPL agreement. derivative works?
If the answer is “no”, does that mean that developers can use GitHub Copilot to “clean up” the GPL of their code and never have to follow it again?
If the answer is “yes”, then not only should GitHub Copilot be free and open source, but the whole of GitHub should be an open source project: According to the GitHub blog “During the early development of GitHub Copilot, as part of an internal trial, nearly 300 employees If you use it at work”, these employees are likely to have integrated the code generated by GitHub Copilot into all aspects of GitHub, so GitHub should also be an open source project.
To this end, Julia Reda, a long-term focus on copyright protection issues and a strong promoter of open source and free software, wrote an article and firmly believes that GitHub Copilot does not infringe developers’ copyrights.
She points out that copyright permission is not required to simply read and process information. For example, if you go to a bookstore and pick up a book from the shelf and start reading, you are not infringing any copyrights in the process, which is the case with the training process of digital technologies such as artificial intelligence, which require a lot of content data.
Julia Reda said in the article: “There are indeed many conflicts between copyright and digital technology. Fortunately, policy makers and courts have long recognized that if every copy of technology requires a license, digital technology will be completely undeveloped and used.”
As early as 2001, the European Union allowed this temporary copying as part of a technical process to be exempt from copyright restrictions, although there were many objections at the time.
Later, in 2019, the European Union Research Association asked European copyright law to explicitly permit so-called text and data mining, the permanent storage of copyrighted works for automated analysis. That said, it is legal under European copyright law to grab GPL-licensed code or any other copyrighted work, regardless of the license used.
Furthermore, Julia Reda argues that machine-generated code cannot be considered derivative work:
First, it is unreasonable to argue that copying even the smallest excerpt from a copyrighted work constitutes a copyright infringement. In this way, even without mentioning that the short code snippets that GitHub Copilot copied from the training data were unlikely to meet the original standard, wouldn’t it be possible if two or more developers used the same code base in their respective programs? Generate endless controversy?
Second, copyright law applies only to intellectual creations—without a creator, there is no work. That said, machine-generated code like GitHub Copilot doesn’t qualify for copyright protection at all, and therefore isn’t a derivative work.
There was a lot of controversy, and some developers even decided to quit GitHub
Even if Julia Reda claims so, but the majority of developers do not buy it. The copyright dispute of GitHub Copilot caused many people to be dissatisfied with Github, and even some developers decided to quit GitHub:
“I believe this is a serious violation of the rights of copyright holders, and therefore I cannot continue to rely on GitHub’s services.”
Other developers criticized GitHub Copilot for using free code as a resource for commercial AI applications:
“GitHub Copilot, by their own admission, has been trained on a lot of GPL code, so I don’t know why this isn’t a form of turning open source code into a commercial work.”