Can Artificial Intelligence Infringe Copyrights?

What happens when Microsoft grabs all GitHub repositories and uses them on their new paid plugin

Otto, the Pilot - image under fair use

Back in 2018, when Microsoft acquired GitHub, people started wondering if it was plotting something, and many developers left the platform. But the majority, mostly because of the functionalities offered by GitHub and the network of people coding there, adopted the “wait and see” posture.

Then, when Microsoft/GitHub announced the release of GitHub Copilot, its new artificial intelligence plugin for MS Visual Studio, things got clearer. The GitHub community discovered, astonished, that it was one more trap set by Microsoft to get for free all the code from the platform—copyrighted or not—, use it to train Copilot, its new artificial intelligence tool, and sell it for software development companies as a way of increasing programmers’ productivity and—why not?—laying off the redundant ones. Microsoft’s explanation? Fair use, for humanity’s sake. But perhaps there could be a much more nefarious strategy behind this situation.

What is the GitHub Copilot plugin?

Microsoft/GitHub Copilot is a machine learning model based on OpenAI’s GPT-3 natural language model. Such models are mere probabilistic, mathematical models: you start typing a line of code, a string or some instructions about what you want and it will try to find in its data set what could probably fit to your query. It will then copy or combine code from what appears more times or belongs to higher quality data set parts (or both). That is the reason why it is asked, on GitHub’s website, to be the more clear and detailed possible on the use of arguments. The data fed to the computer must be “discrete, explicit and determinate”1; otherwise, it will not be able to process it. Hubert L. Dreyfus wrote this almost fifty years ago, and still today computers from one of the biggest technology corporations in the world need such detail to perform simple programming tasks.

This is confirmed by a paper released on Copilot’s website noting that, on a test use of Copilot with the Python code, it did one “recitation” every 10 calendar weeks of use, what GitHub acknowledged. Well, the term “recitation” is nothing more than a euphemism for outright plagiarism. Worse, they recognize that it filled functions correctly only 43% of the time on the first try and that it can suggest insecure code, offensive outputs and “... personal data verbatim from the training set. [...] email addresses, phone numbers, access keys, etc.

If such personal data belong to real people, what we could have here, then, would be not only a copyright issue, but also a data privacy one, because this data shouldn’t be openly available in a commercial plugin.

Pinocchio is a Public Domain picture from Wikimedia CommonsNever ask the truth to an AI program.

Robert Dale, a professor of Computing that tested GPT-3 (without Copilot), found that its outputs lacked “... semantic coherence, resulting in text that is gibberish and increasingly nonsensical as the output grows longer”, and that “... may correspond to assertions that are not consonant with the truth.” That is why he concludes that “... question-answering or advice-giving systems, where it’s important that the resulting answer be true, are a risk too far.”

Brian Lunduke, former Deputy Editor for Linux Journal, wrote that Copilot is “the worst thing to ever happen to computer programming.” Jeremy Howard, a programmer who tested Copilot, pointed out the risk of cognitive and anchoring biases, risking to over-rely on Copilot and follow the path set by it: because of the program’s high error rate and the temptation of simply accepting suggested code without checking its syntax, beginners risk “... learning less, learning slower, increasing technical debt, and introducing subtle bugs—are all things that you might well not notice, particularly for newer developers.”

Another tester took two hours to build an app that manually wouldn’t take more than 15 minutes, because of the need of checking all suggestions made by Copilot to ensure the expected outcome. But what if beginners or coders less acquainted with the programming language don’t know what would be this outcome?

Microsoft/GitHub, perhaps to try to dissipate fears of copyrights violations, claims that Copilot is a code synthesizer and not a search engine. This is a half-truth, because, as it is taught in an introductory lesson on Program Synthesis by the MIT professor Armando Solar-Lezama, the element of search is exactly the distinguishing feature between a compiler and a synthesizer: “a synthesizer is generally understood to involve a search for the program that satisfies the stated requirements.” So, yes, Copilot is a synthesizer that, for its very nature, has a search element. Indeed, nowhere on Copilot’s website is said that the plugin will create code, but that it will rather synthesize it. And the synthesis process is nothing more than the combination of elements that will form a whole. Copilot does not create code; it searches and combines snippets according to the entered arguments. That is why an Oxford professor and an IBM employee, after having tested GPT-3, wrote that:

Public Domain picture by George HodanCopilot’s great new technology

“GPT-3 is an extraordinary piece of technology, but as intelligent, conscious, smart, aware, perceptive, insightful, sensitive and sensible (etc.) as an old typewriter [...]. Hollywood-like AI can be found only in movies, like zombies and vampires.”

Therefore, with its probabilistic model, Copilot will, after some time, give preference to “recite” higher-quality code, which will rather belong to companies or professional developers, but without any acknowledgment from their authors or licenses. Such code can be under a copyright notice that forbids all use of the code or under some more or less restrictive license. As a consequence, plugin users will have no idea on the output code’s license terms “spitted” by Copilot (or would it be “co-parrot”?) and risk, e.g., employing code from a more restrictive license, like GPL, in a project using the MIT one, which is forbidden by the GPL license.

We have, then, indications from the machine learning model (GPT-3), from Copilot’s website and testers that it repeats code. One could say that it does so at a very low rate. Indeed, but it still repeats code from someone, and most likely high quality code, where the author’s creative element is most embedded. This would result in copyright infringement. Could a fair use allegation fit to such a case?

Fair Use: A Tool to Break Free From the Riches, Not to Make Them Richer

Fair use was included in the U.S. Copyright Act as a way to avoid copyright abuse. With it, if an inventor was putting unjustified barriers to the use of his invention, like, e.g., asking an abusive price or condition for licenses, one could use it independently of the creator’s authorization, but exclusively for purposes such as criticism, comment, teaching, scholarship and research.

A recent fair use case happened on April 2021, with a legal battle between Google and Oracle, where Google copied about 0.4% of Oracle’s Java APIs to make easier for Android developers create their apps. The U.S. Supreme Court recognized that there was a fair use case in the benefit of Google. It was a way to make easier for developers to create their apps. Android and the APIs were free, as were most apps. Now it seems that Microsoft is trying to use this decision on its favor, but with a slight difference: while with Google a user interface (the APIs) was taken from a big corporation and made available for free to app developers, Microsoft is taking free code to make it available on a paid plugin, which, as we saw, deters creativity and possibly worsens the quality of the output code.

The Act establishes the following four guiding factors2 for that we have a fair use case:

The purpose and character of the use

That is, is the original code available for a fee? And the one claiming fair use? According to their website, the GitHub Copilot plugin will have a commercial version using the entirety of free code as data set. It is ignored if there will be a free one. Moreover, they state that the output code generated belongs to the user, which means the possibility of closing the source and even charging a fee for it. None of this is in conformity to the U.S. Supreme Court understanding: “every commercial use of copyrighted material is presumptively an unfair exploitation of the monopoly privilege that belongs to the owner of the copyright.”

Moreover, does it have any creative/transformative aggregated value? “The use must be productive and must employ the quoted matter in a different manner or for a different purpose from the original. A quotation of copyrighted material that merely repackages [...] the original is unlikely to pass the test.”3

The nature of the copyrighted work

Computer programs are considered equivalent to written works. As such, the entirety of code is protected. That is, even “... a shopping list, or a loanshark’s note on a debtor’s door saying ‘Pay me by Friday or I’ll break your goddamn arms’ are all protected by the copyright.”4

The amount and substantiality of the copyrighted work employed

Microsoft appropriated itself of the entirety of the code hosted on GitHub. Moreover, there is the substantiality factor—the qualitative aspect of the work.5 With the already mentioned Copilot’s risk of cognitive and anchoring biases, the more one relies upon it, higher will be the possibility of multiple copyright offenses, specially when it recites high quality code from some creator.

Its effect upon the potential market or value of the copyrighted work

Will it trespass the gains that the creator would have with his work? Copilot will be generating thousands of programs that don’t observe the licenses from which they were created. The original author can be in disadvantage regarding the newer, copied code, in the case of his core lines be used in a project having a less restrictive license.

And Outside the U.S.?

As this is a world issue, national rules on jurisdiction and intellectual property may deal differently with it. As a general rule for IP regulations, most are based on the Berne Convention, the TRIPS Agreement and the WIPO Copyright Treaty through the so-called “three-step test.” It “restricts the ability of states to introduce, and maintain, exceptions to the exclusive rights of authors and other right-holders. [...] exceptions are only permitted (1) in certain special cases; (2) which do not result in a conflict with the normal exploitation of a work and (3) which do not unreasonably prejudice the legitimate interests of the author (or other right-holder).”6 Here we see that, differently from the American view on fair use, it is not the market or value that count the most, but the interests of the author.

A Pirate Hub?

We have until now studied a rather optimistic scenario, thinking about the inexperienced user or the programmer under pressure by a project deadline. But as it will be a paid plugin, it will be mainly used by companies, and some of them may willfully put the synthesized code under full copyright, without any license, and with closed source code. Under such circumstances, checking if a generated code breaks license terms or even if is an outright act of piracy will be even more difficult.

All this presupposes, too, that the entire content from GitHub’s millions of users doesn’t itself infringe copyright, what isn’t certainly true. As all content was grabbed by Microsoft, Copilot will also “help” its users with illegal code.

MS Pirate

The issues and facts exposed here lead us to conclude that neither Microsoft nor Copilot’s users are covered by fair use. Firstly, Microsoft will be charging users for the use of free code; as well as it is deceiving potential users when they assert that Copilot’s output code belongs to themselves, without making any reference to liability risks related to copyright claims. In addition, the GPT-3 Codex Copilot is not, as we saw, a code creator. It is a code synthesizer, combining and copying the code of GitHub’s millions of users with high verbosity when instructions are not detailed, which eliminates the novelty element of a creation. It behaves more like a parrot, repeating unintelligently what it has listened, than a copilot.

Secondly, they are facilitating and stimulating users to employ code without observing licenses, where the synthesized code’s new copyright terms may predate creators’ interests and market.

Finally, there is no learning or research experience: it is all about reducing costs and time and increasing productivity – which, certainly, has nothing to do with spending time to learn a new code or to develop skills. This is exactly the opposite of fair use.

To those considering to use Copilot, it would rather be counseled to avoid it. If a creator finds lines of non-trivial code of his authorship in a code originated from Copilot, without due respect to the license terms he chose, you may be liable for license infringement.

Trying to destroy open source?

To summarize, it seems believe Microsoft has a strategy that goes beyond making money with this plugin. As we saw, it seems to be still on beta. Why such a hurry? Is this just desperation for money? Or are they trying to create turmoil in the open source community? In fact, the situation can push many companies who had content hosted on GitHub (or even elsewhere) to close the source of their code, as a means to avoid new issues with open repositories.

With such step, Microsoft seems to be trying to create distrust on open source collaboration and eliminate its—and Silicon Valley’s—biggest rival in the coming future. Not only a simple commercial rival, but a political one. Because knowledge is intrinsically related with power. When free knowledge begins to spread throughout society, the ones who today detain power have their positions menaced. The attack on open source is an attack on freedom, to possess the monopoly of creation.