OpenAI vs. DeepSeek (Part 2): Fair Use or Foul Play? The AI Training Controversy

Jan 30, 2025

In the ring of AI innovation, OpenAI goes head-to-head with DeepSeek. United by a common goal, divided by country and ideologies. Who will come out on top?

AI companies have long faced criticism for using copyrighted data without explicit permission, often scraping publicly available content to train their models. Now, we are witnessing a shift where AI firms are accusing each other of similar practices. This battle signals an era where AI-generated content and proprietary model outputs themselves become contested intellectual property.

The ongoing controversy between DeepSeek and OpenAI over alleged intellectual property (IP) violations is not just a straightforward case of one company potentially infringing upon another's rights. It also brings to the fore a complex and somewhat hypocritical reality: OpenAI, while now claiming IP infringement by DeepSeek, has itself faced similar accusations of using others' work under the banner of "fair use." This creates a complex ethical and legal landscape, where the concept of "fair use" is being stretched and challenged in the context of generative AI. This situation invites a deeper examination of the nuances of IP rights in the rapidly evolving AI domain, specifically regarding data extraction, model distillation, and the ethical dimensions of open-source contributions.

As detailed in the previous analysis, DeepSeek is accused of using output from OpenAI's models, potentially obtained through unauthorized means, to train their competing models. The argument against DeepSeek rests heavily on OpenAI's Terms of Use, which restrict the use of their services for developing competing models and prohibit the programmatic extraction of output data. However, the counterargument being raised, that OpenAI has previously utilized the work of others while claiming "fair use", cannot be dismissed so easily.

OpenAI’s Precedent: A History of Fair Use Claims

OpenAI’s rise to dominance hinges on its use of massive datasets, including copyrighted books, articles, and web content, scraped from the internet. The company has defended this practice by invoking fair use, a U.S. legal doctrine permitting limited use of copyrighted material without permission for purposes such as criticism, research, or education. Key points of contention:

Scale vs. Intent: While fair use traditionally applies to small excerpts or transformative works, OpenAI’s models ingest entire copyrighted works, raising questions about proportionality.
Commercial Benefit: OpenAI’s shift to a profit-driven model (via ChatGPT Plus and enterprise APIs) undermines claims of purely non-commercial, research-oriented data usage.
Author Backlash: Writers, artists, and publishers have sued OpenAI for unauthorized use of their work, arguing that the company profits from their labor without compensation.

This history weakens OpenAI’s moral authority in accusing DeepSeek of similar practices.

The Fair Use Argument for DeepSeek

With this backdrop, DeepSeek could argue that if OpenAI's actions were considered fair use, then their distillation of OpenAI model outputs for training should also be considered fair use. The key points in such an argument would revolve around:

Transformative Use: DeepSeek might argue that their use of OpenAI outputs is "transformative," meaning that the outputs were not simply replicated, but used to create a new model with unique capabilities. This is a key aspect of the "fair use" defense. Courts have upheld fair use in cases where derivative works add “new expression, meaning, or message” (e.g., Google v. Oracle).
Non-Commercial Use: If DeepSeek can demonstrate that their use of OpenAI data was for non-commercial purposes, this could also support a fair use defense. However, given that DeepSeek is a for profit company, this would be a difficult position to defend. Moreover, OpenAI’s ToS explicitly bars commercial and non-commercial competitors, complicating this defense.
Impact on the Market: If DeepSeek can argue that their use of OpenAI's data did not substantially harm the market for OpenAI’s services or unfairly compete with it, that would be another important factor in making a fair use case, though this would also be difficult given that since the launch of Deepseek V3, investors dumped tech stocks as they worried that the emergence of a low-cost Chinese artificial intelligence model would threaten the dominance of AI leaders like Nvidia, evaporating $593 billion of the chipmaker's market value, a record one-day loss for any company on Wall Street. This was followed by the app version overtaking Open AI’s ChatGPT in downloads from Appke’s app store.

The "fair use" defense, however, is far from straightforward. The legal criteria for fair use are complex and context-dependent, varying across jurisdictions. It's not just a question of claiming "fair use" but of proving it under legal scrutiny. DeepSeek is alleged to have used automated methods to extract data and potentially circumvented API restrictions, which weakens its fair use argument, as fair use is often associated with a legitimate or licensed use of data in the first place.

The core of the issue lies in the ambiguity surrounding what constitutes a legitimate use of AI-generated outputs for training purposes. While OpenAI’s Terms of Use are clear, the ethical landscape is far murkier. The following points add to the complexity of the IP situation:

The Nature of AI Output: Unlike traditional copyrighted works, the output of AI models is not always a direct copy of the training data. It is a probabilistic creation based on a vast dataset. This makes it harder to define clear ownership boundaries. The question of whether AI-generated outputs are copyrightable is subject to many theories. The U.S. Copyright Office has stated that works lacking human authorship are not protected, but this remains untested in cases involving distillation.
The Value of Training Data: The training data used to create large language models, including the outputs of these models, is extraordinarily valuable. The ability of models such as DeepSeek-V3 to perform well in math and code, as highlighted in the technical report4, suggests that this data is indeed a valuable asset. Additionally, if AI outputs are deemed non-copyrightable, companies might use them as a “clean” dataset to sidestep infringement claims, a loophole regulators have yet to address.
The Asymmetry in Power: The situation also exposes an asymmetry of power, where large corporations like OpenAI, with their vast resources, might be seen as being able to set the rules of the game, possibly with less regard for open-source ethics.

If smaller AI companies are restricted from using techniques like distillation due to legal concerns, it could create an environment that benefits only large corporations. Big AI firms already have advantages, such as access to vast amounts of training data and computing power. Without the ability to optimise models efficiently, startups and researchers may struggle to compete, leading to a concentration of AI power in the hands of a few. This could slow down innovation, reduce diversity in AI development, and make it harder for new players to enter the market. Clear legal guidelines and policies that promote fair access to training methods and data could help prevent this imbalance.

Conclusion

The controversy between DeepSeek and OpenAI is not only a matter of legal interpretation but also an ethical battleground. The accusation that DeepSeek "distilled knowledge" out of OpenAI models using unauthorized methods clashes with the reality of AI's dependence on vast data sets, which are also often based on data that is generated by others. The controversy challenges us to consider the scope and limitations of "fair use" in the context of advanced AI. While OpenAI has every right to protect its IP, their own reliance on the fair use principle in the past opens them to accusations of hypocrisy. Courts must clarify whether training AI models on copyrighted data or another model’s outputs qualifies as transformative, setting boundaries for permissible use. Transparency could be a safeguard: Mandating disclosure of training data sources might mitigate disputes and align open-source ideals with ethical practices. Finally, global collaboration is critical to harmonize IP laws across jurisdictions, preventing forum shopping and inconsistent rulings. The resolution of this conflict will not only impact these two companies but may also set precedents for others in the industry

While fair use is central to this dispute, it’s just one piece of a much larger puzzle. Catch up on Part 1: Did DeepSeek Copy OpenAI? Model Distillation, IP Theft, and the Battle Over AI's Training Data for context, and look ahead to Part 3: Open-Source AI, Licensing Loopholes, and the Future of AI Governance to explore how AI governance will evolve in light of these controversies.

Bits and Bars

Discussion about this post