Some people are concerned about how ChatGPT uses its web content to train and learn.
Important
- There is a way to prevent your content from being used to train large language models like ChatGPT.
- According to an intellectual property law specialist, technology has outpaced copyright laws’ ability to keep up.
- A search marketing expert wonders if it is fair for AI to use Internet content without permission.
Large Language Models (LLMs) like ChatGPT rely on a vast amount of information to train their algorithms, including web content. This data is used to create summaries of that content in the form of articles that are produced without attribution or financial benefit to the original content publishers.
Search engines, such as Google and Bing, use web crawling and indexing to download website content and provide answers in the form of links to the original websites. In contrast, website publishers have the ability to opt-out of having their content crawled and indexed by search engines through the Robots Exclusion Protocol, commonly referred to as Robots.txt.
The Robots Exclusions Protocol is not an official Internet standard but is widely recognized and respected by legitimate web crawlers. However, the question arises whether web publishers should have the ability to use the Robots.txt protocol to prevent large language models like ChatGPT from using their website content.
The argument for allowing publishers to opt out of LLMs using their content is that it can prevent the misuse of their content without proper attribution or compensation. On the other hand, allowing LLMs to access the range can lead to better information dissemination and the creation of new knowledge.
It is important to note that while the Robots.txt protocol is widely used, it is not a foolproof method of protecting content from being used by LLMs as there are other ways to access the content. Additionally, LLMs can also use the copyrighted content without permission which can lead to legal issues.
In conclusion, there is no clear answer to whether web publishers should be able to use the Robots.txt protocol to prevent large language models from using their website content. The issue is complex and requires a delicate balance between protecting the rights of content creators and promoting access to information for the benefit of society. It is essential that this issue is discussed and evaluated within the framework of copyright laws, data protection, and privacy regulations.
Large Language Models Use Website Content Without Attribution
Some search marketers are concerned about how website data is used to train machines without providing anything in return, such as acknowledgment or traffic.
Hans Petter Blindheim (LinkedIn profile), Senior Expert at Curamando shared his opinions with me.
Hans commented:
“When an author writes something after having learned something from an article on your site, they will more often than not link to your original work because it offers credibility and as a professional courtesy.
It’s called a citation.
But the scale at which ChatGPT assimilates content and does not grant anything back differentiates it from both Google and people.
A website is generally created with a business directive in mind.
Google helps people find the content, providing traffic, which has a mutual benefit to it.
But it’s not like large language models asked your permission to use your content, they just use it in a broader sense than what was expected when your content was published.
And if the AI language models do not offer value in return – why should publishers allow them to crawl and use the content?
Does their use of your content meet the standards of fair use?
When ChatGPT and Google’s own ML/AI models trains on your content without permission, spins what it learns there and uses that while keeping people away from your websites – shouldn’t the industry and also lawmakers try to take back control over the Internet by forcing them to transition to an “opt-in” model?”
The concerns that Hans expresses are certainly valid and worth considering in more depth. As technology continues to evolve at a rapid pace, it raises the question of whether laws concerning fair use should be reconsidered and updated to keep pace with these developments.
To gain a deeper understanding of the issue, I reached out to John Rizvi, a Registered Patent Attorney who is board certified in Intellectual Property Law. I asked him if Internet copyright laws are outdated, and if so, what changes he thinks should be made to address this.
John Rizvi explained that the current laws surrounding fair use and copyright on the internet have not kept up with the rapid pace of technological advancement. The laws were created in an age where the internet was still in its infancy and did not account for the way that technology has changed the way we consume and share information.
He believes that the laws should be updated to better reflect the current state of technology and how it is used. This could include changes to how copyright is enforced online and how fair use is interpreted in the digital age. Additionally, he suggests that laws should be put in place to clearly define the responsibilities of technology companies and internet service providers when it comes to copyright infringement.
It is clear that the laws concerning fair use and copyright on the internet are in need of reconsideration and updating to address the current technological landscape. It is important to consider the implications of any changes made to these laws, as they have the potential to greatly impact the way we consume and share information online.
I asked John Rizvi, a Registered Patent Attorney (LinkedIn profile) who is board certified in Intellectual Property Law, if Internet copyright laws are outdated.
John answered:
“Yes, without a doubt.
One major bone of contention in cases like this is the fact that the law inevitably evolves far more slowly than technology does.
In the 1800s, this maybe didn’t matter so much because advances were relatively slow and so legal machinery was more or less tooled to match.
Today, however, runaway technological advances have far outstripped the ability of the law to keep up.
There are simply too many advances and too many moving parts for the law to keep up.
As it is currently constituted and administered, largely by people who are hardly experts in the areas of technology we’re discussing here, the law is poorly equipped or structured to keep pace with technology…and we must consider that this isn’t an entirely bad thing.
So, in one regard, yes, Intellectual Property law does need to evolve if it even purports, let alone hopes, to keep pace with technological advances.
The primary problem is striking a balance between keeping up with the ways various forms of tech can be used while holding back from blatant overreach or outright censorship for political gain cloaked in benevolent intentions.
The law also has to take care not to legislate against possible uses of tech so broadly as to strangle any potential benefit that may derive from them.
You could easily run afoul of the First Amendment and any number of settled cases that circumscribe how, why, and to what degree intellectual property can be used and by whom.
And attempting to envision every conceivable usage of technology years or decades before the framework exists to make it viable or even possible would be an exceedingly dangerous fool’s errand.
In situations like this, the law really cannot help but be reactive to how technology is used…not necessarily how it was intended.
That’s not likely to change anytime soon, unless we hit a massive and unanticipated tech plateau that allows the law time to catch up to current events.”
It becomes increasingly clear that the issue of copyright laws in relation to AI training is a complex and multi-faceted one, with various considerations that must be weighed and balanced. There is no easy solution to this problem as it involves various legal, ethical, and technological aspects that need to be taken into account. It requires a nuanced and thoughtful approach, taking into account the impact on the creators, publishers, and the AI industry.
The challenge is to strike the right balance between protecting the rights of copyright holders and promoting the development of AI technology. It’s a delicate equilibrium that will require the input of legal experts, policymakers, and industry leaders to navigate successfully. It is not a task that can be underestimated or simplified.
OpenAI and Microsoft Sued
The recent case involving OpenAI and Microsoft’s CoPilot product highlights the complex nature of copyright laws when it comes to AI development. The use of open source code, which is freely available for use, has become a common practice in the industry. However, the use of open source code is not without its challenges. One such challenge is the requirement of attribution under the Creative Commons license, which is a common license used for open source code. This requirement means that the creators of the open source code must be acknowledged and credited for their work.
This case raises important questions about the use of open source code in the development of AI products and the responsibility of companies to adhere to copyright laws. It highlights the need for clear guidelines and regulations on the use of open source code in the AI industry to ensure that creators are properly credited and protected. The case also serves as a reminder of the importance of due diligence and compliance in the use of open source code in the development of AI products.
According to an article published in a scholarly journal:
“Plaintiffs allege that OpenAI and GitHub assembled and distributed a commercial product called Copilot to create generative code using publicly accessible code originally made available under various “open source”-style licenses, many of which include an attribution requirement.
As GitHub states, ‘…[t]rained on billions of lines of code, GitHub Copilot turns natural language prompts into coding suggestions across dozens of languages.’
The resulting product allegedly omitted any credit to the original creators.”
The author of that article, a legal expert on copyrights, stated that many people see open source Creative Commons licenses as a “free-for-all.”
Some may consider the phrase “free-for-all” to be an accurate description of the datasets comprised of Internet content that is scraped and used to generate AI products such as ChatGPT.
Background on LLMs and Datasets
Large language models, such as ChatGPT, are trained on a wide variety of data sets that contain various types of content. These datasets can include emails, books, government data, Wikipedia articles, and even datasets created from websites that are linked in posts on Reddit with at least three upvotes.
One of the most widely used sources of data for training language models is the Common Crawl dataset. This dataset is created by a non-profit organization of the same name, and it is widely used as a starting point for creating other datasets. The Common Crawl dataset is a vast collection of web pages that have been crawled from across the Internet, and it is available for free download and use.
It’s worth noting that the Common Crawl dataset is not only used for training large language models, but also for other AI applications such as information retrieval, sentiment analysis, and natural language processing. Furthermore, the dataset is not only used by OpenAI, but also by other organizations, researchers, and academics.
One example of a model that used a filtered version of the Common Crawl dataset is GPT-3. The filtered version of the dataset was created to include only the web pages that were most relevant to the training of the model, and it was used to train GPT-3 to generate human-like text. This is one of the reasons why GPT-3 was able to generate text that was indistinguishable from text written by a human.
In conclusion, the Common Crawl dataset is a valuable resource for training large language models, and it is widely used as a starting point for creating other datasets. It’s a crucial source of data that allows language models to understand and generate human-like text.
This is how GPT-3 researchers used the website data contained within the Common Crawl dataset:
“Datasets for language models have rapidly expanded, culminating in the Common Crawl dataset… constituting nearly a trillion words.
This size of dataset is sufficient to train our largest models without ever updating on the same sequence twice.
However, we have found that unfiltered or lightly filtered versions of Common Crawl tend to have lower quality than more curated datasets.
Therefore, we took 3 steps to improve the average quality of our datasets:
(1) we downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora,
(2) we performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of overfitting, and
(3) we also added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity.”
The Common Crawl dataset is also the source of Google’s C4 dataset (Colossal, Cleaned Crawl Corpus), which was used to create the Text-to-Text Transfer Transformer (T5).
According to their research paper (PDF Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer),
“Before presenting the results from our large-scale empirical study, we review the necessary background topics required to understand our results, including the Transformer model architecture and the downstream tasks we evaluate on.
We also introduce our approach for treating every problem as a text-to-text task and describe our “Colossal Clean Crawled Corpus” (C4), the Common Crawl-based data set we created as a source of unlabeled text data.
We refer to our model and framework as the ‘Text-to-Text Transfer Transformer’ (T5).”
Google published an article on their AI blog that goes into greater detail about how Common Crawl data (content scraped from the Internet) was used to create C4.
They wrote:
“An important ingredient for transfer learning is the unlabeled dataset used for pre-training.
To accurately measure the effect of scaling up the amount of pre-training, one needs a dataset that is not only high quality and diverse, but also massive.
Existing pre-training datasets don’t meet all three of these criteria — for example, text from Wikipedia is high quality, but uniform in style and relatively small for our purposes, while the Common Crawl web scrapes are enormous and highly diverse, but fairly low quality.
To satisfy these requirements, we developed the Colossal Clean Crawled Corpus (C4), a cleaned version of Common Crawl that is two orders of magnitude larger than Wikipedia.
Our cleaning process involved deduplication, discarding incomplete sentences, and removing offensive or noisy content.
This filtering led to better results on downstream tasks, while the additional size allowed the model size to increase without overfitting during pre-training.”
Google, OpenAI, and even Oracle’s Open Data use Internet content, including your content, to create datasets that are then used to create AI applications like ChatGPT.
Common Crawl Can Be Blocked
Blocking Common Crawl is an option for website owners who want to opt-out of all the datasets that are based on it. However, it is important to note that if the site has already been crawled, the website data is already in the datasets and cannot be removed.
The Robots.txt protocol can be used to block future crawls by Common Crawl, but it won’t prevent researchers from using content that is already in the dataset. It is also important to note that blocking Common Crawl also means blocking other derivative datasets like C4 and Open Data.
In summary, while it is possible to block Common Crawl, it will not remove content that has already been indexed and will only prevent future crawls.
How to Block Common Crawl From Your Data
Within the limitations discussed above, blocking Common Crawl is possible using the Robots.txt protocol.
CCBot is the name of the Common Crawl bot.
It is recognised by the most recent CCBot User-Agent string: CCBot/2.0.
Blocking CCBot with Robots.txt is the same as blocking any other bot.
Here’s how to use Robots.txt to stop CCBot.
User-agent: CCBot
Disallow: /
CCBot crawls IP addresses from Amazon AWS.
CCBot also adheres to the Robots nofollow meta tag:
<meta name="robots" content="nofollow">
What If You’re Not Blocking Common Crawl?
Web content can be downloaded without permission because browsers download content.
Google or anyone else does not need permission to download and use publicly available content.
Website Publishers Have Limited Options
The question of whether it is ethical to train AI on web content does not appear to be part of any discussion about the ethics of AI technology development.
It appears to be assumed that Internet content can be downloaded, summarised, and converted into a product called ChatGPT.
Is that reasonable? The solution is complicated.
2 thoughts on “Is ChatGPT Use Of Web Content Fair?”