New report: 60% of OpenAI model's responses contain plagiarism
A new report from plagiarism detector Copyleaks found that 60% of OpenAI's GPT-3.5 outputs contained some form of plagiarism.
Why it matters: Content creators from authors and songwriters to The New York Times are arguing in court that generative AI trained on copyrighted material ends up spitting out exact copies.
- Copyleaks is an AI-based text analysis company that began selling plagiarism-detection tools to businesses and schools long before ChatGPT's arrival.
- GPT-3.5 was the model powering ChatGPT when it debuted, but OpenAI has moved on to the bigger and more capable GPT-4.0.
Between the lines: Plagiarism takes many forms beyond simple cutting and pasting full sentences and paragraphs.
- Copyleaks attempts to turn detecting plagiarism from "I know it when I see it" into an exact science.
- The company uses a proprietary scoring method that aggregates the rate of identical text, minor changes, paraphrased text, and other factors and then assigns content a "similarity score."
- Per the report, for GPT-3.5, "45.7% of all outputs contained identical text, 27.4% contained minor changes, and 46.5% had paraphrased text."
- "A score of 0% signifies that all of the content is original, whereas a score of 100% means that none of the content is original," per the report.
Zoom in: Copyleaks asked GPT-3.5 for around a thousand outputs, each around 400 words, across 26 subjects.
- The individual GPT-3.5 output with the highest similarity score was in computer science (100%), followed by physics (92%), and psychology (88%).
- The lowest similarity scores appeared in theater (0.9%), humanities (2.8%) and English language (5.4%).
Yes, but: "Our models were designed and trained to learn concepts in order to help them solve new problems," OpenAI spokesperson Lindsey Held wrote in a statement to Axios. "We have measures in place to limit inadvertent memorization, and our terms of use prohibit the intentional use of our models to regurgitate content."
The intrigue: The New York Times lawsuit against Microsoft and OpenAI claims that their AI systems' "widescale copying" constitutes copyright infringement.
- OpenAI responded to the lawsuit arguing that "regurgitation" is a "rare bug" and also accusing The New York Times of "manipulating prompts."
Editor's note: This story has been updated with comment from OpenAI.