Reddit vs Perplexity, and the scourge of ‘data laundering’ in an AI slop world| Business News

Reddit Inc. has sued Perplexity AI Inc. for “industrial-scale data laundering”, as AI firms increasingly scrape websites—often without permission—for original content to train their large language models.

Data laundering is “cleaning up dirty data to make it look legal and trustworthy” (Representative Image/Unsplash)

Three data-scraping companies—OxyLabs UAB, AWMProxy and SerpApi—have been illegally collecting Reddit data via Google Search for the purposes of selling it, according to the complaint filed in a federal court in Manhattan, New York. Perplexity was buying the data from at least one of the companies.

“AI companies are locked in an arms race for quality human content—and that pressure has fuelled an industrial-scale data laundering economy,” Reddit’s Chief Legal Officer Ben Lee said in a statement. “Reddit is a prime target because it’s one of the largest and most dynamic collections of human conversation ever created.”

At this juncture, it’s pertinent to explain what “data laundering” is, how it works, and why threatens the existence of original ideas amid an AI slop sloshing across the internet at large.

What is data laundering?

The “data laundering” is a term used to describe the process of disguising the origin, or legitimacy of data — much like how money laundering hides the source of illegally obtained money.

It typically involves taking unlawfully obtained, stolen or unethical data, transforming and repackaging it, and then using or selling it as legit.

Essentially, data laundering is “cleaning up dirty data to make it look legal and trustworthy”.

How data laundering works

Data laundering happened in stages similar to money laundering:

1. Acquisition: The data is obtained through illegal or unethical means, such as hacking, scraping or unauthorised sharing.

2. Transformation: The data is processed, aggregated, anomymised or mixed with legitimate datasets to obscure its origin.

3. Distribution: The “cleaned” data is sold, licensed or used in AI models, analytics, or marketing as if it were lawfully sourced.

How data laundering takes place

A company—like OxyLabs UAB, AWMProxy and SerpApi in the case ‘Reddit vs Perplexity AI’—scrapes copyrighted or personal data from the web, cleans and labels it as open-source. This “clean” data is then merged with legitimate customer lists and sold as a verified dataset—in this case to Perplexity AI to train its large language models.

This “clean” data does not cite the source, as ChatGPT or Google Gemini does. It’s worth noting here that Perplexity AI is building a rival to Google Search, and not necessarily a chatbot — hence the alleged need for anonymised data.

The illegality of data laundering

The practice of data laundering flies in the face of data privacy laws globally — GDPR in Europe and Digital Personal Data Protection Act in India. It’s an immediate breach of IP rights, if copyrighted material is used.

Still, AI agents are able to “launder” scraped content through layers of anonymisation and/or aggregation to claim compliance with data laws.

Legal cases against data laundering

Reddit vs Perplexity: Reddit is seeking monetary damages from Perplexity AI and a court order to stop the alleged scraping and use of its data in violation of federal copyright law. It’s worth noting here that the discussion forum has already inked deals with OpenAI and Google to licence its data for training their LLMs.

Perplexity has no such arrangement, but its spokesperson Beejoli Shah said the firm “will always fight vigorously for users’ rights to freely and fairly access public knowledge”.

NYT vs OpenAI: The New York Times Corp. sued OpenAI and Microsoft Corp. in December 2023 claiming that the Sam Altman-led company used large quantities of NYT content to train its LLMs without proper licensing.

While the complaint is framed in terms of copyright and data-use rights rather than “data laundering”, the underlying issue is similar — large-scale ingestion of publicly available data without a licence.

OpenAI has been similarly sued by Canadian publishers for scraping and collating data for LLMs without a licence in place.

ANI vs OpenAI: In November 2024, ANI sued OpenAI in a New Delhi court, accusing the ChatGPT creator of using its content without permission to help train the AI chatbot, something OpenAI said it has stopped doing. ANI has also accused OpenAI’s ChatGPT of attributing fabricated news stories to the publication.

Then there’s this case of Stability AI that managed to get millions of copyright artwork without paying a cent. Also, Studio Ghibli and how AI ripped off 40 years of Hayao Miyazaki’s efforts of perfecting the art form.

“Data laundering” is a symptom of weak governance, iffy sourcing, pressure to gather as much data as quickly as possible. The way out is a mix of technical, legal and market fix so that consent and accountability are enforced end-to-end. Until then, it’s you and i in this AI slop world.

Source link

Post Views: 4

Reddit vs Perplexity, and the scourge of ‘data laundering’ in an AI slop world| Business News

What is data laundering?

How data laundering works

How data laundering takes place

The illegality of data laundering

Legal cases against data laundering

Similar Posts

India’s exports to US in September plunge the most since the beginning of 2025 due to US tariff impact| Business News

Lodha Developers appoints Amandeep Singh as Regional CEO for Delhi-NCR

Leave a Reply Cancel reply