menu
menu
News

Google trains AI search tools on publisher content despite opt-outs

Nellius Irene
03/05/2025 05:40:00

Google is facing fresh scrutiny after a senior executive testified that the company’s search-specific artificial intelligence (AI) products, such as AI Overviews, are trained on publisher content, even when those publishers have explicitly opted out of AI training.

Eli Collins, Vice President at Google DeepMind, acknowledged in federal court on Friday that while publishers can prevent their content from being used to train AI models developed by DeepMind, such opt-outs do not apply to Google’s broader search organization.

“Once you take the Gemini [AI model] and put it inside the search org, the search org has the ability to train on the data that publishers had opted out of training, correct?” asked Diana Aguilar, a lawyer for the U.S. Department of Justice (DOJ).

Collins confirmed that the data could still be used “for use in search.”

This revelation comes amid a pivotal antitrust trial determining how the tech firm must restructure its operations after being found guilty last year of illegally monopolizing the online search market. The Justice Department is now pressing for structural remedies, including forcing the firm to divest its Chrome browser and prohibiting deals that make it the default search engine across devices—a move that would also impact the company’s AI products, including Gemini.

Google powers AI tools with content that publishers say they didn’t permit

Google’s AI Overviews feature, which summarizes answers at the top of search results using AI-generated text, has already alarmed website publishers. Many argue that it reduces user clicks to original websites and hurts their revenue, yet the tech firm continues to use data from these sources.

In a DOJ-presented document dated August 26, 2024, titled “Search GenAI <> Gemini v3,” internal data showed that Google had filtered out about 80 billion tokens—essentially snippets of text—from its training corpus of 160 billion tokens in response to publisher opt-outs. However, the remaining 80 billion tokens could still include content that powers the Google Search AI features.

The same document also listed “search sessions data” and YouTube videos as additional sources to enhance AI training, raising concerns over the scope of user data being fed into the tech firm’s AI models.

When Judge Amit Mehta asked whether half the dataset was indeed removed due to publisher opt-outs, Collins confirmed: “That is correct.”

DOJ highlights internal interest in leveraging search data for AI

The DOJ further highlighted internal discussions within Google suggesting ambitions to train AI models using its vast troves of search data—rankings, queries, and user behavior.

One such instance included a briefing prepared for DeepMind CEO Demis Hassabis, in which he pondered the idea of training a Google AI model using comprehensive search data to evaluate the resulting performance gains.

Aguilar asked Collins if Google had built a model using search data. Collins responded that he was not aware of such a model being developed, though he acknowledged that Hassabis had shown interest in the concept.

Google’s legal team tried to downplay concerns of AI dominance, arguing that other AI companies can thrive without leveraging its search index. For example, sports chatbots can access real-time data via commercial partnerships with score providers, not web-crawled content.

Still, the DOJ maintains that Google’s long-standing dominance in search gives it an unfair edge in the AI space, particularly as it integrates Gemini into its search infrastructure.

Google faces further scrutiny on its Ad business

Alphabet’s Google will also face a trial in September over antitrust enforcers’ proposals to force the company to sell off parts of its advertising technology business. The proposed changes aim to address the firm’s dominance over the tools used by online publishers to sell digital ads.

U.S. District Judge Leonie Brinkema in Alexandria, Virginia, set the trial date after hearing from Google and the DOJ about potential remedies. Both sides are expected to file detailed proposals by Monday.

The DOJ is seeking to have the tech firm divest its ad exchange and publisher ad server businesses—a process expected to take several years, according to DOJ attorney Julia Tarver Wood.

Google lawyer Karen Dunn countered that the company supports behavioral remedies, such as allowing real-time bids to be available to competitors. However, she argued that the DOJ cannot legally force the company to sell parts of its business. Dunn further asserted that such a move would harm internet users and face challenges due to a lack of interested buyers.

Cryptopolitan Academy: Want to grow your money in 2025? Learn how to do it with DeFi in our upcoming webclass. Save Your Spot

by KEY Difference