No Data, No Party: Why Competition for High‑Quality Journalism Matters in the GenAI Race

If you follow debates about generative AI, you will have heard a lot about compute and talent. Cloud infrastructure, GPU clusters, and elite machine‑learning teams have become the standard shorthand for what really matters in the AI arms race. What is missing from that conversation is the third leg of the stool: data – and in particular, high‑quality, human‑generated content such as professional journalism. Data is not a secondary input. It is the ingredient that largely determines whether a model is reliable, useful, and competitive at all.

In this post, I argue that high‑quality news and media content is a decisive input in GenAI; the way dominant firms access and use this content can distort competition, both for AI models and for journalism itself; and we already have a legal toolkit (copyright, competition law, the Digital Markets Act (DMA), and the AI Act) that can and should be used to protect competition for third‑party data, rather than treating the matter as a purely copyright problem.

1. The missing input in GenAI debates: data

Policy discussions have converged on a now‑familiar triad of inputs that drive competition in GenAI:

  • Compute: access to large‑scale cloud infrastructure and specialised chips.
  • Talent: the small pool of experienced ML researchers and engineers.
  • Data: the material on which models are trained, tuned, and grounded.

Compute and talent now feature in every competition, industrial‑policy, and security paper on AI. Data, by contrast, is often treated as a secondary or purely “copyright” issue – something to be sorted out between tech firms and rightholders in private negotiations.

That is a mistake. In practice, data quality and control are at least as important for competitive dynamics as compute capacity or headcount. Models trained on higher‑quality data are more accurate, less prone to hallucinations, and more attractive to users and developers. That is why major players openly acknowledge that data quality is the key determinant of performance.

And not all data is created equal. The most valuable datasets today are contemporary, so models stay “in step with the times”; human‑generated, not synthetic outputs recycled from other models; and professionally produced, with clear structure, correct language and factual grounding.

News and journalistic content sit squarely in this category. They are precious resources for GenAI: constantly updated, carefully edited, and closely tied to real‑world events.

That is why the question “who controls access to high‑quality media content?” is no longer just a copyright question. It is a competition question about a critical input in the AI value chain.

2. First‑party versus third‑party data: an uneven playing field

From a competition perspective, the relevant distinction is not (simply) between public and non‑public data, but between:

  • first‑party data, that is, content (mainly end user data) that a platform already controls because it runs the underlying consumer-facing service; and
  • third‑party data, that is, content produced and owned by others, such as news and magazine publishers, broadcasters, and image libraries.

Large integrated tech firms sit on vast troves of first‑party data. They can update terms and conditions or privacy policies to fold those datasets into their AI pipelines, often without meaningful user choice. Smaller or newer AI developers simply do not have this option: they must negotiate licences or find lawful ways to access third‑party content.

This creates a structural asymmetry:

  • Incumbents have:
    • huge first‑party data stores from search, social, e‑commerce and other services;
    • strong bargaining power vis‑à‑vis publishers because they act as the main “gateways” to users;
    • vertical integration across the AI value chain (from cloud and data centres to consumer services).
  • Smaller / independent developers
    • no comparable first‑party reservoir;
    • dependence on third‑party content for training and grounding;
    • limited leverage in licensing negotiations and no ability to coerce access.

The result is an inferior data → inferior model → lost users → even less data spiral for smaller players, while incumbents compound their advantages. Left unchecked, this dynamic risks reproducing in GenAI exactly the kind of concentrated market structures we saw with earlier digital platforms. In a nutshell, if we want a genuinely competitive GenAI marketplace, protecting competition for third‑party content – especially high‑quality journalism – is essential.

3. How AI developers get the data: crawlers, scraping and RAG

To understand what is at stake for publishers, it helps to demystify how AI developers actually obtain and use data.

Web crawling and scraping

Modern AI developers rely heavily on automated tools:

  • Web crawlers (“bots”) systematically traverse the web, discovery‑style, following links from seed URLs and building large indexes of pages.
  • Web scrapers then extract specific information from those pages (e.g., text, images, structured data) and store it in datasets that can be cleaned, filtered and formatted for model training.

This infrastructure was originally built to power search engines and archiving. Today, the same techniques are used to assemble training corpora for foundation models: billions of tokens scraped from news sites, blogs, forums, code repositories, and more.

Training versus grounding

Once collected, content is used in two conceptually different ways:

  • Training: models internalise patterns, style and structure from the data; this is where copyright questions around reproduction and text‑and‑data mining exceptions are most acute.
  • Grounding / RAG (retrieval‑augmented generation): the model is connected to “live” content sources at inference time (for example via search indexes or proprietary feeds) so that it can pull in up‑to‑date material when answering a query.

For journalism, both stages matter. News articles can be used (a) upstream, as training data that shapes the model’s general ability to write, argue and summarise; and (b) downstream, as sources for grounded answers in AI search or chat interfaces.

In both roles, high‑quality media content is central to making GenAI systems useful and trustworthy. That is exactly why rules governing access to that content (whether via bots or APIs or bespoke licensing) have direct competitive implications.

4. When copyright is not enough: data practices as competition problems

At first glance, disputes over AI training on news seem like a classic copyright problem: have developers copied protected works without authorisation, and do any statutory exceptions apply? That is indeed a big part of the story.

But if we stop there, we miss two things.

4.1. Copyright does not ensure fair bargaining

EU copyright rules give publishers powerful rights, including:

  • the reproduction right and communication to the public under the InfoSoc Directive;
  • the ancillary (“neighbouring”) right for press publications in Article 15 of the DSM Copyright Directive;
  • targeted text‑and‑data‑mining exceptions that can be opted out of through machine‑readable reservations.

What these regimes do not do is establish a duty on dominant platforms to:

  • negotiate, rather than impose “take‑it‑or‑leave‑it” terms; or
  • avoid tying access to basic distribution (search visibility) to consent for additional AI uses.

This is where competition law, and especially Article 102 TFEU, becomes relevant.

4.2. Data access as a parameter of competition

In Meta Platforms, the Court of Justice of the EU (CJEU) held that access to and processing of personal data are a significant parameter of competition in the digital economy. There is no reason in principle why the same reasoning could not apply to access to copyright‑protected data in GenAI markets.

If high‑quality media content is a decisive input for training and grounding, then:

  • denying or conditioning access to that content can foreclose rivals;
  • using dominance to impose free or unfair use of content can be an exploitative abuse; and
  • systematic breaches of copyright by a dominant firm can be evidence that it has departed from “competition on the merits”.

The ongoing investigation into how Google uses publishers’ content for AI Overviews is a good illustration. At issue is whether Google is imposing unfair trading conditions on publishers and leveraging its dominant position in search to extract uncompensated AI training and grounding data.

Competition law is therefore not a stranger in this discussion. It is a necessary part of the toolkit if we are serious about preserving competition for third‑party data.

5. An integrated legal toolkit: Article 102, the DMA and the AI Act

We should stop thinking in silos – “this is copyright, that is competition, the DMA is about platforms, the AI Act is about safety”. Instead, we should combine these instruments to protect competition for data.

5.1. Article 102 TFEU

Article 102 is flexible enough to capture several data‑related practices:

  • Unfair trading conditions (102(a)), including forcing publishers to accept free use of their content for AI services as a condition for remaining visible in search, or unilaterally deciding that their content will be fed into AI Overviews without realistic opt‑out or remuneration.
  • Abuse through regulatory non‑compliance: building on Meta Platforms, competition authorities can treat systematic breaches of copyright rules by a dominant firm, where they allow it to accumulate a decisive data advantage, as a strong indication of abuse.
  • Self‑preferencing: when a dominant search engine positions its own AI answers at the top of results pages, trained on publishers’ content, while diverting traffic away from those very publishers.

The point is not to “turn Article 102 into copyright law by other means”, but to recognise that how a dominant firm accesses and uses third‑party content can be anti‑competitive, above and beyond the underlying IP questions.

5.2. The Digital Markets Act

The DMA is not an AI law, but its obligations for gatekeepers are highly relevant to AI ecosystems:

  • Article 6(5) DMA: Gatekeepers may not favour their own services in ranking and display. That provision can capture AI answer boxes or overviews that are integrated into search results and systematically placed ahead of organic publisher links.
  • Article 5(8) DMA: Gatekeepers cannot make access to one core platform service conditional on subscription or registration to another. By analogy, they should not be able to tie basic search indexing to consent for AI training and grounding, effectively forcing publishers to “pay” with their content if they want to remain findable.
  • Article 6(2) DMA: This limits the use of non‑public business‑user data to compete against those users. Although focused on non‑public data, the logic is instructive: platforms should not be able to use their role as intermediaries to appropriate data in ways that distort competition in adjacent markets, including GenAI.

These provisions do not create a licensing regime, but they constrain the ecosystem strategies through which gatekeepers might entrench data advantages in AI.

5.3. The AI Act

Finally, the AI Act brings a different kind of tool: transparency.

For providers of general‑purpose AI models, Article 53 requires:

  • a copyright‑compliance policy that respects rightholders’ reservations under the DSM Copyright Directive; and
  • a public summary of training data, designed to help rightholders understand if and how their content has been used.

In principle, this should help correct the massive informational asymmetry that currently exists: most publishers simply do not know whether their content is in any given training corpus. In practice, implementation matters. A narrow approach that only requires high‑level descriptions or partial domain lists will not give publishers the information they need to exercise their rights or to negotiate.

I argue for a more ambitious interpretation of the above obligations: summaries should be sufficiently detailed, including, where feasible, domain‑level and URL‑level information, to make copyright and competition rights enforceable in practice. Without that, transparency becomes a box‑ticking exercise.

6. Towards a competitive market for third‑party content

Putting this together, what should policy‑makers and enforcers aim for?

  1. Recognise third‑party content as both a copyright object and a competitive resource: High‑quality journalism is not just another line item in a licensing negotiation. It is a key input without which smaller AI developers cannot compete, and without which models will become less grounded, more hallucinatory, and less useful.
  2. Use existing tools before inventing new ones: There is substantial, under‑used scope in Article 102 TFEU, applied in conjunction with copyright rules; the DMA’s bans on self‑preferencing and tying in search and other core platform services; the AI Act’s transparency and copyright‑compliance obligations.
  3. Design remedies with data in mind: When authorities look at abuses in search, social or other gatekeeper markets, they should ask not only “how do we fix ranking or access today?” but also “how do we prevent the same conduct from creating an insurmountable data advantage in GenAI tomorrow?”. That may mean integrating data‑access and data‑licensing remedies into antitrust or DMA cases, and hardwiring traceability obligations into AI regulation.
  4. Support the emergence of a real licensing market: Transparency about training data, combined with credible enforcement against zero‑remuneration “take‑it‑or‑leave‑it” terms, can help move us from opaque appropriation of content to a more normal market where different AI developers compete to license high‑quality datasets on fair terms.

If we fail to act, the likely outcome is not simply that journalists and publishers are under‑compensated (or not compensated at all). It is that competition in GenAI itself will be stunted. A handful of vertically integrated firms will control the key inputs (compute, talent, and data) and everyone else will be left trying to compete on inferior models trained on inferior content.

The image is AI generated.

Comments

Leave a Reply

Discover more from The Platform Law Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading