The Cost of AI Licensing

Note started: October 2023

I’m working on the topic of this note.

I updated it 3 times.

Some thoughts about the data in Generative AI models and how we can make companies pay, based on a question Rose Eveleth asked over on Bluesky:

has anybody seen any good thinking/writing about how much money generative AI companies should be investing in licensing the content they're using for their models? How much would/should it cost to pay for the corpus?

Disclaimer: I’m not saying the following is good thinking. It’s fragments at most. But this question remains remarkably underresearched, at least from what I’ve found venturing around in the AI mines over the last years. If you know otherwise, please answer Rose or (or and) send me a ping on Mastodon or per mail.

What to license

The fact that, at least for the commercial models we don’t really know which data is actually in them and which amount of the total training data a given dataset amounts to, makes it hard(er) to make any argument about compensation in the first place. I don’t think that’s an accident.

We can look at Open Access models to get a bit of an understanding. BLOOM was trained on almost 500 different datasets, but those needed to be pre-processed and cleaned. For Falcon LLM there’s a dedicated paper on the work.

So, if we say OpenAI et al have to commit 20% of their profits to give back to license stuff in their datasets: OAI will not disclose what’s in their data (it’s a secret because they ~~need to make more money~~, sorry, build safe systems).

So either it’s all completely intransparent or we force commercial AI vendors to be transparent.

For a general feeling of how a system could look like, we should keep an eye on Adobe’s Firefly payouts compared to their revenue. Afaik they are the only company that pays out the artists that are in their datasets. That they only trained Firefly on Adobe Stock and public domain images, not everything, certainly makes this easier.

But, honestly, I don’t know if that’s feasible for LLMs, as they require absurd amounts of data.

Sure, 90% of the products currently adding a LLM to their products might not need one, but it’s the Valley Party Train and such arguments won’t stop it.

How to license

So we don’t know what to license, as we don’t know what’s the dataset of the commercial models.

But even if we would now, how would you ensure payouts? Sure, it’s (kinda) easy for copyrighted works of art (music, books, paintings …) but a large part of CommonCrawl is just websites. Should only those in the dataset who are some kind of business get payouts?

And, which license do you use? Do you pay per dataset entry or per training epoch? (Thanks to Sarah Moir for making this point.) Per model created or per output which might contain licensed data? To do this, we need way better understanding of the way a model generates text and what the sources for a specific answer are.

These are notoriously difficult and unsolved problems. Anthropic posted about a possible solution in Decomposing Language Models Into Understandable Components, though this only helps to explain the model behaviour, not the data layer.

Would we end up in a situation similar to music streaming rights, where the payout per stream is so miniscule, that it’s only useful for a small percentage of artists? tbf I see this happening in basically all licensing models I can think of? Are there protections against this?

The Academic Veil of Commercial AI

Another thing to note here: If you think on the data layer only, AI vendors might argue that the datasets have been collected for scientific purposes. It’s a great excuse, as scientific usage grants generous copyright excemptions.

But in the end, this might be little more than an excuse. Take for example LAION – a non-profit that maintains the LAION-5B image dataset used to train Stable Diffusion. LAION is a non-profit, Stable Diffusion is open source and published by researchers at the Ludwig Maximilian University in Munich. Stability AI, essentially the marketing wrapper and funding machine for these non-profit initiatives has raised over $120 million in funding.

If you are interested in diving deep into some of those challenges, Sarah Moir wrote about the legal challenges in the context of music generating models.

Creating licensed data

To note here, too, is that you can by now opt-out of crawling for some AI training (e.g. Google, OpenAI, CommonCrawl). This doesn’t solve the problem, as for any kind of informed consent the systems need to be opt-in. And, as more models spam more outputs on more sites, internet crawls will become a much worse base to train a model on.

Another thing that we are currently seeing – which will not replace training on huge crawls, but probably augment them or take a more prominent role in the finetuning steps – is that data companies like Scale AI are hiring not only gig-workers but e.g. poets to create high quality, licensed training data.

Making models forget

There’s ongoing research in «model unlearning».

If this becomes a feasible technique, that’s a probable way for AI companies to cop out of paying licensing fees. «Oh, sorry. Thanks for bringing this to our attention. We didn’t mean to include your stuff, look we removed it.»

Mind, the if in the previous sentence is doing a lot heavy lifting!

If I want my model to be able to answer questions, it’s not helpful to remove relevant pop culture completely. Also, a model might not perform significantly worse after removing Harry Potter. But what if you remove 500+ other books because their authors demand it?

As the authors of the study conclude:

Extending our approach to other types of content, particularly non-fiction or textbooks, presents its own set of challenges. Unlike the fictional universe of Harry Potter, non-fiction content will not possess the same density of unique terms or phrases. Furthermore, non-fictional texts often embed higher-level constructs such as ideas, concepts, or cultural perspectives. It remains uncertain to what extent our technique can effectively address and unlearn these more abstract elements. This would clearly necessitate adaptations of our technique.

That’s that for now. I’ll append this piece as I find new reporting and worthwhile things to add.