Protecting art from generative AI is vital, now and for the future

Part 2 of our series on generative AI delves into the issues around how it's currently trained

Image credit: Rock Paper Shotgun

Feature by Michael Cook Contributor

Published on March 19, 2024

Generative AI is all over the entertainment industries right now, and lots of people in games are making excited noises about finding new ways to integrate it into their products, from game developers and publishers such as Ubisoft and Square Enix to platform holders and hardware firms such as Epic and Nvidia. This new industry obsession is still taking shape, and there are lots of questions still to be answered about how much it might cost in the future, who will have access to it, and what it will actually help with, not to mention fears about job losses and other harms. But there’s a bigger question bubbling underneath all of this that threatens to burst the wobbly generative AI bubble: is the entire boom built on stolen labour?

Building and training machine learning systems often requires data – a lot of data. Sometimes we’re lucky, and we can make our own data. When OpenAI trained a bot to play 1v1 Mid matches in Dota 2, they did it through a process called Reinforcement Learning, having the bot play against itself over and over, with the only feedback being who won and who lost. This feedback, sometimes called a ‘reward’, helps a machine learning system play a sort of ‘hot and cold’ game, changing its internal wiring to try and get a better reward the next time it performs the task. If you’re old enough to remember the game Black & White (or, god forbid, Creatures), then these games worked on a similar principle. If your little animal did a good thing, you give it a treat, and if they just hurled several villagers into a lake, you tell them off (or give them a treat, if you’re into that).

Sometimes we can’t make our own data. If we want to train an AI to be an artist, we can’t have it just doodle and learn from the results, because we can’t easily define what the feedback should be. In DOTA 2’s 1v1 Mid, if you die then it’s game over, and there are similar goals for Chess, Go, Starcraft and so many other games that AI have tried to play. In art, defining winners and losers is considerably harder. So we need to find some data that already exists, a dataset of art that already looks like the kind of stuff we’d like our machine learning system to be able to do. But where do we find that? Where we find the answer to every problem in life: on random websites.

Zapping wizards in a Dota 2 screenshot. — Dota 2 is one of many games that have been used to train AI tools on how to win matches. | Image credit: Valve

Chances are, if you’ve heard of a machine learning system that generates art – Midjourney, Stable Diffusion, DALL-E - it has trained itself on millions or billions of images scraped directly from the Internet. Most of these datasets are unfiltered, gathered from pages strewn across the web, especially sites where content is easily accessible such as Flickr, Reddit or stock image databases. They’re also massive, with one popular dataset, LAION-5b, containing over five billion images. These images are all gathered automatically, with very minimal attempts to filter the contents. As a result, the datasets used to train these AI models are full of copyrighted content, illegal material, and personal information. And it’s all being fed into for-profit AI products that a huge proportion of us are using every day – including game developers, journalists and players themselves.

This is the chief reason why you’ll see so many people posting about AI ‘stealing’ content online. One of the best-known effects of these murky, legally-questionable datasets is that you can ask an AI to mimic the style of a particular artist, like Greg Rutkowski who has illustrated for games such as the Anno series, and card games like Magic: The Gathering. Rutkowski’s work is beloved, widely shared online, and clearly labelled with his name, all of which means an AI is going to see a lot of examples of his work, as he discovered one day. But there are many more examples that you may never hear about, or that may never come to light. For example, AI Dungeon – which used OpenAI’s GPT-2 to generate RPG stories – took and used thousands of choose your own adventure stories from an online community without permission, leading to a lot of disappointment from the original authors (note that almost all threads on that link are pretty yikes, but it’s here for your context anyway).

Theft is sometimes simple, and sometimes complicated. When I wander into a boss fight in Valheim that my friends have spent hours preparing for and I hoover up all the loot like a hungry, fantasy Roomba, that’s clearly not stealing - that’s just sharing the wealth. In the real world, where courts and legal systems get involved, theft is a lot greyer. Games are often criticised for ‘stealing’ things from other games or media – whether it’s creature appearances in Palworld, dances in Fortnite, or entire games, as happened to Vlambeer’s Ridiculous Fishing and Asher Volmer’s Threes. But we don’t always agree on what theft is, especially in courtrooms. The lawsuits against Fortnite were all dismissed, but in the case of small indie developers who had their work cloned, they had almost no legal recourse at all. Deciding on what constitutes theft is unfortunately often more about power than it is about justice.

Right now, there are several lawsuits running around the world targeting various different AI models, companies and datasets for infringing all manner of different laws and regulations. Some models have been shown to memorise personal information and leak it later, while others have been trained on harmful content or will reproduce copyrighted works. More technical legal arguments get deep into the details of these systems – that the very act of training on copyrighted material constitutes a breach of copyright, for instance. It’s not clear which of these arguments, if any, will break through in court. Companies claim that all of this falls under fair use, that harmful content is a temporary flaw of the system that can be fixed later, and that licensing agreements will help provide respite for artists in the future.

But often it’s not necessarily about what’s legal today, but about how we want the world to work in the future. There are many, many examples throughout the last century of us regulating technology not because it breaks existing laws, but because it lets people work around those laws in ways that no one could have expected. However, the arguments that defend this large-scale exploitation of creative work are missing the point of the problem. This isn’t a question of legality, but one of humanity. It makes sense to protect creative work and the people who work hard to make it, because it plays a really important role in society.

Palworld image of a Fuddler Pal — Palworld has recently drawn a lot of criticism over the design of its Pal monsters.Image credit: Rock Paper Shotgun/Pocketpair

It's this long-term damage that a lot of creative people and AI researchers fear the most. Just like the recent games industry layoffs, the effects of big changes in the industry take a while to fully hit. All the games that were due to come out in a given year will probably come out, and a lot of them will be fun, and maybe you'll wonder if those layoffs really affected anything? But the impact of disruption like this can take years to be noticed, and decades to be reversed. The reason these generative AI systems were able to be built today is because they had decades, centuries even, of human creativity to look at on the Internet. If they play a role in devaluing or destabilising the jobs that helped those people make that art, what culture will there be to learn from at the end of this century? Even if you’re a die-hard AI accelerationist, many are worried that we have permanently infected the Internet with so much AI-generated content it may be impossible to train an AI system on human-authored content ever again.

Our games industry is surprisingly fragile, even though in some ways it feels like it has only become more behemoth-like with each passing decade. Many of the most brilliant ideas from its history, many of the most celebrated creatives or beloved games, bubbled up from the fringes of the industry or from other mediums, often from people in economically vulnerable situations. Small changes that seem innocuous at the time often have wide-reaching effects – and there’s already evidence that Generative AI has affected the quality and quantity of freelance creative work. Regardless of whether you think generative AI is good or bad, it seems disrespectful to ignore the concerns and complaints of people who worked so hard, for little reward, to create the rich and beautiful community that our hobby grew out of. Whatever the courts say, whatever regulation comes, whatever these companies look like when the dust has settled – a lot of damage may have already been done.

In Part 3, Better Living, we look at how generative AI must be centred around creators if we're going to achieve a best-case scenario future.

Missed Part 1? Here, we take a closer look at what generative AI means, and why it needs better language to describe how we use it.

Read this next