Skip to main content

The context you need, when you need it

When news breaks, you need to understand what actually matters — and what to do about it. At Vox, our mission to help you make sense of the world has never been more vital. But we can’t do it on our own.

We rely on readers like you to fund our journalism. Will you support our work and become a Vox Member today?

Join now

A poster’s guide to who’s selling your data to train AI

Those Tumblr, Reddit, and WordPress posts you never thought would see the light of day? Yep, them too.

In this photo illustration, the Reddit logo is seen in behind a silhouette of a person typing.
In this photo illustration, the Reddit logo is seen in behind a silhouette of a person typing.
Photo Illustration by Rafael Henrique/SOPA Images/LightRocket via Getty Images
A.W. Ohlheiser
A.W. Ohlheiser is a senior technology reporter at Vox, writing about the impact of technology on humans and society. They have also covered online culture and misinformation at the Washington Post, Slate, and the Columbia Journalism Review, among other places. They have an MA in religious studies and journalism from NYU.

If you’ve ever posted anything on the internet, chances are that your data has already been scraped, collected, and used to train AI systems like the ones powering ChatGPT, Midjourney, and Sora. Generative AI is designed to succeed as a generalist, and learning to do so, OpenAI has said, requires “internet-scale” data to train on.

You probably don’t need me to tell you what happened when companies used scraped public data — often without the permission of those who created it — from news articles, books, and creative projects to teach AI tools how to, say, generate news articles, books, and creative projects.

The New York Times is currently suing OpenAI for allegedly using its expansive archives without permission to train chatbots (in a recent filing, OpenAI accused the Times of hiring “someone to hack” ChatGPT in order to prove that the chatbot was stealing their content). Getty Images sued Stable Diffusion for copyright infringement. Other lawsuits from authors and creators, angry to find that their works were used to train AI models, have faced setbacks in court.

Other companies have decided to make deals. The Associated Press has licensed part of its archives to OpenAI. Shutterstock, the stock photo archive, has signed a six-year deal with OpenAI to provide training data, which includes access to its photo, video, and music databases.

The ways AI systems use the work of journalists, musicians, and photographers have pretty consequential implications for our information and cultural ecosystem and for the people who work in the fields that AI companies seem dead-set on developing tools to replace. The need to gather more and more training data with as little fuss as possible means that anyone who’s an online poster — whether its a fandom Tumblr account, an active Reddit presence, or a personal blog — could see access to their content being sold by the platforms hosting it to one of these big AI companies.

Below is a quick guide to what we know right now about who might be selling your best posts as training data.

Tumblr and WordPress.com

Earlier this week, 404 Media reported that Automattic, the parent company for Tumblr and WordPress, was preparing to announce deals selling user data to OpenAI and Midjourney. According to 404’s reporting, which describes such a deal as “imminent,” the data seems likely to include user posts on Tumblr and on WordPress.com. On Wednesday, a day after 404’s report, Automattic announced a way for users to opt out of sharing their public content with third parties.

The Tumblr staff announcement on the change framed the whole thing as a sign that the company was working to protect its users. “We already discourage AI crawlers from gathering content from Tumblr and will continue to do so,” the announcement read, “save for those with which we partner.”

Automattic said in a statement that it was “working directly with select AI companies as long as their plans align with what our community cares about: attribution, opt-outs, and control,” but has not provided any further information on the reported deals with OpenAI and Midjourney.

Although Tumblr’s cultural heft has waned over the past decade, it’s still a pretty important platform for fandom content, including fanfiction and fan art. There are also plenty of artists who use Tumblr to host their original work and take commissions.

Reddit

Reddit’s enormous archives of posts are driven by the labor of volunteers: Unpaid subreddit moderators oversee communities of unpaid users. Their collective efforts on Reddit make the platform valuable.

So when Reddit announced that it was launching an IPO, the company reached out to a selection of mods and frequent posters to offer them the opportunity to buy stock early. Some of those who received the offer were not super enthusiastic about it. But Reddit does not need buy-in from its users to profit from their work: It has already sold access to their posts to Google.

Just before the IPO announcement, Reddit and Google entered into a $60 million deal that would give Google access to Reddit’s API in order to, among other things, train its generative AI models.

Everything else, to be honest

The reported deals above are just a couple that have become public. But this doesn’t mean that large AI models aren’t already being trained on your posts across the internet.

Last year, the Washington Post examined one of the massive data sets of scraped public internet data used to train generative AI models and found everything from World of Warcraft message boards to Patreon and Kickstarter and several huge repositories of personal blogs. And it should not be a surprise that Meta uses public posts from Facebook and Instagram to train its AI models.

More in Technology

Podcasts
Are humanoid robots all hype?Are humanoid robots all hype?
Podcast
Podcasts

AI is making them better — but they’re not going to be doing your chores anytime soon.

By Avishay Artsy and Sean Rameswaram
Future Perfect
The old tech that could help stop the next airborne pandemicThe old tech that could help stop the next airborne pandemic
Future Perfect

Glycol vapors, explained.

By Shayna Korol
Future Perfect
Elon Musk could lose his case against OpenAI — and still get what he wantsElon Musk could lose his case against OpenAI — and still get what he wants
Future Perfect

It’s not about who wins. It’s about the dirty laundry you air along the way.

By Sara Herschander
Life
Why banning kids from AI isn’t the answerWhy banning kids from AI isn’t the answer
Life

What kids really need in the age of artificial intelligence.

By Anna North
Culture
Anthropic owes authors $1.5B for pirating work — but the claims process is a Kafkaesque messAnthropic owes authors $1.5B for pirating work — but the claims process is a Kafkaesque mess
Culture

“Your AI monster ate all our work. Now you’re trying to pay us off with this piece of garbage that doesn’t work.”

By Constance Grady
Future Perfect
Some deaf children are hearing again because of a new gene therapySome deaf children are hearing again because of a new gene therapy
Future Perfect

A medical field that almost died is quietly fixing one disease at a time.

By Bryan Walsh