Those Tumblr, Reddit, and WordPress posts you never thought would see the light of day? Yep, them too.
A posterâs guide to whoâs selling your data to train AI If youâve ever posted anything on the internet, chances are that[your data has already been scraped, collected, and used to train AI systems]( like the ones powering ChatGPT, [Midjourney](, and [Sora](. Generative AI is designed to succeed as a generalist, and learning to do so, OpenAI has said, requires â[internet-scale](â data to train on. You probably donât need me to tell you what happened when companies used scraped public data â often without the permission of those who created it â from news articles, books, and creative projects to teach AI tools how to, say, generate news articles, books, and creative projects. The New York Times is currently[suing]( OpenAI for allegedly using its expansive archives without permission to train chatbots [(in a recent filing](, OpenAI accused the Times of hiring âsomeone to hackâ ChatGPT in order to prove that the chatbot was stealing their content). [Getty Images sued Stable Diffusion](for copyright infringement. Other lawsuits from authors and creators, angry to find that their works were used to train AI models,[have faced setbacks in court](. Other companies have decided to[make deals](. The Associated Press has [licensed part of its archives to OpenAI](. Shutterstock, the stock photo archive, has [signed a six-year deal](with OpenAI to provide training data, which includes access to its photo, video, and music databases. The ways AI systems use the work of journalists, musicians, and photographers have pretty consequential implications for our information and cultural ecosystem and for the people who work in the fields that AI companies seem dead-set on [developing tools to replace](. The need to gather more and more training data with as little fuss as possible means that anyone whoâs an online poster â whether its a fandom Tumblr account, an active Reddit presence, or a personal blog â could see access to their content being sold by the platforms hosting it to one of these big AI companies. Below is a quick guide to what we know right now about who might be selling your best posts as training data. Tumblr and WordPress.com Earlier this week, [404 Media](reported that Automattic, the parent company for Tumblr and WordPress, was preparing to announce deals selling user data to OpenAI and Midjourney. According to 404âs reporting, which describes such a deal as âimminent,â the data seems likely to include user posts on Tumblr and on WordPress.com. On Wednesday, a day after 404âs report, Automattic announced a way for users to [opt out](of [sharing their public content]( with third parties. The [Tumblr staff announcement]( on the change framed the whole thing as a sign that the company was working to protect its users. âWe already discourage AI crawlers from gathering content from Tumblr and will continue to do so,â the announcement read, âsave for those with which we partner.â Automattic said in a statement that it was âworking directly with select AI companies as long as their plans align with what our community cares about: attribution, opt-outs, and control,â but has not provided any further information on the reported deals with OpenAI and Midjourney. Although Tumblrâs [cultural heft]( has [waned over the past decade](, itâs still a pretty [important platform for fandom content](, including fanfiction and fan art. There are also plenty of artists who use Tumblr to host their original work and take commissions. Reddit Redditâs enormous archives of posts are driven by the labor of volunteers: Unpaid subreddit moderators oversee communities of unpaid users. Their collective efforts on Reddit make the platform valuable. So when Reddit announced that it was launching an IPO, the company reached out to a selection of mods and frequent posters to offer them the opportunity to buy stock early. Some of those who received the offer [were not super enthusiastic about it](. But Reddit does not need buy-in from its users to profit from their work: It has already sold access to their posts to [Google](. Just before the IPO announcement,[Reddit and Google](entered into a [$60 million deal](that would give[Google access to Redditâs API](in order to, among other things, train its generative AI models. Everything else, to be honest The reported deals above are just a couple that have become public. But this doesnât mean that large AI models arenât already being trained on your posts across the internet. Last year, the [Washington Post examined]( one of the [massive data sets]( of scraped public internet data used to train generative AI models and found everything from World of Warcraft message boards to Patreon and Kickstarter and several huge repositories of personal blogs. And it should not be a surprise that [Meta uses public posts]( from [Facebook]( and [Instagram]( to train its AI models. âA.W. Ohlheiser, senior technology writer [Justice Neil Gorsuch, left, in a navy suit and red tie, and Chief Justice John Roberts, right, in a black suit and gray tie, stand in front of the Supreme Court building.]( Win McNamee/Getty Images [The Supreme Court appeared lost in a massive case about free speech online]( [The justices look likely to reinstate Texas and Florida laws that seize control of much of the internet â but not for long.]( [A hangar-sized room with an American flag on one wall and a crowd of cheering people.]( Raquel Natalicchio/Houston Chronicle/Getty Images [Americaâs first moon landing in 50 years, explained]( [The groundbreaking development speaks to the growing role of private companies in space.]( [An AI-generated video from Sora, OpenAIâs new generative video model, shows sea creatures like fish and dolphins with legs, riding bicycles on top of an ocean.]( Sora/OpenAI CEO Sam Altman [What two years of AI development can tell us about Sora]( [If you want to know the future of OpenAIâs latest tool, take a look at Midjourney and DALL-E 2.](
Â
[Learn more about RevenueStripe...]( [A hand holding a phone in front of a screen with the OpenAI logo and the term GPT-4.]( CFOTO/Future Publishing via Getty Images [AI-generated video is here to awe and mislead]( [OpenAIâs Sora is designed to be a âworld simulator.â Right now itâs having trouble breaking a glass.]( [An illustration of small overlapping squares of paper in the shape of a brain.]( Getty/Paige Vickers for Vox [Your brain needs a really good lawyer]( [Can new legislation protect us from the companies building tech to read our minds?]( Support our work Vox Technology is free for all, thanks in part to financial support from our readers. Will you join them by making a gift today? [Give]( [Listen To This] [Listen to This]( [âMake Argentina Great Again!â]( US inflation feels bad until you look at Argentinaâs, which is breaking 200 percent. Today, Explainedâs Sean Rameswaram reports from Buenos Aires, where residents are divided over their new anarcho-capitalist President Javier Mileiâs shock therapy. [Listen to Apple Podcasts]( [This is cool] [some beautiful music](
Â
[Learn more about RevenueStripe...]( [Facebook]( [Twitter]( [YouTube]( This email was sent to {EMAIL}. Manage yourâ¯[email preferences]( , orâ¯[unsubscribe](param=tech) â¯to stop receiving emails from Vox Media. View our [Privacy Notice]( and our [Terms of Service](. Vox Media, 1201 Connecticut Ave. NW, Washington, DC 20036. Copyright © 2024. All rights reserved.