Newsletter Subject

Who's selling my data to train AI?

From

vox.com

Email Address

newsletter@vox.com

Sent On

Wed, Feb 28, 2024 10:35 PM

Email Preheader Text

Those Tumblr, Reddit, and WordPress posts you never thought would see the light of day? Yep, them to

Those Tumblr, Reddit, and WordPress posts you never thought would see the light of day? Yep, them too. A poster’s guide to who’s selling your data to train AI If you’ve ever posted anything on the internet, chances are that[your data has already been scraped, collected, and used to train AI systems]( like the ones powering ChatGPT, [Midjourney](, and [Sora](. Generative AI is designed to succeed as a generalist, and learning to do so, OpenAI has said, requires “[internet-scale](” data to train on. You probably don’t need me to tell you what happened when companies used scraped public data — often without the permission of those who created it — from news articles, books, and creative projects to teach AI tools how to, say, generate news articles, books, and creative projects. The New York Times is currently[suing]( OpenAI for allegedly using its expansive archives without permission to train chatbots [(in a recent filing](, OpenAI accused the Times of hiring “someone to hack” ChatGPT in order to prove that the chatbot was stealing their content). [Getty Images sued Stable Diffusion](for copyright infringement. Other lawsuits from authors and creators, angry to find that their works were used to train AI models,[have faced setbacks in court](. Other companies have decided to[make deals](. The Associated Press has [licensed part of its archives to OpenAI](. Shutterstock, the stock photo archive, has [signed a six-year deal](with OpenAI to provide training data, which includes access to its photo, video, and music databases. The ways AI systems use the work of journalists, musicians, and photographers have pretty consequential implications for our information and cultural ecosystem and for the people who work in the fields that AI companies seem dead-set on [developing tools to replace](. The need to gather more and more training data with as little fuss as possible means that anyone who’s an online poster — whether its a fandom Tumblr account, an active Reddit presence, or a personal blog — could see access to their content being sold by the platforms hosting it to one of these big AI companies. Below is a quick guide to what we know right now about who might be selling your best posts as training data. Tumblr and WordPress.com Earlier this week, [404 Media](reported that Automattic, the parent company for Tumblr and WordPress, was preparing to announce deals selling user data to OpenAI and Midjourney. According to 404’s reporting, which describes such a deal as “imminent,” the data seems likely to include user posts on Tumblr and on WordPress.com. On Wednesday, a day after 404’s report, Automattic announced a way for users to [opt out](of [sharing their public content]( with third parties. The [Tumblr staff announcement]( on the change framed the whole thing as a sign that the company was working to protect its users. “We already discourage AI crawlers from gathering content from Tumblr and will continue to do so,” the announcement read, “save for those with which we partner.” Automattic said in a statement that it was “working directly with select AI companies as long as their plans align with what our community cares about: attribution, opt-outs, and control,” but has not provided any further information on the reported deals with OpenAI and Midjourney. Although Tumblr’s [cultural heft]( has [waned over the past decade](, it’s still a pretty [important platform for fandom content](, including fanfiction and fan art. There are also plenty of artists who use Tumblr to host their original work and take commissions. Reddit Reddit’s enormous archives of posts are driven by the labor of volunteers: Unpaid subreddit moderators oversee communities of unpaid users. Their collective efforts on Reddit make the platform valuable. So when Reddit announced that it was launching an IPO, the company reached out to a selection of mods and frequent posters to offer them the opportunity to buy stock early. Some of those who received the offer [were not super enthusiastic about it](. But Reddit does not need buy-in from its users to profit from their work: It has already sold access to their posts to [Google](. Just before the IPO announcement,[Reddit and Google](entered into a [$60 million deal](that would give[Google access to Reddit’s API](in order to, among other things, train its generative AI models. Everything else, to be honest The reported deals above are just a couple that have become public. But this doesn’t mean that large AI models aren’t already being trained on your posts across the internet. Last year, the [Washington Post examined]( one of the [massive data sets]( of scraped public internet data used to train generative AI models and found everything from World of Warcraft message boards to Patreon and Kickstarter and several huge repositories of personal blogs. And it should not be a surprise that [Meta uses public posts]( from [Facebook]( and [Instagram]( to train its AI models. —A.W. Ohlheiser, senior technology writer [Justice Neil Gorsuch, left, in a navy suit and red tie, and Chief Justice John Roberts, right, in a black suit and gray tie, stand in front of the Supreme Court building.]( Win McNamee/Getty Images [The Supreme Court appeared lost in a massive case about free speech online]( [The justices look likely to reinstate Texas and Florida laws that seize control of much of the internet — but not for long.](   [A hangar-sized room with an American flag on one wall and a crowd of cheering people.]( Raquel Natalicchio/Houston Chronicle/Getty Images [America’s first moon landing in 50 years, explained]( [The groundbreaking development speaks to the growing role of private companies in space.](   [An AI-generated video from Sora, OpenAI’s new generative video model, shows sea creatures like fish and dolphins with legs, riding bicycles on top of an ocean.]( Sora/OpenAI CEO Sam Altman [What two years of AI development can tell us about Sora]( [If you want to know the future of OpenAI’s latest tool, take a look at Midjourney and DALL-E 2.](    [Learn more about RevenueStripe...](   [A hand holding a phone in front of a screen with the OpenAI logo and the term GPT-4.]( CFOTO/Future Publishing via Getty Images [AI-generated video is here to awe and mislead]( [OpenAI’s Sora is designed to be a “world simulator.” Right now it’s having trouble breaking a glass.](   [An illustration of small overlapping squares of paper in the shape of a brain.]( Getty/Paige Vickers for Vox [Your brain needs a really good lawyer]( [Can new legislation protect us from the companies building tech to read our minds?](   Support our work Vox Technology is free for all, thanks in part to financial support from our readers. Will you join them by making a gift today? [Give](   [Listen To This] [Listen to This]( [“Make Argentina Great Again!”]( US inflation feels bad until you look at Argentina’s, which is breaking 200 percent. Today, Explained’s Sean Rameswaram reports from Buenos Aires, where residents are divided over their new anarcho-capitalist President Javier Milei’s shock therapy. [Listen to Apple Podcasts](   [This is cool] [some beautiful music](  [Learn more about RevenueStripe...](   [Facebook]( [Twitter]( [YouTube]( This email was sent to {EMAIL}. Manage your [email preferences]( , or [unsubscribe](param=tech)  to stop receiving emails from Vox Media. View our [Privacy Notice]( and our [Terms of Service](. Vox Media, 1201 Connecticut Ave. NW, Washington, DC 20036. Copyright © 2024. All rights reserved.

Marketing emails from vox.com

View More
Sent On

06/12/2024

Sent On

05/12/2024

Sent On

03/12/2024

Sent On

29/11/2024

Sent On

27/11/2024

Sent On

27/11/2024

Email Content Statistics

Subscribe Now

Subject Line Length

Data shows that subject lines with 6 to 10 words generated 21 percent higher open rate.

Subscribe Now

Average in this category

Subscribe Now

Number of Words

The more words in the content, the more time the user will need to spend reading. Get straight to the point with catchy short phrases and interesting photos and graphics.

Subscribe Now

Average in this category

Subscribe Now

Number of Images

More images or large images might cause the email to load slower. Aim for a balance of words and images.

Subscribe Now

Average in this category

Subscribe Now

Time to Read

Longer reading time requires more attention and patience from users. Aim for short phrases and catchy keywords.

Subscribe Now

Average in this category

Subscribe Now

Predicted open rate

Subscribe Now

Spam Score

Spam score is determined by a large number of checks performed on the content of the email. For the best delivery results, it is advised to lower your spam score as much as possible.

Subscribe Now

Flesch reading score

Flesch reading score measures how complex a text is. The lower the score, the more difficult the text is to read. The Flesch readability score uses the average length of your sentences (measured by the number of words) and the average number of syllables per word in an equation to calculate the reading ease. Text with a very high Flesch reading ease score (about 100) is straightforward and easy to read, with short sentences and no words of more than two syllables. Usually, a reading ease score of 60-70 is considered acceptable/normal for web copy.

Subscribe Now

Technologies

What powers this email? Every email we receive is parsed to determine the sending ESP and any additional email technologies used.

Subscribe Now

Email Size (not include images)

Font Used

No. Font Name
Subscribe Now

Copyright © 2019–2024 SimilarMail.