The tricky truth about how generative AI uses your data.
AI systems train on your data. What can you do about it? When the White House revealed its list of [voluntary safety and societal commitments]( signed by seven AI companies, one thing was noticeably missing: anything related to the data these AI systems collect and use to train this powerful technology. Including, very likely, yours. There are many concerns about the potential harm that sophisticated generative AI systems have unleashed on the public. What they do with our data is one of them. We know very little about where these models get the petabytes of data they need, how that data is being used, and what protections, if any, are in place when it comes to sensitive information. The companies that make these systems arenât telling us much, and [may not even know]( themselves. You may be okay with all of this, or think the good that generative AI can do far outweighs whatever bad went into building it. But a lot of other people arenât. Two weeks ago, a [viral tweet]( accused Google of scraping Google Docs for data on which to train its AI tools. In a follow-up, its author [claimed]( that Google âused docs and emails to train their AI for years.â The initial tweet has nearly 10 million views, and itâs been retweeted thousands of times. The fact that this may not even be true is almost beside the point. (Google says it doesnât use data from its free or enterprise Workspace products â that includes Gmail and Docs â to train its generative AI models unless it has user permission, though it [does train]( some Workspace AI features like spellcheck and Smart Compose using anonymized data.) âUp until this point, tech companies have not done what theyâre doing now with generative AI, which is to take everyoneâs information and feed it into a product that can then contribute to peopleâs professional obsolescence and totally decimate their privacy in ways previously unimaginable,â said Ryan Clarkson, whose law firm is behind class action lawsuits against [OpenAI and Microsoft]( and [Google](. Googleâs general counsel, Halimah DeLaine Prado, said in a statement that the company has been clear that it uses data from public sources, adding that âAmerican law supports using public information to create new beneficial uses, and we look forward to refuting these baseless claims.â Exactly what rights we may have over our own information, however, is still being worked out in lawsuits, worker strikes, regulator probes, executive orders, and possibly new laws. Those might take care of your data in the future, but what can you do about what these companies already took, used, and profited from? The answer is probably not a whole lot. Generative AI companies are hungry for your data. Hereâs how they get it. Simply put, generative AI systems need as much data as possible to train on. The more they get, the better they can generate approximations of how humans sound, look, talk, and write. The internet provides massive amounts of data thatâs relatively easy to gobble up through web scraping tools and APIs. But that gobbling process doesnât distinguish between copyrighted works and personal data; if itâs out there, it takes it. âIn the absence of meaningful privacy regulations, that means that people can scrape really widely all over the internet, take anything that is âpublicly availableâ â that top layer of the internet for lack of a better term â and just use it in their product,â said Ben Winters, who leads the Electronic Privacy Information Centerâs AI and Human Rights Project and co-authored [its report]( on generative AI harms. Which means that, unbeknownst to you and, apparently, [several of the]( [companies]( whose sites were being scraped, some startup may be taking and using your data to power a technology you had no idea was possible. That data may have been posted on the internet years before these companies existed. It may not have been posted by you at all. Or you may have thought you were giving a company your data for one purpose that you were fine with, but now youâre afraid it was used for something else. Many companiesâ privacy policies, which are updated and changed all the time, may let them do exactly that. They often say something along the lines of how your data may be used to improve their existing products or develop new ones. Conceivably, that includes generative AI systems. Not helping matters is how cagey generative AI companies have been about revealing their data sources, often simply saying that theyâre âpublicly available.â Even Metaâs [more detailed list]( of sources for its first LLaMA model refers to things like â[Common Crawl](,â which is an open source archive of the entire internet, as well as sites like Github, Wikipedia, and Stack Exchange, which are also enormous repositories of information. (Meta [hasnât been]( as forthcoming about the data used for the just-released Llama 2.) All of these sources may contain personal information. OpenAI [admits]( that it uses personal data to train its models, but says it comes across that data âincidentallyâ and only uses it to make âour models better,â as opposed to building profiles of people to sell ads to them. Google and Meta have vast troves of personal user data they say they donât use to train their language models now, but we have no guarantee they wonât do so in the future, especially if it means gaining a competitive advantage. We know that Google scanned users' emails [for years]( in order to target ads (the [company says]( it no longer does this). Meta had a major scandal and a [$5 billion fine]( when it shared data with third parties, including [Cambridge Analytica](, which then misused it. The fact is, these companies have given users plenty of reasons not to take their assurances about data privacy or commitments to produce safe systems at face value. âThe voluntary commitments by big tech require a level of trust that they donât deserve and they have not earned,â Clarkson said. Copyrights, privacy laws, and âpublicly availableâ data For creators â writers, musicians, and actors, for instance â copyrights and image rights are a major issue, and itâs pretty obvious why. Generative AI models have been trained on their work and could put them out of work in the future. Thatâs why comedian Sarah Silverman is [suing OpenAI and Meta]( as part of a class action lawsuit. She alleges that the two companies trained off of her written work by using datasets that contained text from her book, The Bedwetter. There are also [lawsuits]( over image rights and the use of open source computer code. The use of generative AI is also one of the reasons why [writers]( and [actors]( are [on strike](, with both of their unions, the WGA and SAG-AFTRA, fearing that studios will train AI models on artistsâ words and images and simply generate new content without compensating the original human creators. But you, the average person, might not have intellectual property to protect, or at least your livelihood may not depend on it. So your concerns might be more about how companies like OpenAI are protecting your privacy when their systems scoop it up, remix it, and spit it back out. Regulators, lawmakers, and lawyers are wondering about this, too. Italy, which has stronger privacy laws than the US, even temporarily banned ChatGPT over privacy issues. Other European countries are [looking into]( doing their own probes of ChatGPT. The Federal Trade Commission has also [set its sights]( on OpenAI, investigating it for possible violations of consumer protection laws. The agency has also [made it clear]( that it will keep a close eye on generative AI tools. But the FTC can only enforce what the laws allow it to. President Biden has encouraged Congress to pass AI-related bills, and many members of Congress [have said]( they want to do the same. Congress is notoriously slow-moving, however, and has done little to regulate or protect consumers from social media platforms. Lawmakers may learn a lesson from this and act faster when it comes to AI, or they may repeat their mistake. The fact that there is interest in doing something relatively soon after generative AIâs introduction to the general public is promising. âThe pace at which people have introduced legislation and said they want to do something about [AI] is, like, 9 million times faster than it was with any of these other issues,â Winters said. But itâs also hard to imagine Congress acting on data privacy. The US doesnât have a federal consumer online privacy law. Children under 13 do get some [privacy protections](, as do residents of states that passed their own privacy laws. [Some types]( of data are protected, too. That leaves a lot of adults across the country with very little by way of data privacy rights. We will likely be looking at the courts to figure out how generative AI fits with the laws we already have, which is where people like Clarkson come in. âThis is a chance for the people to have their voice heard, through these lawsuits,â he said. âAnd I think that theyâre going to demand action on some of these issues that we havenât made much progress through the other channels thus far. Transparency, the ability to opt out, compensation, ethical sourcing of data â those kinds of things." In some instances, Clarkson and Tim Giordano, a partner at Clarkson Law Firm who is also working on these cases, said thereâs existing law that doesnât explicitly cover peopleâs rights with generative AI but which a judge can interpret to apply there. In others, there are things like [Californiaâs privacy law](, which requires companies that share or sell peopleâs data to give them a way to opt out and delete their information. âThereâs currently no way for these models to delete the personal information that theyâve learned about us, so we think that thatâs a clear example of a privacy violation,â Giordano said. ChatGPTâs [opt out and data deletion tools](, for example, are only for data collected by people using the ChatGPT service. It does have [a way]( for people in âcertain jurisdictionsâ to opt out of having their data processed by OpenAIâs models now, but it also doesnât guarantee it will do so and it requires that you provide evidence that your data was processed in the first place. Although OpenAI [recently changed]( its policy and has stopped training models off data provided by its own customers, another set of privacy concerns crops up with how these models use the data you give them when you use them and the information they release into the wild. âCustomers clearly want us not to train on their data,â Sam Altman, CEO of OpenAI, told CNBC, an indicator that people arenât comfortable with their data being used to train AI systems, though only some are given the chance to opt out of it, and in limited circumstances. Meanwhile, OpenAI [has been sued]( for defamation over a ChatGPT response that falsely claimed that someone had defrauded and stolen money from a nonprofit. And this [isnât the only time]( a ChatGPT response levied false accusations against someone. So what can you currently do about any of this? Thatâs whatâs so tricky here. A lot of the privacy issues now are the result of a failure to pass real, meaningful privacy laws in the past that could have protected your data before these datasets and technologies existed. You can always try to minimize the data you put out there now, but you canât do much about whatâs already been scraped and used. Youâd need a time machine for that, and not even generative AI has been able to invent one yet. âSara Morrison, senior reporter [UPS workers gather around a large inflatable pig holding a bag of money, preparing picket lines with a variety of signs supporting the Teamsters.]( Timothy A Clary/AFP via Getty Images [A UPS strike would have been worse than you think]( [Our reliance on delivery gave the Teamsters union a lot more leverage in UPS negotiations.]( [The blue bird logo of Twitter is large against a stone exterior wall, seen from the street below.]( Tayfun Coskun/Anadolu Agency via Getty Images [The weird sorrow of losing Twitter]( [Grieving a loss, when the loss is the hell-bird site you werenât supposed to love.]( [Elon Musk in the Twitter headquarters holding a sink.]( Twitter account of Elon Musk/AFP via Getty Images [Welcome to X, the wannabe âsuper appâ formerly known as Twitter]( [Elon Muskâs favorite letter wants to be the center of your app universe.](
Â
[Learn more about RevenueStripe...]( [An illustration on a peach-colored background of three women, all yawning in different contexts â one is on the phone, another is holding a cup, while the third is just waking up.]( Getty Images/iStockphoto [Donât schedule meetings after 4 pm]( [People are redefining the 9-to-5 and thatâs a good thing.]( [President Biden at a speech in Philadelphia on July 20, 2023.]( Fatih Aktas/Anadolu Agency via Getty Images [Biden sure seems serious about not letting AI get out of control]( [Some AI companies have made safety commitments. Is that enough?]( Support our work Vox Technology is free for all, thanks in part to financial support from our readers. Will you join them by making a gift today? [Give]( [Listen To This]( [Listen to This]( Strikes! AI! And a Steven Soderbergh show heâs selling himself. Hollywood is reeling from two different strikes. Disney CEO Bob Iger has hung a For Sale sign on parts of his company. And Steven Soderbergh just made a TV series and is selling it directly to consumers, like itâs 2012 or something. [Listen to Apple Podcasts]( [This is cool]( [Anyone want Apple shoes for $50K?](
Â
[Learn more about RevenueStripe...]( [Facebook]( [Twitter]( [YouTube]( This email was sent to {EMAIL}. Manage yourâ¯[email preferences]( , orâ¯[unsubscribe](param=tech) â¯to stop receiving emails from Vox Media. View our [Privacy Notice]( and our [Terms of Service](. Vox Media, 1201 Connecticut Ave. NW, Washington, DC 20036. Copyright © 2023. All rights reserved.