AI tools were trained on scraped content full of toxicity and hate speech. Here's how we can fix it. Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í Í [Mozilla ](
[â¤ï¸ Mozilla â¤ï¸](
[Donate]( Hello, ChatGPT and other generative AI tools were trained on a huge dataset full of toxic content and hate speech, according to new research by Mozilla.1 The huge data set â totaling 9.5 million gigabytes, and assembled by the small non-profit organisation Common Crawl â is the original data source for so many large language models (LLMs) that make up the AI landscape of today's internet. And now OpenAI, Microsoft and Google are rolling out AI tools to be used by people worldwide, built on scraped data from some of the worst parts of the internet. These tools are both biassed because they’re trained on toxic content, and opaque because we don’t know exactly what content they were trained on. Almost every other product we use or consume on a daily basis has safety warning labels or an ingredients list. As customers, why shouldn’t we have the right to know what’s inside the AI tools we are using? Together, let’s use our power as consumers and put the pressure on OpenAI, Google, and Microsoft to tell us what's inside their AI. [Sign Mozilla’s petition and tell OpenAI, Google, and Microsoft to provide transparency about the data used to train their AI tools!]( [Sign Now →]( Common Crawl has been crawling and archiving the internet to train AI, while virtually undetected. But today, itâs the most influential nonprofit youâve never heard of. We’re at an inflection point for AI, and Mozilla’s investigation has uncovered structural flaws in the way Common Crawl is currently used to train AI models. The main problems are: - Common Crawl's data is only representing a fraction of the internet: it primarily captures English language content, which means AI tools trained on it are only helpful for a narrow part of the population and have a biassed perspective.
- Common Crawl's data contains hate speech and explicit content that is harmful when used to train consumer products without care.
- Common Crawl hands its dataset to companies and then walks away. That means the companies like OpenAI, Google, and Microsoft are accountable for explaining how they filtered Common Crawl's data, what effect the data has on its AI products, and what measures they take to address harms from biassed and explicit datasets. When it comes to building trustworthy AI products, better is possible. We need to know the totality of how AI is trained so we understand its risks and limitations â and, most importantly, what needs to be improved to make it trustworthy and helpful for everyone on the internet. Better must start with more transparency from the big tech companies responsible for training AI models. [Tell OpenAI, Google, and Microsoft to provide transparency about the data used to train their AI tools.]( [Sign Now →]( Thank you for all you do for the internet. Christian BockHead of Supporter Engagement
Mozilla --------------------------------------------------------------- More information: 1. Mozilla Foundation: [Training Data for the Price of a Sandwich: Common Crawl’s Impact on Generative AI](. Written by Stefan Baack and Mozilla Insights. Published 6 February 2024. Connect with us [Twitter]( [YouTube]( [Instagram]( Thanks for reading! You’re receiving this email because you subscribed to Mozilla News. If you no longer want to receive our emails, we’ll understand if you [unsubscribe](. You can also [update your email preferences]( at any time. [â¤ï¸ Mozilla â¤ï¸]( [Donate]( 149 New Montgomery St, 4th Floor, San Francisco, CA 94105 USA [Legal]( • [Privacy]( [Unsubscribe](