Inside LAION-5B, an AI training dataset of 5B+ images that has been unavailable for download after researchers found 3,000+ instances of CSAM in December 2023

All — The — Way — Down — This one is sooo good. I recommend this to anyone playing with #AI to understand the biases and the complexities. Oh and the discussion of alt text is amazing. … Erik Jonker / @ErikJonker@mastodon.social : Just wow...amazing website/visualization about LAION-5B , a large dataset a lot models are trained on. — https://knowingmachines.org/ ... #AI #bigdata #LAION5B #trainingdata #CSAM Larry O'Brien / @Lobrien@hachyderm.io : Absolutely wonderful article about how datasets are profoundly influenced by near-arbitrary magic numbers, self-selecting human reviewers, and WEIRD over-representation. https://knowingmachines.org/ ... #ML Bastian Greshake Tzovaras / @gedankenstuecke … : «Models All The Way Down» is a really interesting exploration of how an image data set (in this case LAION-5B) was assembled to be used for training ML/"AI" models! — > “It contains less about how humans see the world than it does about how search engines see the world. … X: Ethan Zuckerman / @ethanz : This is both terrific research and absolutely beautiful presentation of data. Adding this to my “recommended readings” on AI and ethical implications - super smart and affecting work. Kyle Geske / @stungeye : The LAION dataset is almost entirely machine-curated using ML models, often in faulty or biased ways. Where human rankings are used, they come from a very small number of usually western folks with niche tastes. Maria Popova / @brainpicker : “Investigating training sets is an essential avenue to understanding how generative AI models work; the ways they see and re-create the world.” Models All the Way Down — fantastic (and haunting in its intimations) project by my friend @blprnt https://knowingmachines.org/ ... @iethics : “There are models on top of models, and trainings sets on top of training sets. Omissions and biases and blind spots from these stacked-up models and training sets shape all of the resulting new models and new training sets”: https://knowingmachines.org/ ... #ethics #data #AI #research @iethics : Another “truth about generative #AI: The [concept] of what is... visually appealing can be influenced in outsized ways by the tastes of a very small group of individuals, and the processes that are chosen by dataset creators to curate the datasets” https://knowingmachines.org/ ... #data @iethics : “The tiniest of shifts in LAION's thresholds could have excluded or included hundreds of millions of images. What the images contain plays no role at all in deciding what stays and what goes.” https://knowingmachines.org/ ... #ethics #AI #data Matt Ocko / @mattocko : Visually impactful & thoughtful piece on the serious problem of recursive bad data poisoning of large models' training data sets — and hence the models themselves https://knowingmachines.org/ ... H/t @mattgreenfield Kate Crawford / @katecrawford : 👉Today we're launching this investigation into LAION-5B, the blockbuster dataset behind Midjourney and Stable Diffusion. It's a deep dive into how the dataset was made, and where the images come from. The brilliant @christo_buschek & @blprnt follow the models all the way down. Madeeha Merchant / @madeehamerchant : Ever wondered about the anatomy of GenAI ? Amazing work, X-raying the LAION-5B dataset ! Always a fan of @blprnt Christo Buschek / @christo_buschek : The AI field's goal is nothing less than to transform the world. But what are the foundations upon which this transformation is built? In this investigation, @blprnt and I looked at LAION-5B, the only open-source foundation dataset currently available. [image] LinkedIn: Matt Greenfield : This is one of the best explanations I have seen of the way that small groups of humans and arbitrary selection criteria create not just bias … Forums: Hacker News : Models all the way down

Knowing Machines 2024-03-31

Chronicles

Inside LAION-5B, an AI training dataset of 5B+ images that has been unavailable for download after researchers found 3,000+ instances of CSAM in December 2023