Stanford researchers: LAION-5B, a dataset of 5B+ images used by Stability AI and others, contains 1,008+ instances of CSAM, possibly helping AI to generate CSAM

most prominently, Stable Diffusion 1.5—to see to what degree CSAM itself might be present in the training data. https://purl.stanford.edu/... Alex Stamos / @alex.stamos : Lots of people have worried about CSAM in training sets, including the LAION team themselves, but David actually created a novel mechanism to detect it. Hopefully this will change the way these training sets are created in the future. David Thiel / @elegant_wallaby : We used a combination of methods to determine this: perceptual hashing, cryptographic hashing, and k-nearest neighbors analysis using the image embeddings. Seeded from a small subset of the dataset, PhotoDNA identified hundreds of instances, the URLs of which which were reported to NCMEC. Alex Stamos / @alex.stamos : How does Stable Diffusion 1.5 know how to create CSAM? It turns out it was trained on thousands of illegal images contained in the extremely popular LAION-5B image set. I'm so incredibly proud of my friend and colleague @elegant_wallaby. Story: https://www.404media.co/... Paper: https://stacks.stanford.edu/ ... Mastodon: David Thiel / @det@hachyderm.io : As a follow-up to our work on computer-generated CSAM, we took a closer look at the training data used to train various generative models—most prominently, Stable Diffusion 1.5—to see to what degree CSAM itself might be present in the training data. — https://purl.stanford.edu/... Felix Reda / @senficon@ohai.social : This research highlights a ton of issues with poorly controlled training datasets, and how little care has gone into cleaning them up before public release: https://www.404media.co/... Jason Koebler / @jasonkoebler@mastodon.social : A year ago, Motherboard published a story about nonconsensual porn and terrorist beheadings sitting in this dataset. The issue was being openly talked about in the development team's Discord. We asked about the issue and the team started deleting messages and said this: — https://www.vice.com/... … Jason Koebler / @jasonkoebler@mastodon.social : A few things: — 1) This is incredibly important research by Stanford, David Thiel, and the external folks who worked on this — 2) Sam & Emanuel have been trying to understand and quantify this problem for more than a year. It has been clear the LAION dataset has CSAM in it since at least 2021. … X: Abeba Birhane / @abebab : the LAION dataset gave us a gimps into corp datasets locked in corp labs like those in OpenAI, Meta, & Google. you can be sure, those closed datasets — rarely examined by independent auditors — are much worse than the open LAION dataset @ttp_updates : .@StanfordIO has identified 3,000+ likely child sexual abuse images in a popular AI dataset. As @samleecole writes, these findings challenge previous explanations for AI-generated CSAM, which was assumed to combine explicit images of adults with non-explicit images of minors. @ceehawk : AGAIN- AI, specifically AI art, is not some software magically using ur ideas to create something for u to consume like a Star Trek replicator, it's a mashup of existing images lifted from actual creators AND NOW it's including images that shouldn't even exist! Stop using it! @lockdownurlife : Not once do they think about consequences. Because they keep repeating the same mistakes over and over, even when advocates tell them, please consider how this could cause harm, and build in safeguards. Avijit Ghosh / @evijitghosh : At this point nothing short of a politically savvy regulatory machine will stop these people because these warnings have been given by RAI folks since the imagenet days (albeit with much less disgusting issues) and yet accelerationists love to repeat the same cycle of doom. Hal / @halhod : isn't LAION inside every big model? Peter Wells / @peterkwells : “If you have downloaded that full dataset for whatever purpose, for training a model for research purposes, then yes, you absolutely have CSAM, unless you took some extraordinary measures to stop it” https://www.404media.co/... @iwillleavenow : You know how a bunch of us kept saying the GAI policy of indiscriminately scraping as much data as possible from the internet was a bad thing? It is a bad thing. https://www.404media.co/... @stealcase : Whoa, this is an excellent (and horrible) point I didn't even consider! I thought it would be possible to “clean” the dataset, but this makes it clear it's too late. @laion_ai don't you dare put up the dataset again! [image] Davey Alba / @daveyalba : LAION-5B was released in 2022 and underpins multiple image-to-text AI models, the most popular of which is Stable Diffusion. But rumors about the dataset including CSAM have circulated in online forums and on social media for months https://www.bloomberg.com/... [image] Neil Turkewitz / @neilturkewitz : BREAKING: LAION recognizes that collecting & using “data” isn't necessarily ethical. Will the next domino to fall be a rejection of “data” that's been collected without consent? That would be monumental—a Revolution for the people, safeguarding the role of consent in a digital 🌏 Samantha Cole / @samleecole : LAION has known this is a problem for a long time. in 2021, its lead engineer said: “I guess distributing a link to an image such as child porn can be deemed illegal. We tried to eliminate such things but there's no guarantee all of them are out.” https://www.404media.co/... [image] Abeba Birhane / @abebab : not surprising, tbh. we found numerous disturbing and illegal content in the LAION dataset that didn't make it into our papers this is a win for individuals, especially children in the dataset subject to sexual abuse but an overall lose for dataset curation/audits/accountability Samantha Cole / @samleecole : big breaking news: LAION just removed its datasets, following a study from Stanford that found thousands of instances of suspected child sexual abuse material https://www.404media.co/... LinkedIn: Cal Al-Dhubaib : The real challenge in #GenAI development isn't just in training models, but in rigorously vetting the data we use. … Forums: Hacker News : Identifying and Eliminating CSAM in Generative ML Training Data and Models Ars OpenForum : Child sex abuse images found in dataset training image generators, report says

Bloomberg 2023-12-21

Discussion

@renee.diresta Renee DiResta on threads
🚨 Really important - and impactful - child safety work out this morning by my SIO colleague David Thiel. A significant data set for training AI models had some pretty terrible stuff in it; it has been taken down following the findings. …
@alexandra.levine Alexandra S. Levine on threads
Stable Diffusion is *not* the only model trained on this dataset containing CSAM, called LAION-5B; Midjourney also uses it, and likely others in the space. Stanford focused on Stable Diffusion because it's a large open source model that discloses its training data. …
@elegant_wallaby David Thiel on threads
I'm not sure what the legal implications are for this; most CSAM possession laws were made with the assumption that only huge service providers would have this much storage of mixed data, and they generally have detection and reporting flows. But all LAION-5B images can fit in a…
@elegant_wallaby David Thiel on threads
Fixing this problem is going to be difficult. The datasets are already out there, and the models are already trained. While we've made good progress in getting content removed from the source URLs, removing it from public datasets gives people a map to CSAM and its associated i…
@elegant_wallaby David Thiel on threads
As a follow-up to our work on computer-generated CSAM, we took a closer look at the training data used to train various generative models—most prominently, Stable Diffusion 1.5—to see to what degree CSAM itself might be present in the training data. https://purl.stanford.edu/...
@alex.stamos Alex Stamos on threads
Lots of people have worried about CSAM in training sets, including the LAION team themselves, but David actually created a novel mechanism to detect it. Hopefully this will change the way these training sets are created in the future.
@elegant_wallaby David Thiel on threads
We used a combination of methods to determine this: perceptual hashing, cryptographic hashing, and k-nearest neighbors analysis using the image embeddings. Seeded from a small subset of the dataset, PhotoDNA identified hundreds of instances, the URLs of which which were reported…
@alex.stamos Alex Stamos on threads
How does Stable Diffusion 1.5 know how to create CSAM? It turns out it was trained on thousands of illegal images contained in the extremely popular LAION-5B image set. I'm so incredibly proud of my friend and colleague @elegant_wallaby. Story: https://www.404media.co/... Pape…
@jasonkoebler@mastodon.social Jason Koebler on mastodon
A year ago, Motherboard published a story about nonconsensual porn and terrorist beheadings sitting in this dataset. The issue was being openly talked about in the development team's Discord. We asked about the issue and the team started deleting messages and said this: — htt…
@jasonkoebler@mastodon.social Jason Koebler on mastodon
A few things: — 1) This is incredibly important research by Stanford, David Thiel, and the external folks who worked on this — 2) Sam & Emanuel have been trying to understand and quantify this problem for more than a year. It has been clear the LAION dataset has CSAM in it s…
@abebab Abeba Birhane on x
the LAION dataset gave us a gimps into corp datasets locked in corp labs like those in OpenAI, Meta, & Google. you can be sure, those closed datasets — rarely examined by independent auditors — are much worse than the open LAION dataset
@ttp_updates @ttp_updates on x
.@StanfordIO has identified 3,000+ likely child sexual abuse images in a popular AI dataset. As @samleecole writes, these findings challenge previous explanations for AI-generated CSAM, which was assumed to combine explicit images of adults with non-explicit images of minors.
@ceehawk @ceehawk on x
AGAIN- AI, specifically AI art, is not some software magically using ur ideas to create something for u to consume like a Star Trek replicator, it's a mashup of existing images lifted from actual creators AND NOW it's including images that shouldn't even exist! Stop using it!
@lockdownurlife @lockdownurlife on x
Not once do they think about consequences. Because they keep repeating the same mistakes over and over, even when advocates tell them, please consider how this could cause harm, and build in safeguards.
@evijitghosh Avijit Ghosh on x
At this point nothing short of a politically savvy regulatory machine will stop these people because these warnings have been given by RAI folks since the imagenet days (albeit with much less disgusting issues) and yet accelerationists love to repeat the same cycle of doom.
@halhod Hal on x
isn't LAION inside every big model?
@peterkwells Peter Wells on x
“If you have downloaded that full dataset for whatever purpose, for training a model for research purposes, then yes, you absolutely have CSAM, unless you took some extraordinary measures to stop it” https://www.404media.co/...
@iwillleavenow @iwillleavenow on x
You know how a bunch of us kept saying the GAI policy of indiscriminately scraping as much data as possible from the internet was a bad thing? It is a bad thing. https://www.404media.co/...
@stealcase @stealcase on x
Whoa, this is an excellent (and horrible) point I didn't even consider! I thought it would be possible to “clean” the dataset, but this makes it clear it's too late. @laion_ai don't you dare put up the dataset again! [image]
@daveyalba Davey Alba on x
LAION-5B was released in 2022 and underpins multiple image-to-text AI models, the most popular of which is Stable Diffusion. But rumors about the dataset including CSAM have circulated in online forums and on social media for months https://www.bloomberg.com/... [image]
@neilturkewitz Neil Turkewitz on x
BREAKING: LAION recognizes that collecting & using “data” isn't necessarily ethical. Will the next domino to fall be a rejection of “data” that's been collected without consent? That would be monumental—a Revolution for the people, safeguarding the role of consent in a digital 🌏
@samleecole Samantha Cole on x
LAION has known this is a problem for a long time. in 2021, its lead engineer said: “I guess distributing a link to an image such as child porn can be deemed illegal. We tried to eliminate such things but there's no guarantee all of them are out.” https://www.404media.co/... [ima…
@abebab Abeba Birhane on x
not surprising, tbh. we found numerous disturbing and illegal content in the LAION dataset that didn't make it into our papers this is a win for individuals, especially children in the dataset subject to sexual abuse but an overall lose for dataset curation/audits/accountability
@samleecole Samantha Cole on x
big breaking news: LAION just removed its datasets, following a study from Stanford that found thousands of instances of suspected child sexual abuse material https://www.404media.co/...

Chronicles

Stanford researchers: LAION-5B, a dataset of 5B+ images used by Stability AI and others, contains 1,008+ instances of CSAM, possibly helping AI to generate CSAM

Related Coverage

Discussion