AI-Generated Images Are Tricky to Spot
In December of 2022, I launched the human recognition of AI-generated imagery research quiz, often known simply as the AI Quiz. The quiz presented participants with thirty images randomly selected from a pool of both human-created and AI-generated art and photographs. The results I’ll discuss in this article were gathered between November 30, 2022 and December 8, 2022, inclusive.
The Setup
Participants were presented with the AI Quiz, which served them 30
questions. I kept the ratio of human-made to AI-generated images relatively balanced, and the images
were randomly picked from a total pool of 143–113 AI-generated images and 30 human-made images (or
photographs). For the AI-generated images, I used Stable Diffusion 1.5 and AniPlus v2 (a homegrown model
based on a few other Stable Diffusion models) in AUTOMATIC1111’s
Stable Diffusion WebUI
(commit 98947d173e3f1667eba29c904f681047dea9de90) with the settings: eta: 0.69, steps: 15, sampler:
DPM++ 2S a Karras
. I also threw in 5 images from Midjourney’s popular page. Human-made images were
grabbed from Artstation (before the place was invaded with AI-generated work)
and Unsplash. There was no cherry-picking in sight too; I took everything the
AI generated that looked even remotely like the prompt.
Each image was assigned tags based on their content (portrait/landscape, suit, sunrise, etc.), so I could identify what types of images people were able to classify well (or not). Human-made images only have one tag though: the site they’re sourced from.
Results at a Glance
In case you don’t want to read the whole article just to see what the results were, here’s a nice summary in table form.
Demographic | % Identified Human Images | % Identified AI Images | % Total |
---|---|---|---|
Hacker News | 65.3 | 67.8 | 66.2 |
College students | 70.4 | 69.2 | 69.3 |
Stable Diffusion users | 73.2 | 81.6 | 76.8 |
Discord users | 70.1 | 74.6 | 72.5 |
Uncategorized | 67.5 | 72.1 | 69.5 |
Overall | 65.9 | 68.7 | 67.0 |
It seems that the average person is only able to identify AI-generated images about two thirds of the time (and no, Hacker News is not any better at this than the average person). Of course, there’s one significant outlier: Stable Diffusion users. Amazingly, they identified 81.6% of AI-generated images, almost thirteen percentage points higher than the average.
It’s worth noting that these results are calculated from all submitted responses, including ones from unfinished quiz sessions; however, excluding them would not result in significantly different average scores: there would be less than 1% difference.
“I give up?”
The quiz was started 10,873 times, yet only 4,410 participants (40.6%) actually completed the quiz.
Demographic | Abandoned Quiz Sessions | Completed Quiz Sessions | Total |
---|---|---|---|
Hacker News | 5314 | 3406 | 8720 |
College students | 82 | 54 | 126 |
Stable Diffusion users | 76 | 67 | 143 |
Discord users | 1 | 7 | 8 |
Uncategorized | 990 | 876 | 1866 |
Overall | 6463 | 4410 | 10873 |
Almost sixty percent of people didn’t make it through the quiz! Perhaps it was simply longer than expected, but I suspect that the difficulty may have been a factor. People who completed the quiz spent about 34 seconds per question, but those that abandoned it spent about 69 seconds per question. That’s about twice as long. Still, people who abandoned the quiz didn’t perform much worse than those that finished it, so it’s a shame they quit early.
Another Perspective
The results from this quiz also provide a way to benchmark the quality of different image generation models: the more often a participant misclassifies an image from that model, the better that model is.
Model | % Misclassified |
---|---|
Stable Diffusion 1.5 | 21.8 |
AniPlus v2 | 36.0 |
Midjourney v4 | 47.3 |
Notably, Midjourney fooled participants nearly half of the time. Vanilla Stable Diffusion only fooled participants ~20% of the time, which isn’t great, but isn’t terrible considering The Setup.
Closing Thoughts
We’ve seen 34.1% (100% - 65.9% correct) of human-made images misclassified. This is arguably worse than misclassifying AI-generated images as human-made in some respects. Artists are already falsely being accused of using AI, after all. Some artists who are against AI may choose to blame the AI models for confusing them, but (un)arguably starting online witch hunts is never acceptable anyway. That aside, ~34% is pretty terrible if a site has to task humans with moderating away AI-generated images. At this time, I’d expect automated detection systems to perform even worse.
On a more positive note, the high score of Stable Diffusion users shows that people who use the image generation AIs are more likely to recognize their output. It’s no secret that Midjourney has a definitive style, and other models seem to also (NovelAI, etc.). This probably true for other types of generative AIs too; for example, after using ChatGPT since it’s launch, I pretty quickly recognize its work. Editing in Photoshop or a similar tool could of course sidestep this a little. Besides that, as models continue to improve, I don’t expect the majority of people to be able to identify their work reliably.
This quiz is based on a limited set of images generated with models that are now months out of date. Current AI models such as Stable Diffusion 2.0 almost certainly will produce results that better fool humans. Because of that, it’s too early to be making any definitive statements on whether it’s possible to correctly separate human-made and AI-generated images with high accuracy.
You can find the raw data as CSV files here.
#AI #Machine Learning #Image Generation #Stable Diffusion #Midjourney #Quiz