Exploring the LAION-Aesthetics Image Dataset (part 1)

Hello everyone, while a couple of my experiments are blocked for various reasons, I decided to look into Laion’s image datasets and explore what’s inside.

Why?

I have two main reasons for doing this:

It’s important to understand the different parts that make Stable Diffusion (or other AI models) and images used to train the model are at the very core of it. Understanding what went inside should allow us to optimize different workflows.
Sharpening technical skills to get used to such databases for various future tasks. Whether it is for fine-tuning or other possible use cases.

Potential Practical Use-cases

There is more to it than just learning about what’s inside the dataset. There are very practical things that these image datasets can be used for:

Regularization image sets: for Dreambooth fine-tunes in particular, one can scrape these datasets for specific class images and use them for regularization purposes. We haven’t experimented much on this vs. using SD-generated images, but these are a good chance to achieve interesting results.
Fine-tuning on specific subsets: one can scrape images for a specific subject or object, select a high-quality subset, potentially re-caption them and then fine-tune the model with this dataset. Higher quality output is almost guaranteed

Note: I won’t be writing in this post how to scrape images using this database, but you can look into the EveryDream toolkit for a fairly straightforward way to do so. Link

What is LAION Dataset and LAION-Aesthetics

The most relevant part to mention here is that this is THE dataset that was used to create the Stable Diffusion model. Link

LAION 5B is a large-scale dataset for research purposes consisting of 5,85B CLIP-filtered image-text pairs. 2,3B contain English language, 2,2B samples from 100+ other languages, and 1B samples have texts that do not allow a certain language assignment.

And then, there are different versions of subsets generated from this large dataset. In this case, we will be exploring so-called LAION-AESTHETICS subsets. These subsets should be higher quality images, and to select these images, a separate model was trained to predict the rating people gave when asked, “How much do you like this image on a scale from 1 to 10?”. The result of this model is that each image in the LAION database got an Aesthetic score. Link

For this particular exploration, we decided to use the Version 1 subset of English captioned images that have an Aesthetic score of 7 or higher. Link to the subset.

A few findings

Dataset has 52,068,913 rows. i.e., this subset is ~0.9% of the larger 5B dataset, so if you fail to find images of specific things here, consider using larger subsets.
The dataset has the following columns:

1.URL: public URL of the image (i.e., the images are not stored in this database itself but rather their URLs all over the internet)
2.TEXT: caption of the image that should be the description of it
3.WIDTH and HEIGHT: self-explanatory width and height of the image
4.Similarity: I assume this is a score that assesses how close the caption is to the image
5.Hash: hash of the image, unsure based on which type of a hash
6.Punsafe: I assume this is NSFW probability or assessment
7.Pwatermark: the probability of the image has a watermark on it
8.Aesthetic: an aesthetic score based on the algorithm used

The distribution of aesthetic scores looks like this, so we can see a very rapid drop in counts as the score goes up:

Another useful metric to look at is different image resolutions and their frequency. Using this, we can deduce how some of them were cropped during the training period and why we might see those headless image generations or weapons that look like they are zoomed into the center. This also has implications for various fine-tuning needs. Here are the top 20 resolution combinations of this subset:

Scores vs. Images

This section is more anecdotal than analytical, but I thought it would be fun to look at images with top and bottom scores based on the various fields in the database (similarity, aesthetic, Pwatermark, and Punsafe).

Top Aesthetic Scores

It’s interesting to see a certain pattern of what the V1 scorer thinks deserves the highest score:

Top Aesthetic

Bottom Aesthetic Scores

Unsure if this subset is objectively worse, but the difference is apparent

Bottom Aesthetic

Top Pwatermark Scores

As you can see, false positives are possible, especially if images contain some text, so I’d not overly rely on filtering out images for this reason, especially for some specific use cases.

Top Watermark

Bottom Pwatermark Scores

On the other hand, there are zero false negatives, so it’s probably ok to rely on this score if we strictly want to find images that do not have a watermark on them

Bottom Watermark

Scores

I thought this part would be risky, but none of the top images were NSFW. Remember that this is a very small subset and an anecdotal observation, so do not assume there won’t be any. Top Punsafe

Top Punsafe

Bottom Punsafe Scores

The main difference is that images do not even contain humans and are nothing remotely NSFW.

Bottom Punsafe

Top Similarity Scores

Judge yourself, but I think this is quite telling about the quality of overall tagging and the room for improvement:

a modern twist on affogato, this dirty chai affogato drowns a generous scoop of homemade chai ice cream with a shot of hot espresso

Danville Bookcase With Doors 42 Wide

Isobel dress from Ohh by Gum €97.95 Connemara Life 2015 Seasons of Ireland on the Wild Atlantic Way

Paper Doll - Print Dress

Park Icon Sign Set Bear Chasing Man Into Trailer

Bottom Similarity Scores

To be honest, these are better than I expected, but I also observed a lot of digits-only captions among these images that I decided not to include here as examples. Interestingly, lower similarity scores seem to correlate with smaller image sizes:

031 copyblogthumbnail

1947 Plymouth

Balcony

South by Southwest Cornbread Salad

Thanks for reading followfox.ai’s Newsletter! Subscribe for free to receive new posts and support my work.