3 "Limitations" of Stable Diffusion and the Power of Fine-tuning

This week we want to make another framework-type post. A few weeks ago we proposed a framework for Maximizing SD Fine-tuning results (link).

Depiction of the “Three Limitations”

We received somewhat mixed feedback - some of you loved it but others prefer more hands-on, tactical explorations. However, we decided to keep this type of post since we find it useful to establish mental models and frameworks that are relevant regardless of the latest tactical development in the technology. Such frameworks help us develop intuition, broaden the sense of what’s possible, and simply do more impactful stuff.

This time we will talk about some limitations that the original Stable Diffusion has and how fine-tuning can address them. For advanced users, all of it will likely appear obvious and not new but thinking about these limitations explicitly and separately, and actively thinking about how to address them through fine-tuning can result in ultimate outcomes. For example, we love what was accomplished by project Riffusion (link) and we believe there are many more such applications to be discovered and developed. And we believe that thinking from different perspectives is the way to discover and unlock such applications.

Three Limitations and the Power of Fine-tuning

1- Absence of Certain Images and Subjects in the Training Data

You might know that the Stable Diffusion models were trained on subsets of the LAION-5B dataset (link) of images - a dataset consisting of 5,8 Billion(!) CLIP-filtered image-text pairs. Check out our post exploring the dataset (link).

5.8 billion images are a lot but first, we need to realize that only a subset of these images was used to train for example version 2.0+ models. First, the “aesthetic” subset was created filtering images that meet certain minimal aesthetic scoring and further filters were applied to filter out NSFW content.

This means that despite the massive size of the training data, there is a huge number of possible images and subjects that were not part of the training data. This is especially true for domain-specific, rare type classes: like spectrograms from Diffusion, possibly exponents from a not-so-popular museum in rural Africa, or something specific like images of my neighbor’s newborn puppy.

Riffusion Spectrogram

However, luckily, this is very easy to address by fine-tuning methods. We can simply generate a targeted dataset of the new subjects that we want to introduce to the model, run one of the fine-tuning approaches, and we have a new model capable of new things. And these capabilities can be anything - from simply generating new images with your face, creating music like Riffusion, pumping out game assets for your indie development, or flooding the world with images in your art style.

2 - Mis-labeled or Differently Captioned Images

Even if the image is in the training data, that doesn’t guarantee that it will be represented in a specific way that specific subjects on it could have been represented. And there are multiple reasons why - the emphasis might be on a different aspect of the image, the author of the image or photo might have not had the expertise to label it properly, and finally, the CLIP model used to select these text and image pairs might not be able to identify domain-specific things. Let’s take a look at an example (a made-up one to make a point).

Horrid ground-weaver is one of the rarest, tiny money spiders also known as Nothophantes. There are quite a few images of it on the internet so chances are that the initial dataset contained some of that. When using any web search tool, it is easy to retrieve images of this spider.

Google Search Results

However, if we use any of the LAION exploratory tools (example link) none of the combinations of the terms mentioned above retrieve the images of this spider.

LAION Search Result

And this could be because images had different generic captions to a lack of expertise or different emphasis by the author (something like “image of a spider”) or the CLIP model might have not been able to realize that the image of a spider captioned as “Nothophantes” is an accurately captioned image and it might have filtered it out.

Regardless of the reasons, with fine-tuning approaches, there is a lot that we can do. We can either retrieve images that are not labeled the way we want from the LAION dataset or generate a new set of images from the internet or our own sources and then accurately caption using domain-specific terms. That can be done manually or by captioners that have that domain-specific capabilities and accuracy. Then run the fine-tuning process and once again get an updated model with new capabilities, whether it will mean generating domain-specific accurate images or something more creative with the creative approaches to the captioning process.

3 - Under/Over Representation of Specific Subjects

Finally, let’s discuss a case when the images were part of the training data and those images were accurately captioned. However, not all subjects in the training data are equally represented. A Hollywood superstar or extremely popular superhero is likely to be represented by thousands of images in the training data while other, less-popular, and niche domain-specific subjects might have just a handful of images. For example, searching for images of Johnny Depp in the LAION dataset (link) results in an endless number of images.

But someone relatively less but still popular like Alvaro Morte (main actor from “Money Heist”) results in only a few images (link).

As far as we know, the training process for Stable Diffusion didn’t involve any kind of adjustments for such a mismatch in representation. This would result in a model that is not capable to generate high-quality output for much less popular subjects.

The solution once again is a fine-tuning process and a very easy approach. For example, retrieving the desired images from the LAION dataset and “showing” it to the model through the fine-tuning process with a higher number of repeats can result in higher quality, desired output.

What’s more, the same approach can be used in multiple ways. For example, selecting a high-quality subset of the LAION dataset on the specific topic and fine-tuning the model on this data can likely results in a model that will have improved performance for the domain.

Conclusions

In the end, we want to once more highlight the power of fine-tuning when applied in a goal-driven, smart manner. If you are thinking about fine-tuning a model, then most likely you want the model to acquire some capability that the original model doesn’t have.

And if that is the case, the first step should be understanding why those capabilities are not present - did the training data not have the appropriate images? or maybe the problem was in captions? or maybe it’s just a matter of proper representation.

Knowing the reason should make the process of solving the issue much easier, it will be easier to think about how to generate the training data that will properly address the issue.

We hope you find this write up useful and don’t forget to provide feedback and share the post among relevant communities!