Stable Diffusion Fine-tuning Experiments (6 more) with ED2.0 (Part 2)

This week we are continuing to experiment with EveryDream 2 trainer (link) and starting right from where we ended the post the last week. We highly recommend checking that post (link) before reading this one - things will make a lot more sense.

This week we decided to not change anything in the training/experimental setup for consistency even though a couple of potential improvements were identified thanks to your feedback. To understand why we want to state our goals for this series of experiments here one more time:

The goal:

The ultimate goal is to gain as much mastery and control over the fine-tuning process as possible. The truth is, optimal results will be achieved by a mix of different parameters matching the goals of the model and available data. But before attempting to start finding those optimal blends of parameters, we are trying to change one parameter at a time and observe what happens as a result, what are the directions of these changes only after doing these we will be doing theory crafting and optimizations. This is why it is ok if the starting comparison point is not perfect and looks slightly “overcooked” - all we are concentrating on is how each experiment compares to that starting model and in which direction are we observing the changes.

Experiment 5 - Clip Skip 0, 1, 2, and 3

This option is all about text encoder - basically the part that takes our input text and turns into tokens and subsequently numbers that will be fed to the next part of the model along with the noise data. And we know that Stable Diffusion uses the Clip model for this part.

So the idea of skipping Clip is that rather than taking the output of the very last layer from the text encoder, we use the outputs of the previous layers. We can do this at two places: 1 - when generating images (for example Automatic1111 settings allow to change this parameter) and 2 - when fine-tuning the model.

We can read a bit more about Clip Skip parameter for ED in their docs (link). This needs to be confirmed but given that values for ED start from 0 while for Auto1111 it start from 1, we suspect that Clip Skip = 0 in ED is the same as Clip Skip =1 in Auto. So it’s shifted by one. And this has implications if we want to coordinate matching this setting across fine-tunes and inference stages.

Output Quality

It’s hard to make definitive conclusions and to us it almost felt choosing among some of these combos as an arbitrary choices. One of the hypothesis (but totally not confirmed) is that clip skipping might help when we are dealing small dataset and overfitting. Some have suggested quite the opposite.

Reading results: rows are different clip skip settings in Automatic (ie at results generation stage) and columns are four different models: Baseline from the previous post with clip skip 0, and three new models with clip skip values of 1, 2, and 3.

We think that in this case the Ed trainer Clip Skip 0 and 1 (first two columns) are almost identical (but there are micto differences) and results for these two models get worse as we increase inference clip skip values. And columns 3 and 4 have better results in rows 2 and 3.

Very similar results for the avatars but with larger differences once tuning skip is >1.

Superman results is very interesting in this case, especially row 1 for the column 2, and row 3 for the last two columns. More exploration with different types of data might reveal even more directional consistencies.

Performance

All the logs are identical to the initial run from the previous post (link).

Experiment 6 - Increasing Conditional Dropout

From ED docs:

Conditional dropout means the prompt or caption on the training image is dropped, and the caption is "blank". The theory is this can help with unconditional guidance, per the original paper and authors of Latent Diffusion and Stable Diffusion. The value is defaulted at 0.04, which means 4% conditional dropout. You can set it to 0.0 to disable it, or increase it. Many users of EveryDream 1.0 have had great success tweaking this, especially for larger models. You may wish to try 0.10. This may also be useful to really "force" a style into the model with a high setting such as 0.15. However, setting it very high may lead to bleeding or overfitting to your training data, especially if your data is not very diverse, which may or may not be desirable for your project.

So maybe this is more suitable for larger fine-tunes but we noticed some minor differences that are interesting to look at.

Output Quality

While almost identical, there are some minor differences from the original run (for example earings). Almost impossible to call whether one is better than the other.

Same here - it feels like we need to try larger dataset or a more aggressive values.

Even in this example where the original model seemed overtrained, its hard to say whether this one is better or worse.

Logs

Everything else was the same but we noticed a tiny difference in loss graphs - the final loss value of the model with the conditional dropout model is a tiny bit higher than the baseline.

Experiment 7 - Disabled Text Encoder

It is what it sounds like - text encoder part of the model won’t be trained. We can not see any reasons for using this method for the whole training - results are just worse and it doesn’t result in performance gain. We wonder if someone has found any interesting use cases like running this prior to training with text encoder or running additional steps with this setting on. One might suggest that marginal improvement with this could be possible given that it still trains the Unet to some extent and keep the model more intact than the full settings tuning.

Output Quality

As expected results are worse. However, some tuning was definitely done and that is interesting to observe.

Logs

Identical across the board except for a slight increase in the final loss value.

Experiment 8 - Disabled xformers

We saw zero differences in the results - literally the output was identical to the baseline. The training time and resource needs were also identical but we observed that logs didn’t record some of the graphs and the minutes per epoch graph was smoother even tho the total time difference was ~10 seconds.