Conclusions on Learning Rate Discovery and Cycling LR

Hello followfox readers!

PSA: This is the last post with our talented Damon Albarn's face.

We learned a lot with experimenting on models with his face but for now it’s enough of the same face.

We will be featuring a new dataset for the future experiments. If you have any suggestions or ideas what dataset we should use (can be face, subject, style, whatever) let us know and we will consider as we make that decision.

At followfox, we believe in experimentation and do many of them. Some are successful, and a lot are less successful or conclusive. We also believe that writing about not successful ones is an important part of the learning journey.

However, we think leaving inconclusive experiments without a proper summary and takeaways is not okay. And in a way, our last two posts on finding optimal learning rates (link) and cycling learning rates (link) ended up in that unacceptable state. In both cases, we ended up with models that looked undertrained, and we confirmed that this was the case by applying the same methodology to different training datasets. We also received the same feedback from a few readers.

So in this post, we want to look into it one more time, document additional findings, and with that, close this mini-series in a proper way.

The Plan

Our models looked undertrained and less optimal when compared to our original EveryDream 2 post (link). So for starters, we wanted to retrain the model to make sure we are still able to replicate those results with the latest version of the trainer and use that as a baseline for the comparison.

For that reason, we had to disable the AMP option "disable_amp": true (note that this is not recommended!). And indeed, we got pretty much an identical model.

Next, we re-tested our last protocol with a cycling rate at 1.13E-06 LR and confirmed that the model indeed looks undertrained when compared to the initial protocol. Here is an example:

As you can see, in the second column, Damon looks a bit fake and less like him.

From there, we generated a list of hypotheses why this might be the case and tried to explore each:

The learning Rate finding approach might be recommending lower than the true optimal learning rate
We might have a subjective preference for overtrained, ‘cooked’ models over the optimally trained ones
The validation loss methodology might not be accurate enough for our small dataset
Some other reasons we are not accounting for

Revisiting Optimal Rate Discovery Methodology

There are a few factors to consider here: first of all, the approach is pretty old (from the 2017 paper), and combined with the fact that ED2 uses additional optimizers, the approach might not be working as intended.

What’s more, the discovery method suggests that we gradually increase the learning rate while the model is being trained. However, SD models train quite quickly, even at low learning rates, and by the time we approach higher learning rates, the model is already trained for quite a bit.

For this reason, we decided to modify the approach a little bit. Instead of a gradual increase in the learning rate, we decided to test a few values with small runs at a constant rate, each time starting from a fresh base model. The idea with to find the highest value at which the loss is continuously going down during this small run.

We started with an aggressive value of 1e-4 for 20 epochs and saw that the model didn’t learn anything. Here is the validation loss graph:

Next, we tried 1e-5 followed by 7e-6 for 20 epochs each, and in both cases, we observed models learning, but at about 30+ steps, the loss values started going up:

Then we tried 5e-6, and it looked mostly smooth with one tiny bump

And finally, we saw that at 2e-6, that graph had no bumps at all. So we decided to do the test with these two values.

With these two learning rate values (5e-6 and 2e-6), we applied the same cycling approach as before, 5.5 cycles, 110 epochs total.

And here are the result comparisons of the original baseline and the two models we created. For the full picture, we will display the results of all 4 models:

1st column is the baseline with a constant 1.5e-5

2nd column is the most aggressive cycling: 5e-6

3rd column is the less aggressive cycling: 2e-6

Finally, undertrained 1.16e-6 cycling from the previous post for comparison

All first three models look good, however, with 2nd column (5e-6) face seems very similar across generations, potentially indicating overfit.

With avatar variations, overfitting is even more apparent for the 5e-6 version, while 2e-6 looks quite acceptable.

In conclusion, a modified learning discovery protocol so far seems promising, and we suggest you test it for your training sessions. Basically, do a few small runs at various learning rates and use the highest one that results in continuously decreasing validation loss values.

Do we prefer overtrained, ‘cooked’ models?

Answering this one was rather easy, and the answer seems to be yes, assuming the validation loss graph is the right definition for the overtrained model.

To check it, we did the following.

We know that we like our initial model, the one that was trained for 100 epochs with a constant LR of 1.5e-6. So we just did the run again; this time, validation was enabled. Here is the graph that we got:

As we see, for almost half of the training process, the loss values were going up and still, we got the model that we liked. So we probably have a tendency to just like the output from the model that is cooked if defined by the validation loss graph.

Does Validation Loss struggle with small datasets?

This one is tricky and needs a much more in-depth evaluation. However, here we show one example that we think suggests that might be the case.

We did a model run with the 3e-6 cycling approach. The outputs were decent, slightly on the overtrained side. However, this is what we see from the validation loss graph:

If we are interpreting this correctly, the loss value of the generations by the end of the training was higher than at the start with a fresh model. However, with the keyword, if we prompt the initial model, we get zero similarity with the subject. In contrast, the final version model is doing a decent job. This is an indication that the validation loss might not be doing the ideal job when working with just four images, ie, a single batch to look at.