Standardizing Testing for Generative Art AI Models: A Necessary Step Forward

Hello everyone,

This time I decided to pause some of my experiments and share thoughts about a topic that is likely to become more and more important. Figuring out protocols, methods, and scalability of testing to compare different approaches to generative AI is likely to be one of the many areas where the next breakthrough can occur.

The comparison process is the current bottleneck

Even disputes like SD 2.0 vs 1.x models are not settled despite almost every person who has been tinkering with SD trying to make a comparison and sharing their results. And there is a good explanation - it is a very complicated topic, involves objective and subjective testing, and there are close to zero alignments on what a standard testing protocol could look like.

The complexity of the task grows exponentially when we look at the community’s fine-tuned models. There are hundreds, probably thousands of them, and the number will only go up. And more importantly, each of these fine tunes could have been done in tens or hundreds of different ways based on the different parameters: which fine-tune method to use, which model as a source ckpt, how many steps at what learning rate, and so on. And it doesn’t end there - for each of these models, there are hundreds of permutations of possible optimal ways to generate images.

Solving the issue of comparing Generative Art AI Models could be a major breakthrough for the field. If, as a community, we can figure out how to standardize testing, it could significantly accelerate the progress and lead to major advancements in Generative Art AI.

The comparison is not straightforward…

In my previous experiments on my blog, I’ve been trying to create some basic, simple comparison protocols. And at each step, I find more unanswered questions than answers. Here are a few examples for thought out of many:

Should we compare the average output or the best results of a model?

In theory, for most use cases, a model that outputs many mediocre results is worse than a model that outputs mostly bad results but occasional masterpieces. In my previous tests (link to an example), I’ve observed exactly that - a few gems from the models overall were doing worse.

For now, I’m leaning toward comparing averages. A mental model I’m using is that the output quality of each model is likely some sort of standard distribution. And occasional gems overlap between these distributions across the models due to the RNG nature of image generation. So comparing average results should mean that the model with a higher mean is likely to give you even better gems after enough search (ie, enough prompt crafting and parameter adjustments).

Should we use standard prompts/settings across models?

I find this one even trickier. It’s a fact that each model reacts differently to different prompts. Just because the same prompt didn’t do as well on a given model doesn’t mean that the average output of that model is necessarily worse. But now, if we try to create a different prompt per each model, it will be very hard to attribute the difference in results to the model quality itself or the quality of the prompt crafting.

A few ideas that could be explored, but I’m very open to more suggestions here:

pre-written prompts that are not optimized for any of the models that are being tested. but then, are we defaulting to RNG and randomly penalizing objectively better models?
Timeboxing prompt crafting per each model (let’s say I spend 5 minutes per model to find optimal craft per each model), but then there is so much subjectivity…

What are the things we should care about?

Output for a specific task is one thing, but what about all the other things? For example, how well did the model preserve the ability to generate everything else after fine-tuning? Or more performance-based things like file size, time/effort to do the fine-tuning, etc.

Potential mental model

In theory, AI can learn about all objective and subjective preferences of all humans and score models independently. But until we get to this hypothetical state, I propose to split the process into three parts ranging based on how objective/subjective it is:

Metric-based objective assessment
Objective differentiation of results using subjective methods
Subjective, individual preferences

Metric-based objective assessment

I imagine this step to happen before we even generate any images from a model. This can be the interpretation of loss graphs of fine-tuning process or implementation of some new metrics. For example, once I used Dreambooth fine-tuning and accidentally set the learning rate at 0. The loss graph of that test showed zero change in loss over time - a clear indication that nothing got fine-tuned, so the model could be automatically discarded. The same would be true of the loss amount going up over time. However, I have not yet seen some deeper and better metrics or interpretations of such metrics.

Objective differentiation of results

This is probably where we are going to see the majority of the effort. This step can be done manually or using some automated implementation, and most likely, it will become more and more automated. The question is, how do we get there? A lot of effort is already being put in here. Examples: Midjourney asking users to rate the results, my experiments to compare the models (link), creation of various automated aesthetic scorers (example), and many more.

I myself will continue to explore this space, write about it or even propose solutions. For example, I want to try using MaxDiff or Conjoint analysis type of implementation instead of brute-force scoring of every single image on 1-10 scale.

Subjective, individual preferences

This is likely to stay here for a while. At the end of the day, we are talking about art, and after going through the first two steps, in theory, we should have models that output good results. However, based on specific tastes, needs, etc., folks will continue one over the other, which is very normal. Maybe (or likely) in the future, there will be individually optimized scoring approaches, but I see this as a cherry on top after solving the bottleneck issue mentioned above.

What’s next

First of all, I’d like to hear different approaches and models that you use for this purpose
I’m also going to try some of the already existing tools in this space and implement them in my workflows
Finally, I want to try a few new things, like the survey design with MaxDiff implementation, to see the viability of such efforts.

Please share your thoughts!