Tools like Midjourney are taking the world by storm. Like many others we see huge potential in this new field, but when put use in projects, we also realize some recurring challenges. For example, certain themes and objects might not be that well represented in the model, as in the case of a new product. Another challenge is token/key word ambiguity, making it hard to reference the particular thing you are after. This made us curious, how could we explore ways to expand and tailor the capabilities of text-to-image systems? To explore this we made a small experiment to augment a generative model with custom training to be able to better represent an object that "was missing" - a specific 1960s lounge chair design.
First, let’s take a very simplified look under the hood of image based generative AI. There are two core components: The Generator and the Model. The model is distilled from a collection of digital imagery - typically a vast collection of online sourced imagery - but condensed and organized into a format that the Generator can easily understand.
While it's useful to have a wide visual reference model, there are scenarios where these tools will have a hard time being helpful. Consider the case of having an unreleased product - or a product that isn’t common on the internet or in the training data. Or when there is a problematic prompting situation, where you can’t reference the desired object or element through a unique product name. In both these cases it might be difficult or impossible to use generative AI.
As an example we tried to generate images of a 1960s lounge chair, designed by Bengt Ruda. This object has a very characteristic design, but the generated images don't look anything like the actual object. How could we improve the ability to generate accurate representations of this specific element?
Preparing the dataset
Using the method of “fine-tuning” one can graft a new dataset into an existing base model and achieve a far more exact output. We created a custom dataset which documented the product's features from a multitude of perspectives, lighting conditions and environments. The model needs to be provided with examples of how the object works and how it relates to the world, other objects and people. We used a mix of images from previous campaigns, product shots, but also customer UGC and online auction imagery. For the sake of the experiment we also created some additional mockups for unusual scenarios.
We used 40 images in our experiment, but around 100 would have been optimal. Then we created texts for the trainer. This step works as a type of backwards prompting, with one text describing each image. The more exact and descriptive the text, the more accurately it will be able to isolate the object and its qualities, materials, structure, size and behavior.
The preparation of the dataset, the descriptions and setting up the prerequisites for the training is by far the most time consuming part of the process. It also takes some experience and proper planning to get right, since you are acting part DoP, part behaviorist.
Training the model
Using Stable Diffusion Web UI and the Dreambooth extension we trained the model for 3000 epochs and evaluated it to find a good middle ground for when the model outputs the object in its best representation. A weird aspect is that you can overtrain it, at which point the AI starts outputting some pretty crazy, over-exaggerated results. The training itself takes a couple of hours and can be done much faster on powerful hardware since most of the computing is done on dedicated Nvidia GPU. We used a bit of a dated gaming rig from the office scrapyard but there are also options to use cloud GPU services at a cost per minute.
One challenge in the training process, and when working with a generated model, is that since the AI uses text instructions from general language models to understand what you want it to generate, it will have a general understanding of all words you can come up with. In our case it knew the general category of our product (high chair, childrens chair, etc.), but it pictured it wrong.
If we were to keep training it on that name, the result would always be diluted by the base model’s general understanding of the generic product. Instead, to be precise it’s better to start with a term with no prior meaning, and ideally make it as short and unique as possible. There is of course a science to this, but simplified - the less hits on Google, the better. You will end up with combinations like ‘ohwx chair’ and ‘zshq style’ which will be used when prompting it using your new model with the AI.
When the training is finished from the best level, a new model can be generated and loaded into Stable Diffusion - and then utilized as any other model in text2image, inpainting, outpainting, image2image, etc. We were now able to place lounge chair on the top of a mountain, on the bridge of a StarWars imperial cruiser, in 8-bit and as a technical drawing within a couple of minutes. This hugely increases the precision and speed that the teams can generate ideas and visuals in the creative process.
We believe there are many interesting benefits of custom trained visual models, relevant for the individual creative as well as an agency or a brand. To exemplify with a few speculative uses:
Protect visual representation.
Proprietary models could play a role in ensuring the accurate visual representation of products and properties. Maybe they can provide a more flexible version of the brand guideline, or key art? Models could be handed off to partners, or plugged into platforms such as delivery networks or DAM’s (all of which likely will be increasingly AI-driven for personalization, optimization etc).
The creative oracle.
A model can be trained on the visual universe of a brand or creator. Take the example of a fashion brand, imagine all the collective output through the years from the organization: Mood boards, references, sketches, and the final products that make up the creative soul of the brand. The resulting model could be used as a creative oracle, providing stimuli and guidance for new collections - possibly augmented with other specialized models relevant to specific areas of inspiration.