Stable Diffusion XL Improvements and Limitations

Text-to-image tools will likely be seeing remarkable improvements and progress thanks to a new model called Stable Diffusion XL (SDXL). A recent publication by Stability-AI delves into the advancements and limitations of their new model, providing valuable insights. In this post, we will explore the key findings of Stability’s research and list some of the advancements and limitations we can expect to see from SDXL.

Advancements in SDXL

  1. Better Image Quality: SDXL has made significant improvements in generating high-quality images. Compared to its previous versions, the model produces more realistic and visually appealing images that closely follow the prompts given.
  2. Staying True to Prompts: One important aspect of text-to-image synthesis is the accuracy of generating the given prompt. SDXL outperforms its competitor, Midjourney v5.1, in terms of accurately incorporating the provided input into the generated images. This means that the model better understands and reflects the intended prompt, even complex ones.
  3. Improved Image Composition: SDXL excels in creating coherent and well-composed images. It effectively combines different elements within the image, resulting in more visually pleasing and contextually consistent images.
  4. Faster Image Generation: The speed of image generation is much faster compared to that of older Stable Diffusion models. This means less time spent waiting for your images to generate, as well as less work for your GPU to handle!

Now that SDXL 1.0 has been officially released, here are some of the most exciting improvements as described by the Stability AI team:

📷 The highest quality text to image model: SDXL generates images considered to be best in overall quality and aesthetics across a variety of styles, concepts, and categories by blind testers. Compared to other leading models, SDXL shows a notable bump up in quality overall.

📷 Freedom of expression: Best-in-class photorealism, as well as an ability to generate high quality art in virtually any art style. Distinct images are made without having any particular ‘feel’ that is imparted by the model, ensuring absolute freedom of style

📷 Enhanced intelligence: Best-in-class ability to generate concepts that are notoriously difficult for image models to render, such as hands and text, or spatially arranged objects and persons (e.g., a red box on top of a blue box) Simpler prompting: Unlike other generative image models, SDXL requires only a few words to create complex, detailed, and aesthetically pleasing images. No more need for paragraphs of qualifiers.

📷 More accurate: Prompting in SDXL is not only simple, but more true to the intention of prompts. SDXL’s improved CLIP model understands text so effectively that concepts like “The Red Square” are understood to be different from ‘a red square’. This accuracy allows much more to be done to get the perfect image directly from text, even before using the more advanced features or fine-tuning that Stable Diffusion is famous for.

📷 All of the flexibility of Stable Diffusion: SDXL is primed for complex image design workflows that include generation for text or base image, inpainting (with masks), outpainting, and more. SDXL can also be fine-tuned for concepts and used with controlnets. Some of these features will be forthcoming releases from Stability.

Source: Stability AI Discord

Limitations of SDXL

  1. Difficulty with Complex Subjects: Although SDXL has made great strides, it still faces challenges when generating intricate subjects such as… you guessed it, human hands! Although the model can generate realistic looking hands and fingers, getting the anatomy correct is still a struggle.
  2. Not Perfectly Photorealistic: While SDXL produces impressive results, it does not achieve perfect photorealism. Some subtle details such as lighting effects or texture variations may not be accurately represented in the generated images. Whilst the average person may not notice these subtleties, a photographer or image expert should be able to surmise that an image was generated using AI.
  3. Concept Blending: Concept blending refers to the unintentional merging or overlap of different visual elements in the generated images. SDXL, like other models, may exhibit this phenomenon, resulting in the blending of unrelated features and objects.
  4. Challenges in Rendering Text: Rendering long and legible text poses a challenge for SDXL. The model may struggle to maintain clarity and coherence when attempting to generate text, which can affect the quality of the generated images. SDXL can render some text, but it greatly depends on the length and complexity of the word.

Stable Diffusion XL has brought significant advancements to text-to-image and generative AI images in general, outperforming or matching Midjourney in many aspects. However, there are still limitations to address, and we hope to see further improvements to the model. As soon as the model is released to the public under an open-source license, we expect to see a surge in the number of custom models created with it.

Custom models made using SDXL will likely be where the true improvements are seen. There are already thousands of well-trained models for Stable Diffusion 1.5 up to 2.1 – each with their own strengths and weaknesses. Some of the photorealism models for 2.1 have already shown seriously impressive results, we expect that the benefits which come with SDXL will take these models to the next level!

Have you tried SDXL yet? If you haven’t, you can always check out our guide on how to access and use Stable Diffusion XL.