TL;DR
In this blog, we explore ControlNet's architecture, moving beyond the basics to explain how it works under the hood. ControlNet is a platform that adds external components, like a Transformer embedding and UNet, to the frozen diffusion model, enabling it to process additional inputs like depth maps or edges. We break down the components, explain their connections, and show you how to modify the architecture for custom needs, such as adding new channels or enhancing input types.
ControlNets
By the end, you'll be equipped to understand and tweak ControlNet for your projects. In this blog, we’ll go beyond the surface-level explanations of ControlNet and take a deep dive into its architecture. Instead of providing a simple high-level summary, we’ll break down the components of ControlNet and guide you through the technical and theoretical details needed to understand the platform entirely. Whether you’re looking to modify the architecture, enhance inputs, or integrate new features, this blog will equip you with the knowledge to confidently make those changes.
We’ll start by exploring Why ControlNet Customizations Matter and sharing insights from my journey adapting ControlNet models for custom needs. Next, we’ll break down What Exactly ControlNet does and move into a detailed explanation of the Four Fundamental Components of ControlNet, outlining how each one plays a crucial role in the platform.
From there, we’ll focus on the ControlNet Component, explain how these work together within the UNet and Transformer framework, and dive into how the T2I foundation Model and ControlNet UNet Are Connected. This section will also explore the concept of a Hyper-Network, explaining the close relationship between the foundation and ControlNet models.
As we progress, we’ll examine the Foundation Model and the critical Connections between the T2I model and ControlNet more closely. Finally, we’ll discuss the VAE and its role in the system, including a historical overview and technical breakdown. We’ll wrap up with a look at Training and a deep dive into the Transformer (ControlNet Conditioning Embedding), providing all the details you’ll need to modify and extend ControlNet for your projects.
This blog is intended to complement another, more technical blog I’ve written, How to Modify ControlNet Input: A Step-by-Step Guide for Input Modification, which focuses on the specific steps required to modify ControlNet. At the same time, that blog walks you through how to implement changes; this blog provides the why. It dives into the underlying theory and architecture to explain the reasoning behind those modifications. By understanding the foundations covered here, you’ll better appreciate the technical steps outlined in the other blog and be better equipped to make thoughtful and effective changes to ControlNet.
This blog isn't just another general post about ControlNet, filled with hand-waving over its importance and architecture.
There are plenty of great blogs that already do that. Here, I want to take you deeper into the intricacies of ControlNet so that by the end of this read, you'll be able to truly understand how everything works under the hood and make specific tweaks on your own.
After years of working with Gen AI technology, both in R&D and in my current role as VP of Gen AI Technology at Bria, one of my key goals is to help our customers tailor Bria’s open-source models to their specific needs. A common way to tailor these models is to use ControlNet and add conditional input, but to make such customizations, it's essential to understand the architecture deeply—not just on the surface.
You need to understand the nature of the changes you want to make ( why) and how to implement them effectively both at a theoretical level and at a technical level.
This blog is designed to break down the topic and make the complexities of ControlNet easier to understand.
For a more technical walkthrough on modifying ControlNet, including adding new input channels or integrating components like VAE, you can refer to my other blogs, where I explain the specific steps in detail.
The way I see it, ControlNet isn’t just “another model.”
it’s a platform that allows us to use the foundation model slightly differently. Interestingly, this aligns with its name—it's the foundation upon which the ControlNet platform is built.
The control net platform is a strategy that leverages the preexisting weights of a diffusion foundation model to quickly build a different, smaller model capable of taking additional input as a condition. This process is done by adding an external component—what I refer to as ControlNet, though it’s sometimes also called a hyper-network or an external model.
Remember, throughout the entire training process of the ControlNet platform, the original T2I (Text to Image ) diffusion model is locked—frozen like a deep freezer in Antarctica. It's not moving, not changing, and definitely not learning. Meanwhile, the ControlNet connected to it is doing all the heavy lifting, soaking up knowledge like a sponge, while the T2I stays as solid as a rock.
The Four Fundamental Components of ControlNet, Breaking Down the Basics
So, let's dive in and break it down. We'll explore the components that make up the ControlNet platform.
Let's start with the basics—what do we have? What are the four fundamental components of the platform?
The ControlNet platform consists of four main components:
Four main components consisting of the ControlNet platform
When we dive deeper into these four components, we see that they reveal even more underlying elements.
Let’s dive into the ControlNet component. The ControlNet component consists of two main parts (these are the parts trained during the ControlNet training process); the first part is the Transformer, a.k.a. “ControlNet Conditioning Embedding.” The second part is the UNet
It's a standard UNet, just like the one we know—only customized for ControlNet.
This UNet is a mini-copy (small but significant!) of our base T2I model.
The Transformer component converts the visual input (the “condition”) provided to the ControlNet platform into the latent space, ensuring that what enters the UNet is already adapted to the latent space.
This little genius of a Transformer converts the visual input given to the ControlNet platform (like a depth map) into the latent space, allowing the UNet to process it properly.
This Transformer is a component that, let's put it gently, is easier to overlook :) But in reality, it's absolutely critical when it comes to making real changes in ControlNet.
To modify elements within ControlNet, you must understand this component (or at the very least, acknowledge its existence).
Now that we’ve established the importance of the Transformer, we'll take a closer look at how it technically works. In the different section (Transformer (ControlNet Conditioning Embedding) Deep Dive), we’ll explore the conversion process, from feature extraction to creating tensors, and how the depth map gets transformed into meaningful data. We’ll also touch on why this transformation is key to optimizing ControlNet and setting up more advanced modifications.
The UNet processes the conditioned input after it passes through the Transformer, and its output is fed into the frozen T2I foundation model. Essentially, the ControlNet component interprets the conditioned input, transforms it into meaningful data, and directs it to the T2I foundation model in a way that enhances the final result.
In many ways, this control net ( unet+ transformer ) is the heart of the platform, the core element that really drives things forward. When we train ControlNet, this is the part that we're actually training. And when you download ControlNet from Hugging Face, this is the components you're downloading and connecting to the original T2I model.
(The word "connecting" here hides quite a bit of hand-waving, which, by the end of this blog, I hope you'll understand in depth, I wrote a full section about it)
The Frozen (Locked) Text-to-Image foundation Model
remains unchanged and simply connects to the ControlNet model. In the original paper, the researchers chose to showcase and test ControlNet with Stable Diffusion, but at Bria, we’ve also experimented with it using our foundation models, and the results were incredible. It’s important to understand that when you train ControlNet on a specific frozen foundation model, it will only work at inference time with that foundation model. You can’t take a ControlNet trained on Bria2.3 model and connect it to SDXL and vice versa.
Why is that? It’s actually quite fascinating. Both Bria and Stable Diffusion are trained on vast datasets and handle a wide range of visual content, including illustrations and photography. Each foundation model, whether it’s Bria’s or Stable Diffusion’s, creates a specific and exclusive bond with its hyper-network. This is why a ControlNet model trained on Bria cannot be used with Stable Diffusion, and vice versa—they are tightly integrated and designed to work together.
The Concept of a Hyper-Network
The idea of a hyper-network, or an external model, isn’t new.
https://arxiv.org/pdf/1609.09106
It’s based on the premise that you have a base foundational model that’s large, powerful, and highly intelligent, but it’s tailored to a very specific task (for example, converting text into images). Instead of retraining this large model for a new task( for example converting text + depth map into images), we create a hyper (external) network precisely adapted to the required task.
This hyper network is much smaller than the foundation modeland is connected to the base model, and we only train the hyper network. The result is an efficient and effective solution that allows for precise adjustments without altering the foundation model itself.
This is exactly what’s being done here in the ControlNet platform.
In the foundation diffusion model (T2I), signals pass through layers of an encoder and decoder, with convolutional and attention layers shaping the final output.
ControlNet connects to this model in a clever way: it takes the visual input that will serve as a condition (like a depth map) and processes it. The output from the ControlNet-UNet is then fed into the convolutional and attention layers of the T2I model, allowing the processed information to merge with the signals in the foundation model (the merging is quite simple, it’s just an addition of the elements).
This means that the ControlNet UNet introduces new information that influences the final outcome of the foundation T2I without altering the weights of the underlying T2I model, thereby maintaining its stability throughout the process.
T2I foundation model and the Control-Unet connection, zoom-in
The ControlNet platform creates a mechanism that allows the ControlNet model (the UNet plus the Transformer) to channel the processed information into the foundation model. As a result, the foundation diffusion model can incorporate the new information without actually updating its weights.
So, we’ve talked about how the hyper model (ControlNet) connects to the frozen foundation model, and let’s see how the visual input (like a depth map) feeds into the Control-UNet. We will dive deep into these connections, because there’s a whole world of complexities and principles hidden there that are crucial to understand if you want to make any changes to the architecture.
There are four fundamental components in the ControlNet platform, and the connections between them are both intricate and crucial to fully grasping how the system works. How do we transition from image and text space to the latent space? How do we convert the output of the diffusion model back into an image? These connections are vital, and mastering them can significantly enhance results.For example, we’ve seen firsthand how integrating the VAE into additional connections within the platform significantly improved our results.
Let’s start with a basic connection we're familiar with from diffusion models: the VAE and the Tokenizer. The input to a diffusion model is text and a noise image, but in reality, this doesn’t go directly into the model. The VAE and Tokenizer perform a conversion process, translating the information from pixel and letter space into the latent space. The output of this conversion—vectors generated by the Tokenizer and VAE—are what actually enter the UNet of the diffusion model. In other words, these connections are fundamental to the process.
Now, let's add the connection component between ControlNet and the frozen model.
VAE and tokenizer integrated into the pipeline
Let’s focus on the VAE. The way I see it, the VAE has a dual role: first, it functions as a transformer—a converter. How does the VAE learn how to represent visual input within the latent space? Through training. The VAE creates a latent space, where visual concepts have meaning. This meaning is mathematical, not visual, but it holds real significance.
The second role of this VAE transformer is to create meaning. It’s not just a simple conversion like JPEG compression—it’s a learning process that extracts features and generates a latent space with semantic meaning. Similar visual elements—like the same image from different angles—are represented in the latent space as points that are close to each other. Images with the same semantic meaning, or even with a similar style, will be close to each other in the latent space. This conversion is a learned process that creates a latent space with deep meaning.
To make this clearer, think about two emojis: one is laughing, and the other is wearing sunglasses. Initially, they are far apart from each other in the visual space. However, after passing through the VAE, the system learns how to represent them in the latent space. Emojis or objects that have similar visual features or styles, like these two, will be placed closer to each other in the latent space. This process doesn’t just compress the images, it creates a deeper, learned meaning.
Images with the same semantic meaning will be close to each other in the latent space. The image on the left shows the emojis in their original state, far from each other, and the image on the right shows them closer together after the VAE learning processes them in the latent space.
To expand on this idea, imagine a cluster of emojis—faces, hearts, and other familiar icons—all grouped together in the latent space because of their similar visual style. Now, let’s add a photorealistic image of a monkey. Unlike the emojis, this realistic image will be positioned far away from the cluster in the latent space, reflecting its distinct features and level of detail. But if we introduce an emoji of a monkey, it sits somewhere in between, sharing visual traits with both the emoji cluster and the photorealistic image. This demonstrates how the VAE learns to map out objects in the latent space, organizing them based on their visual or stylistic characteristics.
Images with the same semantic meaning will be close to each other in the latent space. On the left, a group of emojis is clustered closely together, representing their visual similarity in the latent space. On the right, a photorealistic image of a monkey is far from the emoji cluster, illustrating its distinct features. In between, an emoji of a monkey sits somewhere in the middle, bridging the gap between the photorealistic and emoji representations.
More About the VAE: The History of the VAE
The VAE, or Variational Autoencoder, is, in my view, is one of the most central and influential tools in the field of deep learning and generative models. The VAE model, first introduced in the groundbreaking paper by Kingma and Welling (2013) link to paper, revolutionized the way we approach generative modeling.
The VAE introduced a new approach that combines representation learning with a probabilistic dimension. The VAE not only compresses raw data into a latent space but also allows for resampling of latent representations, enabling the creation of new data variations in a generative manner. Since its development, the VAE has become a crucial foundation in visual Gen AI. If you've grasped this component of the VAE, we can now incorporate this concept into our diagram.
For more information regarding the VAE, check out this great explanation Variational Autoencoders
Let's remember that we're talking about diffusion models! The noise prediction from the original T2I model interacts with the original noise, and the update—instead of updating the parameters of the foundation model as it would in training a regular diffusion model—occurs within the UNet component of ControlNet.
ControlNet Training loop
Now that we've established how crucial the connections are, let’s take a closer look at the Transformer component (AKA embedder), a key part of the ControlNet architecture. Understanding its role is essential to grasp how ControlNet processes additional inputs like depth maps, and how it integrates with the foundation model.
The Transformer is crucial because it serves as the bridge between the raw input (such as a depth map) and the ControlNet UNet. Without understanding how this transformation happens, it’s difficult to optimize or customize ControlNet effectively. The Transformer is more than just a simple processing layer; it dictates how the information is encoded and adapted into the latent space. By mastering the inner workings of this component, you’ll be able to make informed decisions about modifications that can dramatically improve performance.
Let’s break down how the Transformer works:
In the Transformer, the depth map is converted into a tensor through a series of convolutional layers that perform feature extraction from the original image, transforming it into a multi-dimensional data array. This tensor is then fed into the ControlNet UNet for further processing and integration with the information from the base model.
The conversion of a depth map into a tensor in ControlNet involves several stages based on Convolutional Neural Networks (CNNs). Let's break down the process in more technical detail:
This tensor represents the abstracted information extracted from the 2D input and is now ready for further processing in the ControlNet UNet.
The paper doesn’t go into much detail about this component, but the ControlNet code in this context is relatively clear: the Transformer component is actually composed of 2 convolutional layers and several thousand parameters. These parameters are initialized randomly and are learned during training.
Questioning the Role of the Transformer: Why Replace the VAE with a Simple Convolutional Layer?
There’s an inherent assumption here that this component will know how to take a depth map, extract features from it, and generate a new optimal representation of the depth map in the latent space. Now, this last point is a bit odd, isn’t it? We’ll address it in the next blog—why does it make sense to replace a component and significant and historically important as the VAE with a simple convolutional layer?
What’s Next?
In the next blog, we’ll dive into the following ideas:
Full disclosure: The author is the VP of Generative AI Technology at Bria, holds a Ph.D. in Computer Vision, and has many years of experience in generative AI. She has extensive expertise in training models (both fine-tuning and from scratch), writing pipelines, and conducting practical, theoretical, and applied research in various areas of computer vision.