How We Cut AI Inference Time by 50% in Just 2 Days - Without Losing Quality
Start building with BRIA's platform with a free trial
Explore Our Models on Hugging Face
Try Now
Join Our Community
Bria's Discord
Create Your First API Key
Sign Up
Need More Help?
Contact Us
Explore Our Models on Hugging Face
Try Now
Join Our Community
Bria's Discord
Create Your First API Key
Sign Up
Need More Help?
Contact Us
Explore Our Models on Hugging Face
Try Now
Join Our Community
Bria's Discord
Create Your First API Key
Sign Up
Need More Help?
Contact Us
Explore Our Models on Hugging Face
Try Now
Join Our Community
Bria's Discord
Create Your First API Key
Sign Up
Need More Help?
Contact Us
Explore Our Models on Hugging Face
Try Now
Join Our Community
Bria's Discord
Create Your First API Key
Sign Up
Need More Help?
Contact Us
2 min read
Michael Feinstein & Tair Chamiel : Aug 25, 2025 1:00:00 PM
At BRIA, we build foundation models for visual generation. Performance is everything - our customers rely on our APIs in production, and milliseconds can determine whether an experience feels smooth or broken.
One day, we faced an urgent escalation:
“The API is too slow. We can’t go live like this.”
For a customer about to launch, inference latency was a showstopper. We had to move fast - really fast. The optimization was declared an all-hands emergency, with parallel efforts across multiple teams.
My mission: make inference as fast as possible. What followed was one of the most intense 48-hour optimization pushes we've ever experienced.
Stage 1: Torch Compile
Our first lever was Torch Compile - PyTorch’s built-in optimization for model inference.
Normally, this is straightforward. Torch’s JIT compilation can produce big wins with minimal code changes. But this wasn’t a normal case.
We ran on AWS G6e instances (NVIDIA L40S Tensor Core GPUs), which had their own quirks.
Just a month earlier, using Torch Compile with LoRA meant every new LoRA load triggered a full recompilation - sometimes making inference slower than running without it.
This time, I tried diffusers’ new LoRA hot-swapping feature. For the first time, it worked seamlessly. No more forced recompilation every time we loaded an adapter.
Still, Torch Compile wasn't plug-and-play:
Clearly, this wouldn’t scale.
Torch Compile does cache compiled graphs. The trick was making the cache persistent across pods. Options included Redis or a shared filesystem; we went with the latter, since caches could grow to multiple gigabytes.
At 4 AM, after countless iterations and 30% faster inference, deployment succeeded. I sent a quick note to our customer-facing team, then finally collapsed into bed.
A 30% improvement was huge, but not enough. Some edge cases—two requests hitting the same GPU - still doubled inference time, breaching SLA.
Time for our second optimization lever: Pruna AI.
Pruna is a GenAI inference optimization engine with flexible configuration and great support. Integrating it was surprisingly smooth, and in just hours, we achieved another 20–30% performance boost.
29 hours later, with only 5 hours of sleep and deep into the weekend, the optimization was complete.
🚀 Results:
Inference latency cut by over 50%
Customer unblocked and delighted
Team exhausted but proud
This sprint reinforced key lessons for anyone building production GenAI infrastructure:
Leverage Torch Compile - but design warm-up carefully. Dynamic workloads need tailored warm-up recipes and caching strategies.
Use LoRA hot-swapping. It’s a game-changer for flexible, adapter-heavy deployments.
Don’t underestimate caching. Persisting compile caches across pods turned warm-up from hours into minutes.
Have multiple optimization levers. Torch Compile got us far; Pruna AI sealed the deal.
Team grit matters. When time is short, collaboration and perseverance are as important as tools.
Cutting inference time by 50% in under 2 days wasn’t easy - but with the right mix of tools, persistence, and teamwork, it was possible.
For teams working on GenAI in production, performance tuning is both an art and a science. Sometimes, it’s about knowing which levers to pull - and sometimes, it’s about having the grit to pull them at 4 AM.
Contact us for a deeper understanding of our Generative AI capabilities:
Start building with BRIA's platform with a free trial
Dear Fidji,