2 min read
How We Cut AI Inference Time by 50% in Just 2 Days - Without Losing Quality
Michael Feinstein & Tair Chamiel : Aug 25, 2025 1:00:00 PM

Start building with BRIA's platform with a free trial
A real-world deep dive into optimizing GenAI inference latency with Torch Compile, LoRA hot-swapping, and Pruna AI.
The Challenge
At BRIA, we build foundation models for visual generation. Performance is everything - our customers rely on our APIs in production, and milliseconds can determine whether an experience feels smooth or broken.
One day, we faced an urgent escalation:
“The API is too slow. We can’t go live like this.”
For a customer about to launch, inference latency was a showstopper. We had to move fast - really fast. The optimization was declared an all-hands emergency, with parallel efforts across multiple teams.
My mission: make inference as fast as possible. What followed was one of the most intense 48-hour optimization pushes we've ever experienced.
Stage 1: Torch Compile
Our first lever was Torch Compile - PyTorch’s built-in optimization for model inference.
Normally, this is straightforward. Torch’s JIT compilation can produce big wins with minimal code changes. But this wasn’t a normal case.
Why It Was Harder Than Usual
- We were working with a complex foundation model + auxiliary models.
- Models changed dynamically at runtime - especially with LoRA adapters.
We ran on AWS G6e instances (NVIDIA L40S Tensor Core GPUs), which had their own quirks.
Just a month earlier, using Torch Compile with LoRA meant every new LoRA load triggered a full recompilation - sometimes making inference slower than running without it.
The Breakthrough: LoRA Hot-Swapping
This time, I tried diffusers’ new LoRA hot-swapping feature. For the first time, it worked seamlessly. No more forced recompilation every time we loaded an adapter.
Still, Torch Compile wasn't plug-and-play:
- Warm-up was painful. With auxiliary models and changing parameters, Torch’s JIT needed to compile many “unique” runs.
- We found that 8 warm-up runs were necessary for stable performance. Each run took ~4 minutes → 45+ minutes just to warm up a pod.
Clearly, this wouldn’t scale.
Caching to the Rescue
Torch Compile does cache compiled graphs. The trick was making the cache persistent across pods. Options included Redis or a shared filesystem; we went with the latter, since caches could grow to multiple gigabytes.
At 4 AM, after countless iterations and 30% faster inference, deployment succeeded. I sent a quick note to our customer-facing team, then finally collapsed into bed.
Stage 2: Pruna AI
A 30% improvement was huge, but not enough. Some edge cases—two requests hitting the same GPU - still doubled inference time, breaching SLA.
Time for our second optimization lever: Pruna AI.
Pruna is a GenAI inference optimization engine with flexible configuration and great support. Integrating it was surprisingly smooth, and in just hours, we achieved another 20–30% performance boost.
The Outcome
29 hours later, with only 5 hours of sleep and deep into the weekend, the optimization was complete.
🚀 Results:
-
Inference latency cut by over 50%
-
Customer unblocked and delighted
-
Team exhausted but proud
Lessons Learned
This sprint reinforced key lessons for anyone building production GenAI infrastructure:
-
-
Leverage Torch Compile - but design warm-up carefully. Dynamic workloads need tailored warm-up recipes and caching strategies.
-
Use LoRA hot-swapping. It’s a game-changer for flexible, adapter-heavy deployments.
-
Don’t underestimate caching. Persisting compile caches across pods turned warm-up from hours into minutes.
-
Have multiple optimization levers. Torch Compile got us far; Pruna AI sealed the deal.
-
Team grit matters. When time is short, collaboration and perseverance are as important as tools.
-
Final Thoughts
Cutting inference time by 50% in under 2 days wasn’t easy - but with the right mix of tools, persistence, and teamwork, it was possible.
For teams working on GenAI in production, performance tuning is both an art and a science. Sometimes, it’s about knowing which levers to pull - and sometimes, it’s about having the grit to pull them at 4 AM.
Start building with BRIA's platform with a free trial
Learn More About Bria
Foundation ModelsAPI Documentation
Case Studies
Contact Us
Start with Bria today
Contact us for a deeper understanding of our Generative AI capabilities:

Enterprise-Grade Control Meets Creativity: New Platform Release

For enterprises, true transformation isn't about adopting a new tool; it’s about reshaping workflows in ways that deliver measurable ROI and...
