2 min read

How We Cut AI Inference Time by 50% in Just 2 Days - Without Losing Quality

Michael Feinstein & Tair Chamiel : Aug 25, 2025 1:00:00 PM

Responsible AI

6:31

Start building with BRIA's platform

Get a 14-day trial to try FIBO and explore the full BRIA platform.

A real-world deep dive into optimizing GenAI inference latency with Torch Compile, LoRA hot-swapping, and Pruna AI.

The Challenge

At BRIA, we build foundation models for visual generation. Performance is everything - our customers rely on our APIs in production, and milliseconds can determine whether an experience feels smooth or broken.

One day, we faced an urgent escalation:

“The API is too slow. We can’t go live like this.”

For a customer about to launch, inference latency was a showstopper. We had to move fast - really fast. The optimization was declared an all-hands emergency, with parallel efforts across multiple teams.

My mission: make inference as fast as possible. What followed was one of the most intense 48-hour optimization pushes we've ever experienced.

Stage 1: Torch Compile

Our first lever was Torch Compile - PyTorch’s built-in optimization for model inference.

Normally, this is straightforward. Torch’s JIT compilation can produce big wins with minimal code changes. But this wasn’t a normal case.

Why It Was Harder Than Usual

We were working with a complex foundation model + auxiliary models.
Models changed dynamically at runtime - especially with LoRA adapters.

We ran on AWS G6e instances (NVIDIA L40S Tensor Core GPUs), which had their own quirks.

Just a month earlier, using Torch Compile with LoRA meant every new LoRA load triggered a full recompilation - sometimes making inference slower than running without it.

The Breakthrough: LoRA Hot-Swapping

This time, I tried diffusers’ new LoRA hot-swapping feature. For the first time, it worked seamlessly. No more forced recompilation every time we loaded an adapter.

Still, Torch Compile wasn't plug-and-play:

Warm-up was painful. With auxiliary models and changing parameters, Torch’s JIT needed to compile many “unique” runs.
We found that 8 warm-up runs were necessary for stable performance. Each run took ~4 minutes → 45+ minutes just to warm up a pod.

Clearly, this wouldn’t scale.

Caching to the Rescue

Torch Compile does cache compiled graphs. The trick was making the cache persistent across pods. Options included Redis or a shared filesystem; we went with the latter, since caches could grow to multiple gigabytes.

At 4 AM, after countless iterations and 30% faster inference, deployment succeeded. I sent a quick note to our customer-facing team, then finally collapsed into bed.

Stage 2: Pruna AI

A 30% improvement was huge, but not enough. Some edge cases—two requests hitting the same GPU - still doubled inference time, breaching SLA.

Time for our second optimization lever: Pruna AI.

Pruna is a GenAI inference optimization engine with flexible configuration and great support. Integrating it was surprisingly smooth, and in just hours, we achieved another 20–30% performance boost.

The Outcome

29 hours later, with only 5 hours of sleep and deep into the weekend, the optimization was complete.

🚀 Results:

Inference latency cut by over 50%
Customer unblocked and delighted
Team exhausted but proud

Lessons Learned

This sprint reinforced key lessons for anyone building production GenAI infrastructure:

1. Leverage Torch Compile - but design warm-up carefully. Dynamic workloads need tailored warm-up recipes and caching strategies.
2. Use LoRA hot-swapping. It’s a game-changer for flexible, adapter-heavy deployments.
3. Don’t underestimate caching. Persisting compile caches across pods turned warm-up from hours into minutes.
4. Have multiple optimization levers. Torch Compile got us far; Pruna AI sealed the deal.
5. Team grit matters. When time is short, collaboration and perseverance are as important as tools.

Final Thoughts

Cutting inference time by 50% in under 2 days wasn’t easy - but with the right mix of tools, persistence, and teamwork, it was possible.

For teams working on GenAI in production, performance tuning is both an art and a science. Sometimes, it’s about knowing which levers to pull - and sometimes, it’s about having the grit to pull them at 4 AM.

Start building with BRIA's platform

Get a 14-day free trial to try FIBO and explore the full BRIA platform.

Learn More About Bria

Foundation Models
API Documentation
Case Studies
Contact Us

Responsible AI

Our Core Products

ro

For Enterprises

For Developers

Our Technology

ro

Company

ro

How We Cut AI Inference Time by 50% in Just 2 Days - Without Losing Quality

Start building with BRIA's platform

A real-world deep dive into optimizing GenAI inference latency with Torch Compile, LoRA hot-swapping, and Pruna AI.

The Challenge

Why It Was Harder Than Usual

The Breakthrough: LoRA Hot-Swapping

Caching to the Rescue

Stage 2: Pruna AI

The Outcome

Lessons Learned

Final Thoughts

Start building with BRIA's platform

Get a 14-day free trial to try FIBO and explore the full BRIA platform.

Learn More About Bria

Start with Bria today

Bria is Live on DeepInfra! 🎉

Case Study: Mutabor Builds Branded AI Images For Henkel with Bria For Enterprise Use

Enterprise-Grade Control Meets Creativity: New Platform Release

Our Core Products

ro

For Enterprises

For Developers

Our Technology

ro

Company

ro

How We Cut AI Inference Time by 50% in Just 2 Days - Without Losing Quality

Start building with BRIA's platform

A real-world deep dive into optimizing GenAI inference latency with Torch Compile, LoRA hot-swapping, and Pruna AI.

The Challenge

Why It Was Harder Than Usual

The Breakthrough: LoRA Hot-Swapping

Caching to the Rescue

Stage 2: Pruna AI

The Outcome

Lessons Learned

Final Thoughts

Start building with BRIA's platform

Get a 14-day free trial to try FIBO and explore the full BRIA platform.

Learn More About Bria

Start with Bria today

Posts by Tag

Bria is Live on DeepInfra! 🎉

Case Study: Mutabor Builds Branded AI Images For Henkel with Bria For Enterprise Use

Enterprise-Grade Control Meets Creativity: New Platform Release