How We Cut AI Inference Time by 50% in Just 2 Days - Without Losing Quality

Written by Michael Feinstein & Tair Chamiel | Aug 25, 2025 5:00:00 PM

Start building with BRIA's platform

Get a 14-day trial to try FIBO and explore the full BRIA platform.

A real-world deep dive into optimizing GenAI inference latency with Torch Compile, LoRA hot-swapping, and Pruna AI.

The Challenge

At BRIA, we build foundation models for visual generation. Performance is everything - our customers rely on our APIs in production, and milliseconds can determine whether an experience feels smooth or broken.

One day, we faced an urgent escalation:

“The API is too slow. We can’t go live like this.”

For a customer about to launch, inference latency was a showstopper. We had to move fast - really fast. The optimization was declared an all-hands emergency, with parallel efforts across multiple teams.

My mission: make inference as fast as possible. What followed was one of the most intense 48-hour optimization pushes we've ever experienced.

Stage 1: Torch Compile

Our first lever was Torch Compile - PyTorch’s built-in optimization for model inference.

Normally, this is straightforward. Torch’s JIT compilation can produce big wins with minimal code changes. But this wasn’t a normal case.

Why It Was Harder Than Usual

We were working with a complex foundation model + auxiliary models.
Models changed dynamically at runtime - especially with LoRA adapters.

We ran on AWS G6e instances (NVIDIA L40S Tensor Core GPUs), which had their own quirks.

Just a month earlier, using Torch Compile with LoRA meant every new LoRA load triggered a full recompilation - sometimes making inference slower than running without it.

The Breakthrough: LoRA Hot-Swapping

This time, I tried diffusers’ new LoRA hot-swapping feature. For the first time, it worked seamlessly. No more forced recompilation every time we loaded an adapter.

Still, Torch Compile wasn't plug-and-play:

Warm-up was painful. With auxiliary models and changing parameters, Torch’s JIT needed to compile many “unique” runs.
We found that 8 warm-up runs were necessary for stable performance. Each run took ~4 minutes → 45+ minutes just to warm up a pod.

Clearly, this wouldn’t scale.

Caching to the Rescue

Torch Compile does cache compiled graphs. The trick was making the cache persistent across pods. Options included Redis or a shared filesystem; we went with the latter, since caches could grow to multiple gigabytes.

At 4 AM, after countless iterations and 30% faster inference, deployment succeeded. I sent a quick note to our customer-facing team, then finally collapsed into bed.

Stage 2: Pruna AI

A 30% improvement was huge, but not enough. Some edge cases—two requests hitting the same GPU - still doubled inference time, breaching SLA.

Time for our second optimization lever: Pruna AI.

Pruna is a GenAI inference optimization engine with flexible configuration and great support. Integrating it was surprisingly smooth, and in just hours, we achieved another 20–30% performance boost.

The Outcome

29 hours later, with only 5 hours of sleep and deep into the weekend, the optimization was complete.

🚀 Results:

Inference latency cut by over 50%
Customer unblocked and delighted
Team exhausted but proud

Lessons Learned

This sprint reinforced key lessons for anyone building production GenAI infrastructure:

1. Leverage Torch Compile - but design warm-up carefully. Dynamic workloads need tailored warm-up recipes and caching strategies.
2. Use LoRA hot-swapping. It’s a game-changer for flexible, adapter-heavy deployments.
3. Don’t underestimate caching. Persisting compile caches across pods turned warm-up from hours into minutes.
4. Have multiple optimization levers. Torch Compile got us far; Pruna AI sealed the deal.
5. Team grit matters. When time is short, collaboration and perseverance are as important as tools.

Final Thoughts

Cutting inference time by 50% in under 2 days wasn’t easy - but with the right mix of tools, persistence, and teamwork, it was possible.

For teams working on GenAI in production, performance tuning is both an art and a science. Sometimes, it’s about knowing which levers to pull - and sometimes, it’s about having the grit to pull them at 4 AM.

How We Cut AI Inference Time by 50% in Just 2 Days - Without Losing Quality

Start building with BRIA's platform

A real-world deep dive into optimizing GenAI inference latency with Torch Compile, LoRA hot-swapping, and Pruna AI.

The Challenge

Why It Was Harder Than Usual

The Breakthrough: LoRA Hot-Swapping

Caching to the Rescue

Stage 2: Pruna AI

The Outcome

Lessons Learned

Final Thoughts

Start building with BRIA's platform

Get a 14-day free trial to try FIBO and explore the full BRIA platform.

Learn More About Bria