Skip to the main content.

 

hf mega menuExplore Our Models on Hugging Face
Try Now

discord mega menu1Join Our Community
Bria's Discord

api menu button iconCreate Your First API Key
Sign Up

contact menu iconNeed More Help?
Contact Us

 

 

hf mega menuExplore Our Models on Hugging Face
Try Now

discord mega menu1Join Our Community
Bria's Discord

api menu button iconCreate Your First API Key
Sign Up

contact menu iconNeed More Help?
Contact Us

 

ro

hf mega menuExplore Our Models on Hugging Face
Try Now

discord mega menu1Join Our Community
Bria's Discord

api menu button iconCreate Your First API Key
Sign Up

contact menu iconNeed More Help?
Contact Us

 

2 min read

How We Cut AI Inference Time by 50% in Just 2 Days - Without Losing Quality

How We Cut AI Inference Time by 50% in Just 2 Days - Without Losing Quality
How We Cut AI Inference Time by 50% in Just 2 Days - Without Losing Quality
6:31

Start building with BRIA's platform with a free trial    

A real-world deep dive into optimizing GenAI inference latency with Torch Compile, LoRA hot-swapping, and Pruna AI.

 

The Challenge

 

At BRIA, we build foundation models for visual generation. Performance is everything - our customers rely on our APIs in production, and milliseconds can determine whether an experience feels smooth or broken.

One day, we faced an urgent escalation:

“The API is too slow. We can’t go live like this.”

For a customer about to launch, inference latency was a showstopper. We had to move fast - really fast. The optimization was declared an all-hands emergency, with parallel efforts across multiple teams.

My mission: make inference as fast as possible. What followed was one of the most intense 48-hour optimization pushes we've ever experienced. 

 

Stage 1: Torch Compile

Our first lever was Torch Compile - PyTorch’s built-in optimization for model inference.

Normally, this is straightforward. Torch’s JIT compilation can produce big wins with minimal code changes. But this wasn’t a normal case.

 

Why It Was Harder Than Usual

  • We were working with a complex foundation model + auxiliary models.

  • Models changed dynamically at runtime - especially with LoRA adapters.

We ran on AWS G6e instances (NVIDIA L40S Tensor Core GPUs), which had their own quirks.

 

Just a month earlier, using Torch Compile with LoRA meant every new LoRA load triggered a full recompilation - sometimes making inference slower than running without it.

 

The Breakthrough: LoRA Hot-Swapping

 

This time, I tried diffusers’ new LoRA hot-swapping feature. For the first time, it worked seamlessly. No more forced recompilation every time we loaded an adapter.

Still, Torch Compile wasn't plug-and-play:

  • Warm-up was painful. With auxiliary models and changing parameters, Torch’s JIT needed to compile many “unique” runs.

  • We found that 8 warm-up runs were necessary for stable performance. Each run took ~4 minutes → 45+ minutes just to warm up a pod.

Clearly, this wouldn’t scale.

 

Caching to the Rescue

Torch Compile does cache compiled graphs. The trick was making the cache persistent across pods. Options included Redis or a shared filesystem; we went with the latter, since caches could grow to multiple gigabytes.

At 4 AM, after countless iterations and 30% faster inference, deployment succeeded. I sent a quick note to our customer-facing team, then finally collapsed into bed.

 

Stage 2: Pruna AI

A 30% improvement was huge, but not enough. Some edge cases—two requests hitting the same GPU - still doubled inference time, breaching SLA.

Time for our second optimization lever: Pruna AI.

Pruna is a GenAI inference optimization engine with flexible configuration and great support. Integrating it was surprisingly smooth, and in just hours, we achieved another 20–30% performance boost.

 

The Outcome

29 hours later, with only 5 hours of sleep and deep into the weekend, the optimization was complete.

🚀 Results:

  • Inference latency cut by over 50%

  • Customer unblocked and delighted

  • Team exhausted but proud

Lessons Learned

This sprint reinforced key lessons for anyone building production GenAI infrastructure:

    1. Leverage Torch Compile - but design warm-up carefully. Dynamic workloads need tailored warm-up recipes and caching strategies.

    2. Use LoRA hot-swapping. It’s a game-changer for flexible, adapter-heavy deployments.

    3. Don’t underestimate caching. Persisting compile caches across pods turned warm-up from hours into minutes.

    4. Have multiple optimization levers. Torch Compile got us far; Pruna AI sealed the deal.

    5. Team grit matters. When time is short, collaboration and perseverance are as important as tools.

Final Thoughts

Cutting inference time by 50% in under 2 days wasn’t easy - but with the right mix of tools, persistence, and teamwork, it was possible.

For teams working on GenAI in production, performance tuning is both an art and a science. Sometimes, it’s about knowing which levers to pull - and sometimes, it’s about having the grit to pull them at 4 AM.


Start building with BRIA's platform with a free trial

 
Learn More About Bria
Foundation Models
API Documentation
Case Studies
Contact Us