At BRIA, we build foundation models for visual generation. Performance is everything - our customers rely on our APIs in production, and milliseconds can determine whether an experience feels smooth or broken.
One day, we faced an urgent escalation:
“The API is too slow. We can’t go live like this.”
For a customer about to launch, inference latency was a showstopper. We had to move fast - really fast. The optimization was declared an all-hands emergency, with parallel efforts across multiple teams.
My mission: make inference as fast as possible. What followed was one of the most intense 48-hour optimization pushes we've ever experienced.
Stage 1: Torch Compile
Our first lever was Torch Compile - PyTorch’s built-in optimization for model inference.
Normally, this is straightforward. Torch’s JIT compilation can produce big wins with minimal code changes. But this wasn’t a normal case.
We ran on AWS G6e instances (NVIDIA L40S Tensor Core GPUs), which had their own quirks.
Just a month earlier, using Torch Compile with LoRA meant every new LoRA load triggered a full recompilation - sometimes making inference slower than running without it.
This time, I tried diffusers’ new LoRA hot-swapping feature. For the first time, it worked seamlessly. No more forced recompilation every time we loaded an adapter.
Still, Torch Compile wasn't plug-and-play:
Clearly, this wouldn’t scale.
Torch Compile does cache compiled graphs. The trick was making the cache persistent across pods. Options included Redis or a shared filesystem; we went with the latter, since caches could grow to multiple gigabytes.
At 4 AM, after countless iterations and 30% faster inference, deployment succeeded. I sent a quick note to our customer-facing team, then finally collapsed into bed.
A 30% improvement was huge, but not enough. Some edge cases—two requests hitting the same GPU - still doubled inference time, breaching SLA.
Time for our second optimization lever: Pruna AI.
Pruna is a GenAI inference optimization engine with flexible configuration and great support. Integrating it was surprisingly smooth, and in just hours, we achieved another 20–30% performance boost.
29 hours later, with only 5 hours of sleep and deep into the weekend, the optimization was complete.
🚀 Results:
Inference latency cut by over 50%
Customer unblocked and delighted
Team exhausted but proud
This sprint reinforced key lessons for anyone building production GenAI infrastructure:
Leverage Torch Compile - but design warm-up carefully. Dynamic workloads need tailored warm-up recipes and caching strategies.
Use LoRA hot-swapping. It’s a game-changer for flexible, adapter-heavy deployments.
Don’t underestimate caching. Persisting compile caches across pods turned warm-up from hours into minutes.
Have multiple optimization levers. Torch Compile got us far; Pruna AI sealed the deal.
Team grit matters. When time is short, collaboration and perseverance are as important as tools.
Cutting inference time by 50% in under 2 days wasn’t easy - but with the right mix of tools, persistence, and teamwork, it was possible.
For teams working on GenAI in production, performance tuning is both an art and a science. Sometimes, it’s about knowing which levers to pull - and sometimes, it’s about having the grit to pull them at 4 AM.