Pipeshift cuts GPU usage for AI inferences 75% with modular interface engine

Source: Venture Beat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

DeepSeek’s release of R1 this week was a watershed moment in the field of AI. Nobody thought a Chinese startup would be the first to drop a reasoning model matching OpenAI’s o1 and open-source it (in line with OpenAI’s original mission) at the same time.

Enterprises can easily download R1’s weights via Hugging Face, but access has never been the problem — over 80% of teams are using or planning to use open models. Deployment is the real culprit. If you go with hyperscaler services, like Vertex AI, you’re locked into a specific cloud. On the other hand, if you go solo and build in-house, there’s the challenge of resource constraints as you have to set up a dozen different components just to get started, let alone optimizing or scaling downstream.

To address this challenge, Y Combinator and SenseAI-backed Pipeshift is launching an end-to-end platform that allows enterprises to train, deploy and scale open-source generative AI models — LLMs, vision models, audio models and image models — across any cloud or on-prem GPUs. The company is competing with a rapidly growing domain that includes Baseten, Domino Data Lab, Together AI and Simplismart.

The key value proposition? Pipeshift uses a modular inference engine that can quickly be optimized for speed and efficiency, helping teams not only deploy 30 times faster but achieve more with the same infrastructure, leading to as much as 60% cost savings.

Imagine running inferences worth four GPUs with just one.

The orchestration bottleneck

When you have to run different models, stitching together a functional MLOps stack in-house — from accessing compute, training and fine-tuning to production-grade deployment and monitoring — becomes the problem. You have to set up 10 different inference components and instances to get things up and running and then put in thousands of engineering hours for even the smallest of optimizations.

“There are multiple components of an inference engine,” Arko Chattopadhyay, cofounder and CEO of Pipeshift, told VentureBeat. “Every combination of these components creates a distinct engine with varying performance for the same workload. Identifying the optimal combination to maximize ROI requires weeks of repetitive experimentation and fine-tuning of settings. In most cases, the in-house teams can take years to develop pipelines that can allow for the flexibility and modularization of infrastructure, pushing enterprises behind in the market alongside accumulating massive tech debts.”

While there are startups that offer platforms to deploy open models across cloud or on-premise environments, Chattopadhyay says most of them are GPU brokers, offering one-size-fits-all inference solutions. As a result, they maintain separate GPU instances for different LLMs, which doesn’t help when teams want to save costs and optimize for performance.

To fix this, Chattopadhyay started Pipeshift and developed a framework called modular architecture for GPU-based inference clusters (MAGIC), aimed at distributing the inference stack into different plug-and-play pieces. The work created a Lego-like system that allows teams to configure the right inference stack for their workloads, without the hassle of infrastructure engineering.

This way, a team can quickly add or interchange different inference components to piece together a customized inference engine that can extract more out of existing infrastructure to meet expectations for costs, throughput or even scalability.

For instance, a team could set up a unified inference system, where multiple domain-specific LLMs could run with hot-swapping on a single GPU, utilizing it to full benefit.

Running four GPU workloads on one

Since claiming to offer a modular inference solution is one thing and delivering on it is entirely another, Pipeshift’s founder was quick to point out the benefits of the company’s offering.

“In terms of operational expenses…MAGIC allows you to run LLMs like Llama 3.1 8B at >500 tokens/sec on a given set of Nvidia GPUs without any model quantization or compression,” he said. “This unlocks a massive reduction of scaling costs as the GPUs can now handle workloads that are an order of magnitude 20-30 times what they originally were able to achieve using the native platforms offered by the cloud providers.”

The CEO noted that the company is already working with 30 companies on an annual license-based model.

One of these is a Fortune 500 retailer that initially used four independent GPU instances to run four open fine-tuned models for their automated support and document processing workflows. Each of these GPU clusters was scaling independently, adding to massive cost overheads.

“Large-scale fine-tuning was not possible as datasets became larger and all the pipelines were supporting single-GPU workloads while requiring you to upload all the data at once. Plus, there was no auto-scaling support with tools like AWS Sagemaker, which made it hard to ensure optimal use of infra, pushing the company to pre-approve quotas and reserve capacity beforehand for theoretical scale that only hit 5% of the time,” Chattopadhyay noted.

Interestingly, after shifting to Pipeshift’s modular architecture, all the fine-tunes were brought down to a single GPU instance that served them in parallel, without any memory partitioning or model degradation. This brought down the requirement to run these workloads from four GPUs to just a single GPU.

“Without additional optimizations, we were able to scale the capabilities of the GPU to a point where it was serving five-times-faster tokens for inference and could handle a four-times-higher scale,” the CEO added. In all, he said that the company saw a 30-times faster deployment timeline and a 60% reduction in infrastructure costs.

With modular architecture, Pipeshift wants to position itself as the go-to platform for deploying all cutting-edge open-source AI models, including DeepSeek R-1.

However, it won’t be an easy ride as competitors continue to evolve their offerings.

For instance, Simplismart, which raised $7 million a few months ago, is taking a similar software-optimized approach to inference. Cloud service providers like Google Cloud and Microsoft Azure are also bolstering their respective offerings, although Chattopadhyay thinks these CSPs will be more like partners than competitors in the long run.

“We are a platform for tooling and orchestration of AI workloads, like Databricks has been for data intelligence,” he explained. “In most scenarios, most cloud service providers will turn into growth-stage GTM partners for the kind of value their customers will be able to derive from Pipeshift on their AWS/GCP/Azure clouds.”

In the coming months, Pipeshift will also introduce tools to help teams build and scale their datasets, alongside model evaluation and testing. This will speed up the experimentation and data preparation cycle exponentially, enabling customers to leverage orchestration more efficiently.

Read Full Article