Headline & Intro
AI builders are discovering that buying more GPUs is no longer enough. Energy bills, supply constraints and regulator scrutiny are forcing a new obsession: squeezing every last drop of performance out of the hardware they already own. That is exactly the pain point Gimlet Labs is attacking. The startup is not trying to ship yet another chip, but to become the software brain that coordinates all of them. In this piece we’ll look at what Gimlet is actually doing, why it threatens existing power structures in AI infrastructure, and why European players in particular should be paying attention.
The News in Brief
According to TechCrunch, Stanford adjunct professor and repeat founder Zain Asgar has raised an $80 million Series A round for Gimlet Labs, led by Menlo Ventures, bringing total funding to $92 million. The company claims to offer the first "multi-silicon inference cloud": software that can run a single AI workload across heterogeneous hardware – conventional CPUs, GPU accelerators and high‑memory systems – at the same time.
Instead of forcing developers to choose a single chip type or vendor, Gimlet’s orchestration layer slices AI workloads (including agentic, multi‑step pipelines) and assigns each segment to the most suitable hardware. The startup says this can deliver 3–10x faster inference for the same cost and power. Customers are large model labs and data‑center operators rather than typical app developers. As reported by TechCrunch, Gimlet launched publicly in October with eight‑figure revenue already and now counts a major model provider and a large cloud company among its clients. It has partnerships with NVIDIA, AMD, Intel, Arm, Cerebras and d‑Matrix, and employs around 30 people.
Why This Matters
Gimlet is going after the least glamorous but most economically important part of AI: inference. Training makes headlines; inference pays the bills. Every time a chatbot answers, a copilot writes code or an AI agent calls a tool, someone is burning GPU hours and power. If McKinsey’s projection of multi‑trillion‑dollar data‑center investment this decade is even directionally right, then tiny efficiency gains compound into enormous savings.
Today, utilization is shockingly low. Asgar estimates that deployed hardware is active only 15–30% of the time. Anyone who has stared at half‑idle GPU dashboards while still queueing jobs will recognize this. Different steps of modern AI pipelines stress different resources: model forward pass is compute‑heavy, token decoding can be memory‑bound, and tool‑calling or retrieval hits the network. No single chip is ideal for all of that.
Gimlet’s core bet is that the future of AI infrastructure is heterogeneous by default. Rather than waiting for a mythical "do‑everything" accelerator, it assumes data centers will contain a messy mix of old and new GPUs, CPUs with big RAM, and an expanding zoo of specialized accelerators. Whoever provides the abstraction layer that makes that chaos usable will control real strategic leverage.
Winners if Gimlet’s approach works:
- Data‑center operators and clouds: higher utilization, longer life for existing fleets, less pressure to over‑buy the newest GPUs.
- Model providers: cheaper inference and more predictable performance, especially for complex agents.
- Non‑NVIDIA silicon vendors: a neutral orchestration layer lowers the penalty for mixing and matching chips.
Potential losers:
- Proprietary, single‑vendor stacks that rely on lock‑in at the hardware–software boundary.
- GPU rental middlemen whose margin comes from arbitrage rather than true utilization gains.
In other words, this is less a story about one startup and more about whether AI compute becomes software‑defined in the same way networking and storage did.
The Bigger Picture
Gimlet fits squarely into a broader shift: the industry is moving from "more GPUs at any cost" to "smarter orchestration across many kinds of silicon." Over the last few years we’ve seen an explosion of AI‑specific hardware – Google TPUs, Amazon’s in‑house chips, Cerebras wafers, Graphcore, countless NPUs – but the software layer has lagged badly behind. Most production inference still assumes a relatively homogeneous GPU pool.
We’ve been here before. Virtualization turned bare‑metal servers into flexible resource pools. Kubernetes did the same for containers across clusters. The pattern is consistent: once hardware diversity grows and workloads become more complex, a neutral orchestrator emerges and eventually becomes indispensable. Gimlet is explicitly pitching itself as that orchestrator for AI inference.
This also intersects with the rise of agentic workflows. A simple single‑prompt/single‑response model call is easy to pin to one GPU. But an AI "agent" that reasons in multiple steps, calls external tools, queries vector stores and maybe hands off tasks to smaller specialist models — that is a distributed system. Treating it as such, and placing each step on the ideal hardware, is exactly where a multi‑silicon scheduler can shine.
Major incumbents are not standing still. NVIDIA pushes Triton Inference Server and its full CUDA‑centric stack. Hyperscalers like AWS and Google offer their own compilers and orchestrators (Neuron, XLA, etc.) tuned for their chips. Open‑source projects like vLLM, Ray, and various inference servers also chip away at the problem from different angles.
What’s different in Gimlet’s thesis is the insistence on cross‑vendor, cross‑architecture orchestration as a product, not just as glue code inside one cloud. If they succeed, they don’t just optimize costs; they weaken the gravitational pull of any single hardware vendor. That is the kind of structural change that can reshape pricing power and innovation incentives across the industry.
The European / Regional Angle
For Europe, this isn’t an abstract cloud‑infrastructure story. It touches three very concrete pressure points: energy prices, regulatory requirements and strategic autonomy.
European data centers operate under stricter environmental and zoning constraints than many US facilities, and power is often more expensive. Several cities have flirted with data‑center moratoria. If AI continues to drive up electricity demand, regulators will increasingly ask not just how green the power mix is, but how efficiently it is used. Software that genuinely improves hardware utilization is one of the few levers operators have that doesn’t involve massive new capex.
Then there’s the EU AI Act and the Digital Services Act. Compliance for foundation‑model providers and large platforms will not be cheap. More logging, monitoring and safety tooling all add inference overhead. Any efficiency reclaimed at the silicon‑orchestration level helps absorb that cost without simply passing it on to customers.
Finally, Europe is still fighting for relevance in the AI compute race. Local cloud providers like OVHcloud, Scaleway, Hetzner or Deutsche Telekom cannot match the hyperscalers in sheer GPU volume. But they can differentiate on smarter, more frugal infrastructure and on data‑sovereign, regionally distributed deployments. A vendor‑neutral multi‑silicon layer makes it easier for them to compose fleets from a mix of GPUs, CPUs and emerging European chips (think projects like SiPearl) without rewriting their entire stack.
For European enterprises, especially in regulated sectors (finance, healthcare, public sector), the prospect of running sophisticated AI agents on hardware they already own — with better utilization and potentially lower latency because workloads stay local — is strategically interesting. The combination of sovereignty, compliance and efficiency is precisely where Europe can compete.
Looking Ahead
Over the next 12–24 months, the key question is whether "multi‑silicon inference" becomes a mainstream architectural pattern or remains a niche for a few bleeding‑edge labs.
Several signposts to watch:
- Benchmarks and transparency. Will independent tests confirm 3–10x gains in real‑world, messy workloads, not just idealized demos? If yes, procurement teams will pay attention fast.
- Ecosystem moves. If hyperscalers or chip vendors start offering similar cross‑silicon schedulers, that’s both validation and direct competition. An acquisition of Gimlet by a cloud or major semiconductor player is a plausible scenario.
- Standardization vs. lock‑in. Does Gimlet lean into open standards and APIs, positioning itself as a neutral layer, or drift toward a proprietary control plane that competes with clouds’ native services? The former makes it a potential "Kubernetes of inference"; the latter risks relegating it to yet another vertical SaaS.
- Edge and on‑prem integration. Many enterprises in Europe and Asia are investing in on‑prem GPU clusters for sovereignty reasons. A truly compelling multi‑silicon story will have to span cloud and edge, not just hyperscale data centers.
Risks are non‑trivial. NVIDIA and the big clouds have powerful incentives to keep orchestration close to their own stacks. If, for instance, licensing or API changes made it harder for third parties to orchestrate across generations of GPUs, startups like Gimlet would face headwinds. On the other hand, customer pressure for flexibility and cost control is growing just as fast.
The Bottom Line
Gimlet Labs is not just another AI startup chasing model hype; it is a bet that the real power in the AI era will sit in the orchestration layer that makes heterogeneous compute usable. If its multi‑silicon inference cloud delivers on promised efficiency gains, it could shift bargaining power away from single‑vendor stacks and toward more open, software‑defined infrastructure — a shift that plays to Europe’s strengths. The open question for readers building AI today: are you designing your architecture as if the hardware stack will stay homogeneous, or are you preparing for a messy, multi‑silicon future?



