Jensen Huang has described the rise of a "token economy" — a world where enterprises continuously generate and process tokens at massive scale. Embedding models sit at the heart of this: every document indexed, every query processed, every retrieval made depends on converting text into vectors. Today we look at exactly that conversion — how Vector Databases and embedding models run on IBM Power, and why the Spyre card turns out to be a natural fit for this workload.
What is a Vector Database?
Every piece of text has a meaning. A vector database makes that meaning mathematical. An embedding model reads a sentence and produces a list of numbers — a vector — that represents its meaning as a point in space. Sentences that mean similar things end up close together. Sentences that are unrelated end up far apart.
This is what makes semantic search possible. Instead of matching keywords, you match meaning. A user asking "how do I reset my password?" will find documents about "account recovery" even if those exact words never appear together.

In the diagram above, each dot is a document mapped to a point in vector space. The red dot is a user query — "login not working". The database doesn't look for those exact words. It finds the nearest neighbours in the space, which happen to be IT support documents about passwords and account recovery. That's semantic search.
In practice, these vectors have hundreds or thousands of dimensions — not two. But the principle is the same: meaning becomes geometry, and search becomes finding what's nearby.
The embedding model is the engine that produces these vectors. It runs at ingest time to encode your entire document library, and at query time to encode every user question before the search. Which is why its performance and efficiency matter so much in production.
Why embedding models have different Hardware requirements compared to LLMs
LLMs actually have two distinct phases of work. The first is the prefill phase: the model reads and processes the entire input prompt in one go — this is compute-intensive, similar to an embedding model. The second is the decode phase: the model generates tokens one at a time, and this is where things get expensive. Each new token requires attending over the full history of previous tokens, stored in a structure called the KV cache — a growing block of memory that must be loaded from memory on every single generation step. At typical serving batch sizes, the GPU spends most of its time fetching this cache, not computing. That's what makes LLM inference memory-bound: arithmetic intensity drops to roughly 1–10 FLOPs/byte, and adding more compute to the chip barely helps.
Embedding models have no decode phase at all. They are encoder-only — one forward pass over the input, one vector out. No token-by-token loop, no KV cache, no growing memory footprint. When you batch 128, 256, or 512 sequences together, the same weights get reused across every sequence simultaneously through large matrix multiplications. Arithmetic intensity climbs to 100+ FLOPs/byte. The chip is fully busy computing, not waiting on memory. That shifts the relevant hardware metric entirely — from GB/s of memory bandwidth to FLOPS per watt of sustained compute.

This difference introduces the following metric: Arithmetic intensity. Arithmetic intensity is the ratio of compute to memory bandwidth — how many floating-point operations are performed for every byte fetched from memory. It determines which hardware fits which AI-workload.
LLMs have low arithmetic intensity (~1–10 FLOPs/byte): the accelerator spends most of its time waiting for weights and the KV cache to arrive from memory, not computing. Embedding models flip this: processing a full batch in one pass reuses weights across every sequence simultaneously, pushing arithmetic intensity above 100 FLOPs/byte. The accelerator stays fully busy doing math. Same formula — very different hardware requirements.
This is exactly where the IBM Spyre cards have a structural advantage. Because embedding inference is compute-bound, what matters is the ratio of compute to memory bandwidth — and Spyre's architecture favors that quotient. The benchmark numbers reflect it.
The following results compare a single Spyre card against single NVIDIA cards, running the Granite-Embedding-125M model across varying sequence lengths. In raw throughput, Spyre sits between the L40S and the H100 — competitive with chips that draw significantly more power. But the throughput-per-watt picture tells the more important story for production workloads that run continuously.

In raw throughput, Spyre lands between the L40S and H100 — competitive with chips that draw significantly more power. The H100 leads on absolute sequences per second, but that lead evaporates when you factor in wattage: Spyre delivers 8.8× more useful work per watt. For a continuously running embedding pipeline, it's the efficiency number that drives your electricity bill, not the peak.
In this post we looked at how embedding models work, why their compute profile differs fundamentally from LLMs, and what that means for hardware selection. The benchmarks show that Spyre performs competitively on raw throughput while delivering a significant efficiency advantage — making it a workload-appropriate choice worth considering for embedding inference.