Working with Llama.cpp Embeddings
May 01, 2026
Table of Contents
Overview
Working with machine learning embeddings in llama.cpp is straightforward once you understand that it exposes three different surfaces with slightly different behavior:
- the standalone
llama-embeddingCLI, - the non-OpenAI
/embeddingsserver route, - the OpenAI-compatible
/v1/embeddingsserver route.
That last one is where many people get justifiably tripped up. The endpoint borrows OpenAI-shaped conventions like /v1/embeddings, model, and Authorization: Bearer ..., but the naming and documentation do a poor job explaining what those fields actually mean in a local llama.cpp server. The bearer token is not required for embeddings, and it is not an OpenAI API key, even though the codebase is littered with references to “OpenAI API key.” For OpenAI compatibility alone, the API-key mechanism could be omitted from llama.cpp entirely: the server could accept /v1/embeddings requests with no Authorization header, or ignore a dummy bearer token supplied by OpenAI-compatible SDKs. The only real reason to define an API-key mechanism is separate from OpenAI compatibility: it lets server operators optionally gate access to the local endpoint—for example, to prevent unauthorized use, meter requests, or put a simple access-control boundary in front of an exposed inference server. That is a valid reason to support a token, but the interface blurs together three separate ideas: OpenAI client compatibility, optional local server authentication, and actual embedding/runtime configuration, with poor naming making that confusion worse. A better design would call it a server auth token or local access token, clearly state that it can be omitted when auth is disabled, explain that dummy values are only for SDK compatibility, and separately document that embedding behavior is still determined by startup configuration such as --embeddings, --pooling, batch sizes, context size, and the model loaded at startup.
A sharper observation is that authentication shouldn't be handled by Llama at all — it should be handled at the edge of a system.
Note: I'm working on the v2 Llama embeddings endpoint https://github.com/ggml-org/llama.cpp/discussions/16957.
What Llama.cpp Embeddings Actually Gives You
At a high level:
llama-embeddingis the most direct local workflow.llama-server /embeddingsis the more revealing HTTP surface if you need token-level embeddings or--pooling none.llama-server /v1/embeddingsis the easiest path if you already have OpenAI-compatible tooling, but it is also the easiest place to form the wrong mental model.
The official server docs explicitly distinguish /embeddings from /v1/embeddings:
/embeddingssupports all pooling modes, including--pooling none./v1/embeddingsrequires a pooling mode other thannone.- pooled outputs are Euclidean-normalized; with
pooling none,/embeddingscan return per-token embeddings instead. ([GitHub][1])
That distinction matters because, in machine learning, “embedding” can mean either:
- one vector per text (sentence/document retrieval), or
- one vector per token (more advanced analysis, alignment, custom pooling, token-level similarity).
The Interface Trap: OpenAI Compatibility vs. Actual Control Surface
This is the nuance that deserves special emphasis.
The official docs show /v1/embeddings requests like:
Authorization: Bearer no-key"model": "GPT-4"
Those fields are there because the route is OpenAI-compatible, not because llama.cpp is actually exposing the same abstraction boundary as OpenAI’s hosted embeddings service. On the llama.cpp side, whether embeddings are enabled at all, what pooling mode is used, how batching works, and what model is active are primarily controlled by server configuration. In router mode, requests are routed by the requested model name, but request-level overrides remain intentionally limited compared with what the OpenAI-shaped surface suggests. ([GitHub][1])
So the right mental model is:
- Use
/v1/embeddingsfor client compatibility - Use
/embeddings(and/or the CLI) when you want clearer control over what the server is actually doing
That is not obvious from the interface alone, and it is a documentation weakness more than a user mistake.
Choosing the Right Embedding Model
For retrieval, search, clustering, semantic deduplication, or reranking pipelines, you usually want a dedicated embedding model, not a general chat/instruct checkpoint. The official embedding tutorial uses an embedding-specific model (Snowflake/snowflake-arctic-embed-s) and also notes support for embedding models such as BERT-family checkpoints. ([GitHub][2])
A practical way to think about model choice:
| Model Type | Usually Best For | Strength | Trade-off |
|---|---|---|---|
| Embedding-specific model | Search, retrieval, clustering, semantic matching | Trained to produce useful vector geometry | You need a separate generation model if you also want chat/completions |
| Chat / instruct model | Text generation | General-purpose language behavior | Embedding quality for retrieval is often not what you actually want |
| Reranker model | Final ranking after retrieval | Better precision on top-k candidates | Not a drop-in replacement for corpus-wide embedding generation |
A good rule of thumb:
- use an embedding model to embed your corpus and queries,
- optionally use a reranker after vector search,
- use a chat model only for downstream answer generation.
Quantized vs. Unquantized Models
The official embedding tutorial walks through both an F16 GGUF and a Q8_0 quantized version of the same embedding model. In that example, the quantized model is materially smaller on disk than the F16 version, which is exactly why quantization is so attractive for local retrieval systems. ([GitHub][2])
For embeddings, the practical trade-off is:
| Format | Pros | Cons | Usually a Good Fit For |
|---|---|---|---|
| F16 / BF16 | Highest fidelity; simplest baseline | Largest RAM/VRAM footprint | Benchmarking quality, high-end GPUs, smaller corpora |
| Q8 / Q6 | Often strong quality / memory trade-off | Still larger than aggressive low-bit quants | Most production local retrieval setups |
| Q4 and below | Much smaller; easier to fit on CPU boxes or weak GPUs | Higher risk of degrading retrieval quality | Memory-constrained experiments; validate carefully before production |
The important nuance: for embeddings, quality degradation may show up less as “the output looks weird” and more as recall/precision drift in search. That means you should validate quantization against a retrieval metric such as:
- Recall@k
- MRR
- NDCG
- or simply “did the right document stay in the top 5?”
Running Embeddings from the CLI
The official tutorial uses llama-embedding directly:
# F16
./llama-embedding -m model-f16.gguf -e -p "Hello world" --verbose-prompt -ngl 99
# Q8_0
./llama-embedding -m model-q8_0.gguf -e -p "Hello world" --verbose-prompt -ngl 99And for a quick benchmark:
./llama-bench -m model-f16.gguf -r 10 -p 8,16,32,64,128,256,512 -n 0 -embd 1Those commands are straight from the upstream embedding tutorial. ([GitHub][2])
What the main CLI flags mean
-m→ path to the GGUF model-e→ run in embedding mode-p→ the input text/prompt to embed--verbose-prompt→ useful when you want to inspect tokenization behavior-ngl 99→ offload as many layers as possible to GPU; use-ngl 0for CPU-only runs. The tutorial explicitly calls out-ngl 99as full offload and-ngl 0as CPU-only. ([GitHub][2])
Practical hardware guidance
- CPU-only: start with
-ngl 0; focus on thread count and smaller quantized models - Single GPU:
-ngl 99(or equivalently “all available layers”) is usually what you want - Weak GPU / laptop GPU: Q8 or Q6 often gives a friendlier memory/throughput trade-off than F16
- Benchmark first: embedding throughput depends heavily on input length, not just model size
Running Embeddings from llama-server
The server README includes a dedicated embedding-server-style example:
llama-server -m model.gguf --embeddings --pooling cls -ub 8192That is the official “serve an embedding model” example, and it already hints at the two knobs that matter most for embedding throughput on the server:
--pooling-ub / --ubatch-size([GitHub][3])
/v1/embeddings (OpenAI-compatible)
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"input": "hello",
"model": "GPT-4",
"encoding_format": "float"
}'/embeddings (llama.cpp-native)
Use a similar request payload, but call /embeddings when you want the non-OpenAI route—especially if you need --pooling none, want token-level output, or want the llama.cpp-native response shape. The docs explicitly note that /embeddings supports all pooling modes, while /v1/embeddings requires pooling other than none and returns an OpenAI-compatible response shape. ([GitHub][1])
When to choose which route
| Route | Use It When | Strength | Limitation |
|---|---|---|---|
| /v1/embeddings | You want OpenAI SDK compatibility | Easy drop-in for existing clients | Poorer mental model of what llama.cpp is actually controlling |
| /embeddings | You want the llama.cpp-native embedding surface | Supports all pooling modes, including none | Not OpenAI-compatible |
Calling Embeddings from Python
Option 1: OpenAI client against /v1/embeddings
This is the easiest way if your app already uses the OpenAI Python client.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="sk-no-key-required", # placeholder if your server does not enforce auth
)
resp = client.embeddings.create(
model="your-model-name-or-placeholder",
input=["hello world", "semantic search is fun"],
encoding_format="float",
)
for item in resp.data:
print(len(item.embedding), item.index)The official server docs explicitly show use of the OpenAI Python client with llama.cpp’s compatible endpoints, and they also show the generic bearer token example. ([GitHub][1])
Option 2: Raw HTTP to /embeddings
This is a better fit when you want llama.cpp-native behavior.
import requests
payload = {
"input": ["hello world", "semantic search is fun"],
"encoding_format": "float",
}
r = requests.post(
"http://localhost:8080/embeddings",
headers={"Content-Type": "application/json"},
json=payload,
timeout=60,
)
r.raise_for_status()
data = r.json()
print(data)Option 3: Corpus + query example
from openai import OpenAI
import math
client = OpenAI(base_url="http://localhost:8080/v1", api_key="sk-no-key-required")
docs = [
"Raft is a consensus algorithm used in distributed systems.",
"Topological sort orders DAG nodes by dependency.",
"Monotonic stacks solve nearest greater/smaller problems efficiently.",
]
query = "Which algorithm is used for consensus in distributed databases?"
doc_embeddings = client.embeddings.create(model="placeholder", input=docs, encoding_format="float").data
query_embedding = client.embeddings.create(model="placeholder", input=query, encoding_format="float").data[0].embedding
def dot(a, b):
return sum(x * y for x, y in zip(a, b))
# if vectors are normalized, dot product == cosine similarity
scored = []
for i, item in enumerate(doc_embeddings):
score = dot(query_embedding, item.embedding)
scored.append((score, docs[i]))
scored.sort(reverse=True)
for score, doc in scored:
print(round(score, 4), doc)If you are using pooled vectors from /v1/embeddings or pooled /embeddings, the docs say those embeddings are Euclidean-normalized, which is why dot product is a perfectly reasonable cosine shortcut there. ([GitHub][1])
Important Runtime Parameters
The server README surfaces the most important embedding-related knobs:
--embeddings→ enable embedding mode/endpoints--pooling {none,mean,cls,last,rank}→ pooling type-c, --ctx-size→ prompt context size-b, --batch-size→ logical max batch size-ub, --ubatch-size→ physical max batch size-t, --threadsand-tb, --threads-batch→ CPU thread counts-ngl, --gpu-layers→ how many layers go to VRAM ([GitHub][1])
What these actually mean in practice
| Parameter | What It Controls | Usually Good For | Trade-off |
|---|---|---|---|
--pooling | How token embeddings become one vector | mean or model default for general retrieval; cls for models trained that way; none for token-level work | Wrong pooling can quietly degrade retrieval quality |
-c / --ctx-size | Maximum token context per input | Long documents and chunk-heavy pipelines | Bigger context raises memory use |
-b / -ub | Logical vs physical batching | Server throughput tuning for many inputs | Larger values can improve throughput but spike RAM/VRAM and can trigger OOM |
-t / -tb | CPU parallelism | CPU-only boxes or hybrid prompt processing | Too many threads can hurt due to contention |
-ngl | How much of the model is offloaded to GPU | GPU-equipped machines; usually as high as fits | Higher offload needs more VRAM |
Hardware-oriented starting points
- CPU-only server: start with quantized models, moderate
-ub, and tune-t/-tb - Consumer GPU: start with Q8/Q6 and high
-ngl; only push-ubupward after watching VRAM - Large GPU / workstation: F16 or Q8 with larger
-ubcan be worthwhile if your workload is many long chunks - Laptop / constrained RAM: reduce context size and use smaller quantized models first
Search / Retrieval Workflow: Query + Corpus
This is another easy point to miss: for search, you do not just embed the user query. You embed:
- the corpus (offline or in batches),
- the incoming query (online),
- then compare vectors.
That means memory planning matters.
Memory footprint formula
If you store embeddings as float32:
memory ≈ number_of_vectors × dimension × 4 bytesExamples:
100,000 docs × 768 dims × 4 bytes ≈ 307 MB1,000,000 docs × 768 dims × 4 bytes ≈ 2.9 GB
If you store them as float16 after generation:
memory ≈ number_of_vectors × dimension × 2 bytesThat roughly halves footprint, though you should validate whether reduced precision affects recall in your retrieval stack.
Practical advice
- Precompute corpus embeddings once
- Use the same embedding model + pooling strategy for query and corpus
- Chunk long documents consistently
- Normalize consistently (or rely on the route’s normalization rules when applicable)
What Similarity Scores Actually Mean
For normalized embeddings, cosine similarity / dot product is usually the main score you inspect.
But the numbers are not universal truth. They are:
- model-dependent
- domain-dependent
- chunking-dependent
- corpus-dependent
Still, for some normalized embedding models, these rough buckets can be a useful debugging intuition—not a product threshold:
| Cosine / Dot Score | Typical Interpretation | How to Treat It |
|---|---|---|
| 0.85+ | Near-duplicate or extremely strong semantic match | Often safe to trust, but still validate in-domain |
| 0.75–0.85 | Very strong match | Usually excellent retrieval territory |
| 0.60–0.75 | Good match, but more context-sensitive | Often useful for top-k retrieval |
| 0.40–0.60 | Weak-to-moderate relation | Can be noisy; rerank if the result matters |
| Below 0.40 | Often weak or unrelated | Do not use as a universal cutoff without validation |
The right way to choose thresholds is to build a small labeled set and evaluate:
- precision@k
- recall@k
- false-positive tolerance
- false-negative tolerance
For production search, ranking quality matters more than any single universal threshold.
Common Pitfalls (and Fixes)
| Pitfall | Why It Happens | Fix |
|---|---|---|
Treating /v1/embeddings as if it fully exposes model behavior | The OpenAI-shaped request makes that assumption feel natural | Remember that llama.cpp still fixes most behavior at server startup; use /embeddings when you need the native surface |
| Using a chat model for retrieval | “It can embed” gets confused with “it is a retrieval model” | Use a dedicated embedding checkpoint for search |
| Changing pooling between corpus and query | Different pipelines evolve independently | Lock model + pooling + normalization together for the whole system |
| Over-quantizing without validation | Smaller footprint looks attractive, but retrieval quality drifts silently | Compare Recall@k / MRR before and after quantization |
| Ignoring memory footprint of stored vectors | The model fit is planned, but the vector index footprint is not | Budget for both runtime model memory and corpus embedding storage |
| Using absolute similarity thresholds as doctrine | People copy thresholds from another model or blog post | Calibrate thresholds on your own data |
Conclusion
Working with llama.cpp embeddings is easiest when you separate client compatibility from actual embedding behavior. The CLI and /embeddings route expose the mechanics more honestly; /v1/embeddings is useful, but it can encourage the wrong assumptions if you read too much into its OpenAI-shaped surface. Use dedicated embedding models, validate quantization against retrieval metrics, tune batching to your hardware, and remember that retrieval requires embedding both the query and the corpus—not just one side of the comparison. The llama.cpp project already has the pieces; the real challenge is understanding which interface surface gives you which level of control. ([GitHub][1])
Key Takeaways
- llama.cpp exposes embeddings through the CLI,
/embeddings, and/v1/embeddings /v1/embeddingsis convenient, but the server configuration still determines most of the real behavior- Use dedicated embedding models for search/retrieval
- Quantization is often worth it, but validate recall/precision, not just raw speed or memory
- For retrieval, embed both corpus and query, and budget memory for the stored vectors as well as the model itself
References
1: https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md "llama.cpp/tools/server/README.md at master · ggml-org/llama.cpp · GitHub"
2: https://github.com/ggml-org/llama.cpp/discussions/7712 "tutorial : compute embeddings using llama.cpp · ggml-org llama.cpp · Discussion #7712 · GitHub"
3: https://github.com/ggml-org/llama.cpp "GitHub - ggml-org/llama.cpp: LLM inference in C/C++ · GitHub"

