Working with Llama.cpp Embeddings

May 01, 2026

Overview

What Llama.cpp Embeddings Actually Gives You

The Interface Trap: OpenAI Compatibility vs. Actual Control Surface

Choosing the Right Embedding Model

Quantized vs. Unquantized Models

Running Embeddings from the CLI

Running Embeddings from llama-server

Calling Embeddings from Python

Important Runtime Parameters

Search / Retrieval Workflow: Query + Corpus

What Similarity Scores Actually Mean

Common Pitfalls (and Fixes)

Conclusion

Key Takeaways

Overview

Working with machine learning embeddings in llama.cpp is straightforward once you understand that it exposes three different surfaces with slightly different behavior:

the standalone llama-embedding CLI,
the non-OpenAI /embeddings server route,
the OpenAI-compatible /v1/embeddings server route.

That last one is where many people get justifiably tripped up. The endpoint borrows OpenAI-shaped conventions like /v1/embeddings, model, and Authorization: Bearer ..., but the naming and documentation do a poor job explaining what those fields actually mean in a local llama.cpp server. The bearer token is not required for embeddings, and it is not an OpenAI API key, even though the codebase is littered with references to “OpenAI API key.” For OpenAI compatibility alone, the API-key mechanism could be omitted from llama.cpp entirely: the server could accept /v1/embeddings requests with no Authorization header, or ignore a dummy bearer token supplied by OpenAI-compatible SDKs. The only real reason to define an API-key mechanism is separate from OpenAI compatibility: it lets server operators optionally gate access to the local endpoint—for example, to prevent unauthorized use, meter requests, or put a simple access-control boundary in front of an exposed inference server. That is a valid reason to support a token, but the interface blurs together three separate ideas: OpenAI client compatibility, optional local server authentication, and actual embedding/runtime configuration, with poor naming making that confusion worse. A better design would call it a server auth token or local access token, clearly state that it can be omitted when auth is disabled, explain that dummy values are only for SDK compatibility, and separately document that embedding behavior is still determined by startup configuration such as --embeddings, --pooling, batch sizes, context size, and the model loaded at startup.

Note: I'm working on the v2 Llama embeddings endpoint https://github.com/ggml-org/llama.cpp/discussions/16957.

What Llama.cpp Embeddings Actually Gives You

At a high level:

llama-embedding is the most direct local workflow.
llama-server /embeddings is the more revealing HTTP surface if you need token-level embeddings or --pooling none.
llama-server /v1/embeddings is the easiest path if you already have OpenAI-compatible tooling, but it is also the easiest place to form the wrong mental model.

The official server docs explicitly distinguish /embeddings from /v1/embeddings:

/embeddings supports all pooling modes, including --pooling none.
/v1/embeddings requires a pooling mode other than none.
pooled outputs are Euclidean-normalized; with pooling none, /embeddings can return per-token embeddings instead. ([GitHub][1])

That distinction matters because, in machine learning, “embedding” can mean either:

one vector per text (sentence/document retrieval), or
one vector per token (more advanced analysis, alignment, custom pooling, token-level similarity).

The Interface Trap: OpenAI Compatibility vs. Actual Control Surface

This is the nuance that deserves special emphasis.

The official docs show /v1/embeddings requests like:

Authorization: Bearer no-key
"model": "GPT-4"

Those fields are there because the route is OpenAI-compatible, not because llama.cpp is actually exposing the same abstraction boundary as OpenAI’s hosted embeddings service. On the llama.cpp side, whether embeddings are enabled at all, what pooling mode is used, how batching works, and what model is active are primarily controlled by server configuration. In router mode, requests are routed by the requested model name, but request-level overrides remain intentionally limited compared with what the OpenAI-shaped surface suggests. ([GitHub][1])

So the right mental model is:

Use /v1/embeddings for client compatibility
Use /embeddings (and/or the CLI) when you want clearer control over what the server is actually doing

That is not obvious from the interface alone, and it is a documentation weakness more than a user mistake.

Choosing the Right Embedding Model

For retrieval, search, clustering, semantic deduplication, or reranking pipelines, you usually want a dedicated embedding model, not a general chat/instruct checkpoint. The official embedding tutorial uses an embedding-specific model (Snowflake/snowflake-arctic-embed-s) and also notes support for embedding models such as BERT-family checkpoints. ([GitHub][2])

A practical way to think about model choice:

Model Type	Usually Best For	Strength	Trade-off
Embedding-specific model	Search, retrieval, clustering, semantic matching	Trained to produce useful vector geometry	You need a separate generation model if you also want chat/completions
Chat / instruct model	Text generation	General-purpose language behavior	Embedding quality for retrieval is often not what you actually want
Reranker model	Final ranking after retrieval	Better precision on top-k candidates	Not a drop-in replacement for corpus-wide embedding generation

A good rule of thumb:

use an embedding model to embed your corpus and queries,
optionally use a reranker after vector search,
use a chat model only for downstream answer generation.

Quantized vs. Unquantized Models

The official embedding tutorial walks through both an F16 GGUF and a Q8_0 quantized version of the same embedding model. In that example, the quantized model is materially smaller on disk than the F16 version, which is exactly why quantization is so attractive for local retrieval systems. ([GitHub][2])

For embeddings, the practical trade-off is:

Format	Pros	Cons	Usually a Good Fit For
F16 / BF16	Highest fidelity; simplest baseline	Largest RAM/VRAM footprint	Benchmarking quality, high-end GPUs, smaller corpora
Q8 / Q6	Often strong quality / memory trade-off	Still larger than aggressive low-bit quants	Most production local retrieval setups
Q4 and below	Much smaller; easier to fit on CPU boxes or weak GPUs	Higher risk of degrading retrieval quality	Memory-constrained experiments; validate carefully before production

The important nuance: for embeddings, quality degradation may show up less as “the output looks weird” and more as recall/precision drift in search. That means you should validate quantization against a retrieval metric such as:

Recall@k
MRR
NDCG
or simply “did the right document stay in the top 5?”

Running Embeddings from the CLI

The official tutorial uses llama-embedding directly:

# F16
./llama-embedding -m model-f16.gguf -e -p "Hello world" --verbose-prompt -ngl 99

# Q8_0
./llama-embedding -m model-q8_0.gguf -e -p "Hello world" --verbose-prompt -ngl 99

And for a quick benchmark:

./llama-bench -m model-f16.gguf -r 10 -p 8,16,32,64,128,256,512 -n 0 -embd 1

Those commands are straight from the upstream embedding tutorial. ([GitHub][2])

What the main CLI flags mean

-m → path to the GGUF model
-e → run in embedding mode
-p → the input text/prompt to embed
--verbose-prompt → useful when you want to inspect tokenization behavior
-ngl 99 → offload as many layers as possible to GPU; use -ngl 0 for CPU-only runs. The tutorial explicitly calls out -ngl 99 as full offload and -ngl 0 as CPU-only. ([GitHub][2])

Practical hardware guidance

CPU-only: start with -ngl 0; focus on thread count and smaller quantized models
Single GPU: -ngl 99 (or equivalently “all available layers”) is usually what you want
Weak GPU / laptop GPU: Q8 or Q6 often gives a friendlier memory/throughput trade-off than F16
Benchmark first: embedding throughput depends heavily on input length, not just model size

Running Embeddings from llama-server

The server README includes a dedicated embedding-server-style example:

llama-server -m model.gguf --embeddings --pooling cls -ub 8192

That is the official “serve an embedding model” example, and it already hints at the two knobs that matter most for embedding throughput on the server:

--pooling
-ub / --ubatch-size ([GitHub][3])

`/v1/embeddings` (OpenAI-compatible)

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer no-key" \
  -d '{
    "input": "hello",
    "model": "GPT-4",
    "encoding_format": "float"
  }'

`/embeddings` (llama.cpp-native)

Use a similar request payload, but call /embeddings when you want the non-OpenAI route—especially if you need --pooling none, want token-level output, or want the llama.cpp-native response shape. The docs explicitly note that /embeddings supports all pooling modes, while /v1/embeddings requires pooling other than none and returns an OpenAI-compatible response shape. ([GitHub][1])

When to choose which route

Route	Use It When	Strength	Limitation
/v1/embeddings	You want OpenAI SDK compatibility	Easy drop-in for existing clients	Poorer mental model of what llama.cpp is actually controlling
/embeddings	You want the llama.cpp-native embedding surface	Supports all pooling modes, including `none`	Not OpenAI-compatible

Calling Embeddings from Python

Option 1: OpenAI client against `/v1/embeddings`

This is the easiest way if your app already uses the OpenAI Python client.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="sk-no-key-required",   # placeholder if your server does not enforce auth
)

resp = client.embeddings.create(
    model="your-model-name-or-placeholder",
    input=["hello world", "semantic search is fun"],
    encoding_format="float",
)

for item in resp.data:
    print(len(item.embedding), item.index)

The official server docs explicitly show use of the OpenAI Python client with llama.cpp’s compatible endpoints, and they also show the generic bearer token example. ([GitHub][1])

Option 2: Raw HTTP to `/embeddings`

This is a better fit when you want llama.cpp-native behavior.

import requests

payload = {
    "input": ["hello world", "semantic search is fun"],
    "encoding_format": "float",
}

r = requests.post(
    "http://localhost:8080/embeddings",
    headers={"Content-Type": "application/json"},
    json=payload,
    timeout=60,
)

r.raise_for_status()
data = r.json()
print(data)

Option 3: Corpus + query example

from openai import OpenAI
import math

client = OpenAI(base_url="http://localhost:8080/v1", api_key="sk-no-key-required")

docs = [
    "Raft is a consensus algorithm used in distributed systems.",
    "Topological sort orders DAG nodes by dependency.",
    "Monotonic stacks solve nearest greater/smaller problems efficiently.",
]

query = "Which algorithm is used for consensus in distributed databases?"

doc_embeddings = client.embeddings.create(model="placeholder", input=docs, encoding_format="float").data
query_embedding = client.embeddings.create(model="placeholder", input=query, encoding_format="float").data[0].embedding

def dot(a, b):
    return sum(x * y for x, y in zip(a, b))

# if vectors are normalized, dot product == cosine similarity
scored = []
for i, item in enumerate(doc_embeddings):
    score = dot(query_embedding, item.embedding)
    scored.append((score, docs[i]))

scored.sort(reverse=True)
for score, doc in scored:
    print(round(score, 4), doc)

If you are using pooled vectors from /v1/embeddings or pooled /embeddings, the docs say those embeddings are Euclidean-normalized, which is why dot product is a perfectly reasonable cosine shortcut there. ([GitHub][1])

Important Runtime Parameters

The server README surfaces the most important embedding-related knobs:

--embeddings → enable embedding mode/endpoints
--pooling {none,mean,cls,last,rank} → pooling type
-c, --ctx-size → prompt context size
-b, --batch-size → logical max batch size
-ub, --ubatch-size → physical max batch size
-t, --threads and -tb, --threads-batch → CPU thread counts
-ngl, --gpu-layers → how many layers go to VRAM ([GitHub][1])

What these actually mean in practice

Parameter	What It Controls	Usually Good For	Trade-off
`--pooling`	How token embeddings become one vector	`mean` or model default for general retrieval; `cls` for models trained that way; `none` for token-level work	Wrong pooling can quietly degrade retrieval quality
`-c` / `--ctx-size`	Maximum token context per input	Long documents and chunk-heavy pipelines	Bigger context raises memory use
`-b` / `-ub`	Logical vs physical batching	Server throughput tuning for many inputs	Larger values can improve throughput but spike RAM/VRAM and can trigger OOM
`-t` / `-tb`	CPU parallelism	CPU-only boxes or hybrid prompt processing	Too many threads can hurt due to contention
`-ngl`	How much of the model is offloaded to GPU	GPU-equipped machines; usually as high as fits	Higher offload needs more VRAM

Hardware-oriented starting points

CPU-only server: start with quantized models, moderate -ub, and tune -t / -tb
Consumer GPU: start with Q8/Q6 and high -ngl; only push -ub upward after watching VRAM
Large GPU / workstation: F16 or Q8 with larger -ub can be worthwhile if your workload is many long chunks
Laptop / constrained RAM: reduce context size and use smaller quantized models first

Search / Retrieval Workflow: Query + Corpus

This is another easy point to miss: for search, you do not just embed the user query. You embed:

the corpus (offline or in batches),
the incoming query (online),
then compare vectors.

That means memory planning matters.

Memory footprint formula

If you store embeddings as float32:

memory ≈ number_of_vectors × dimension × 4 bytes

Examples:

100,000 docs × 768 dims × 4 bytes ≈ 307 MB
1,000,000 docs × 768 dims × 4 bytes ≈ 2.9 GB

If you store them as float16 after generation:

memory ≈ number_of_vectors × dimension × 2 bytes

That roughly halves footprint, though you should validate whether reduced precision affects recall in your retrieval stack.

Practical advice

Precompute corpus embeddings once
Use the same embedding model + pooling strategy for query and corpus
Chunk long documents consistently
Normalize consistently (or rely on the route’s normalization rules when applicable)

What Similarity Scores Actually Mean

For normalized embeddings, cosine similarity / dot product is usually the main score you inspect.

But the numbers are not universal truth. They are:

model-dependent
domain-dependent
chunking-dependent
corpus-dependent

Still, for some normalized embedding models, these rough buckets can be a useful debugging intuition—not a product threshold:

Cosine / Dot Score	Typical Interpretation	How to Treat It
0.85+	Near-duplicate or extremely strong semantic match	Often safe to trust, but still validate in-domain
0.75–0.85	Very strong match	Usually excellent retrieval territory
0.60–0.75	Good match, but more context-sensitive	Often useful for top-k retrieval
0.40–0.60	Weak-to-moderate relation	Can be noisy; rerank if the result matters
Below 0.40	Often weak or unrelated	Do not use as a universal cutoff without validation

The right way to choose thresholds is to build a small labeled set and evaluate:

precision@k
recall@k
false-positive tolerance
false-negative tolerance

For production search, ranking quality matters more than any single universal threshold.

Common Pitfalls (and Fixes)

Pitfall	Why It Happens	Fix
Treating `/v1/embeddings` as if it fully exposes model behavior	The OpenAI-shaped request makes that assumption feel natural	Remember that llama.cpp still fixes most behavior at server startup; use `/embeddings` when you need the native surface
Using a chat model for retrieval	“It can embed” gets confused with “it is a retrieval model”	Use a dedicated embedding checkpoint for search
Changing pooling between corpus and query	Different pipelines evolve independently	Lock model + pooling + normalization together for the whole system
Over-quantizing without validation	Smaller footprint looks attractive, but retrieval quality drifts silently	Compare Recall@k / MRR before and after quantization
Ignoring memory footprint of stored vectors	The model fit is planned, but the vector index footprint is not	Budget for both runtime model memory and corpus embedding storage
Using absolute similarity thresholds as doctrine	People copy thresholds from another model or blog post	Calibrate thresholds on your own data

Conclusion

Working with llama.cpp embeddings is easiest when you separate client compatibility from actual embedding behavior. The CLI and /embeddings route expose the mechanics more honestly; /v1/embeddings is useful, but it can encourage the wrong assumptions if you read too much into its OpenAI-shaped surface. Use dedicated embedding models, validate quantization against retrieval metrics, tune batching to your hardware, and remember that retrieval requires embedding both the query and the corpus—not just one side of the comparison. The llama.cpp project already has the pieces; the real challenge is understanding which interface surface gives you which level of control. ([GitHub][1])

Key Takeaways

llama.cpp exposes embeddings through the CLI, /embeddings, and /v1/embeddings
/v1/embeddings is convenient, but the server configuration still determines most of the real behavior
Use dedicated embedding models for search/retrieval
Quantization is often worth it, but validate recall/precision, not just raw speed or memory
For retrieval, embed both corpus and query, and budget memory for the stored vectors as well as the model itself

References

1: https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md "llama.cpp/tools/server/README.md at master · ggml-org/llama.cpp · GitHub"
2: https://github.com/ggml-org/llama.cpp/discussions/7712 "tutorial : compute embeddings using llama.cpp · ggml-org llama.cpp · Discussion #7712 · GitHub"
3: https://github.com/ggml-org/llama.cpp "GitHub - ggml-org/llama.cpp: LLM inference in C/C++ · GitHub"

Table of Contents

Overview

What Llama.cpp Embeddings Actually Gives You

The Interface Trap: OpenAI Compatibility vs. Actual Control Surface

Choosing the Right Embedding Model

Quantized vs. Unquantized Models

Running Embeddings from the CLI

What the main CLI flags mean

Practical hardware guidance

Running Embeddings from llama-server

/v1/embeddings (OpenAI-compatible)

/embeddings (llama.cpp-native)

When to choose which route

Calling Embeddings from Python

Option 1: OpenAI client against /v1/embeddings

Option 2: Raw HTTP to /embeddings

Option 3: Corpus + query example

Important Runtime Parameters

What these actually mean in practice

Hardware-oriented starting points

Search / Retrieval Workflow: Query + Corpus

Memory footprint formula

Practical advice

What Similarity Scores Actually Mean

Common Pitfalls (and Fixes)

Conclusion

Key Takeaways

References

`/v1/embeddings` (OpenAI-compatible)

`/embeddings` (llama.cpp-native)

Option 1: OpenAI client against `/v1/embeddings`

Option 2: Raw HTTP to `/embeddings`