Working with Llama.cpp Embeddings

May 01, 2026

Table of Contents

Overview
What Llama.cpp Embeddings Actually Gives You
The Interface Trap: OpenAI Compatibility vs. Actual Control Surface
Choosing the Right Embedding Model
Quantized vs. Unquantized Models
Running Embeddings from the CLI
Running Embeddings from llama-server
Calling Embeddings from Python
Important Runtime Parameters
Search / Retrieval Workflow: Query + Corpus
What Similarity Scores Actually Mean
Common Pitfalls (and Fixes)
Conclusion
Key Takeaways

Overview
^

Working with machine learning embeddings in llama.cpp is straightforward once you understand that it exposes three different surfaces with slightly different behavior:

  1. the standalone llama-embedding CLI,
  2. the non-OpenAI /embeddings server route,
  3. the OpenAI-compatible /v1/embeddings server route.

That last one is where many people get justifiably tripped up. The endpoint borrows OpenAI-shaped conventions like /v1/embeddings, model, and Authorization: Bearer ..., but the naming and documentation do a poor job explaining what those fields actually mean in a local llama.cpp server. The bearer token is not required for embeddings, and it is not an OpenAI API key, even though the codebase is littered with references to “OpenAI API key.” For OpenAI compatibility alone, the API-key mechanism could be omitted from llama.cpp entirely: the server could accept /v1/embeddings requests with no Authorization header, or ignore a dummy bearer token supplied by OpenAI-compatible SDKs. The only real reason to define an API-key mechanism is separate from OpenAI compatibility: it lets server operators optionally gate access to the local endpoint—for example, to prevent unauthorized use, meter requests, or put a simple access-control boundary in front of an exposed inference server. That is a valid reason to support a token, but the interface blurs together three separate ideas: OpenAI client compatibility, optional local server authentication, and actual embedding/runtime configuration, with poor naming making that confusion worse. A better design would call it a server auth token or local access token, clearly state that it can be omitted when auth is disabled, explain that dummy values are only for SDK compatibility, and separately document that embedding behavior is still determined by startup configuration such as --embeddings, --pooling, batch sizes, context size, and the model loaded at startup.

A sharper observation is that authentication shouldn't be handled by Llama at all — it should be handled at the edge of a system.

Note: I'm working on the v2 Llama embeddings endpoint https://github.com/ggml-org/llama.cpp/discussions/16957.

What Llama.cpp Embeddings Actually Gives You
^

At a high level:

  • llama-embedding is the most direct local workflow.
  • llama-server /embeddings is the more revealing HTTP surface if you need token-level embeddings or --pooling none.
  • llama-server /v1/embeddings is the easiest path if you already have OpenAI-compatible tooling, but it is also the easiest place to form the wrong mental model.

The official server docs explicitly distinguish /embeddings from /v1/embeddings:

  • /embeddings supports all pooling modes, including --pooling none.
  • /v1/embeddings requires a pooling mode other than none.
  • pooled outputs are Euclidean-normalized; with pooling none, /embeddings can return per-token embeddings instead. ([GitHub][1])

That distinction matters because, in machine learning, “embedding” can mean either:

  • one vector per text (sentence/document retrieval), or
  • one vector per token (more advanced analysis, alignment, custom pooling, token-level similarity).

The Interface Trap: OpenAI Compatibility vs. Actual Control Surface
^

This is the nuance that deserves special emphasis.

The official docs show /v1/embeddings requests like:

  • Authorization: Bearer no-key
  • "model": "GPT-4"

Those fields are there because the route is OpenAI-compatible, not because llama.cpp is actually exposing the same abstraction boundary as OpenAI’s hosted embeddings service. On the llama.cpp side, whether embeddings are enabled at all, what pooling mode is used, how batching works, and what model is active are primarily controlled by server configuration. In router mode, requests are routed by the requested model name, but request-level overrides remain intentionally limited compared with what the OpenAI-shaped surface suggests. ([GitHub][1])

So the right mental model is:

  • Use /v1/embeddings for client compatibility
  • Use /embeddings (and/or the CLI) when you want clearer control over what the server is actually doing

That is not obvious from the interface alone, and it is a documentation weakness more than a user mistake.

Choosing the Right Embedding Model
^

For retrieval, search, clustering, semantic deduplication, or reranking pipelines, you usually want a dedicated embedding model, not a general chat/instruct checkpoint. The official embedding tutorial uses an embedding-specific model (Snowflake/snowflake-arctic-embed-s) and also notes support for embedding models such as BERT-family checkpoints. ([GitHub][2])

A practical way to think about model choice:

Model TypeUsually Best ForStrengthTrade-off
Embedding-specific modelSearch, retrieval, clustering, semantic matchingTrained to produce useful vector geometryYou need a separate generation model if you also want chat/completions
Chat / instruct modelText generationGeneral-purpose language behaviorEmbedding quality for retrieval is often not what you actually want
Reranker modelFinal ranking after retrievalBetter precision on top-k candidatesNot a drop-in replacement for corpus-wide embedding generation

A good rule of thumb:

  • use an embedding model to embed your corpus and queries,
  • optionally use a reranker after vector search,
  • use a chat model only for downstream answer generation.

Quantized vs. Unquantized Models
^

The official embedding tutorial walks through both an F16 GGUF and a Q8_0 quantized version of the same embedding model. In that example, the quantized model is materially smaller on disk than the F16 version, which is exactly why quantization is so attractive for local retrieval systems. ([GitHub][2])

For embeddings, the practical trade-off is:

FormatProsConsUsually a Good Fit For
F16 / BF16Highest fidelity; simplest baselineLargest RAM/VRAM footprintBenchmarking quality, high-end GPUs, smaller corpora
Q8 / Q6Often strong quality / memory trade-offStill larger than aggressive low-bit quantsMost production local retrieval setups
Q4 and belowMuch smaller; easier to fit on CPU boxes or weak GPUsHigher risk of degrading retrieval qualityMemory-constrained experiments; validate carefully before production

The important nuance: for embeddings, quality degradation may show up less as “the output looks weird” and more as recall/precision drift in search. That means you should validate quantization against a retrieval metric such as:

  • Recall@k
  • MRR
  • NDCG
  • or simply “did the right document stay in the top 5?”

Running Embeddings from the CLI
^

The official tutorial uses llama-embedding directly:

# F16
./llama-embedding -m model-f16.gguf -e -p "Hello world" --verbose-prompt -ngl 99

# Q8_0
./llama-embedding -m model-q8_0.gguf -e -p "Hello world" --verbose-prompt -ngl 99

And for a quick benchmark:

./llama-bench -m model-f16.gguf -r 10 -p 8,16,32,64,128,256,512 -n 0 -embd 1

Those commands are straight from the upstream embedding tutorial. ([GitHub][2])

What the main CLI flags mean

  • -m → path to the GGUF model
  • -e → run in embedding mode
  • -p → the input text/prompt to embed
  • --verbose-prompt → useful when you want to inspect tokenization behavior
  • -ngl 99 → offload as many layers as possible to GPU; use -ngl 0 for CPU-only runs. The tutorial explicitly calls out -ngl 99 as full offload and -ngl 0 as CPU-only. ([GitHub][2])

Practical hardware guidance

  • CPU-only: start with -ngl 0; focus on thread count and smaller quantized models
  • Single GPU: -ngl 99 (or equivalently “all available layers”) is usually what you want
  • Weak GPU / laptop GPU: Q8 or Q6 often gives a friendlier memory/throughput trade-off than F16
  • Benchmark first: embedding throughput depends heavily on input length, not just model size

Running Embeddings from llama-server
^

The server README includes a dedicated embedding-server-style example:

llama-server -m model.gguf --embeddings --pooling cls -ub 8192

That is the official “serve an embedding model” example, and it already hints at the two knobs that matter most for embedding throughput on the server:

  • --pooling
  • -ub / --ubatch-size ([GitHub][3])

/v1/embeddings (OpenAI-compatible)

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer no-key" \
  -d '{
    "input": "hello",
    "model": "GPT-4",
    "encoding_format": "float"
  }'

/embeddings (llama.cpp-native)

Use a similar request payload, but call /embeddings when you want the non-OpenAI route—especially if you need --pooling none, want token-level output, or want the llama.cpp-native response shape. The docs explicitly note that /embeddings supports all pooling modes, while /v1/embeddings requires pooling other than none and returns an OpenAI-compatible response shape. ([GitHub][1])

When to choose which route

RouteUse It WhenStrengthLimitation
/v1/embeddingsYou want OpenAI SDK compatibilityEasy drop-in for existing clientsPoorer mental model of what llama.cpp is actually controlling
/embeddingsYou want the llama.cpp-native embedding surfaceSupports all pooling modes, including noneNot OpenAI-compatible

Calling Embeddings from Python
^

Option 1: OpenAI client against /v1/embeddings

This is the easiest way if your app already uses the OpenAI Python client.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="sk-no-key-required",   # placeholder if your server does not enforce auth
)

resp = client.embeddings.create(
    model="your-model-name-or-placeholder",
    input=["hello world", "semantic search is fun"],
    encoding_format="float",
)

for item in resp.data:
    print(len(item.embedding), item.index)

The official server docs explicitly show use of the OpenAI Python client with llama.cpp’s compatible endpoints, and they also show the generic bearer token example. ([GitHub][1])

Option 2: Raw HTTP to /embeddings

This is a better fit when you want llama.cpp-native behavior.

import requests

payload = {
    "input": ["hello world", "semantic search is fun"],
    "encoding_format": "float",
}

r = requests.post(
    "http://localhost:8080/embeddings",
    headers={"Content-Type": "application/json"},
    json=payload,
    timeout=60,
)

r.raise_for_status()
data = r.json()
print(data)

Option 3: Corpus + query example

from openai import OpenAI
import math

client = OpenAI(base_url="http://localhost:8080/v1", api_key="sk-no-key-required")

docs = [
    "Raft is a consensus algorithm used in distributed systems.",
    "Topological sort orders DAG nodes by dependency.",
    "Monotonic stacks solve nearest greater/smaller problems efficiently.",
]

query = "Which algorithm is used for consensus in distributed databases?"

doc_embeddings = client.embeddings.create(model="placeholder", input=docs, encoding_format="float").data
query_embedding = client.embeddings.create(model="placeholder", input=query, encoding_format="float").data[0].embedding

def dot(a, b):
    return sum(x * y for x, y in zip(a, b))

# if vectors are normalized, dot product == cosine similarity
scored = []
for i, item in enumerate(doc_embeddings):
    score = dot(query_embedding, item.embedding)
    scored.append((score, docs[i]))

scored.sort(reverse=True)
for score, doc in scored:
    print(round(score, 4), doc)

If you are using pooled vectors from /v1/embeddings or pooled /embeddings, the docs say those embeddings are Euclidean-normalized, which is why dot product is a perfectly reasonable cosine shortcut there. ([GitHub][1])

Important Runtime Parameters
^

The server README surfaces the most important embedding-related knobs:

  • --embeddings → enable embedding mode/endpoints
  • --pooling {none,mean,cls,last,rank} → pooling type
  • -c, --ctx-size → prompt context size
  • -b, --batch-size → logical max batch size
  • -ub, --ubatch-size → physical max batch size
  • -t, --threads and -tb, --threads-batch → CPU thread counts
  • -ngl, --gpu-layers → how many layers go to VRAM ([GitHub][1])

What these actually mean in practice

ParameterWhat It ControlsUsually Good ForTrade-off
--poolingHow token embeddings become one vectormean or model default for general retrieval; cls for models trained that way; none for token-level workWrong pooling can quietly degrade retrieval quality
-c / --ctx-sizeMaximum token context per inputLong documents and chunk-heavy pipelinesBigger context raises memory use
-b / -ubLogical vs physical batchingServer throughput tuning for many inputsLarger values can improve throughput but spike RAM/VRAM and can trigger OOM
-t / -tbCPU parallelismCPU-only boxes or hybrid prompt processingToo many threads can hurt due to contention
-nglHow much of the model is offloaded to GPUGPU-equipped machines; usually as high as fitsHigher offload needs more VRAM

Hardware-oriented starting points

  • CPU-only server: start with quantized models, moderate -ub, and tune -t / -tb
  • Consumer GPU: start with Q8/Q6 and high -ngl; only push -ub upward after watching VRAM
  • Large GPU / workstation: F16 or Q8 with larger -ub can be worthwhile if your workload is many long chunks
  • Laptop / constrained RAM: reduce context size and use smaller quantized models first

Search / Retrieval Workflow: Query + Corpus
^

This is another easy point to miss: for search, you do not just embed the user query. You embed:

  1. the corpus (offline or in batches),
  2. the incoming query (online),
  3. then compare vectors.

That means memory planning matters.

Memory footprint formula

If you store embeddings as float32:

memory ≈ number_of_vectors × dimension × 4 bytes

Examples:

  • 100,000 docs × 768 dims × 4 bytes ≈ 307 MB
  • 1,000,000 docs × 768 dims × 4 bytes ≈ 2.9 GB

If you store them as float16 after generation:

memory ≈ number_of_vectors × dimension × 2 bytes

That roughly halves footprint, though you should validate whether reduced precision affects recall in your retrieval stack.

Practical advice

  • Precompute corpus embeddings once
  • Use the same embedding model + pooling strategy for query and corpus
  • Chunk long documents consistently
  • Normalize consistently (or rely on the route’s normalization rules when applicable)

What Similarity Scores Actually Mean
^

For normalized embeddings, cosine similarity / dot product is usually the main score you inspect.

But the numbers are not universal truth. They are:

  • model-dependent
  • domain-dependent
  • chunking-dependent
  • corpus-dependent

Still, for some normalized embedding models, these rough buckets can be a useful debugging intuition—not a product threshold:

Cosine / Dot ScoreTypical InterpretationHow to Treat It
0.85+Near-duplicate or extremely strong semantic matchOften safe to trust, but still validate in-domain
0.75–0.85Very strong matchUsually excellent retrieval territory
0.60–0.75Good match, but more context-sensitiveOften useful for top-k retrieval
0.40–0.60Weak-to-moderate relationCan be noisy; rerank if the result matters
Below 0.40Often weak or unrelatedDo not use as a universal cutoff without validation

The right way to choose thresholds is to build a small labeled set and evaluate:

  • precision@k
  • recall@k
  • false-positive tolerance
  • false-negative tolerance

For production search, ranking quality matters more than any single universal threshold.

Common Pitfalls (and Fixes)
^

PitfallWhy It HappensFix
Treating /v1/embeddings as if it fully exposes model behaviorThe OpenAI-shaped request makes that assumption feel naturalRemember that llama.cpp still fixes most behavior at server startup; use /embeddings when you need the native surface
Using a chat model for retrieval“It can embed” gets confused with “it is a retrieval model”Use a dedicated embedding checkpoint for search
Changing pooling between corpus and queryDifferent pipelines evolve independentlyLock model + pooling + normalization together for the whole system
Over-quantizing without validationSmaller footprint looks attractive, but retrieval quality drifts silentlyCompare Recall@k / MRR before and after quantization
Ignoring memory footprint of stored vectorsThe model fit is planned, but the vector index footprint is notBudget for both runtime model memory and corpus embedding storage
Using absolute similarity thresholds as doctrinePeople copy thresholds from another model or blog postCalibrate thresholds on your own data

Conclusion
^

Working with llama.cpp embeddings is easiest when you separate client compatibility from actual embedding behavior. The CLI and /embeddings route expose the mechanics more honestly; /v1/embeddings is useful, but it can encourage the wrong assumptions if you read too much into its OpenAI-shaped surface. Use dedicated embedding models, validate quantization against retrieval metrics, tune batching to your hardware, and remember that retrieval requires embedding both the query and the corpus—not just one side of the comparison. The llama.cpp project already has the pieces; the real challenge is understanding which interface surface gives you which level of control. ([GitHub][1])

Key Takeaways

  • llama.cpp exposes embeddings through the CLI, /embeddings, and /v1/embeddings
  • /v1/embeddings is convenient, but the server configuration still determines most of the real behavior
  • Use dedicated embedding models for search/retrieval
  • Quantization is often worth it, but validate recall/precision, not just raw speed or memory
  • For retrieval, embed both corpus and query, and budget memory for the stored vectors as well as the model itself

References

1: https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md "llama.cpp/tools/server/README.md at master · ggml-org/llama.cpp · GitHub"
2: https://github.com/ggml-org/llama.cpp/discussions/7712 "tutorial : compute embeddings using llama.cpp · ggml-org llama.cpp · Discussion #7712 · GitHub"
3: https://github.com/ggml-org/llama.cpp "GitHub - ggml-org/llama.cpp: LLM inference in C/C++ · GitHub"