How to generate embeddings for RAG

Matt Herold

Apr 18, 2025 • 7 min read

When using retrieval augmented generation in order to provide additional context to an LLM, you have to generate the vectorized embeddings for a prompt. In the past year, a lot of ready-made libraries got available making this task rather trivial.

The more interesting aspect is to select the right model. When you browse Hugging Face or Tensorflow Hub, there are plenty of cool options available. You can also choose if you plan to map prompts in different languages and therefore want to go with a multi-lingual version.

After deciding on a model, just use a framework for generating embeddings - the most relevant ones being PyTorch and Tensorflow. In case you opted for a model from Hugging Face, you can use the excellent Transformers library as a frontend for both, otherwise you can go with Tensorflow directly.

The sample code below shows you how load to a model and encode the prompts. In order to make this more understandable, prompts are transformed individually.

import torch
from transformers import AutoTokenizer, AutoModel

sentences = [
    "How did the stock market develop?",
    "What is the development of the stock market?",
    "How can I have great organic food?",
    "The stock did drop.",
    "The stock price did fall.",
    "Price of the stock did reduce.", 
    "The stock did go up.",
    "Die Aktie ist gestiegen"
]

# --- Setup the model ---
tokenizer = AutoTokenizer.from_pretrained("paraphrase-multilingual-MiniLM-L12-v2")
model = AutoModel.from_pretrained("paraphrase-multilingual-MiniLM-L12-v2")
    
all_embeddings = []

# --- Generate embeddings ---
for sentence in sentences:
    encoded = tokenizer(sentence, padding=True, truncation=True, return_tensors='pt')

    with torch.no_grad():
        output = model(**encoded)

    # Mean pooling
    token_embeddings = output.last_hidden_state
    attention_mask = encoded['attention_mask']

    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size())
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    sentence_embedding = sum_embeddings / sum_mask
    
    all_embeddings.append(sentence_embedding.squeeze(0))

embedding_tensor = torch.stack(all_embeddings)

Not shown here, I actually did the same test with the model "universal-sentence-encoder-multilingual" using Tensorflow. At least for the simple dataset given above, Hugging Face dramatically outperformed it.

In order to analyze the results in a graphical way, you can write the results to disk and use tensorboard for data visualization.

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter(log_dir="logs")
writer.add_embedding(embedding_tensor, metadata=sentences)
writer.close()

Another option is to reduce the vector dimensions with SKLearn first and display them with the exceptional Vega graphing library.

from sklearn.decomposition import PCA

pca = PCA(n_components=2, random_state=42)
results = tsne.fit_transform(embedding_tensor)

Here is a sample visualization using PCA as a reduction algorithm with the parameters given above for a multi-lingual model from Hugging Face. You can nicely recognize how the model clustered the individual prompts.

For a more thorough evaluation, you can calculate the cosine similarity between each of the sentences and - for example also visualize it - using Vega.

Of course, in a production setting you would integrate this with a database like Postgres or Cassandra to store the embeddings and also retrieve the data of the visualizations directly from there.