Build production RAG pipelines with BGE-M3 embeddings on VoltageGPU. Semantic search, document retrieval, and knowledge bases at scale.
Retrieval-Augmented Generation (RAG) combines the knowledge of your documents with the reasoning power of LLMs. VoltageGPU provides both the embedding models to vectorize your data and the LLM inference to generate answers. Run BGE-M3 and other embedding models via API, store vectors in your preferred database, and query them alongside DeepSeek, Llama, or Qwen for accurate, grounded responses.
BGE-M3 on GPU generates 1,000+ embeddings per second. Vectorize millions of documents in minutes.
Drop-in replacement for OpenAI embeddings API. Change one line of code to switch providers.
BGE-M3 supports 100+ languages natively. Build multilingual search and RAG without separate models.
Embeddings + LLM inference on the same platform. No data transfer costs between providers.
Embedding API at $0.005 per 1M tokens vs $0.13 on OpenAI. 26x cheaper for the same quality.
Your documents never leave VoltageGPU infrastructure. No data used for model training.
from openai import OpenAI
# Initialize VoltageGPU client
client = OpenAI(
base_url="https://api.voltagegpu.com/v1",
api_key="YOUR_VOLTAGE_API_KEY"
)
# Step 1: Generate embeddings for your documents
documents = [
"VoltageGPU offers H100 GPUs at $2.49 per hour.",
"Fine-tuning with LoRA reduces VRAM requirements by 10x.",
"RAG pipelines combine retrieval with LLM generation.",
"BGE-M3 supports multilingual embeddings in 100+ languages.",
]
embeddings_response = client.embeddings.create(
model="BAAI/bge-m3",
input=documents,
)
vectors = [e.embedding for e in embeddings_response.data]
print(f"Generated {len(vectors)} embeddings of dim {len(vectors[0])}")
# Step 2: Query with RAG (after storing vectors in your DB)
query = "How much does an H100 cost?"
query_embedding = client.embeddings.create(
model="BAAI/bge-m3",
input=[query],
).data[0].embedding
# Step 3: Use retrieved context with an LLM
retrieved_docs = ["VoltageGPU offers H100 GPUs at $2.49 per hour."]
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V3-0324",
messages=[
{"role": "system", "content": f"Answer based on context:\n{'\n'.join(retrieved_docs)}"},
{"role": "user", "content": query},
],
)
print(response.choices[0].message.content)$5 free credit. No credit card required.