RAG(Retrieval-Augmented Generation) is a powerful approach for enhancing large language models with external knowledge and there are many variations in how to make it work better. Traditional RAG systems often force you to choose between keyword-based search (lexical) and vector similarity search (semantic). But what if you could combine the precision of keyword matching with the contextual understanding of semantic search? When building search, you have two primary tools:
Sparse Vectors (Keyword Search)
- Based on classic algorithms like BM25, which uses term frequency and inverse document frequency (TF-IDF).
- Example: A query for machine learning retrieves only documents containing those exact words.
- Strengths: High precision for literal matches, interpretable, fast on small corpora.
Dense Vectors (Semantic Search)
- Words and phrases are encoded into high-dimensional vectors (e.g., 768 floats) using embedding models.
- Example: A query for machine learning can retrieve a document discussing neural networksbecause their embeddings are close in vector space.
- Strengths: Captures synonyms, paraphrasing, and conceptual similarity.
Hybrid search merges the precision of keyword search with the semantic understanding of vector search.
Hybrid = α·dense_score + (1-α)·sparse_scoreWhere α is a parameter chosen by the user, its value is between 0 and 1. With this parameter, we can give more importance to keyword search or vector search. Pure vector search(alpha = 1), pure keyword search(alpha = 0).
After the hybrid retrieval stage, we can have a rerank service so the top‑k you send to the LLM is truly the best. Results with higher hybrid scores are prioritized in the final ranking.
In this guide, I'll walk through setting up a high-performance hybrid RAG system that splits the workload between two powerful machines: Nvidia RTX 5090 for embeddings and reranking models, and NVIDIA Jetson Thor Developer Kit running a massive OpenAI's gpt-oss-120B parameter LLM.
To build a high-performance system, we split the workload between two optimized machines, as shown in the architecture diagram below:
- RTX 5090 Workstation: Handles the computationally intensive embedding and reranking tasks using vLLM inference engine.
- NVIDIA Jetson Thor: Dedicated to running the massive 120B parameter LLM for text generation, leveraging its MXFP4 support for optimal performance. Its high VRAM is perfect for running these models efficiently.
The workflow is as follows:
- A user submits a query to the system.
- Chunking, also known as text splitting, breaks up texts that are too long to embed into smaller, semantically similar pieces.
- The RTX 5090 workstation generates both a sparse (keyword) and dense (vector) representation
- BM25 builds on the keyword scoring method TF-IDF (Term-Frequency Inverse-Document Frequency)
- A hybrid search is performed in Qdrant (our vector database).The top candidate passages are sent to the reranker model (on the 5090 machine).
- The massive gpt-oss-120B LLM on the Nvidia Jetson Thor generates the final, coherent answer, leveraging the high-quality context for a factual and precise response.
Embeddings should be stored for later use, especially for building RAG systems. These databases (such as Weaviate, Pinecone, Qdrant or FAISS) are optimized to store high-dimensional vectors and perform similarity search efficiently.
Qdrant stores and retrieves vector embeddings efficiently. Qdrant expects vector data to be structured in a specific way depending on whether you're using:
Pull the Qdrant Docker image:
docker pull qdrant/qdrantThen, run the service:
docker run -p 6333:6333 -p 6334:6334 \
-v "$(pwd)/qdrant_storage:/qdrant/storage:z" \
qdrant/qdrantThe API is now available at:
http://<worksation-ip>:6333/dashboardServe the Google’s Gemma-300M Embedding ModelWe'll use a model with Matryoshka Representation Learning to get adaptive, optimal-sized embeddings and serve Google’s Gemma-300M embedding model using vLLM’s embedding mode.
vllm serve google/embeddinggemma-300m --task embed \
--gpu-memory-utilization 0.3 \
--dtype bfloat16 \
--hf_overrides '{"matryoshka_dimensions":[128,256,512,768]}'We can efficiently use the
--hf_overrides '{"matryoshka_dimensions":[128,256,512,768]}'The matryoshka_dimensions flag is key here. It allows us to retrieve embeddings of different sizes (128, 256, etc.), letting us balance speed and accuracy for different applications without retraining.
While embedding models are a powerful tool for initial retrieval in RAG systems, they can sometimes return a large number of documents that might be generally relevant. This is where reranking models come into play.
Serve the Qwen3 Reranker ModelA reranker acts as a final quality check, reassessing the top passages from the hybrid search for maximal relevance to the query. Reranking models introduce a two-step retrieval process that significantly improves precision. Based on this analysis, the reranking model reorders the documents, placing the most relevant ones at the top.
Then run reranker model on the 5090 machine:
vllm serve Qwen/Qwen3-Reranker-4B \
--runner pooling \
--hf-overrides '{"architectures":["Qwen3ForSequenceClassification"],"classifier_from_token":["no","yes"],"is_original_qwen3_reranker":true}' \
--host 0.0.0.0 --port 9000 \
--gpu-memory-utilization 0.6Test the Reranker
curl -s http://127.0.0.1:9000/rerank \
-H "Content-Type: application/json" \
-d '{
"instruction": "Given a web search query, retrieve relevant passages that answer the query",
"query": "What is Bitcoin?",
"documents": [
"Bitcoin is a decentralized digital currency…",
"The mitochondrion is the powerhouse of the cell…",
"A banana is a curved yellow fruit…"
]
}' | jqExpected Output:
"id": "rerank-c85261aa817b4959bd46bc926279ba38",
"model": "Qwen/Qwen3-Reranker-4B",
"usage": {
"total_tokens": 38
},
"results": [
{
"index": 0,
"document": {
"text": "Bitcoin is a decentralized digital currency…",
"multi_modal": null
},
"relevance_score": 0.276665598154068
},
{
"index": 2,
"document": {
"text": "A banana is a curved yellow fruit…",
"multi_modal": null
},
"relevance_score": 0.0862158015370369
},
{
"index": 1,
"document": {
"text": "The mitochondrion is the powerhouse of the cell…",
"multi_modal": null
},
"relevance_score": 0.08217901736497879
}
]
}You should see JSON results where the "Bitcoin" passage has the highest relevance_score (e.g., 0.276...).
Here is the VRAM usage after embedding model and reranker models:
Now, let's get the large language model running on the edge. The Nvidia Jetson Thor Developer Kit is built for edge AI. With MXFP4 quantization and massive memory VRAM, it can run OpenAI's gpt-oss-120B models efficiently.
Launch the vLLM Docker Container from NGC(25.09-py3).
sudo docker run --rm -it \
--network host \
--shm-size=16g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--runtime=nvidia \
--name vllm \
-v /home/$USER/vllm_models:/root/.cache/huggingface \
nvcr.io/nvidia/vllm:25.09-py3The openai/gpt-oss-120b model requires specific tokenizer encodings.
mkdir -p tiktoken_encodings
wget -O tiktoken_encodings/o200k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken"
wget -O tiktoken_encodings/cl100k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"
export TIKTOKEN_ENCODINGS_BASE=${PWD}/tiktoken_encodingsServe it with vLLM server:
export VLLM_USE_FLASHINFER_MXFP4_MOE=1
vllm serve "openai/gpt-oss-120b" \
--async-scheduling \
--port 8000 \
--host 0.0.0.0 \
--trust-remote-code \
--swap-space 16 \
--max-model-len 32000 \
--tensor-parallel-size 1 \
--max-num-seqs 1024 \
--gpu-memory-utilization 0.85 \
--tool-call-parser openai \
--enable-auto-tool-choiceThis model now serves OpenAI-compatible completions with massive context and MoE efficiency.
If you encounter memory pressure, you can clear the system cache:
sudo sysctl -w vm.drop_caches=3The Gradio Application: Bringing It All TogetherThe provided Python script creates a user-friendly web interface that orchestrates this entire complex pipeline. Here’s what it does under the hood:
- Indexing Pipeline: Loads PDFs, splits them into chunks, generates embeddings (on the 5090 machine), and insert them into Qdrant. It also builds an in-memory BM25 index for keyword search.
- Dense-only: Top results based purely on vector similarity.
- Sparse-only: Top results based purely on BM25 keyword scores.
- Combined: The hybrid results from
α·dense_norm + (1-α)·sparse_norm.- Reranker-only: The top
reranker_top_kcandidates from the Combined stage, re-sorted by the reranker model. - Fused: A final fusion of the reranker and combined scores:
β·rerank_norm + (1-β)·combined_norm.The content of gradio_rag_app.py file.
import os
import re
import json
import textwrap
from typing import List, Tuple, Dict, Any, Optional
import gradio as gr
import numpy as np
import requests
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.documents import Document
from rank_bm25 import BM25Okapi
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct, VectorParams, Distance
# -------------------------------
# Defaults (editable in UI)
# -------------------------------
DEFAULT_EMBED_API_BASE = "http://IP_ADDRESS:8000/v1"
DEFAULT_LLM_API_BASE = "http://IP_ADDRESS:8000/v1"
DEFAULT_EMBED_MODEL = "google/embeddinggemma-300m"
DEFAULT_LLM_MODEL = "openai/gpt-oss-120b"
DEFAULT_QDRANT_HOST = "localhost"
DEFAULT_QDRANT_PORT = 6333
DEFAULT_COLLECTION = "white_paper_docs"
DEFAULT_CHUNK_SIZE = 600
DEFAULT_CHUNK_OVERLAP = 150
DEFAULT_SYSTEM_MESSAGE = (
"You are an expert research assistant specializing in summarization and "
"retrieval-augmented reasoning. You always answer briefly and only from the given context. "
"If the answer is not explicitly supported by the documents, respond with 'I don't know.'"
)
DEFAULT_USE_API_RERANKER = True
DEFAULT_RERANKER_API_URL = "http://IP_ADDRESS:9000/rerank"
RERANK_INSTRUCTION = "Given a web search query, retrieve relevant passages that answer the query"
DEFAULT_RERANK_WEIGHT_BETA = 0.5
def _embed_with_vllm(
texts: List[str],
embed_api_base: str,
embed_model: str,
timeout_s: int = 300,
) -> np.ndarray:
"""Get embeddings via vLLM-compatible /embeddings API."""
if not texts:
return np.zeros((0, 0), dtype=float)
payload = {"model": embed_model, "input": texts}
resp = requests.post(f"{embed_api_base}/embeddings", json=payload, timeout=timeout_s)
resp.raise_for_status()
data = resp.json()["data"]
embeddings = [item["embedding"] for item in data]
return np.array(embeddings, dtype=float)
def generate_with_llm_chat(
prompt: str,
llm_api_base: str,
llm_model: str,
system_message: str = DEFAULT_SYSTEM_MESSAGE,
max_tokens: int = 100,
temperature: float = 0.0,
timeout_s: int = 300,
) -> str:
url = f"{llm_api_base}/chat/completions"
payload = {
"model": llm_model,
"messages": [
{"role": "system", "content": system_message},
{"role": "user", "content": prompt},
],
"max_tokens": max_tokens,
"temperature": temperature,
"stop": ["\n\nQuestion:", "\n\nAnswer:", "\n\nThe following", "\n\nIt looks like"],
"frequency_penalty": 0.6,
"presence_penalty": 0.0,
}
try:
resp = requests.post(url, json=payload, timeout=timeout_s)
resp.raise_for_status()
result = resp.json()
choices = result.get("choices", [])
if not choices:
return "No response generated by the model."
first_choice = choices[0]
message = first_choice.get("message", None)
if message is None:
content_val = first_choice.get("text", "")
else:
content_val = message.get("content")
if content_val is None:
content_val = message.get("reasoning_content", "")
content = str(content_val or "").strip()
if not content:
return "Model returned empty response."
return content
except requests.RequestException as e:
return f"API request failed: {e}"
except KeyError as e:
return f"Unexpected API response format: missing key {e}"
except Exception as e:
return f"Unexpected error during generation: {e}"
def _minmax(scores: np.ndarray) -> np.ndarray:
if scores.size == 0:
return scores
x_min, x_max = float(scores.min()), float(scores.max())
if x_max - x_min < 1e-12:
return np.zeros_like(scores)
return (scores - x_min) / (x_max - x_min)
# -------------------------------
# Reranker (API only)
# -------------------------------
def rerank_with_api(
query: str,
docs: List[str],
api_url: str,
instruction: Optional[str] = RERANK_INSTRUCTION,
timeout_s: int = 120,
) -> List[float]:
"""
Supports three response shapes:
- {"scores":[...]}
- {"data":[{"index":i,"score":...}, ...]}
- {"results":[{"index":i,"relevance_score":...}, ...]}
Returns a float score per input doc, aligned by index.
"""
if not docs:
return []
try:
payload = {
"instruction": instruction,
"query": query,
"documents": docs
}
resp = requests.post(api_url, json=payload, timeout=timeout_s)
resp.raise_for_status()
js = resp.json()
if isinstance(js.get("scores"), list):
scores = js["scores"]
if len(scores) != len(docs):
raise ValueError(f"len(scores) != len(docs): {len(scores)} vs {len(docs)}")
return [float(s) for s in scores]
if isinstance(js.get("data"), list):
data = js["data"]
if len(data) != len(docs):
raise ValueError(f"len(data) != len(docs): {len(data)} vs {len(docs)}")
scores = [0.0] * len(docs)
for item in data:
idx = int(item.get("index", -1))
if 0 <= idx < len(scores):
scores[idx] = float(item.get("score", 0.0))
return scores
if isinstance(js.get("results"), list):
results = js["results"]
scores = [0.0] * len(docs)
for item in results:
idx = int(item.get("index", -1))
if 0 <= idx < len(scores):
scores[idx] = float(item.get("relevance_score", 0.0))
return scores
raise ValueError(f"Unsupported reranker response shape: {js}")
except Exception as e:
print(f"[RERANK API] Failed: {e}")
return []
# -------------------------------
# Indexing pipeline
# -------------------------------
def index_pdfs(
filepaths: List[str],
collection_name: str,
qdrant_host: str,
qdrant_port: int,
embed_api_base: str,
embed_model: str,
chunk_size: int,
chunk_overlap: int,
reset_collection: bool,
prior_state: Dict[str, Any],
) -> Tuple[str, Dict[str, Any]]:
"""
Load PDFs -> split -> embed -> (re)create Qdrant collection -> upsert -> (re)build BM25 in memory.
Returns (status_message, state_dict)
"""
splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
# Initialize or reset local state
if reset_collection or not prior_state:
state = {
"documents": [],
"texts": [],
"bm25": None,
"id_to_idx": {},
"next_id": 0,
"qdrant_cfg": {"host": qdrant_host, "port": qdrant_port, "collection": collection_name},
"embed_cfg": {"api_base": embed_api_base, "model": embed_model},
}
else:
state = dict(prior_state)
state["qdrant_cfg"] = {"host": qdrant_host, "port": qdrant_port, "collection": collection_name}
state["embed_cfg"] = {"api_base": embed_api_base, "model": embed_model}
# Load all PDFs
new_documents: List[Document] = []
for path in (filepaths or []):
if not path or not str(path).lower().endswith(".pdf"):
continue
loader = PyPDFLoader(path)
for page in loader.load():
for chunk in splitter.split_text(page.page_content):
new_documents.append(Document(page_content=chunk, metadata=page.metadata))
if not new_documents:
return "No documents found to index. Make sure you uploaded PDFs.", state
new_texts = [d.page_content for d in new_documents]
# Embeddings for the new batch
embeddings = _embed_with_vllm(new_texts, embed_api_base=embed_api_base, embed_model=embed_model)
if embeddings.ndim != 2:
return "[ERROR] Embedding matrix must be 2D (N, D).", state
# Qdrant setup
qdrant = QdrantClient(host=qdrant_host, port=qdrant_port, timeout=30.0)
if reset_collection:
qdrant.recreate_collection(
collection_name=collection_name,
vectors_config=VectorParams(size=embeddings.shape[1], distance=Distance.COSINE),
)
state["documents"].clear()
state["texts"].clear()
state["id_to_idx"].clear()
state["next_id"] = 0
else:
try:
qdrant.get_collection(collection_name=collection_name)
except Exception:
qdrant.recreate_collection(
collection_name=collection_name,
vectors_config=VectorParams(size=embeddings.shape[1], distance=Distance.COSINE),
)
state["documents"].clear()
state["texts"].clear()
state["id_to_idx"].clear()
state["next_id"] = 0
# Assign unique IDs and upsert
start_id = int(state.get("next_id", 0))
points = [
PointStruct(
id=start_id + i,
vector=embeddings[i].tolist(),
payload={"text": new_documents[i].page_content, "metadata": new_documents[i].metadata},
)
for i in range(len(new_documents))
]
qdrant.upsert(collection_name=collection_name, points=points)
# Update local state
for i, doc in enumerate(new_documents):
qid = start_id + i
state["id_to_idx"][qid] = len(state["documents"])
state["documents"].append(doc)
state["texts"].append(doc.page_content)
state["next_id"] = start_id + len(new_documents)
# Rebuild BM25 over the entire session corpus
tokenized_docs = [re.findall(r"\w+", t.lower()) for t in state["texts"]]
state["bm25"] = BM25Okapi(tokenized_docs)
msg = f"Indexed {len(new_documents)} chunks from {len(set(filepaths or []))} file(s) into '{collection_name}'. Total chunks in session: {len(state['documents'])}."
return msg, state
# -------------------------------
# Hybrid search producing FIVE leaderboards
# + separate reranker_top_k
# -------------------------------
def hybrid_search(
query: str,
top_k: int,
alpha: float,
state: Dict[str, Any],
use_api_reranker: bool = True,
reranker_api_url: Optional[str] = None,
rerank_weight_beta: float = 0.5, # only used for fused scores
reranker_top_k: int = 10, # NEW: how many from Combined go to reranker
) -> Tuple[
List[Dict[str, Any]], # hits_dense
List[Dict[str, Any]], # hits_sparse
List[Dict[str, Any]], # hits_combined
List[Dict[str, Any]], # hits_reranker
List[Dict[str, Any]], # hits_fused
str # dbg_json
]:
"""
Produce five leaderboards:
- Dense-only (Qdrant cosine scores) → size top_k
- Sparse-only (BM25 over ALL docs) → size top_k
- Combined (α*dense_norm + (1-α)*sparse_norm) over DENSE CANDIDATES → size top_k
- Reranker-only (API reranker over top reranker_top_k candidates from Combined) → size reranker_top_k
- Fused (β*norm(reranker) + (1-β)*norm(Combined)) on same reranker set → size reranker_top_k
"""
if not state or "documents" not in state or "bm25" not in state or state["bm25"] is None:
empty = []
return empty, empty, empty, empty, empty, json.dumps({"error": "Index not ready. Please index PDFs first."}, indent=2)
documents: List[Document] = state["documents"]
texts: List[str] = state["texts"]
bm25: BM25Okapi = state["bm25"]
id_to_idx: Dict[int, int] = state["id_to_idx"]
embed_cfg = state["embed_cfg"]
qdrant_cfg = state["qdrant_cfg"]
_dbg = {
"query": query,
"top_k": top_k,
"alpha": alpha,
"reranker_enabled": bool(use_api_reranker),
"reranker_api_url": (reranker_api_url or "").strip(),
"reranker_top_k": reranker_top_k,
"notes": []
}
# ---------- Dense embedding & Qdrant dense search ----------
query_vec = _embed_with_vllm([query], embed_api_base=embed_cfg["api_base"], embed_model=embed_cfg["model"])[0]
qdrant = QdrantClient(host=qdrant_cfg["host"], port=qdrant_cfg["port"], timeout=30.0)
search_res = qdrant.search(
collection_name=qdrant_cfg["collection"],
query_vector=query_vec.tolist(),
limit=max(top_k * 3, top_k),
)
dense_ids_all: List[int] = []
dense_scores_all: List[float] = []
for p in search_res:
try:
did = int(p.id)
except Exception:
continue
if did in id_to_idx:
dense_ids_all.append(did)
dense_scores_all.append(float(p.score))
if not dense_ids_all:
empty = []
dbg = {
"query": query,
"note": "No overlapping docs between Qdrant results and session state."
}
return empty, empty, empty, empty, empty, json.dumps(dbg, indent=2)
dense_scores = np.array(dense_scores_all, dtype=float)
tokenized_query = re.findall(r"\w+", query.lower())
all_sparse_scores = bm25.get_scores(tokenized_query)
# Dense-only leaderboard from dense results (size top_k)
dense_pairs = list(zip(dense_ids_all, dense_scores_all))
dense_sorted = sorted(dense_pairs, key=lambda t: t[1], reverse=True)[:top_k]
def _mk_hit(doc_id: int, *, dense_score=None, sparse_score=None, combined=None, rerank_score=None, fused_score=None):
idx = id_to_idx[doc_id]
text = texts[idx]
return {
"doc_id": doc_id,
"dense_score": float(dense_score) if dense_score is not None else None,
"sparse_score": float(sparse_score) if sparse_score is not None else None,
"combined": float(combined) if combined is not None else None,
**({"rerank_score": float(rerank_score)} if rerank_score is not None else {}),
**({"fused_score": float(fused_score)} if fused_score is not None else {}),
"preview": text[:600] + ("..." if len(text) > 600 else ""),
"metadata": documents[idx].metadata,
}
# Hits: Dense
hits_dense = []
for r, (did, dscore) in enumerate(dense_sorted, start=1):
hits_dense.append({**_mk_hit(did, dense_score=dscore), "rank": r})
# Sparse-only leaderboard (size top_k) over ALL docs by BM25
reverse_idx_to_id = {v: k for k, v in id_to_idx.items()}
sparse_idx_sorted = np.argsort(all_sparse_scores)[::-1][:top_k]
hits_sparse = []
for r, sidx in enumerate(sparse_idx_sorted, start=1):
did = reverse_idx_to_id[sidx]
sscore = float(all_sparse_scores[sidx])
hits_sparse.append({**_mk_hit(did, sparse_score=sscore), "rank": r})
# Align sparse scores for dense candidates
sparse_scores = np.array([all_sparse_scores[id_to_idx[did]] for did in dense_ids_all], dtype=float)
dense_norm = _minmax(dense_scores)
sparse_norm = _minmax(sparse_scores)
combined = alpha * dense_norm + (1.0 - alpha) * sparse_norm
# Sort by combined and slice to top_k
ord_combined = np.argsort(combined)[::-1][:top_k]
hits_combined = []
for rank_pos, pos in enumerate(ord_combined, start=1):
did = dense_ids_all[pos]
dscore = dense_scores[pos]
sscore = sparse_scores[pos]
cscore = combined[pos]
hits_combined.append({**_mk_hit(did, dense_score=dscore, sparse_score=sscore, combined=cscore), "rank": rank_pos})
_dbg["candidates_from_dense"] = len(dense_ids_all)
_dbg["candidates_for_combined"] = len(hits_combined)
reranker_candidates = hits_combined[:max(0, min(reranker_top_k, len(hits_combined)))]
candidate_ids = [c["doc_id"] for c in reranker_candidates]
candidate_texts = [texts[id_to_idx[did]] for did in candidate_ids] # full text for reranker
candidate_combined_scores = [float(c["combined"]) for c in reranker_candidates]
_dbg["candidates_for_reranker"] = len(candidate_texts)
reranker_used = False
rerank_scores: List[float] = []
if use_api_reranker and (reranker_api_url or "").strip() and len(candidate_texts) > 0:
scores = rerank_with_api(
query=query,
docs=candidate_texts,
api_url=(reranker_api_url or "").strip(),
instruction=RERANK_INSTRUCTION,
)
_dbg["rerank_call_ok"] = bool(scores) and (len(scores) == len(candidate_texts))
_dbg["rerank_scores_len"] = len(scores) if scores else 0
if scores and len(scores) == len(candidate_texts):
reranker_used = True
rerank_scores = [float(s) for s in scores]
_dbg["sample_rerank_scores_first5"] = [float(s) for s in rerank_scores[:5]]
else:
_dbg["notes"].append("Reranker returned no/partial scores; skipping reranker boards.")
hits_reranker = []
if reranker_used:
order_rr = np.argsort(np.array(rerank_scores))[::-1]
for r, i in enumerate(order_rr[:len(candidate_ids)], start=1):
did = candidate_ids[i]
hits_reranker.append({
**_mk_hit(did,
combined=candidate_combined_scores[i],
rerank_score=rerank_scores[i]),
"rank": r
})
hits_fused = []
if reranker_used:
comb_arr = np.array(candidate_combined_scores, dtype=float)
rr_arr = np.array(rerank_scores, dtype=float)
comb_n = _minmax(comb_arr)
rr_n = _minmax(rr_arr)
fused = rerank_weight_beta * rr_n + (1.0 - rerank_weight_beta) * comb_n
order_fused = np.argsort(fused)[::-1]
for r, i in enumerate(order_fused[:len(candidate_ids)], start=1):
did = candidate_ids[i]
hits_fused.append({
**_mk_hit(did,
combined=candidate_combined_scores[i],
rerank_score=rerank_scores[i],
fused_score=float(fused[i])),
"rank": r
})
dbg = {
**_dbg,
"reranker_used": reranker_used,
"rerank_weight_beta": rerank_weight_beta,
"counts": {
"hits_dense": len(hits_dense),
"hits_sparse": len(hits_sparse),
"hits_combined": len(hits_combined),
"hits_reranker": len(hits_reranker),
"hits_fused": len(hits_fused),
}
}
return (
hits_dense,
hits_sparse,
hits_combined,
hits_reranker,
hits_fused,
json.dumps(dbg, indent=2),
)
def answer_with_context_ui(
query: str,
hits_json: str,
llm_api_base: str,
llm_model: str,
system_message: str,
) -> str:
try:
hits = json.loads(hits_json)
if isinstance(hits, str):
hits = json.loads(hits)
except Exception:
return "Failed to parse hits JSON. Run search first."
context_text = "\n\n".join(h["preview"] for h in hits[:6])
user_prompt = textwrap.dedent(f"""
Context:
{context_text}
Question: {query}
Provide a single, concise, factual sentence based only on the above context.
If not in the context, say: I don't know.
""").strip()
try:
raw_answer = generate_with_llm_chat(
prompt=user_prompt,
llm_api_base=llm_api_base,
llm_model=llm_model,
system_message=system_message,
max_tokens=500,
temperature=0.0,
)
return raw_answer
except Exception as e:
return f"Answering failed: {e}"
# -------------------------------
# Gradio UI
# -------------------------------
with gr.Blocks(title="RAG (Qdrant + BM25 + API Reranker) — Gradio") as demo:
gr.Markdown("## RAG App — Hybrid Search with **five leaderboards** (Dense, Sparse, Combined, Reranker, Fused) + separate Reranker Top-K")
with gr.Accordion("Server & Model Settings", open=False):
with gr.Row():
embed_api_base = gr.Textbox(label="Embedding API Base", value=DEFAULT_EMBED_API_BASE)
embed_model = gr.Textbox(label="Embedding Model", value=DEFAULT_EMBED_MODEL)
with gr.Row():
llm_api_base = gr.Textbox(label="LLM API Base", value=DEFAULT_LLM_API_BASE)
llm_model = gr.Textbox(label="LLM Model", value=DEFAULT_LLM_MODEL)
with gr.Row():
qdrant_host = gr.Textbox(label="Qdrant Host", value=DEFAULT_QDRANT_HOST)
qdrant_port = gr.Number(label="Qdrant Port", value=DEFAULT_QDRANT_PORT, precision=0)
collection_name = gr.Textbox(label="Qdrant Collection", value=DEFAULT_COLLECTION)
system_message = gr.Textbox(
label="System Message",
value=DEFAULT_SYSTEM_MESSAGE,
lines=3
)
gr.Markdown("---")
gr.Markdown("### Reranker (API Only)")
with gr.Row():
use_api_reranker = gr.Checkbox(label="Enable API Reranker", value=DEFAULT_USE_API_RERANKER)
reranker_api_url = gr.Textbox(label="Reranker API URL", value=DEFAULT_RERANKER_API_URL)
with gr.Row():
rerank_weight_beta = gr.Slider(
label="Fused β (weight on reranker in fused score)",
minimum=0.0, maximum=1.0, step=0.05, value=DEFAULT_RERANK_WEIGHT_BETA
)
with gr.Accordion("Index PDFs", open=True):
pdfs = gr.File(
label="Upload one or more PDFs",
file_types=[".pdf"],
file_count="multiple",
type="filepath"
)
with gr.Row():
chunk_size = gr.Number(label="Chunk size", value=DEFAULT_CHUNK_SIZE, precision=0)
chunk_overlap = gr.Number(label="Chunk overlap", value=DEFAULT_CHUNK_OVERLAP, precision=0)
reset_collection = gr.Checkbox(label="Reset (recreate) collection", value=True)
index_btn = gr.Button("Index PDFs")
index_status = gr.Markdown("")
rag_state = gr.State({})
def _on_index(paths, coll, host, port, ebase, emodel, csize, coverlap, reset, st):
try:
msg, new_state = index_pdfs(
paths, coll, host, int(port), ebase, emodel, int(csize), int(coverlap), bool(reset), st or {}
)
return msg, new_state
except Exception as e:
return f"Indexing failed: `{e}`", st or {}
index_btn.click(
_on_index,
inputs=[pdfs, collection_name, qdrant_host, qdrant_port, embed_api_base, embed_model, chunk_size, chunk_overlap, reset_collection, rag_state],
outputs=[index_status, rag_state]
)
gr.Markdown("---")
with gr.Accordion("Ask Questions", open=True):
query = gr.Textbox(label="Your question", placeholder="Ask something about the indexed PDFs…")
with gr.Row():
top_k = gr.Slider(label="Top-K (Initial Retrieval)", minimum=1, maximum=50, step=1, value=20)
reranker_top_k = gr.Slider(label="Top-K for Reranker", minimum=1, maximum=20, step=1, value=10)
with gr.Row():
alpha = gr.Slider(label="Hybrid α (Dense weight)", minimum=0.0, maximum=1.0, step=0.05, value=0.6)
search_btn = gr.Button("Search")
with gr.Tabs():
with gr.Tab("Dense"):
hits_dense_json = gr.JSON(label="Top Hits — Dense")
with gr.Tab("Sparse"):
hits_sparse_json = gr.JSON(label="Top Hits — Sparse")
with gr.Tab("Combined"):
hits_combined_json = gr.JSON(label="Top Hits — Combined (Hybrid)")
with gr.Tab("Reranker"):
hits_reranker_json = gr.JSON(label="Top Hits — Reranker Only (size = Reranker Top-K)")
with gr.Tab("Fused"):
hits_fused_json = gr.JSON(label="Top Hits — Fused (Combined + Reranker) (size = Reranker Top-K)")
dbg_json = gr.Code(label="Search Debug", language="json")
hits_for_answer = gr.State("[]")
def _on_search(q, k, rr_k, a, st, use_rr, rr_url, beta):
try:
d, s, c, r, f, dbg = hybrid_search(
q, int(k), float(a), st or {},
use_api_reranker=bool(use_rr),
reranker_api_url=(rr_url or "").strip(),
rerank_weight_beta=float(beta),
reranker_top_k=int(rr_k),
)
answer_src = c if c else d
return d, s, c, r, f, json.dumps(answer_src, indent=2), dbg
except Exception as e:
err = {"error": str(e)}
return [], [], [], [], [], json.dumps([], indent=2), json.dumps(err, indent=2)
search_btn.click(
_on_search,
inputs=[query, top_k, reranker_top_k, alpha, rag_state, use_api_reranker, reranker_api_url, rerank_weight_beta],
outputs=[hits_dense_json, hits_sparse_json, hits_combined_json, hits_reranker_json, hits_fused_json, hits_for_answer, dbg_json]
)
gr.Markdown("### Generate Final Answer")
answer_btn = gr.Button("Answer with Context")
final_answer = gr.Textbox(label="Final Answer", lines=3)
def _on_answer(q, hits_json, llm_base, llm_model_name, sys_msg):
try:
return answer_with_context_ui(q, hits_json, llm_base, llm_model_name, sys_msg)
except Exception as e:
return f"Answering failed: {e}"
answer_btn.click(
_on_answer,
inputs=[query, hits_for_answer, llm_api_base, llm_model, system_message],
outputs=[final_answer]
)
gr.Markdown(
"""
**Notes**
- Reranker settings live in **Server & Model Settings → Reranker (API Only)**.
- Two K's:
- **Top-K (Initial Retrieval)** controls Dense/Sparse/Combined board sizes.
- **Top-K for Reranker** controls how many Combined items are sent to the reranker and shown in Reranker/Fused.
- Leaderboards:
- **Dense**: Qdrant cosine similarity only (size = Top-K).
- **Sparse**: BM25 over the whole in-memory corpus (size = Top-K).
- **Combined**: α·norm(dense) + (1-α)·norm(sparse) on dense candidates (size = Top-K).
- **Reranker**: API reranker ordering of the top reranker_top_k Combined (size = Reranker Top-K).
- **Fused**: β·norm(rerank) + (1-β)·norm(Combined) on the same reranker set (size = Reranker Top-K).
- If reranking fails or is disabled, Reranker and Fused tabs may be empty.
"""
)
if __name__ == "__main__":
DEFAULT_PORT = int(os.environ.get("GRADIO_SERVER_PORT", "7860"))
demo.launch(server_name="0.0.0.0", server_port=DEFAULT_PORT)The RAG App in ActionIn practice, this system delivers robust, high-quality answers. The interface allows for deep inspection of the retrieval process, so you can see exactly why a particular passage was chosen.
The indexing interface for processing PDFs.
Inspecting the different search leaderboards to see why certain passages were retrieved.
Generating the final answer using the refined context. By seeing all five rankings simultaneously, you can visually determine the optimal retrieval strategy for your data.
A Demo VideoPerformance & Inference SpeedFrom vLLM logs on Nvidia Jetson Thor:
Avg prompt throughput: 69.7 tokens/s
Avg generation throughput: 16.8 tokens/s
GPU KV cache usage: <0.2%
Prefix cache hit rate: ~46%This means fast, responsive answers even with a 120B model, thanks to MXFP4 quantization and FlashInfer optimizations.
By splitting workloads intelligently:
- Nvidia RTX 5090 handles heavy embedding and reranking models.
- Nvidia Jetson Thor runs the massive LLM efficiently with MXFP4.
This setup is ideal for edge RAG applications, private deployments, or cost-sensitive inference where cloud APIs aren’t an option.
Happy coding!
References & Further Reading



Comments