Deep Dive: Data Processing & RAG Indexing¶
Part A โ Business Metrics (prepare_business_metrics.py)¶
Goal¶
Create ranked metrics for restaurants using Bayesian weighted rating and popularity blending.
Inputs / Outputs¶
- Input:
data/processed/businesses_clean.csv - Output:
data/processed/businesses_ranked.csv
Steps¶
- Coerce
ratingandreview_countto numeric and clamp valid ranges - Compute global mean rating
C - Set popularity threshold
m= 60th percentile ofreview_count - Bayesian weighted rating:
(v/(v+m))*R + (m/(v+m))*C- Popularity =
log1p(review_count) - Rank score =
bayes_score * (1 + 0.15 * popularity) - Sort by
rank_score, thenbayes_score,review_count,rating
Why Bayesian?¶
Balances high ratings with sample size so a 5.0 with 1 review doesnโt dominate a 4.5 with 500 reviews.
Part B โ RAG Index (build_rag_index.py)¶
Goal¶
Build a FAISS index over business cards for fast semantic retrieval in chat.
Inputs / Outputs¶
- Input:
businesses_ranked.csv(fallback tobusinesses_clean.csv) - Outputs:
data/processed/rag/faiss.indexdata/processed/rag/docstore.parquet(records with text)data/processed/rag/meta.json(model metadata)
Document Construction¶
Each row becomes a textual card with: - Name, categories, price tier, stars (with review count), full address, URL
Embeddings & Index¶
- Model:
sentence-transformers/all-MiniLM-L6-v2 - Normalize vectors (L2)
- FAISS
IndexFlatIP(cosine via normalized dot-product)
Why This Design¶
- Small, fast, well-known model with good recall/latency balance
- Flat index is sufficient for dataset size; trivial to swap for IVF/HNSW later
Data Quality¶
- Missing
priceโ "N/A"; missing text fields โ ""; numeric coercion with defaults - Unique by
idin clean CSV
Rebuilding¶
- Re-run both scripts after new fetch or on schedule via auto-refresh