Technical Glossary (Plain English)¶
A quick reference for terms used across the app and documentation.
Data & Columns¶
- rating: Yelp star rating (0.0–5.0).
- review_count: Total number of Yelp reviews for a business.
- price: Price tier —
$(cheap) to$$$$(expensive);None/N/Ameans unknown. - categories: Comma‑separated list of cuisines/tags from Yelp.
- city / address / zip_code / latitude / longitude: Standard location fields.
- hours: Human‑readable business hours built from Yelp’s raw schedule.
Analytics & Scoring¶
- Bayesian weighted rating (bayes_score): A rating that balances a restaurant’s own stars with the global average to avoid over‑promoting places with very few reviews. Formula:
(v/(v+m))*R + (m/(v+m))*CwhereR=restaurant rating,v=review_count,C=global mean,m=popularity threshold. - popularity:
log1p(review_count)— softer growth so very large counts don’t dominate. - rank_score: Overall ranking score used in tables —
bayes_score * (1 + 0.15 * popularity). - reliability_score: Used for “best” queries —
rating * log(review_count + 1); prefers highly‑rated places with more reviews. - Opportunity Score: For investor analysis —
(avg_rating Ă— avg_review_count) Ă· (business_count + 1); higher is better (quality, visibility, low competition).
Clustering & Geography¶
- KMeans: An algorithm that groups nearby points into clusters by distance.
- cluster_id (0,1,2,3): The numeric label assigned by KMeans. It has no inherent meaning by itself — it’s just “group 0/1/2/3”. We compute per‑cluster stats (average rating, business_count, center_lat/lng) to interpret them.
- cluster center (center_lat/center_lng): The average latitude/longitude of restaurants in the cluster — a “hotspot” center.
Retrieval & AI¶
- RAG (Retrieval‑Augmented Generation): The model answers using only information we retrieve from our dataset (candidates and optional passages), preventing hallucination.
- embeddings: Numeric vectors representing text meaning. We use
all‑MiniLM‑L6‑v2. - FAISS: A fast vector index. We build it over restaurant text cards for semantic search.
- Index type (IndexFlatIP): Cosine similarity via dot product after L2 normalization.
- candidates: The set of restaurant rows retrieved by our tabular/semantic search before the model answers.
- grounding: Forcing the AI to use only the supplied candidates/passages.
Search Strategy (Tabular Layer)¶
- Exact match: Direct string match on business names.
- Fuzzy match: Tolerates typos using similarity (e.g., “mcdonalds” → “McDonald’s”).
- Cuisine match: Looks for cuisine keywords in
categories(e.g., mexican, ramen, sushi). - City filter: Restricts to Odessa or Midland.
- Rating filter:
rating ≥ Xfrom queries like “rating more than 3”. - Price filter: Filters by
$,$$,$$$,$$$$. - Fallbacks: If a strict query fails, we progressively relax filters to always give a data‑grounded answer.
Data Pipeline¶
- Data Collection: Fetch pages from Yelp, cache JSON per city/category/offset, and write CSVs.
- Processed CSVs:
businesses_clean.csv,businesses_ranked.csv(with ranking columns). - RAG Index Files:
rag/faiss.index,rag/docstore.parquet,rag/meta.json. - Backups: Timestamped copies under
data/backups/before refresh.
Automation & Freshness¶
- Auto refresh:
auto_refresh_data.py— full or incremental update; validates files and updatesdata_metadata.json. - Hot‑reload (chat): The chat’s retriever automatically reloads the latest table if the file changes, so answers reflect daily updates.
- GitHub Actions: CI that can deploy docs and (optionally) run refresh workflows on schedule.
UI & Maps¶
- CARTO provider: Basemap used for PyDeck maps without requiring a Mapbox token.
- Tooltips: On map hover, we show
name,city,rating,price, andcluster label. - Legend: Colors (red/green/blue/yellow) map to cluster IDs 0–3 for quick scanning.
Common Phrases in the Docs¶
- “High rating, low competition”: Categories with
avg_rating ≥ 4.0andbusiness_count < 5. - “Database‑wide stats”: Questions like “how many 5‑star restaurants?” that are answered by scanning the full table, not just search hits.
- “Grounded responses”: Answers built only from the retrieved rows/passages — nothing invented.