System Architecture¶
🏗️ Complete System Overview¶
The Yelp Odessa-Midland Restaurant Analytics Platform follows a five-layer architecture designed for scalability, reliability, and performance.
📐 Architecture Diagram¶
graph TB
%% Styling for better readability
classDef apiClass fill:#e8f4f8,stroke:#2c5aa0,stroke-width:2px,color:#000
classDef processClass fill:#fff4e6,stroke:#d97706,stroke-width:2px,color:#000
classDef dataClass fill:#f0fdf4,stroke:#16a34a,stroke-width:2px,color:#000
classDef ragClass fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#000
classDef appClass fill:#fce7f3,stroke:#c026d3,stroke-width:2px,color:#000
classDef decisionClass fill:#f3f4f6,stroke:#6b7280,stroke-width:3px,color:#000
classDef pageClass fill:#dbeafe,stroke:#2563eb,stroke-width:2px,color:#000
classDef featureClass fill:#e0e7ff,stroke:#6366f1,stroke-width:2px,color:#000
Start([Start]):::apiClass --> YelpAPI[Yelp Fusion API]:::apiClass
YelpAPI --> Fetch[yelp_fetch_reviews.py<br/>Fetch Business Data]:::processClass
Fetch --> Cache[Cache JSON Pages<br/>data/cache/]:::dataClass
Cache --> RawCSV[Raw CSV<br/>data/raw/businesses.csv]:::dataClass
RawCSV --> Process[prepare_business_metrics.py<br/>Calculate Rankings]:::processClass
Process --> Bayesian[Bayesian Weighted Rating<br/>Popularity Scoring]:::processClass
Bayesian --> RankedCSV[Ranked CSV<br/>data/processed/businesses_ranked.csv]:::dataClass
RankedCSV --> BuildRAG[build_rag_index.py<br/>Build RAG Index]:::ragClass
BuildRAG --> Embeddings[Sentence Transformers<br/>all-MiniLM-L6-v2]:::ragClass
Embeddings --> FAISS[FAISS Vector Index<br/>data/processed/rag/]:::ragClass
FAISS --> AutoRefresh{auto_refresh_data.py<br/>Data Refresh System}:::decisionClass
AutoRefresh -->|Manual| ManualRefresh[Full/Incremental Refresh]:::processClass
AutoRefresh -->|Scheduled| CronJob[Cron Job<br/>Daily at 2 AM]:::processClass
AutoRefresh -->|Check| StatusCheck[Data Freshness Check]:::processClass
RankedCSV --> Streamlit[Streamlit App<br/>app.py]:::appClass
FAISS --> Streamlit
Streamlit --> Analytics[📊 Analytics Page]:::pageClass
Streamlit --> Chat[💬 Chat Page]:::pageClass
Streamlit --> Investor[💰 Investor Insights Page]:::pageClass
Analytics --> Filters[Filters:<br/>City, Price, Rating, Reviews]:::featureClass
Filters --> Viz1[KPI Metrics]:::featureClass
Filters --> Viz2[Ratings Distribution]:::featureClass
Filters --> Viz3[Price Analysis]:::featureClass
Filters --> Viz4[Category Analysis]:::featureClass
Filters --> Viz5[Interactive Map]:::featureClass
Filters --> Export[CSV Export]:::featureClass
Chat --> RAGSystem[RAG Retrieval System]:::ragClass
RAGSystem --> TabularRetrieval[Tabular Retrieval<br/>Ranked Candidates]:::ragClass
RAGSystem --> VectorSearch[FAISS Vector Search<br/>Semantic Similarity]:::ragClass
VectorSearch --> LLM[OpenAI GPT-4o-mini<br/>Optional]:::ragClass
LLM --> ChatResponse[Context-Aware Responses<br/>with Citations]:::ragClass
Investor --> MarketAnalysis[Market Opportunity Analysis]:::featureClass
Investor --> LocationClustering[Geographic Clustering<br/>K-Means]:::featureClass
Investor --> CategoryGaps[Category Gap Analysis]:::featureClass
Investor --> InvestmentInsights[Investment Recommendations]:::featureClass
Color Coding: - 🔵 Blue = External APIs & Data Sources - 🟠 Orange = Data Processing & Scripts - 🟢 Green = Data Storage & Files - 🟡 Yellow = RAG System & AI Components - 🟣 Purple = Application Layer - ⚪ Gray = Decision Points - 🔷 Light Blue = Streamlit Pages - 💙 Indigo = Features & Visualizations
🔄 Data Flow¶
1. Data Collection Layer¶
Component: yelp_fetch_reviews.py
Responsibilities: - Yelp API integration with rate limiting - Resumable caching system - Pagination handling (50 results per page) - City-category matrix traversal - Data flattening and normalization
Output: Raw JSON cache files and aggregated CSV
2. Data Processing Layer¶
Component: prepare_business_metrics.py
Responsibilities: - Duplicate removal (by business ID) - Missing value handling with user-friendly replacements - Bayesian weighted rating calculation - Popularity scoring (logarithmic scaling) - Composite ranking algorithm
Output: Clean CSV files and ranked datasets
3. RAG Index Layer¶
Component: build_rag_index.py
Responsibilities: - Text document preparation - Embedding generation (Sentence Transformers) - FAISS index building - Metadata storage
Output: FAISS vector index and document store
4. Application Layer¶
Component: src/app.py + src/pages/
Responsibilities: - Streamlit web interface - Three specialized dashboards: - Analytics Dashboard - RAG Chat Assistant - Investor Insights - User interaction handling - Data visualization
Output: Interactive web application
5. Automation Layer¶
Component: .github/workflows/auto-refresh.yml + auto_refresh_data.py
Responsibilities: - Scheduled data refresh (daily) - Data integrity validation - Automatic backups - Error recovery - CI/CD integration
Output: Continuously updated dataset
🧩 Component Details¶
Analytics Dashboard (pages/analytics.py)¶
Features: - KPI metrics display - Interactive filtering - Chart visualizations (Plotly) - Geographic mapping (PyDeck) - CSV export functionality
Data Sources:
- businesses_ranked.csv
- Real-time filtering and aggregation
RAG Chat Assistant (pages/chat.py)¶
Features: - Natural language query processing - Multi-strategy search (7 layers) - Vector similarity search (FAISS) - LLM response generation (GPT-4o-mini) - Citation and source tracking
Components:
- utils/rag.py - Retrieval logic
- utils/llm_openai.py - LLM integration
- FAISS index for semantic search
Investor Insights (pages/investor_insights.py)¶
Features: - Market opportunity analysis - Location hotspot clustering (KMeans) - Competitor benchmarking - Strategic recommendations
Analytics Methods: - Category gap identification - Geographic clustering - Statistical benchmarking
🔐 Security & Environment¶
Environment Variables¶
YELP_API_KEY # Required for data collection
OPENAI_API_KEY # Optional, for enhanced chat
DATA_DIR # Optional, override data directory
RAG_DOC_TABLE # Optional, override document table
EMBED_MODEL # Optional, override embedding model
Security Measures¶
- API keys stored in environment variables
- GitHub Secrets for CI/CD
- No sensitive data in code
- Secure error handling
⚡ Performance Optimizations¶
Caching Strategy¶
- Resumable API cache: JSON files per page
- Streamlit caching:
@st.cache_datadecorators - Manifest tracking: Avoids redundant API calls
Query Optimization¶
- Multi-strategy search: Fast exact matches first
- Vector index: FAISS for semantic similarity
- Deduplication: Efficient result merging
Data Processing¶
- Batch processing: Process data in chunks
- Incremental updates: Only fetch new/changed data
- Lazy loading: Load data only when needed
📊 Scalability Considerations¶
Current Capacity¶
- 1,200+ restaurants: Efficiently handled
- 31,000+ reviews: Fast processing
- 36+ categories: Easy expansion
Future Scalability¶
- Multi-city support: Architecture ready
- Additional data sources: Modular design
- API integration: RESTful architecture ready
- Microservices: Components can be separated
🛠️ Technology Stack¶
| Layer | Technology | Purpose |
|---|---|---|
| Language | Python 3.12 | Core development |
| Web Framework | Streamlit | Interactive UI |
| Vector DB | FAISS (CPU) | Similarity search |
| Embeddings | Sentence Transformers | Text-to-vector |
| LLM | OpenAI GPT-4o-mini | Response generation |
| Visualization | Plotly, PyDeck | Charts and maps |
| Automation | GitHub Actions | CI/CD pipeline |
| Data Processing | Pandas, NumPy | Data manipulation |
| ML | scikit-learn | Clustering algorithms |
📈 System Metrics¶
Performance Benchmarks¶
- Data Collection: 18 min average (full refresh)
- Data Processing: 2 min average
- RAG Index Building: 5 min average
- Query Response: <2 seconds average
- Page Load: <1 second
Reliability Metrics¶
- Automation Success Rate: 95%
- Data Freshness: Daily updates
- Error Recovery: Automatic backups
- Uptime: High availability
🔄 Maintenance & Updates¶
Automated Processes¶
- Daily Data Refresh: GitHub Actions scheduler
- Automatic Backups: Before each update
- Error Alerts: Notification on failures
- Health Monitoring: Status reports
Manual Processes¶
- Code Updates: Standard git workflow
- Configuration Changes: Update
mkdocs.ymlor config files - Model Updates: Change embedding model if needed
📚 Further Reading¶
Ready to explore? Try our Interactive Demo or Launch the App