System Architecture
Event extraction, enrichment, storage & export pipeline
graph TD
A["Sitemap URLs"] --> B["Crawlee Scraper"]
B --> C1["LIV Extractor"]
B --> C2["WynnSocial Extractor"]
B --> C3["TAO Group Extractor"]
C1 --> D["VegasEvent"]
C2 --> D
C3 --> D
D --> E["StorageManager"]
D --> F["MasterDatabase"]
E -.-> G["data/runs/"]
F --> H["Export Pipeline"]
H --> I1["events.json"]
H --> I2["events.csv"]
H --> I3["D1 Database"]
H --> I4["Venue Pages"]
D --> J["Image Processor"]
D --> K["Table Pricing"]
D --> L["Artist Enricher"]
J --> M["Cloudflare R2"]
K --> F
L --> F
classDef sources fill:#818cf811,stroke:#818cf844,stroke-width:1.5px
classDef extractor fill:#2dd4bf11,stroke:#2dd4bf44,stroke-width:1.5px
classDef model fill:#34d39911,stroke:#34d39944,stroke-width:1.5px
classDef storage fill:#34d39911,stroke:#34d39944,stroke-width:1.5px
classDef plugin fill:#fbbf2411,stroke:#fbbf2444,stroke-width:1.5px
classDef export fill:#fb718511,stroke:#fb718544,stroke-width:1.5px
classDef infra fill:#2dd4bf11,stroke:#2dd4bf44,stroke-width:1.5px
class A sources
class B extractor
class C1,C2,C3 extractor
class D model
class E,F storage
class G extractor
class H export
class I1,I2,I3,I4 export
class J,K,L plugin
class M infra
- Sitemap Ingestion: URLs from venue websites are fed into Crawlee's concurrent scraper
- Venue Extraction: Three specialized extractors parse event data from LIV, XS/EBC (via WynnSocial), and TAO Group venues
- Event Modeling: Raw HTML is transformed into strongly-typed Pydantic VegasEvent objects with validation
- Storage & Diff Tracking: Events are stored in timestamped runs and compared against the master database for field-level changes
- Enrichment Pipeline: Three feature plugins augment events with images (R2), table pricing (urvenue API), and artist metadata (Spotify/RA/Tracklists)
- Multi-Format Export: Curated master database is exported as JSON, CSV, SQL DDL (D1), and Markdown venue pages
Orchestrates concurrent HTTP requests across venue sitemaps with configurable concurrency limits and retry logic.
- ConcurrencySettings(max_concurrency=5)
- Requests queued from sitemap URLs
- Routed to venue-specific extractors
- Handles rate limiting & retries
Extracts events from LIV Las Vegas and LIV Beach using JSON-LD parsing with HTML fallback.
- JSON-LD parsing (primary)
- HTML fallback extraction
- VEA image URL extraction
- urvenue table pricing API
Shared base for XS Nightclub and Encore Beach Club using wynnsocial.com domain.
- XS: JSON-LD Schema.org Event
- EBC: HTML text inspection
uv_tablesitemspricing extraction- EVE URL segments for event IDs
Extracts events from TAO Group venues with Las Vegas-only filtering and multi-layer validation.
- JSON-LD + og:title parsing
- Las Vegas venue filter (10 venues)
- Omnia, Hakkasan, Marquee, etc.
- booketing.com proxy pricing
Pydantic v2 model with strict validation for event metadata and enrichment tracking.
- artist, date, venue, image_urls
- enrichment_status tracking
- social_links metadata
- Immutable model_copy() updates
Manages timestamped run directories and raw scrape data. Each run captures full HTML and extracted JSON.
data/runs/{timestamp}/- raw_data.json — extracted events
- Hourly/daily run history
- Process tracking & stats
Field-level diff tracking compares new runs against master_events.json. Tracks what changed, when, and how.
- master_events.json canonical source
- FieldChange history per event
- model_dump(mode='json') for serialization
- Deduplication by (venue, artist, date)
Each scrape run is archived with full context for audit trail and historical analysis.
- Snapshot of all extracted events
- Timestamps and metadata
- Diffable against previous runs
- Reference for re-enrichment
Downloads artist images, stores in R2 with standardized paths and sizes.
data/artists/{slug}/- Venue-tagged filenames (ebc, xs, liv, livb)
- 500px & 1500px variants
- Dedup by (venue, artist, size)
Fetches VIP table pricing from urvenue API for LIV, LIV Beach, and TAO Group venues.
- livnightclub.com wp-admin API
- TAO: booketing.com proxy (VEN codes)
- Bottle service pricing tiers
- Per-event pricing variation
Augments artist metadata: Spotify tracks, Resident Advisor profiles, and DJ tracklists.
- SpotifyEnricher — top tracks & links
- ResidentAdvisorEnricher — bio & profile
- TracklistEnricher — past set recordings
- EnrichmentStatus tracking on events
Coordinates run order of enrichment plugins with dependency management and error tracking.
- Run order: Tracklists → RA → Spotify
- Parallel execution where safe
- Error recovery & retry logic
- Status tracking per event
Exports master database to JSON, CSV, SQL DDL, and Markdown for downstream consumption.
- events.json — full event data
- events.csv — tabular format
- D1 SQL — Cloudflare Workers database
- Markdown venue pages
Image CDN and object storage powered by Cloudflare's global edge network.
- R2 bucket: vinny-vegas-images
- Custom domain: img.vinny.vegas
- Public URL distribution
- High-speed image serving
SQLite database for Astro site queries with automatic schema generation from exports.
- SQL schema exported from master DB
- Hourly/daily syncs
- Worker queries in Astro site
- Full-text search support
Lightweight REST API for accessing events, venues, and statistics. Deployed as ASGI app.
GET /events— list all eventsGET /venues— venue metadataGET /stats— scrape statisticsGET /health— liveness check
Command-line interface for manual scraping, enrichment, and data export workflows.
vinny scrape VENUE— run venue extractorvinny enrich artists— artist metadata pipelinevinny export— multi-format output- Sub-commands for detailed control
src/cli.py is the central registry. Command implementations live in cli_scrape.py, cli_enrich.py, etc. Each sub-module exports functions registered via app.command().