System Architecture

Event extraction, enrichment, storage & export pipeline

1 — System Data Flow
graph TD
  A["Sitemap URLs"] --> B["Crawlee Scraper"]
  B --> C1["LIV Extractor"]
  B --> C2["WynnSocial Extractor"]
  B --> C3["TAO Group Extractor"]
  C1 --> D["VegasEvent"]
  C2 --> D
  C3 --> D
  D --> E["StorageManager"]
  D --> F["MasterDatabase"]
  E -.-> G["data/runs/"]
  F --> H["Export Pipeline"]
  H --> I1["events.json"]
  H --> I2["events.csv"]
  H --> I3["D1 Database"]
  H --> I4["Venue Pages"]
  D --> J["Image Processor"]
  D --> K["Table Pricing"]
  D --> L["Artist Enricher"]
  J --> M["Cloudflare R2"]
  K --> F
  L --> F

  classDef sources fill:#818cf811,stroke:#818cf844,stroke-width:1.5px
  classDef extractor fill:#2dd4bf11,stroke:#2dd4bf44,stroke-width:1.5px
  classDef model fill:#34d39911,stroke:#34d39944,stroke-width:1.5px
  classDef storage fill:#34d39911,stroke:#34d39944,stroke-width:1.5px
  classDef plugin fill:#fbbf2411,stroke:#fbbf2444,stroke-width:1.5px
  classDef export fill:#fb718511,stroke:#fb718544,stroke-width:1.5px
  classDef infra fill:#2dd4bf11,stroke:#2dd4bf44,stroke-width:1.5px

  class A sources
  class B extractor
  class C1,C2,C3 extractor
  class D model
  class E,F storage
  class G extractor
  class H export
  class I1,I2,I3,I4 export
  class J,K,L plugin
  class M infra
      
2 — Data Flow Overview
  1. Sitemap Ingestion: URLs from venue websites are fed into Crawlee's concurrent scraper
  2. Venue Extraction: Three specialized extractors parse event data from LIV, XS/EBC (via WynnSocial), and TAO Group venues
  3. Event Modeling: Raw HTML is transformed into strongly-typed Pydantic VegasEvent objects with validation
  4. Storage & Diff Tracking: Events are stored in timestamped runs and compared against the master database for field-level changes
  5. Enrichment Pipeline: Three feature plugins augment events with images (R2), table pricing (urvenue API), and artist metadata (Spotify/RA/Tracklists)
  6. Multi-Format Export: Curated master database is exported as JSON, CSV, SQL DDL (D1), and Markdown venue pages
3 — Extraction Layer
Crawlee
CRAWLER

Orchestrates concurrent HTTP requests across venue sitemaps with configurable concurrency limits and retry logic.

  • ConcurrencySettings(max_concurrency=5)
  • Requests queued from sitemap URLs
  • Routed to venue-specific extractors
  • Handles rate limiting & retries
LIV Extractor
PARSER

Extracts events from LIV Las Vegas and LIV Beach using JSON-LD parsing with HTML fallback.

  • JSON-LD parsing (primary)
  • HTML fallback extraction
  • VEA image URL extraction
  • urvenue table pricing API
WynnSocial Extractor
PARSER

Shared base for XS Nightclub and Encore Beach Club using wynnsocial.com domain.

  • XS: JSON-LD Schema.org Event
  • EBC: HTML text inspection
  • uv_tablesitems pricing extraction
  • EVE URL segments for event IDs
TAO Group Extractor
PARSER

Extracts events from TAO Group venues with Las Vegas-only filtering and multi-layer validation.

  • JSON-LD + og:title parsing
  • Las Vegas venue filter (10 venues)
  • Omnia, Hakkasan, Marquee, etc.
  • booketing.com proxy pricing
4 — Data Models & Storage
VegasEvent
MODEL

Pydantic v2 model with strict validation for event metadata and enrichment tracking.

  • artist, date, venue, image_urls
  • enrichment_status tracking
  • social_links metadata
  • Immutable model_copy() updates
StorageManager
STORAGE

Manages timestamped run directories and raw scrape data. Each run captures full HTML and extracted JSON.

  • data/runs/{timestamp}/
  • raw_data.json — extracted events
  • Hourly/daily run history
  • Process tracking & stats
MasterDatabase
DATABASE

Field-level diff tracking compares new runs against master_events.json. Tracks what changed, when, and how.

  • master_events.json canonical source
  • FieldChange history per event
  • model_dump(mode='json') for serialization
  • Deduplication by (venue, artist, date)
Timestamped Runs
ARCHIVE

Each scrape run is archived with full context for audit trail and historical analysis.

  • Snapshot of all extracted events
  • Timestamps and metadata
  • Diffable against previous runs
  • Reference for re-enrichment
5 — Enrichment Plugins
Image Processor
PLUGIN

Downloads artist images, stores in R2 with standardized paths and sizes.

  • data/artists/{slug}/
  • Venue-tagged filenames (ebc, xs, liv, livb)
  • 500px & 1500px variants
  • Dedup by (venue, artist, size)
Table Pricing
PLUGIN

Fetches VIP table pricing from urvenue API for LIV, LIV Beach, and TAO Group venues.

  • livnightclub.com wp-admin API
  • TAO: booketing.com proxy (VEN codes)
  • Bottle service pricing tiers
  • Per-event pricing variation
Artist Enrichment
PLUGIN

Augments artist metadata: Spotify tracks, Resident Advisor profiles, and DJ tracklists.

  • SpotifyEnricher — top tracks & links
  • ResidentAdvisorEnricher — bio & profile
  • TracklistEnricher — past set recordings
  • EnrichmentStatus tracking on events
Enrichment Registry
ORCHESTRATOR

Coordinates run order of enrichment plugins with dependency management and error tracking.

  • Run order: Tracklists → RA → Spotify
  • Parallel execution where safe
  • Error recovery & retry logic
  • Status tracking per event
6 — Export & Infrastructure
Multi-Format Export
EXPORT

Exports master database to JSON, CSV, SQL DDL, and Markdown for downstream consumption.

  • events.json — full event data
  • events.csv — tabular format
  • D1 SQL — Cloudflare Workers database
  • Markdown venue pages
Cloudflare R2
CDN

Image CDN and object storage powered by Cloudflare's global edge network.

  • R2 bucket: vinny-vegas-images
  • Custom domain: img.vinny.vegas
  • Public URL distribution
  • High-speed image serving
Cloudflare D1
DATABASE

SQLite database for Astro site queries with automatic schema generation from exports.

  • SQL schema exported from master DB
  • Hourly/daily syncs
  • Worker queries in Astro site
  • Full-text search support
FastAPI Server
API

Lightweight REST API for accessing events, venues, and statistics. Deployed as ASGI app.

  • GET /events — list all events
  • GET /venues — venue metadata
  • GET /stats — scrape statistics
  • GET /health — liveness check
7 — CLI
Cyclopts CLI
CLI

Command-line interface for manual scraping, enrichment, and data export workflows.

  • vinny scrape VENUE — run venue extractor
  • vinny enrich artists — artist metadata pipeline
  • vinny export — multi-format output
  • Sub-commands for detailed control
Hub-and-spoke pattern: src/cli.py is the central registry. Command implementations live in cli_scrape.py, cli_enrich.py, etc. Each sub-module exports functions registered via app.command().
Vinny Scraper — System Architecture · Generated 2026-03-06 · docs/diagrams/