Skip to content

Venue Extractors Reference

This document covers all venue extractors, their current status, URL patterns, and implementation details. For general plugin development patterns, see PLUGIN_DEVELOPMENT.md.

Table of Contents

  1. Extractor Architecture
  2. Registered Venues
  3. LIV Nightclub / LIV Beach
  4. XS Nightclub
  5. Encore Beach Club (EBC)
  6. TAO Group (Omnia, Hakkasan, Marquee, Jewel)
  7. Adding a New Extractor

Extractor Architecture

flowchart TD
    A["Venue Sitemap URL"] --> B["Crawlee Sitemap Handler"]
    B --> C{"SitemapIndex.diff()"}
    C -->|new/updated| D["Enqueue event URLs"]
    C -->|unchanged| E["Skip"]
    D --> F{"ExtractorRegistry\nfirst match wins"}
    F -->|livnightclub.com| G["LIVExtractor"]
    F -->|wynnsocial.com| H{"Venue detection"}
    H -->|EBC page text| I["EBCExtractor"]
    H -->|default| J["XSExtractor"]
    F -->|taogroup.com| K["TaoGroupExtractor"]
    G --> L["VegasEvent"]
    I --> L
    J --> L
    K --> L
    L --> M["StorageManager\n+ MasterDatabase"]

    style A fill:#7c3aed,color:#fff
    style L fill:#059669,color:#fff
    style E fill:#6b7280,color:#fff

All extractors live in src/extractors/ and inherit from VenueExtractor (base in src/extractors/__init__.py).

Type safety: EventData

Every extractor builds its output as an EventData TypedDict and constructs the final object through the factory method — never VegasEvent(**dict) directly:

from src.models import EventData, VegasEvent

event_data: EventData = {
    "url": url, "scraped_at": scraped_at,
    "performer": performer, "venue": venue, "event_date": event_date,
}
# ... add optional fields conditionally ...
return VegasEvent.from_extractor_data(event_data)

ty (Astral's type checker, run in pre-commit) validates every key and value type at the extractor call site — misspelled field names or wrong types are caught before the commit lands. See DATA_MODEL.md § EventData TypedDict and § Type Safety Guarantees for the full list of what ty catches.

src/extractors/
├── __init__.py        # VenueExtractor base + ExtractorRegistry
├── liv.py             # LIV Nightclub + LIV Beach (Dayclub)
├── wynn.py            # WynnSocialBase (shared base for XS + EBC)
├── xs.py              # XS Nightclub
├── ebc.py             # Encore Beach Club + EBC at Night
└── tao.py             # TAO Group (Omnia, Hakkasan, Marquee, Jewel, etc.)

WynnSocial Shared Base

XS and EBC both inherit from WynnSocialBase in wynn.py, which provides shared logic for Schema.org JSON-LD parsing, table pricing extraction (uv_tablesitems), and URL pattern handling. Registration order in create_default_registry() matters — first match wins, and EBCExtractor handles disambiguation via _detect_venue().

Registration order in create_default_registry() matters — first match wins. XS and EBC both use wynnsocial.com, but the EBCExtractor handles disambiguation via _detect_venue().


Registered Venues

Extractor Venue(s) Domain Status
LIVExtractor LIV Nightclub, LIV Beach livnightclub.com ✅ Production
XSExtractor XS Nightclub wynnsocial.com ✅ Production
EBCExtractor Encore Beach Club, EBC at Night wynnsocial.com 🚧 Validation in progress
TaoGroupExtractor Omnia, Hakkasan, Marquee, Jewel + 5 dayclubs taogroup.com ✅ Phase 1+2 complete

LIV Nightclub / LIV Beach

File: src/extractors/liv.py

  • Domain: livnightclub.com
  • Data format: Schema.org Event JSON-LD embedded in page
  • Venues handled: LIV Nightclub (nightclub) and LIV Beach (dayclub)
  • Table pricing: Separate urvenue API call — see TABLE_PRICING.md

URL Pattern

https://livnightclub.com/event/{slug}/

Key Extraction Logic

  1. Find <script type="application/ld+json"> with "@type": "Event"
  2. Extract performer, date, images from JSON-LD
  3. Fall back to og:image for images if JSON-LD image is missing

XS Nightclub

File: src/extractors/xs.py Base class: WynnSocialBase

  • Domain: wynnsocial.com
  • Data format: Schema.org Event JSON-LD (same site as EBC)
  • Table pricing: uv_tablesitems JS variable embedded in page (no API call needed)

URL Pattern

https://www.wynnsocial.com/event/EVE{id}{YYYYMMDD}/{slug}/

The EVE segment encodes the date: last 8 chars are YYYYMMDD.

Key Extraction Logic

  1. Find <script type="application/ld+json"> with "@type": "Event"
  2. Extract performer from performer[0].name, date from startDate
  3. Extract table pricing from inline uv_tablesitems JS var (WynnSocialBase.extract_table_pricing())

Encore Beach Club (EBC)

File: src/extractors/ebc.py Base class: WynnSocialBase

  • Domain: wynnsocial.com (same as XS)
  • Data format: HTML-only — no Schema.org JSON-LD on EBC pages
  • Venues: Encore Beach Club (Dayclub) and Encore Beach Club at Night
  • Operator: Wynn Nightlife (same as XS)

URL Pattern

Same Wynn Social pattern as XS:

https://www.wynnsocial.com/event/EVE{id}{YYYYMMDD}/{slug}/

Venue Detection

Since EBC and XS share the same domain, EBCExtractor._detect_venue() inspects page text:

if "encore beach club at night" in page_text:
    return "Encore Beach Club at Night"
if "encore beach club" in page_text:
    return "Encore Beach Club"
return None  # Not an EBC page — let XS handle it

Table Pricing

EBC pages use the same uv_tablesitems JS variable as XS. WynnSocialBase.extract_table_pricing() handles both — no EBC-specific code needed.

EBC Rollout Phases

EBC is being rolled out in phases to avoid wasted effort. Test 5 events after each phase before proceeding.

Phase Issue Description Status
1a #85 EBC at Night extractor validation (5-event test) 🔲 Open
1b #86 EBC Dayclub extractor validation (5-event test) 🔲 Open
#87 Artist info enrichment (description, streaming links) 🔲 Open
#88 Venue info enrichment (hours, capacity, dress code) 🔲 Open
2 #89 Table pricing extraction (5-event test) 🔲 Open
3 #90 Image extraction — full gallery (5-event test) 🔲 Open

Dependency chain: Phase 1 → Phase 2 → Phase 3 → Full calendar scrape

Testing EBC Events

# Test a single EBC event URL
just scrape -u "https://www.wynnsocial.com/event/EVE.../slug/" --max-requests 5

# Inspect extracted data
just list-runs
cat runs/latest/events.json | jq '.[0]'

# Check table pricing
cat runs/latest/events.json | jq '.[0].table_pricing'

# Check images
just images-download latest
just images-status latest

TAO Group (Omnia, Hakkasan, Marquee, Jewel)

File: src/extractors/tao.py

  • Domain: taogroup.com
  • Data format: Schema.org Event JSON-LD + og:title for performer name
  • Venues handled: 10 Las Vegas venues (4 nightclubs + 6 day/pool venues)
  • Table pricing: urvenue API via booketing.com proxy (same protocol as LIV, different base URL)
  • Sitemaps: events-sitemap4.xml and events-sitemap5.xml (2026 events only)

URL Pattern

https://taogroup.com/event/{M}-{D}-{YYYY}-{slug}/

Example: https://taogroup.com/event/3-20-2026-tyga-hakkasan-nightclub/

Las Vegas Venues

Venue Type Hotel Venue Tag
Omnia Nightclub Night Caesars Palace omn
Hakkasan Nightclub Night MGM Grand hak
Marquee Nightclub Night Cosmopolitan marq
Jewel Nightclub Night Aria jwl
Marquee Dayclub Day Cosmopolitan marqd
Tao Beach Dayclub Day Venetian taob
Tao Nightclub Night Venetian tao
Wet Republic Ultra Pool Day MGM Grand wet
Palm Tree Beach Club Day Mandalay Bay palm
Liquid Pool Lounge Day Aria liq

Key Extraction Logic

  1. Find <script type="application/ld+json"> with "@type": "Event"
  2. Extract date from startDate, time from startDate/endDate (ISO format)
  3. Performer: Parse from og:title ("M/D/YYYY - PERFORMER - VENUE") — JSON-LD performer.name is bugged (returns true)
  4. Venue: From JSON-LD location.name, strip - Las Vegas suffix, normalize casing
  5. Images: Prefer JSON-LD image (artist photo) over og:image (may be venue default)
  6. Non-LV filtering: Skip events where venue is not in _LAS_VEGAS_VENUES set (taogroup.com is global)

Quirks

  • Performer name bug: JSON-LD performer.name returns boolean true instead of the actual name. Must parse from og:title split on " - ".
  • Global domain: taogroup.com covers NYC, LA, Singapore venues too. URL-level pre-filter in crawlee_main.py uses _TAO_LV_VENUE_SLUGS allowlist.
  • Sitemap filtering: Only sitemaps 4+5 have 2026 events. Date and venue slug filters applied at the URL level before crawling.
  • Per-venue sitemap filtering: vinny scrape omnia filters sitemap URLs by slug before crawling. See CLI Aliases section below.
  • Booketing.com is urvenue: TAO pricing uses the same urvenue protocol as LIV, routed through booketing.com/uws/house/proxy with an extra manageentid=61 param. Venue codes: VEN1085 (Hakkasan), VEN1089 (Omnia), VEN1108 (Marquee).

TAO Sitemap Filtering

TAO Group sitemaps cover venues globally (NYC, LA, Singapore). Vinny applies a two-level filter: only sitemaps 4+5 are fetched (2026 events), and per-venue aliases like vinny scrape omnia apply URL slug filters before crawling. The slug mapping lives in _TAO_ALIAS_SLUGS in src/cli_scrape.py.

  • Image CDN: WordPress wp-content URLs with Fastly IO resize (?width=N). Max 1080px original.

CLI Aliases & Per-Venue Filtering

All TAO aliases point to the same two sitemaps, but per-venue aliases apply sitemap-level URL filtering so only matching events are crawled:

vinny scrape tao          # All TAO LV venues (no filter)
vinny scrape tao-group    # Same as tao
vinny scrape omnia        # Only URLs containing "omnia"
vinny scrape hakkasan     # Only "hakkasan-nightclub" URLs
vinny scrape marquee      # Both "marquee-nightclub" and "marquee-dayclub" URLs
vinny scrape jewel        # Only "jewel-nightclub" URLs
vinny scrape omnia hakkasan  # Both omnia + hakkasan URLs

Filtering happens in crawlee_main.py at the sitemap handler level — event pages for other venues are never downloaded. The slug mapping lives in _TAO_ALIAS_SLUGS in src/cli_scrape.py.


Incremental Sitemap Scraping

Venues that use sitemaps (LIV, TAO Group) support incremental scraping — only new or updated event URLs are crawled on subsequent runs.

How It Works

Sitemap XML → Parse <loc> + <lastmod> pairs
            SitemapIndex.diff()
        ┌───────────┼───────────┐
       new       updated    unchanged
        ↓           ↓           ↓
     enqueue     enqueue      skip
  1. SitemapIndex (src/sitemap_index.py) stores URL + lastmod + scraped_at per event in data/sitemaps/{source_key}.json
  2. On each run, the sitemap handler parses <url>/<loc>/<lastmod> pairs and diffs against the stored index
  3. Only new and updated (lastmod changed) URLs are enqueued for scraping
  4. Past events (date parsed from URL < today) are auto-skipped
  5. All visited URLs are marked in the index after the crawl — even if extraction returns nothing (prevents infinite re-visits of unparseable pages)

Source Keys

Pattern in URL Source Key Index File
taogroup.com/events-sitemap tao-group data/sitemaps/tao-group.json
livnightclub.com/events-sitemap liv data/sitemaps/liv.json

CLI

# Normal run — only scrapes new/changed URLs
vinny scrape tao
vinny sync omnia

# Force full re-scrape (bypass diff)
vinny scrape tao --force
vinny sync omnia --force

# Check index status
vinny sitemap-status

Master DB Fallback in vinny sync

When vinny sync <venue> finds 0 new events (everything already indexed), it loads matching events from the master database so the rest of the pipeline (images → R2 → D1) still runs. This handles the common case of pushing already-scraped events through the full pipeline for the first time.

Key Files

  • src/sitemap_index.pySitemapIndex, SitemapEntry, DiffResult models
  • src/crawlee_main.py_SITEMAP_SOURCE_KEYS, sitemap handler with diff logic
  • src/cli_sitemap.pyvinny sitemap-status command
  • data/sitemaps/ — persisted index JSON files

Adding a New Extractor

See PLUGIN_DEVELOPMENT.md for the full step-by-step guide and DATA_MODEL.md for the complete field reference.

Quick checklist:

  1. Create src/extractors/{venue}.py inheriting from VenueExtractor (or WynnSocialBase for Wynn properties)
  2. Implement name, domain, and extract()
  3. Register in create_default_registry() in src/extractors/__init__.py
  4. Test with 5 events before full calendar scrape
  5. Document in this file

Last updated: 2026-03-05