Venue Extractors Reference¶
This document covers all venue extractors, their current status, URL patterns, and implementation details. For general plugin development patterns, see PLUGIN_DEVELOPMENT.md.
Table of Contents¶
- Extractor Architecture
- Registered Venues
- LIV Nightclub / LIV Beach
- XS Nightclub
- Encore Beach Club (EBC)
- TAO Group (Omnia, Hakkasan, Marquee, Jewel)
- Adding a New Extractor
Extractor Architecture¶
flowchart TD
A["Venue Sitemap URL"] --> B["Crawlee Sitemap Handler"]
B --> C{"SitemapIndex.diff()"}
C -->|new/updated| D["Enqueue event URLs"]
C -->|unchanged| E["Skip"]
D --> F{"ExtractorRegistry\nfirst match wins"}
F -->|livnightclub.com| G["LIVExtractor"]
F -->|wynnsocial.com| H{"Venue detection"}
H -->|EBC page text| I["EBCExtractor"]
H -->|default| J["XSExtractor"]
F -->|taogroup.com| K["TaoGroupExtractor"]
G --> L["VegasEvent"]
I --> L
J --> L
K --> L
L --> M["StorageManager\n+ MasterDatabase"]
style A fill:#7c3aed,color:#fff
style L fill:#059669,color:#fff
style E fill:#6b7280,color:#fff
All extractors live in src/extractors/ and inherit from VenueExtractor (base in src/extractors/__init__.py).
Type safety: EventData¶
Every extractor builds its output as an EventData TypedDict and constructs the
final object through the factory method — never VegasEvent(**dict) directly:
from src.models import EventData, VegasEvent
event_data: EventData = {
"url": url, "scraped_at": scraped_at,
"performer": performer, "venue": venue, "event_date": event_date,
}
# ... add optional fields conditionally ...
return VegasEvent.from_extractor_data(event_data)
ty (Astral's type checker, run in pre-commit) validates every key and value type
at the extractor call site — misspelled field names or wrong types are caught before
the commit lands. See DATA_MODEL.md § EventData TypedDict
and § Type Safety Guarantees for the full list of what ty catches.
src/extractors/
├── __init__.py # VenueExtractor base + ExtractorRegistry
├── liv.py # LIV Nightclub + LIV Beach (Dayclub)
├── wynn.py # WynnSocialBase (shared base for XS + EBC)
├── xs.py # XS Nightclub
├── ebc.py # Encore Beach Club + EBC at Night
└── tao.py # TAO Group (Omnia, Hakkasan, Marquee, Jewel, etc.)
WynnSocial Shared Base
XS and EBC both inherit from WynnSocialBase in wynn.py, which provides shared logic for Schema.org JSON-LD parsing, table pricing extraction (uv_tablesitems), and URL pattern handling. Registration order in create_default_registry() matters — first match wins, and EBCExtractor handles disambiguation via _detect_venue().
Registration order in create_default_registry() matters — first match wins. XS and EBC both use wynnsocial.com, but the EBCExtractor handles disambiguation via _detect_venue().
Registered Venues¶
| Extractor | Venue(s) | Domain | Status |
|---|---|---|---|
LIVExtractor |
LIV Nightclub, LIV Beach | livnightclub.com |
✅ Production |
XSExtractor |
XS Nightclub | wynnsocial.com |
✅ Production |
EBCExtractor |
Encore Beach Club, EBC at Night | wynnsocial.com |
🚧 Validation in progress |
TaoGroupExtractor |
Omnia, Hakkasan, Marquee, Jewel + 5 dayclubs | taogroup.com |
✅ Phase 1+2 complete |
LIV Nightclub / LIV Beach¶
File: src/extractors/liv.py
- Domain:
livnightclub.com - Data format: Schema.org Event JSON-LD embedded in page
- Venues handled: LIV Nightclub (nightclub) and LIV Beach (dayclub)
- Table pricing: Separate urvenue API call — see TABLE_PRICING.md
URL Pattern¶
Key Extraction Logic¶
- Find
<script type="application/ld+json">with"@type": "Event" - Extract performer, date, images from JSON-LD
- Fall back to
og:imagefor images if JSON-LD image is missing
XS Nightclub¶
File: src/extractors/xs.py
Base class: WynnSocialBase
- Domain:
wynnsocial.com - Data format: Schema.org Event JSON-LD (same site as EBC)
- Table pricing:
uv_tablesitemsJS variable embedded in page (no API call needed)
URL Pattern¶
The EVE segment encodes the date: last 8 chars are YYYYMMDD.
Key Extraction Logic¶
- Find
<script type="application/ld+json">with"@type": "Event" - Extract performer from
performer[0].name, date fromstartDate - Extract table pricing from inline
uv_tablesitemsJS var (WynnSocialBase.extract_table_pricing())
Encore Beach Club (EBC)¶
File: src/extractors/ebc.py
Base class: WynnSocialBase
- Domain:
wynnsocial.com(same as XS) - Data format: HTML-only — no Schema.org JSON-LD on EBC pages
- Venues: Encore Beach Club (Dayclub) and Encore Beach Club at Night
- Operator: Wynn Nightlife (same as XS)
URL Pattern¶
Same Wynn Social pattern as XS:
Venue Detection¶
Since EBC and XS share the same domain, EBCExtractor._detect_venue() inspects page text:
if "encore beach club at night" in page_text:
return "Encore Beach Club at Night"
if "encore beach club" in page_text:
return "Encore Beach Club"
return None # Not an EBC page — let XS handle it
Table Pricing¶
EBC pages use the same uv_tablesitems JS variable as XS. WynnSocialBase.extract_table_pricing() handles both — no EBC-specific code needed.
EBC Rollout Phases¶
EBC is being rolled out in phases to avoid wasted effort. Test 5 events after each phase before proceeding.
| Phase | Issue | Description | Status |
|---|---|---|---|
| 1a | #85 | EBC at Night extractor validation (5-event test) | 🔲 Open |
| 1b | #86 | EBC Dayclub extractor validation (5-event test) | 🔲 Open |
| — | #87 | Artist info enrichment (description, streaming links) | 🔲 Open |
| — | #88 | Venue info enrichment (hours, capacity, dress code) | 🔲 Open |
| 2 | #89 | Table pricing extraction (5-event test) | 🔲 Open |
| 3 | #90 | Image extraction — full gallery (5-event test) | 🔲 Open |
Dependency chain: Phase 1 → Phase 2 → Phase 3 → Full calendar scrape
Testing EBC Events¶
# Test a single EBC event URL
just scrape -u "https://www.wynnsocial.com/event/EVE.../slug/" --max-requests 5
# Inspect extracted data
just list-runs
cat runs/latest/events.json | jq '.[0]'
# Check table pricing
cat runs/latest/events.json | jq '.[0].table_pricing'
# Check images
just images-download latest
just images-status latest
TAO Group (Omnia, Hakkasan, Marquee, Jewel)¶
File: src/extractors/tao.py
- Domain:
taogroup.com - Data format: Schema.org Event JSON-LD +
og:titlefor performer name - Venues handled: 10 Las Vegas venues (4 nightclubs + 6 day/pool venues)
- Table pricing: urvenue API via booketing.com proxy (same protocol as LIV, different base URL)
- Sitemaps:
events-sitemap4.xmlandevents-sitemap5.xml(2026 events only)
URL Pattern¶
Example: https://taogroup.com/event/3-20-2026-tyga-hakkasan-nightclub/
Las Vegas Venues¶
| Venue | Type | Hotel | Venue Tag |
|---|---|---|---|
| Omnia Nightclub | Night | Caesars Palace | omn |
| Hakkasan Nightclub | Night | MGM Grand | hak |
| Marquee Nightclub | Night | Cosmopolitan | marq |
| Jewel Nightclub | Night | Aria | jwl |
| Marquee Dayclub | Day | Cosmopolitan | marqd |
| Tao Beach Dayclub | Day | Venetian | taob |
| Tao Nightclub | Night | Venetian | tao |
| Wet Republic Ultra Pool | Day | MGM Grand | wet |
| Palm Tree Beach Club | Day | Mandalay Bay | palm |
| Liquid Pool Lounge | Day | Aria | liq |
Key Extraction Logic¶
- Find
<script type="application/ld+json">with"@type": "Event" - Extract date from
startDate, time fromstartDate/endDate(ISO format) - Performer: Parse from
og:title("M/D/YYYY - PERFORMER - VENUE") — JSON-LDperformer.nameis bugged (returnstrue) - Venue: From JSON-LD
location.name, strip- Las Vegassuffix, normalize casing - Images: Prefer JSON-LD
image(artist photo) overog:image(may be venue default) - Non-LV filtering: Skip events where venue is not in
_LAS_VEGAS_VENUESset (taogroup.com is global)
Quirks¶
- Performer name bug: JSON-LD
performer.namereturns booleantrueinstead of the actual name. Must parse fromog:titlesplit on" - ". - Global domain: taogroup.com covers NYC, LA, Singapore venues too. URL-level pre-filter in
crawlee_main.pyuses_TAO_LV_VENUE_SLUGSallowlist. - Sitemap filtering: Only sitemaps 4+5 have 2026 events. Date and venue slug filters applied at the URL level before crawling.
- Per-venue sitemap filtering:
vinny scrape omniafilters sitemap URLs by slug before crawling. See CLI Aliases section below. - Booketing.com is urvenue: TAO pricing uses the same urvenue protocol as LIV, routed through
booketing.com/uws/house/proxywith an extramanageentid=61param. Venue codes: VEN1085 (Hakkasan), VEN1089 (Omnia), VEN1108 (Marquee).
TAO Sitemap Filtering
TAO Group sitemaps cover venues globally (NYC, LA, Singapore). Vinny applies a two-level filter: only sitemaps 4+5 are fetched (2026 events), and per-venue aliases like vinny scrape omnia apply URL slug filters before crawling. The slug mapping lives in _TAO_ALIAS_SLUGS in src/cli_scrape.py.
- Image CDN: WordPress
wp-contentURLs with Fastly IO resize (?width=N). Max 1080px original.
CLI Aliases & Per-Venue Filtering¶
All TAO aliases point to the same two sitemaps, but per-venue aliases apply sitemap-level URL filtering so only matching events are crawled:
vinny scrape tao # All TAO LV venues (no filter)
vinny scrape tao-group # Same as tao
vinny scrape omnia # Only URLs containing "omnia"
vinny scrape hakkasan # Only "hakkasan-nightclub" URLs
vinny scrape marquee # Both "marquee-nightclub" and "marquee-dayclub" URLs
vinny scrape jewel # Only "jewel-nightclub" URLs
vinny scrape omnia hakkasan # Both omnia + hakkasan URLs
Filtering happens in crawlee_main.py at the sitemap handler level — event pages for other venues are never downloaded. The slug mapping lives in _TAO_ALIAS_SLUGS in src/cli_scrape.py.
Incremental Sitemap Scraping¶
Venues that use sitemaps (LIV, TAO Group) support incremental scraping — only new or updated event URLs are crawled on subsequent runs.
How It Works¶
Sitemap XML → Parse <loc> + <lastmod> pairs
↓
SitemapIndex.diff()
↓
┌───────────┼───────────┐
new updated unchanged
↓ ↓ ↓
enqueue enqueue skip
- SitemapIndex (
src/sitemap_index.py) stores URL +lastmod+scraped_atper event indata/sitemaps/{source_key}.json - On each run, the sitemap handler parses
<url>/<loc>/<lastmod>pairs and diffs against the stored index - Only new and updated (lastmod changed) URLs are enqueued for scraping
- Past events (date parsed from URL < today) are auto-skipped
- All visited URLs are marked in the index after the crawl — even if extraction returns nothing (prevents infinite re-visits of unparseable pages)
Source Keys¶
| Pattern in URL | Source Key | Index File |
|---|---|---|
taogroup.com/events-sitemap |
tao-group |
data/sitemaps/tao-group.json |
livnightclub.com/events-sitemap |
liv |
data/sitemaps/liv.json |
CLI¶
# Normal run — only scrapes new/changed URLs
vinny scrape tao
vinny sync omnia
# Force full re-scrape (bypass diff)
vinny scrape tao --force
vinny sync omnia --force
# Check index status
vinny sitemap-status
Master DB Fallback in vinny sync¶
When vinny sync <venue> finds 0 new events (everything already indexed), it loads matching events from the master database so the rest of the pipeline (images → R2 → D1) still runs. This handles the common case of pushing already-scraped events through the full pipeline for the first time.
Key Files¶
src/sitemap_index.py—SitemapIndex,SitemapEntry,DiffResultmodelssrc/crawlee_main.py—_SITEMAP_SOURCE_KEYS, sitemap handler with diff logicsrc/cli_sitemap.py—vinny sitemap-statuscommanddata/sitemaps/— persisted index JSON files
Adding a New Extractor¶
See PLUGIN_DEVELOPMENT.md for the full step-by-step guide and DATA_MODEL.md for the complete field reference.
Quick checklist:
- Create
src/extractors/{venue}.pyinheriting fromVenueExtractor(orWynnSocialBasefor Wynn properties) - Implement
name,domain, andextract() - Register in
create_default_registry()insrc/extractors/__init__.py - Test with 5 events before full calendar scrape
- Document in this file
Last updated: 2026-03-05