Extractor Routing

How URLs flow from sitemaps to venue-specific extractors

1 — URL Routing Pipeline

graph TD
  A["Venue Sitemap URL"] --> B["Parse Sitemap"]
  B --> C{"SitemapIndex.diff()"}
  C -->|"New / Updated"| D["Filter Past Events"]
  C -->|"Unchanged"| SKIP["Skip — no re-crawl"]
  D -->|"date >= today"| E["ExtractorRegistry\n.get_extractor(url)"]
  D -->|"date < today"| SKIP2["Skip — past event"]
  E --> F["Venue Extractor\n.extract(soup, url)"]
  F --> G["VegasEvent"]

  classDef default fill:#22d3ee11,stroke:#22d3ee44,stroke-width:1.5px
  classDef decision fill:#fbbf2411,stroke:#fbbf2444,stroke-width:1.5px
  classDef skip fill:#fb718511,stroke:#fb718544,stroke-width:1.5px
  classDef output fill:#34d39911,stroke:#34d39944,stroke-width:2px

  class C decision
  class SKIP,SKIP2 skip
  class G output

Registration order matters. ExtractorRegistry iterates in order: LIVExtractor → XSExtractor → EBCExtractor → TaoGroupExtractor. First extractor whose can_handle(url) returns true wins.

2 — Domain Routing

LIVExtractor

livnightclub.com

JSON-LD parsing (primary)
HTML fallback
VEA image extraction
urvenue table pricing API
LIV Las Vegas + LIV Beach

WynnSocialBase

wynnsocial.com

Shared base for XS + EBC
uv_tablesitems pricing
EVE URL segments for IDs
tixr.com ticketing links

TaoGroupExtractor

taogroup.com

JSON-LD + og:title parsing
Las Vegas venue filter
10 venues (Omnia, Hakkasan...)
booketing.com proxy pricing

3 — WynnSocial: XS vs EBC Routing

Both XS Nightclub and Encore Beach Club share wynnsocial.com. The registry disambiguates via registration order and venue detection:

graph LR
  URL["wynnsocial.com/event/..."] --> XS{"XSExtractor\ncan_handle?"}
  XS -->|"Yes"| JSONLD{"Has JSON-LD?"}
  JSONLD -->|"Yes"| XS_OUT["XS Event"]
  JSONLD -->|"No"| EBC_TRY["Try EBCExtractor"]
  XS -->|"No"| EBC_TRY
  EBC_TRY --> DETECT{"_detect_venue(soup)\ntext search"}
  DETECT -->|"'encore beach club' found"| EBC_OUT["EBC Event"]
  DETECT -->|"not found"| NONE["None — skip"]

  classDef default fill:#22d3ee11,stroke:#22d3ee44,stroke-width:1.5px
  classDef decision fill:#fbbf2411,stroke:#fbbf2444,stroke-width:1.5px
  classDef xs fill:#34d39911,stroke:#34d39944,stroke-width:1.5px
  classDef ebc fill:#a78bfa11,stroke:#a78bfa44,stroke-width:1.5px
  classDef skip fill:#fb718511,stroke:#fb718544,stroke-width:1.5px

  class XS,JSONLD,DETECT decision
  class XS_OUT xs
  class EBC_OUT ebc
  class NONE skip

Why XS first? XS pages always have JSON-LD schema, so extraction is clean and reliable. EBC pages lack JSON-LD and require HTML text inspection via _detect_venue(soup). Trying XS first avoids false positives.

4 — TAO Group Venue Filtering

TAO Group sitemaps contain all global venues (NYC, LA, Singapore, LV). Three filter layers ensure only Las Vegas events are scraped:

Layer	Location	Purpose
CLI Alias	`src/cli_scrape.py`	Maps CLI alias (e.g., "omnia") to URL slugs via `_TAO_ALIAS_SLUGS`
Crawlee Pre-filter	`src/crawlee_main.py`	Only enqueues sitemap URLs matching `_TAO_LV_VENUE_SLUGS`
Extractor Validation	`src/extractors/tao.py`	Post-filter: validates venue name against `_LAS_VEGAS_VENUES` set

_LAS_VEGAS_VENUES = {
  "Hakkasan Nightclub",
  "Omnia Nightclub",
  "Marquee Nightclub",
  "Marquee Dayclub",
  "Jewel Nightclub",
  "Wet Republic",
  "Tao Nightclub",
  "Tao Beach Dayclub",
  "Liquid Pool Lounge",
  "Cathédrale at Aria",
}
# NYC, LA, Singapore venues are filtered out

Defensive design. All three layers are necessary because TAO sitemaps (sitemap4.xml, sitemap5.xml) are global. Removing any single layer would leak non-LV events into the database.

5 — Extraction Strategies by Venue

Venue(s)	Domain	Data Source	Pricing	Special Logic
LIV Las Vegas LIV Beach	`livnightclub.com`	JSON-LDapplication/ld+json	urvenue API	VEA image extraction
XS Nightclub	`wynnsocial.com`	JSON-LDSchema.org Event	`uv_tablesitems` JS var	Fallback venue detect
Encore Beach Club EBC at Night	`wynnsocial.com`	HTML onlyNo JSON-LD	`uv_tablesitems` JS var	`_detect_venue()` text check
TAO Group10 LV venues	`taogroup.com`	JSON-LD + og:title	booketing.com proxy	LV venue filter + CLI slugs

6 — Worked Examples

1 LIV Las Vegas Event

URL: https://livnightclub.com/event/dua-lipa-03-15-2026/
Sitemap: https://livnightclub.com/events-sitemap.xml

Crawlee enqueues URL from sitemap
Date filter: 2026-03-15 >= today ✓
ExtractorRegistry.get_extractor(url) → LIVExtractor (matches livnightclub.com)
_extract_embedded_json(soup) finds JSON-LD ✓
Returns VegasEvent { performer: "Dua Lipa", venue: "LIV Las Vegas" }
Table pricing: separate urvenue API call with venue_id = VEN1121561

2 XS vs EBC Disambiguation

URL: https://www.wynnsocial.com/event/EVE111500020260531/some-event/
Domain: wynnsocial.com (shared by XS and EBC)

get_extractor(url) → XSExtractor (registered first)
XS looks for JSON-LD Schema.org Event in page
If found: parse and return VegasEvent ✓
If not found: XS returns None
Registry continues → EBCExtractor tries
_detect_venue(soup): searches for "encore beach club" in page text
If matched → returns VegasEvent { venue: "Encore Beach Club" }

3 TAO Group — Omnia

CLI: $ vinny scrape omnia
Sitemaps: taogroup.com/events-sitemap4.xml, sitemap5.xml

_collect_tao_venue_slugs(["omnia"]) → ("omnia",)
Crawlee pre-filters: only URLs containing /omnia/
get_extractor(url) → TaoGroupExtractor
JSON-LD Event + Location parsed; venue = "Omnia Nightclub"
Validates: "Omnia Nightclub" in _LAS_VEGAS_VENUES ✓
Returns VegasEvent { venue: "Omnia Nightclub", venue_id: "VEN1089" }
Pricing: booketing.com proxy with manageentid=61

7 — Key Design Patterns

First Match Wins (ExtractorRegistry)

Registration order determines priority. XSExtractor before EBCExtractor because XS has JSON-LD for all pages, while EBC must inspect page text. If a page matches multiple extractors, the first registered one handles it.

WynnSocialBase Inheritance

XS and EBC both inherit from WynnSocialBase to share table pricing extraction (uv_tablesitems). The base class provides extract_table_pricing() — subclasses only override extract() for event parsing.

Multi-Layer TAO Filtering

Three filter layers prevent non-LV TAO venues from being scraped: (1) CLI slug mapping, (2) Crawlee URL pre-filter, (3) extractor venue validation. Defensive because taogroup.com sitemaps are global — all three layers are required.

Incremental Sitemap Diffing

SitemapIndex.diff() compares lastmod timestamps against a stored index. Only new or updated URLs are enqueued, avoiding redundant re-crawls of unchanged events.

Vinny Scraper — Extractor Routing · Generated 2026-03-06 · src/extractors/