Extractor Routing
How URLs flow from sitemaps to venue-specific extractors
1 — URL Routing Pipeline
graph TD
A["Venue Sitemap URL"] --> B["Parse Sitemap"]
B --> C{"SitemapIndex.diff()"}
C -->|"New / Updated"| D["Filter Past Events"]
C -->|"Unchanged"| SKIP["Skip — no re-crawl"]
D -->|"date >= today"| E["ExtractorRegistry\n.get_extractor(url)"]
D -->|"date < today"| SKIP2["Skip — past event"]
E --> F["Venue Extractor\n.extract(soup, url)"]
F --> G["VegasEvent"]
classDef default fill:#22d3ee11,stroke:#22d3ee44,stroke-width:1.5px
classDef decision fill:#fbbf2411,stroke:#fbbf2444,stroke-width:1.5px
classDef skip fill:#fb718511,stroke:#fb718544,stroke-width:1.5px
classDef output fill:#34d39911,stroke:#34d39944,stroke-width:2px
class C decision
class SKIP,SKIP2 skip
class G output
Registration order matters. ExtractorRegistry iterates in order:
LIVExtractor → XSExtractor → EBCExtractor → TaoGroupExtractor. First extractor whose can_handle(url) returns true wins.
2 — Domain Routing
LIVExtractor
livnightclub.com
- JSON-LD parsing (primary)
- HTML fallback
- VEA image extraction
- urvenue table pricing API
- LIV Las Vegas + LIV Beach
WynnSocialBase
wynnsocial.com
- Shared base for XS + EBC
uv_tablesitemspricing- EVE URL segments for IDs
- tixr.com ticketing links
TaoGroupExtractor
taogroup.com
- JSON-LD + og:title parsing
- Las Vegas venue filter
- 10 venues (Omnia, Hakkasan...)
- booketing.com proxy pricing
3 — WynnSocial: XS vs EBC Routing
Both XS Nightclub and Encore Beach Club share wynnsocial.com. The registry disambiguates via registration order and venue detection:
graph LR
URL["wynnsocial.com/event/..."] --> XS{"XSExtractor\ncan_handle?"}
XS -->|"Yes"| JSONLD{"Has JSON-LD?"}
JSONLD -->|"Yes"| XS_OUT["XS Event"]
JSONLD -->|"No"| EBC_TRY["Try EBCExtractor"]
XS -->|"No"| EBC_TRY
EBC_TRY --> DETECT{"_detect_venue(soup)\ntext search"}
DETECT -->|"'encore beach club' found"| EBC_OUT["EBC Event"]
DETECT -->|"not found"| NONE["None — skip"]
classDef default fill:#22d3ee11,stroke:#22d3ee44,stroke-width:1.5px
classDef decision fill:#fbbf2411,stroke:#fbbf2444,stroke-width:1.5px
classDef xs fill:#34d39911,stroke:#34d39944,stroke-width:1.5px
classDef ebc fill:#a78bfa11,stroke:#a78bfa44,stroke-width:1.5px
classDef skip fill:#fb718511,stroke:#fb718544,stroke-width:1.5px
class XS,JSONLD,DETECT decision
class XS_OUT xs
class EBC_OUT ebc
class NONE skip
Why XS first? XS pages always have JSON-LD schema, so extraction is clean and reliable. EBC pages lack JSON-LD and require HTML text inspection via
_detect_venue(soup). Trying XS first avoids false positives.
4 — TAO Group Venue Filtering
TAO Group sitemaps contain all global venues (NYC, LA, Singapore, LV). Three filter layers ensure only Las Vegas events are scraped:
| Layer | Location | Purpose |
|---|---|---|
| CLI Alias | src/cli_scrape.py |
Maps CLI alias (e.g., "omnia") to URL slugs via _TAO_ALIAS_SLUGS |
| Crawlee Pre-filter | src/crawlee_main.py |
Only enqueues sitemap URLs matching _TAO_LV_VENUE_SLUGS |
| Extractor Validation | src/extractors/tao.py |
Post-filter: validates venue name against _LAS_VEGAS_VENUES set |
_LAS_VEGAS_VENUES = {
"Hakkasan Nightclub",
"Omnia Nightclub",
"Marquee Nightclub",
"Marquee Dayclub",
"Jewel Nightclub",
"Wet Republic",
"Tao Nightclub",
"Tao Beach Dayclub",
"Liquid Pool Lounge",
"Cathédrale at Aria",
}
# NYC, LA, Singapore venues are filtered out
Defensive design. All three layers are necessary because TAO sitemaps (sitemap4.xml, sitemap5.xml) are global. Removing any single layer would leak non-LV events into the database.
5 — Extraction Strategies by Venue
| Venue(s) | Domain | Data Source | Pricing | Special Logic |
|---|---|---|---|---|
| LIV Las Vegas LIV Beach |
livnightclub.com |
JSON-LDapplication/ld+json | urvenue API | VEA image extraction |
| XS Nightclub | wynnsocial.com |
JSON-LDSchema.org Event | uv_tablesitems JS var |
Fallback venue detect |
| Encore Beach Club EBC at Night |
wynnsocial.com |
HTML onlyNo JSON-LD | uv_tablesitems JS var |
_detect_venue() text check |
| TAO Group10 LV venues | taogroup.com |
JSON-LD + og:title | booketing.com proxy | LV venue filter + CLI slugs |
6 — Worked Examples
1 LIV Las Vegas Event
URL: https://livnightclub.com/event/dua-lipa-03-15-2026/ Sitemap: https://livnightclub.com/events-sitemap.xml
- Crawlee enqueues URL from sitemap
- Date filter:
2026-03-15>= today ✓ ExtractorRegistry.get_extractor(url)→ LIVExtractor (matches livnightclub.com)_extract_embedded_json(soup)finds JSON-LD ✓- Returns
VegasEvent { performer: "Dua Lipa", venue: "LIV Las Vegas" } - Table pricing: separate urvenue API call with
venue_id = VEN1121561
2 XS vs EBC Disambiguation
URL: https://www.wynnsocial.com/event/EVE111500020260531/some-event/ Domain: wynnsocial.com (shared by XS and EBC)
get_extractor(url)→ XSExtractor (registered first)- XS looks for JSON-LD Schema.org Event in page
- If found: parse and return VegasEvent ✓
- If not found: XS returns
None - Registry continues → EBCExtractor tries
_detect_venue(soup): searches for "encore beach club" in page text- If matched → returns
VegasEvent { venue: "Encore Beach Club" }
3 TAO Group — Omnia
CLI: $ vinny scrape omnia Sitemaps: taogroup.com/events-sitemap4.xml, sitemap5.xml
_collect_tao_venue_slugs(["omnia"])→("omnia",)- Crawlee pre-filters: only URLs containing
/omnia/ get_extractor(url)→ TaoGroupExtractor- JSON-LD Event + Location parsed; venue = "Omnia Nightclub"
- Validates: "Omnia Nightclub" in
_LAS_VEGAS_VENUES✓ - Returns
VegasEvent { venue: "Omnia Nightclub", venue_id: "VEN1089" } - Pricing: booketing.com proxy with
manageentid=61
7 — Key Design Patterns
First Match Wins (ExtractorRegistry)
Registration order determines priority. XSExtractor before EBCExtractor because XS has JSON-LD for all pages, while EBC must inspect page text. If a page matches multiple extractors, the first registered one handles it.
WynnSocialBase Inheritance
XS and EBC both inherit from
WynnSocialBase to share table pricing extraction (uv_tablesitems). The base class provides extract_table_pricing() — subclasses only override extract() for event parsing.Multi-Layer TAO Filtering
Three filter layers prevent non-LV TAO venues from being scraped: (1) CLI slug mapping, (2) Crawlee URL pre-filter, (3) extractor venue validation. Defensive because
taogroup.com sitemaps are global — all three layers are required.Incremental Sitemap Diffing
SitemapIndex.diff() compares lastmod timestamps against a stored index. Only new or updated URLs are enqueued, avoiding redundant re-crawls of unchanged events.