Extractor Routing

How URLs flow from sitemaps to venue-specific extractors

1 — URL Routing Pipeline
graph TD
  A["Venue Sitemap URL"] --> B["Parse Sitemap"]
  B --> C{"SitemapIndex.diff()"}
  C -->|"New / Updated"| D["Filter Past Events"]
  C -->|"Unchanged"| SKIP["Skip — no re-crawl"]
  D -->|"date >= today"| E["ExtractorRegistry\n.get_extractor(url)"]
  D -->|"date < today"| SKIP2["Skip — past event"]
  E --> F["Venue Extractor\n.extract(soup, url)"]
  F --> G["VegasEvent"]

  classDef default fill:#22d3ee11,stroke:#22d3ee44,stroke-width:1.5px
  classDef decision fill:#fbbf2411,stroke:#fbbf2444,stroke-width:1.5px
  classDef skip fill:#fb718511,stroke:#fb718544,stroke-width:1.5px
  classDef output fill:#34d39911,stroke:#34d39944,stroke-width:2px

  class C decision
  class SKIP,SKIP2 skip
  class G output
      
Registration order matters. ExtractorRegistry iterates in order: LIVExtractorXSExtractorEBCExtractorTaoGroupExtractor. First extractor whose can_handle(url) returns true wins.
2 — Domain Routing
LIVExtractor
livnightclub.com
  • JSON-LD parsing (primary)
  • HTML fallback
  • VEA image extraction
  • urvenue table pricing API
  • LIV Las Vegas + LIV Beach
WynnSocialBase
wynnsocial.com
  • Shared base for XS + EBC
  • uv_tablesitems pricing
  • EVE URL segments for IDs
  • tixr.com ticketing links
TaoGroupExtractor
taogroup.com
  • JSON-LD + og:title parsing
  • Las Vegas venue filter
  • 10 venues (Omnia, Hakkasan...)
  • booketing.com proxy pricing
3 — WynnSocial: XS vs EBC Routing

Both XS Nightclub and Encore Beach Club share wynnsocial.com. The registry disambiguates via registration order and venue detection:

graph LR
  URL["wynnsocial.com/event/..."] --> XS{"XSExtractor\ncan_handle?"}
  XS -->|"Yes"| JSONLD{"Has JSON-LD?"}
  JSONLD -->|"Yes"| XS_OUT["XS Event"]
  JSONLD -->|"No"| EBC_TRY["Try EBCExtractor"]
  XS -->|"No"| EBC_TRY
  EBC_TRY --> DETECT{"_detect_venue(soup)\ntext search"}
  DETECT -->|"'encore beach club' found"| EBC_OUT["EBC Event"]
  DETECT -->|"not found"| NONE["None — skip"]

  classDef default fill:#22d3ee11,stroke:#22d3ee44,stroke-width:1.5px
  classDef decision fill:#fbbf2411,stroke:#fbbf2444,stroke-width:1.5px
  classDef xs fill:#34d39911,stroke:#34d39944,stroke-width:1.5px
  classDef ebc fill:#a78bfa11,stroke:#a78bfa44,stroke-width:1.5px
  classDef skip fill:#fb718511,stroke:#fb718544,stroke-width:1.5px

  class XS,JSONLD,DETECT decision
  class XS_OUT xs
  class EBC_OUT ebc
  class NONE skip
        
Why XS first? XS pages always have JSON-LD schema, so extraction is clean and reliable. EBC pages lack JSON-LD and require HTML text inspection via _detect_venue(soup). Trying XS first avoids false positives.
4 — TAO Group Venue Filtering

TAO Group sitemaps contain all global venues (NYC, LA, Singapore, LV). Three filter layers ensure only Las Vegas events are scraped:

Layer Location Purpose
CLI Alias src/cli_scrape.py Maps CLI alias (e.g., "omnia") to URL slugs via _TAO_ALIAS_SLUGS
Crawlee Pre-filter src/crawlee_main.py Only enqueues sitemap URLs matching _TAO_LV_VENUE_SLUGS
Extractor Validation src/extractors/tao.py Post-filter: validates venue name against _LAS_VEGAS_VENUES set
_LAS_VEGAS_VENUES = {
  "Hakkasan Nightclub",
  "Omnia Nightclub",
  "Marquee Nightclub",
  "Marquee Dayclub",
  "Jewel Nightclub",
  "Wet Republic",
  "Tao Nightclub",
  "Tao Beach Dayclub",
  "Liquid Pool Lounge",
  "Cathédrale at Aria",
}
# NYC, LA, Singapore venues are filtered out
Defensive design. All three layers are necessary because TAO sitemaps (sitemap4.xml, sitemap5.xml) are global. Removing any single layer would leak non-LV events into the database.
5 — Extraction Strategies by Venue
Venue(s) Domain Data Source Pricing Special Logic
LIV Las Vegas
LIV Beach
livnightclub.com JSON-LDapplication/ld+json urvenue API VEA image extraction
XS Nightclub wynnsocial.com JSON-LDSchema.org Event uv_tablesitems JS var Fallback venue detect
Encore Beach Club
EBC at Night
wynnsocial.com HTML onlyNo JSON-LD uv_tablesitems JS var _detect_venue() text check
TAO Group10 LV venues taogroup.com JSON-LD + og:title booketing.com proxy LV venue filter + CLI slugs
6 — Worked Examples
1 LIV Las Vegas Event
URL: https://livnightclub.com/event/dua-lipa-03-15-2026/
Sitemap: https://livnightclub.com/events-sitemap.xml
  1. Crawlee enqueues URL from sitemap
  2. Date filter: 2026-03-15 >= today ✓
  3. ExtractorRegistry.get_extractor(url)LIVExtractor (matches livnightclub.com)
  4. _extract_embedded_json(soup) finds JSON-LD ✓
  5. Returns VegasEvent { performer: "Dua Lipa", venue: "LIV Las Vegas" }
  6. Table pricing: separate urvenue API call with venue_id = VEN1121561
2 XS vs EBC Disambiguation
URL: https://www.wynnsocial.com/event/EVE111500020260531/some-event/
Domain: wynnsocial.com (shared by XS and EBC)
  1. get_extractor(url)XSExtractor (registered first)
  2. XS looks for JSON-LD Schema.org Event in page
  3. If found: parse and return VegasEvent ✓
  4. If not found: XS returns None
  5. Registry continues → EBCExtractor tries
  6. _detect_venue(soup): searches for "encore beach club" in page text
  7. If matched → returns VegasEvent { venue: "Encore Beach Club" }
3 TAO Group — Omnia
CLI: $ vinny scrape omnia
Sitemaps: taogroup.com/events-sitemap4.xml, sitemap5.xml
  1. _collect_tao_venue_slugs(["omnia"])("omnia",)
  2. Crawlee pre-filters: only URLs containing /omnia/
  3. get_extractor(url)TaoGroupExtractor
  4. JSON-LD Event + Location parsed; venue = "Omnia Nightclub"
  5. Validates: "Omnia Nightclub" in _LAS_VEGAS_VENUES
  6. Returns VegasEvent { venue: "Omnia Nightclub", venue_id: "VEN1089" }
  7. Pricing: booketing.com proxy with manageentid=61
7 — Key Design Patterns
First Match Wins (ExtractorRegistry)
Registration order determines priority. XSExtractor before EBCExtractor because XS has JSON-LD for all pages, while EBC must inspect page text. If a page matches multiple extractors, the first registered one handles it.
WynnSocialBase Inheritance
XS and EBC both inherit from WynnSocialBase to share table pricing extraction (uv_tablesitems). The base class provides extract_table_pricing() — subclasses only override extract() for event parsing.
Multi-Layer TAO Filtering
Three filter layers prevent non-LV TAO venues from being scraped: (1) CLI slug mapping, (2) Crawlee URL pre-filter, (3) extractor venue validation. Defensive because taogroup.com sitemaps are global — all three layers are required.
Incremental Sitemap Diffing
SitemapIndex.diff() compares lastmod timestamps against a stored index. Only new or updated URLs are enqueued, avoiding redundant re-crawls of unchanged events.
Vinny Scraper — Extractor Routing · Generated 2026-03-06 · src/extractors/