Skip to content

Extractor Contract

A checklist and verification contract for adding new venue extractors. Every agent (human or AI) must follow these steps. Update this doc when you learn something new.


Phase 0: Research (before writing code)

0a. Page discovery and data source audit

  • Identify the venue's event listing page and single-event page URLs
  • Locate the venue's sitemap — most venues have one (XML sitemap, WordPress events-sitemap.xml, etc.)
  • Save a local copy to tests/fixtures/{venue}/ for offline testing and reference
  • Note the URL count and date range covered
  • If the sitemap covers multiple cities or years, document which portions contain current Las Vegas events
  • Choose 3 sample event pages — these will be your ground-truth fixtures through all phases
  • Take a screenshot of each sample event page using agent-browser — save to tests/fixtures/{venue}/
  • Inspect single-event page source for data sources (in priority order):
  • Schema.org JSON-LD (<script type="application/ld+json">)
  • Embedded JS variables (e.g. uv_tablesitems)
  • og:* meta tags
  • Structured HTML (classes, data attrs)
  • Raw HTML text (last resort — fragile)
  • Document the URL pattern (e.g. https://domain.com/event/EVE{id}{YYYYMMDD}/{slug}/)
  • Check if the venue shares a domain with existing extractors (requires disambiguation)
  • Identify the venue_id / venuecode — check urvenue embed, JS vars, or API calls in devtools
  • Locate image sources — CDN URLs, og:image, JSON-LD image field
  • Check for streaming/social links in page source

Playwright note: If the event page relies on client-side JS rendering (e.g. React/Vue SPA, dynamically loaded content), static HTML from httpx/requests won't contain the data. Check the saved HTML for missing content — if the data isn't in the raw HTML, you'll need Playwright (Crawlee's PlaywrightCrawler) instead of the default HTTP crawler. Signs you need Playwright: empty <div id="app">, data only visible after JS execution, API calls triggered by useEffect/onMounted.

0b. Save fixtures and verify extractability

  • Save the 3 sample event HTML files to tests/fixtures/{venue}/ for offline testing
  • Open the saved HTML locally — confirm the data you need is actually present in the static source
  • If data is missing → Playwright required (see note above)
  • Manually extract ALL verifiable fields from each saved HTML and record them in tests/fixtures/{venue}/phase0.json:
  • Required: event_date, performer, venue, url
  • Optional: event_time, venue_id, external_id, ticket_url, images (source URLs)
  • Table pricing: section names, min spend amounts, guest counts
  • Example: "Dance Floor Table: $2,000 min / 6 guests"{"name": "Dance Floor Table", "min_spend": 2000, "guests": 6}
  • Note in phase0.json whether table pricing data is present → determines if Phase 3d applies (it almost always will)
  • Record all Phase 0 findings in tests/fixtures/{venue}/debug.md — this file is the running log for the entire extractor build

phase0.json is your test oracle. Every field recorded here will be cross-referenced against scraper output in Phase 3a and pricing output in Phase 3d. Get these values right now so you have a reliable baseline.

0c. Image resolution audit

Test which image sizes are available from the venue's CDN/image source. Our image pipeline supports these size presets:

Preset Pixels Use case
small 250px Thumbnails, cards
main 500px Default event image
medium 750px Mid-resolution
large 1000px Detail pages
hd 1500px High-definition
raw Original Highest available resolution
  • Identify the base image URL from the event page (og:image, JSON-LD image, or CDN URL)
  • Test each size preset against the venue's CDN to see which resolutions are available
  • For VEA CDN venues: modify the URL dimension parameter and check for 200 vs 404
  • For other CDNs: check if resize parameters exist in the URL pattern
  • Document which sizes are downloadable for this venue
  • Note the highest available resolution — this becomes the source for the image pipeline
  • Save sample images to tests/fixtures/{venue}/ for at least one artist, using the naming convention below
  • Record available sizes and source URLs in debug.md

Naming convention for downloaded images:

data/artists/{artist_slug}/{artist_slug}_{venue_tag}_{size}.jpg
Venue Tag Example filename
Encore Beach Club ebc diplo_ebc_main.jpg
Encore Beach Club at Night ebcn diplo_ebcn_raw.jpg
XS Nightclub xs diplo_xs_hd.jpg
LIV Nightclub / LIV Las Vegas liv diplo_liv_main.jpg
LIV Beach livb diplo_livb_small.jpg

New venues: add a 2-4 char tag to ImageStorage.VENUE_SUFFIXES in src/plugins/images/storage.py. Unknown venues fall back to the first 3 chars of the slugified venue name.

Fixture directory structure

After Phase 0 is complete, tests/fixtures/{venue}/ should look like this:

tests/fixtures/omnia/
├── events-sitemap.xml           # Local copy of venue sitemap(s)
├── event1.html                  # Sample event page HTML
├── event2.html
├── event3.html
├── event1.png                   # Screenshot from agent-browser
├── event2.png
├── event3.png
├── phase0.json                  # Ground-truth field values (test oracle)
├── debug.md                     # Running log: findings, r2_urls, discrepancies
├── artist_omn_main.jpg          # Sample image: main (500px)
└── artist_omn_small.jpg         # Sample image: small (250px)
  • phase0.json — structured expected values for all 3 events, used as assertions in Phase 3a and 3d
  • debug.md — append-only log updated at each phase; records screenshots, image audit results, r2_urls, pricing cross-checks, and any discrepancies requiring human review

Phase Gate: Complete Phase 0 before writing code

Do not start implementation until all Phase 0 items are checked. You need: 3 saved HTML fixtures, a phase0.json test oracle with manually extracted field values, and a completed image resolution audit. Skipping this leads to extractors that "work" but produce wrong data.

Phase 1: Implement the extractor

File: src/extractors/{venue}.py

  • Inherit from VenueExtractor (or WynnSocialBase for Wynn properties)
  • Implement required properties: name, domain
  • If sharing a domain, implement can_handle() with disambiguation logic
  • Implement extract(soup, url) → VegasEvent | None

Required fields (extraction will fail without these)

Field Format Example
event_date YYYY-MM-DD "2026-03-15"
performer Display name "DOM DOLLA"
venue Full venue name "Encore Beach Club"
url Source URL passed through
scraped_at ISO-8601 datetime.now(tz=UTC).isoformat()

Critical optional fields (things that break downstream if wrong)

Field Why it matters What breaks without it
venue_id Table pricing API needs VEN* code Pricing queries return nothing; D1 gets synthetic venue:{slug} key
images ImageMetadata list → R2 upload → D1 No artist image on site; cards show placeholder
external_id Dedup across re-scrapes Minor — composite_key handles identity
event_time Display + event_datetime Shows "TBD" on site
ticket_url CTA button on event page No ticket link shown

Construction pattern

Always use EventData TypedDict + factory method (never VegasEvent(**dict)):

from src.models import EventData, VegasEvent

event_data: EventData = {
    "url": url,
    "scraped_at": datetime.now(tz=timezone.utc).isoformat(),
    "performer": performer,
    "venue": "My Venue Name",
    "event_date": event_date,
}
# Add optional fields conditionally
if venue_id:
    event_data["venue_id"] = venue_id
if images:
    event_data["images"] = images

return VegasEvent.from_extractor_data(event_data)

Image extraction

from src.models import ImageMetadata

images = []
if img_url:
    images.append(ImageMetadata(source_url=img_url, category="artist_full"))
event_data["images"] = images
  • Prefer full-resolution source URLs (avoid thumbnail crops)
  • Use category="artist_full" for the main artist/event flyer
  • The image pipeline handles downloading, resizing, and R2 upload — just provide the source URL

Phase 2: Register and wire up

  • Register in create_default_registry() in src/extractors/__init__.py
  • Order matters: if sharing a domain, put the extractor with richer data (e.g. JSON-LD) first
  • Add start URL(s) to crawlee router in src/crawlee_main.py
  • If using sitemaps, add URL-level pre-filters for date range and venue (avoid crawling thousands of irrelevant pages)
  • Add venue aliases to VENUE_ALIASES in src/cli_scrape.py (e.g. vinny scrape omnia)
  • Per-venue filtering (for multi-venue domains like TAO Group):
  • Add URL slug mapping to _TAO_ALIAS_SLUGS in src/cli_scrape.py so vinny scrape omnia only enqueues Omnia URLs from the shared sitemaps
  • Slugs are extracted from CLI args before alias resolution (aliases become sitemap URLs)
  • Passed as tao_venue_slugs to run_scraper() → used in sitemap handler to filter URLs
  • Omitting a venue from the slug map (e.g. tao, tao-group) means no filter = all LV venues
  • Add venue tag(s) to ImageStorage.VENUE_SUFFIXES in src/plugins/images/storage.py
  • Run just check (ruff + ty) — fix any type errors

Phase 3: Verification (the part we kept skipping)

Pipeline order is critical

Images must be downloaded and uploaded to R2 before the D1 export, otherwise artist_image_url will be NULL. The steps below follow the correct pipeline order: extract → images → D1 → pricing → site. vinny sync handles this automatically.

Order matters. Images must be downloaded and uploaded to R2 before the D1 export, otherwise artist_image_url will be NULL. The steps below follow the correct pipeline order: extract → images → D1 → pricing → site.

3a. Unit extraction test

Scrape all 3 sample events from Phase 0 and cross-reference against phase0.json.

Sample test output
# Scrape the 3 sample events
vinny scrape -u "https://example.com/event/1" -u "https://example.com/event/2" -u "https://example.com/event/3" --max-requests 10

# Inspect the output
cat runs/latest/events.json | jq '.[]'

Expected: each event's composite_key, event_date, performer, and venue match the values in phase0.json.

Cross-reference each event against tests/fixtures/{venue}/phase0.json:

  • composite_key looks right: {date}-{performer-slug}-{venue-slug}
  • event_date matches Phase 0 expected value
  • performer matches Phase 0 expected value (not the event title, not the venue)
  • venue matches the canonical venue name used in the site
  • venue_id matches Phase 0 expected value (starts with VEN, or intentionally null with a comment)
  • event_time matches Phase 0 expected value (or documented as unavailable)
  • images array has at least one entry with a valid source_url
  • ticket_url matches Phase 0 expected value (if available)
  • Record any discrepancies in tests/fixtures/{venue}/debug.md with explanation

3b. Image pipeline test

Images must be downloaded and uploaded to R2 before the D1 export so that artist_image_url is populated (not NULL).

# Download images for the run
vinny images download -r latest

# Check status
vinny images status -r latest

# Upload to R2 + sync URLs to D1
vinny images upload-r2 --sync-d1

Verify:

  • Images downloaded to data/artists/{artist-slug}/{artist-slug}_{venue-tag}_{size}.jpg
  • Artist slug in filename matches the performer slug in composite_key
  • R2 upload succeeds (check for r2_url in image metadata)
  • Record r2_url values in tests/fixtures/{venue}/debug.md under a ## Phase 3b: Image Pipeline heading

3c. D1 export test

Now that images have R2 URLs, export to D1. The composite_key becomes the D1 event_id primary key, and venue_id becomes a foreign key to the venues table.

# Export to D1 (database auto-resolved from wrangler.jsonc or D1_DATABASE_ID env)
vinny export-d1 --execute

# Or inspect the generated SQL first
cat runs/latest/d1_import.sql | head -100

Verify in the SQL:

  • event_id value matches composite_key (e.g. '2026-03-15-dom-dolla-encore-beach-club')
  • venue_id is a real VEN code OR an intentional venue:{slug} with pricing disabled
  • venue_name matches what the site expects (check site/src/lib/queries.ts for any venue name filters)
  • artist_image_url is an R2 URL (not a VEA CDN URL; NULL is fine if images aren't on R2 yet)
  • streaming_links_json is valid JSON or NULL (not the string "null")

Database auto-resolution: --database flag → D1_DATABASE_ID env → wrangler.jsonc. You don't need to pass --database if credentials are set up.

3d. Table pricing test

Skip only if Phase 0b explicitly documented that no table pricing data exists for this venue. This is rare — most venues have it.

# Check pricing extraction for all 3 sample events
cat runs/latest/events.json | jq '.[].table_pricing'

Cross-reference against phase0.json table pricing values:

  • table_pricing.sections is populated (or null if venue doesn't embed pricing)
  • Each tier's name, min_spend, guests matches the values manually extracted in Phase 0b
  • If using urvenue API, venue_id is the correct VEN code for THIS venue (not another venue's code)
  • Record results in tests/fixtures/{venue}/debug.md under ## Phase 3d: Table Pricing

Handling discrepancies: The API may return more data than what's visible on the event page (e.g. extra tiers, different min spends). When this happens: 1. Review the Phase 0 screenshots to confirm what's actually shown on the site 2. Prioritize what's visible on the site — the user sees the site, not the API 3. If the API has more data, document it in debug.md but don't assert against it 4. If values conflict (e.g. different min spend), flag for human review in debug.md

3e. Site rendering test

The Astro dev server uses a local miniflare D1 database (not the remote production D1). Data must be imported into this local DB before pages will render.

Prerequisites — import data into local D1

# Import to local D1 (auto-resolves database from wrangler.jsonc)
vinny export-d1 --execute --local

The local D1 SQLite file lives at:

site/.wrangler/state/v3/d1/miniflare-D1DatabaseObject/*.sqlite

If using sqlite3 directly instead of wrangler:

LOCAL_DB=$(find site/.wrangler/state/v3/d1 -name "*.sqlite" | head -1)
sqlite3 "$LOCAL_DB" < runs/latest/d1_import.sql

URL patterns

  • Venue list: /venues — all venues with upcoming event counts
  • Venue detail: /venues/{venue_id} — e.g. /venues/venue:hakkasan-nightclub or /venues/VEN1121561
  • Event detail: /events/{composite_key} — e.g. /events/2026-03-20-tyga-hakkasan-nightclub
  • Events list: /events or /events?venue={venue_id} for filtered view

Start the dev server

cd site && bun run dev

The server starts on port 4321 (or next available if occupied). Check the terminal output for the actual port.

Checklist

  • Navigate to /venues — does the new venue appear with the correct event count?
  • Navigate to the venue detail page (e.g. /venues/venue:hakkasan-nightclub) — are events listed?
  • Navigate to each of the 3 sample event pages — do they render without SSR errors?
  • Check dev server terminal output for stack traces (SSR errors are silent in the browser — HTTP 200 with empty <main>)
  • Artist image shows (R2 URL, not placeholder)
  • Event date, time, venue name display correctly
  • Streaming links render (or gracefully absent)
  • Table pricing section renders (or gracefully absent)
  • Take screenshots and save to tests/fixtures/{venue}/ for the record

Phase 4: Documentation

  • Add venue section to EXTRACTORS.md
  • Update the "Registered Venues" table
  • Document any quirks (shared domains, missing data, fragile selectors)
  • Update PLAN.md if this was a tracked milestone

Lessons Learned

Add to this section each time an extractor issue is discovered post-deploy.

2026-03-03: EBC D1 key mismatch

Problem: EBC events exported to D1 but keys were wrong and images didn't show.

Root cause: Skipped Phase 3b/3c verification. The composite_key was generating correctly but the venue_id wasn't being set (fell back to synthetic venue:encore-beach-club), and images weren't going through the R2 upload pipeline before the D1 export.

Fix: Always run the full Phase 3 verification before marking an extractor as done. Never go straight from "extract works" to "ship it".

2026-03-04: Local D1 is separate from remote D1

Problem: After running just export-d1 (which pushes to remote D1), the Astro dev server showed no new venues or events.

Root cause: The Astro dev server uses platformProxy: { enabled: true } in astro.config.mjs, which means it reads from a local miniflare D1 SQLite file — not the remote Cloudflare D1. The just export-d1 command only pushes to remote.

Fix: Import d1_import.sql directly into the local SQLite file using sqlite3. See Phase 3e above for the full procedure. Alternatively, use vinny export-d1 --execute --local which resolves the database from wrangler.jsonc automatically.

2026-03-04: D1 export fails with "duplicate column name"

Problem: vinny export-d1 --execute failed with duplicate column name: category: SQLITE_ERROR.

Root cause: The generated SQL included ALTER TABLE table_tiers ADD COLUMN category TEXT migrations for backward compatibility, but the CREATE TABLE statement already defined those columns. The REST API path handled this gracefully (ran migrations one-by-one, caught duplicates), but the wrangler CLI sent the whole file as one transaction and failed on the first error.

Fix: Removed the redundant ALTER TABLE migrations. The CREATE TABLE schema is the source of truth — if a column exists there, no migration is needed.

2026-03-04: Pipeline order matters — images before D1

Problem: D1 export showed artist_image_url = NULL even though images were on R2.

Root cause: Ran D1 export before image download/upload. The D1 exporter only writes R2 URLs, never VEA CDN URLs, so artist_image_url is NULL until images are on R2.

Fix: Reordered Phase 3 verification: 3a extraction → 3b images → 3c D1 → 3d pricing → 3e site. The vinny sync command already does this in the right order; the contract now matches.

2026-03-04: Database auto-resolution for image sync

Problem: vinny images sync-d1 required --database flag even though the database name is in wrangler.jsonc.

Root cause: The images CLI didn't use the same fallback chain as export-d1. Only cli_export.py and cli_pipeline.py resolved from wrangler.jsonc.

Fix: Added the same --databaseD1_DATABASE_ID env → read_wrangler_d1_config(wrangler.jsonc) fallback to _apply_r2_sync() in the images CLI.


Last updated: 2026-03-04