Extractor Contract¶

A checklist and verification contract for adding new venue extractors. Every agent (human or AI) must follow these steps. Update this doc when you learn something new.

Phase 0: Research (before writing code)¶

0a. Page discovery and data source audit¶

Playwright note: If the event page relies on client-side JS rendering (e.g. React/Vue SPA, dynamically loaded content), static HTML from httpx/requests won't contain the data. Check the saved HTML for missing content — if the data isn't in the raw HTML, you'll need Playwright (Crawlee's PlaywrightCrawler) instead of the default HTTP crawler. Signs you need Playwright: empty <div id="app">, data only visible after JS execution, API calls triggered by useEffect/onMounted.

0b. Save fixtures and verify extractability¶

Save the 3 sample event HTML files to tests/fixtures/{venue}/ for offline testing
Open the saved HTML locally — confirm the data you need is actually present in the static source
If data is missing → Playwright required (see note above)
Manually extract ALL verifiable fields from each saved HTML and record them in tests/fixtures/{venue}/phase0.json:
Required: event_date, performer, venue, url
Optional: event_time, venue_id, external_id, ticket_url, images (source URLs)
Table pricing: section names, min spend amounts, guest counts
Example: "Dance Floor Table: $2,000 min / 6 guests" → {"name": "Dance Floor Table", "min_spend": 2000, "guests": 6}
Note in phase0.json whether table pricing data is present → determines if Phase 3d applies (it almost always will)
Record all Phase 0 findings in tests/fixtures/{venue}/debug.md — this file is the running log for the entire extractor build

phase0.json is your test oracle. Every field recorded here will be cross-referenced against scraper output in Phase 3a and pricing output in Phase 3d. Get these values right now so you have a reliable baseline.

0c. Image resolution audit¶

Test which image sizes are available from the venue's CDN/image source. Our image pipeline supports these size presets:

Preset	Pixels	Use case
`small`	250px	Thumbnails, cards
`main`	500px	Default event image
`medium`	750px	Mid-resolution
`large`	1000px	Detail pages
`hd`	1500px	High-definition
`raw`	Original	Highest available resolution

Identify the base image URL from the event page (og:image, JSON-LD image, or CDN URL)
Test each size preset against the venue's CDN to see which resolutions are available
For VEA CDN venues: modify the URL dimension parameter and check for 200 vs 404
For other CDNs: check if resize parameters exist in the URL pattern
Document which sizes are downloadable for this venue
Note the highest available resolution — this becomes the source for the image pipeline
Save sample images to tests/fixtures/{venue}/ for at least one artist, using the naming convention below
Record available sizes and source URLs in debug.md

Naming convention for downloaded images:

data/artists/{artist_slug}/{artist_slug}_{venue_tag}_{size}.jpg

Venue	Tag	Example filename
Encore Beach Club	`ebc`	`diplo_ebc_main.jpg`
Encore Beach Club at Night	`ebcn`	`diplo_ebcn_raw.jpg`
XS Nightclub	`xs`	`diplo_xs_hd.jpg`
LIV Nightclub / LIV Las Vegas	`liv`	`diplo_liv_main.jpg`
LIV Beach	`livb`	`diplo_livb_small.jpg`

New venues: add a 2-4 char tag to ImageStorage.VENUE_SUFFIXES in src/plugins/images/storage.py. Unknown venues fall back to the first 3 chars of the slugified venue name.

Fixture directory structure¶

After Phase 0 is complete, tests/fixtures/{venue}/ should look like this:

tests/fixtures/omnia/
├── events-sitemap.xml           # Local copy of venue sitemap(s)
├── event1.html                  # Sample event page HTML
├── event2.html
├── event3.html
├── event1.png                   # Screenshot from agent-browser
├── event2.png
├── event3.png
├── phase0.json                  # Ground-truth field values (test oracle)
├── debug.md                     # Running log: findings, r2_urls, discrepancies
├── artist_omn_main.jpg          # Sample image: main (500px)
└── artist_omn_small.jpg         # Sample image: small (250px)

phase0.json — structured expected values for all 3 events, used as assertions in Phase 3a and 3d
debug.md — append-only log updated at each phase; records screenshots, image audit results, r2_urls, pricing cross-checks, and any discrepancies requiring human review

Phase Gate: Complete Phase 0 before writing code

Do not start implementation until all Phase 0 items are checked. You need: 3 saved HTML fixtures, a phase0.json test oracle with manually extracted field values, and a completed image resolution audit. Skipping this leads to extractors that "work" but produce wrong data.

Phase 1: Implement the extractor¶

File: `src/extractors/{venue}.py`¶

Inherit from VenueExtractor (or WynnSocialBase for Wynn properties)
Implement required properties: name, domain
If sharing a domain, implement can_handle() with disambiguation logic
Implement extract(soup, url) → VegasEvent | None

Required fields (extraction will fail without these)¶

Field	Format	Example
`event_date`	`YYYY-MM-DD`	`"2026-03-15"`
`performer`	Display name	`"DOM DOLLA"`
`venue`	Full venue name	`"Encore Beach Club"`
`url`	Source URL	passed through
`scraped_at`	ISO-8601	`datetime.now(tz=UTC).isoformat()`

Critical optional fields (things that break downstream if wrong)¶

Field	Why it matters	What breaks without it
`venue_id`	Table pricing API needs `VEN*` code	Pricing queries return nothing; D1 gets synthetic `venue:{slug}` key
`images`	`ImageMetadata` list → R2 upload → D1	No artist image on site; cards show placeholder
`external_id`	Dedup across re-scrapes	Minor — composite_key handles identity
`event_time`	Display + event_datetime	Shows "TBD" on site
`ticket_url`	CTA button on event page	No ticket link shown

Construction pattern¶

Always use EventData TypedDict + factory method (never VegasEvent(**dict)):

from src.models import EventData, VegasEvent

event_data: EventData = {
    "url": url,
    "scraped_at": datetime.now(tz=timezone.utc).isoformat(),
    "performer": performer,
    "venue": "My Venue Name",
    "event_date": event_date,
}
# Add optional fields conditionally
if venue_id:
    event_data["venue_id"] = venue_id
if images:
    event_data["images"] = images

return VegasEvent.from_extractor_data(event_data)

Image extraction¶

from src.models import ImageMetadata

images = []
if img_url:
    images.append(ImageMetadata(source_url=img_url, category="artist_full"))
event_data["images"] = images

Prefer full-resolution source URLs (avoid thumbnail crops)
Use category="artist_full" for the main artist/event flyer
The image pipeline handles downloading, resizing, and R2 upload — just provide the source URL

Phase 2: Register and wire up¶

Register in create_default_registry() in src/extractors/__init__.py
Order matters: if sharing a domain, put the extractor with richer data (e.g. JSON-LD) first
Add start URL(s) to crawlee router in src/crawlee_main.py
If using sitemaps, add URL-level pre-filters for date range and venue (avoid crawling thousands of irrelevant pages)
Add venue aliases to VENUE_ALIASES in src/cli_scrape.py (e.g. vinny scrape omnia)
Per-venue filtering (for multi-venue domains like TAO Group):
Add URL slug mapping to _TAO_ALIAS_SLUGS in src/cli_scrape.py so vinny scrape omnia only enqueues Omnia URLs from the shared sitemaps
Slugs are extracted from CLI args before alias resolution (aliases become sitemap URLs)
Passed as tao_venue_slugs to run_scraper() → used in sitemap handler to filter URLs
Omitting a venue from the slug map (e.g. tao, tao-group) means no filter = all LV venues
Add venue tag(s) to ImageStorage.VENUE_SUFFIXES in src/plugins/images/storage.py
Run just check (ruff + ty) — fix any type errors

Phase 3: Verification (the part we kept skipping)¶

Pipeline order is critical

Images must be downloaded and uploaded to R2 before the D1 export, otherwise artist_image_url will be NULL. The steps below follow the correct pipeline order: extract → images → D1 → pricing → site. vinny sync handles this automatically.

Order matters. Images must be downloaded and uploaded to R2 before the D1 export, otherwise artist_image_url will be NULL. The steps below follow the correct pipeline order: extract → images → D1 → pricing → site.

3a. Unit extraction test¶

Scrape all 3 sample events from Phase 0 and cross-reference against phase0.json.

Sample test output

# Scrape the 3 sample events
vinny scrape -u "https://example.com/event/1" -u "https://example.com/event/2" -u "https://example.com/event/3" --max-requests 10

# Inspect the output
cat runs/latest/events.json | jq '.[]'

Expected: each event's composite_key, event_date, performer, and venue match the values in phase0.json.

Cross-reference each event against tests/fixtures/{venue}/phase0.json:

composite_key looks right: {date}-{performer-slug}-{venue-slug}
event_date matches Phase 0 expected value
performer matches Phase 0 expected value (not the event title, not the venue)
venue matches the canonical venue name used in the site
venue_id matches Phase 0 expected value (starts with VEN, or intentionally null with a comment)
event_time matches Phase 0 expected value (or documented as unavailable)
images array has at least one entry with a valid source_url
ticket_url matches Phase 0 expected value (if available)
Record any discrepancies in tests/fixtures/{venue}/debug.md with explanation

3b. Image pipeline test¶

Images must be downloaded and uploaded to R2 before the D1 export so that artist_image_url is populated (not NULL).

# Download images for the run
vinny images download -r latest

# Check status
vinny images status -r latest

# Upload to R2 + sync URLs to D1
vinny images upload-r2 --sync-d1

Verify:

Images downloaded to data/artists/{artist-slug}/{artist-slug}_{venue-tag}_{size}.jpg
Artist slug in filename matches the performer slug in composite_key
R2 upload succeeds (check for r2_url in image metadata)
Record r2_url values in tests/fixtures/{venue}/debug.md under a ## Phase 3b: Image Pipeline heading

3c. D1 export test¶

Now that images have R2 URLs, export to D1. The composite_key becomes the D1 event_id primary key, and venue_id becomes a foreign key to the venues table.

# Export to D1 (database auto-resolved from wrangler.jsonc or D1_DATABASE_ID env)
vinny export-d1 --execute

# Or inspect the generated SQL first
cat runs/latest/d1_import.sql | head -100

Verify in the SQL:

event_id value matches composite_key (e.g. '2026-03-15-dom-dolla-encore-beach-club')
venue_id is a real VEN code OR an intentional venue:{slug} with pricing disabled
venue_name matches what the site expects (check site/src/lib/queries.ts for any venue name filters)
artist_image_url is an R2 URL (not a VEA CDN URL; NULL is fine if images aren't on R2 yet)
streaming_links_json is valid JSON or NULL (not the string "null")

Database auto-resolution: --database flag → D1_DATABASE_ID env → wrangler.jsonc. You don't need to pass --database if credentials are set up.

3d. Table pricing test¶

Skip only if Phase 0b explicitly documented that no table pricing data exists for this venue. This is rare — most venues have it.

# Check pricing extraction for all 3 sample events
cat runs/latest/events.json | jq '.[].table_pricing'

Cross-reference against phase0.json table pricing values:

table_pricing.sections is populated (or null if venue doesn't embed pricing)
Each tier's name, min_spend, guests matches the values manually extracted in Phase 0b
If using urvenue API, venue_id is the correct VEN code for THIS venue (not another venue's code)
Record results in tests/fixtures/{venue}/debug.md under ## Phase 3d: Table Pricing

Handling discrepancies: The API may return more data than what's visible on the event page (e.g. extra tiers, different min spends). When this happens: 1. Review the Phase 0 screenshots to confirm what's actually shown on the site 2. Prioritize what's visible on the site — the user sees the site, not the API 3. If the API has more data, document it in debug.md but don't assert against it 4. If values conflict (e.g. different min spend), flag for human review in debug.md

3e. Site rendering test¶

The Astro dev server uses a local miniflare D1 database (not the remote production D1). Data must be imported into this local DB before pages will render.

Prerequisites — import data into local D1¶

# Import to local D1 (auto-resolves database from wrangler.jsonc)
vinny export-d1 --execute --local

The local D1 SQLite file lives at:

site/.wrangler/state/v3/d1/miniflare-D1DatabaseObject/*.sqlite

If using sqlite3 directly instead of wrangler:

LOCAL_DB=$(find site/.wrangler/state/v3/d1 -name "*.sqlite" | head -1)
sqlite3 "$LOCAL_DB" < runs/latest/d1_import.sql

URL patterns¶

Venue list: /venues — all venues with upcoming event counts
Venue detail: /venues/{venue_id} — e.g. /venues/venue:hakkasan-nightclub or /venues/VEN1121561
Event detail: /events/{composite_key} — e.g. /events/2026-03-20-tyga-hakkasan-nightclub
Events list: /events or /events?venue={venue_id} for filtered view

Start the dev server¶

cd site && bun run dev

The server starts on port 4321 (or next available if occupied). Check the terminal output for the actual port.

Checklist¶

Navigate to /venues — does the new venue appear with the correct event count?
Navigate to the venue detail page (e.g. /venues/venue:hakkasan-nightclub) — are events listed?
Navigate to each of the 3 sample event pages — do they render without SSR errors?
Check dev server terminal output for stack traces (SSR errors are silent in the browser — HTTP 200 with empty <main>)
Artist image shows (R2 URL, not placeholder)
Event date, time, venue name display correctly
Streaming links render (or gracefully absent)
Table pricing section renders (or gracefully absent)
Take screenshots and save to tests/fixtures/{venue}/ for the record

Phase 4: Documentation¶

Add venue section to EXTRACTORS.md
Update the "Registered Venues" table
Document any quirks (shared domains, missing data, fragile selectors)
Update PLAN.md if this was a tracked milestone

Lessons Learned¶

Add to this section each time an extractor issue is discovered post-deploy.

2026-03-03: EBC D1 key mismatch¶

Problem: EBC events exported to D1 but keys were wrong and images didn't show.

Root cause: Skipped Phase 3b/3c verification. The composite_key was generating correctly but the venue_id wasn't being set (fell back to synthetic venue:encore-beach-club), and images weren't going through the R2 upload pipeline before the D1 export.

Fix: Always run the full Phase 3 verification before marking an extractor as done. Never go straight from "extract works" to "ship it".

2026-03-04: Local D1 is separate from remote D1¶

Problem: After running just export-d1 (which pushes to remote D1), the Astro dev server showed no new venues or events.

Root cause: The Astro dev server uses platformProxy: { enabled: true } in astro.config.mjs, which means it reads from a local miniflare D1 SQLite file — not the remote Cloudflare D1. The just export-d1 command only pushes to remote.

Fix: Import d1_import.sql directly into the local SQLite file using sqlite3. See Phase 3e above for the full procedure. Alternatively, use vinny export-d1 --execute --local which resolves the database from wrangler.jsonc automatically.

2026-03-04: D1 export fails with "duplicate column name"¶

Problem: vinny export-d1 --execute failed with duplicate column name: category: SQLITE_ERROR.

Root cause: The generated SQL included ALTER TABLE table_tiers ADD COLUMN category TEXT migrations for backward compatibility, but the CREATE TABLE statement already defined those columns. The REST API path handled this gracefully (ran migrations one-by-one, caught duplicates), but the wrangler CLI sent the whole file as one transaction and failed on the first error.

Fix: Removed the redundant ALTER TABLE migrations. The CREATE TABLE schema is the source of truth — if a column exists there, no migration is needed.

2026-03-04: Pipeline order matters — images before D1¶

Problem: D1 export showed artist_image_url = NULL even though images were on R2.

Root cause: Ran D1 export before image download/upload. The D1 exporter only writes R2 URLs, never VEA CDN URLs, so artist_image_url is NULL until images are on R2.

Fix: Reordered Phase 3 verification: 3a extraction → 3b images → 3c D1 → 3d pricing → 3e site. The vinny sync command already does this in the right order; the contract now matches.

2026-03-04: Database auto-resolution for image sync¶

Problem: vinny images sync-d1 required --database flag even though the database name is in wrangler.jsonc.

Root cause: The images CLI didn't use the same fallback chain as export-d1. Only cli_export.py and cli_pipeline.py resolved from wrangler.jsonc.

Fix: Added the same --database → D1_DATABASE_ID env → read_wrangler_d1_config(wrangler.jsonc) fallback to _apply_r2_sync() in the images CLI.

Last updated: 2026-03-04