Extractor Contract¶
A checklist and verification contract for adding new venue extractors. Every agent (human or AI) must follow these steps. Update this doc when you learn something new.
Phase 0: Research (before writing code)¶
0a. Page discovery and data source audit¶
- Identify the venue's event listing page and single-event page URLs
- Locate the venue's sitemap — most venues have one (XML sitemap, WordPress
events-sitemap.xml, etc.) - Save a local copy to
tests/fixtures/{venue}/for offline testing and reference - Note the URL count and date range covered
- If the sitemap covers multiple cities or years, document which portions contain current Las Vegas events
- Choose 3 sample event pages — these will be your ground-truth fixtures through all phases
- Take a screenshot of each sample event page using
agent-browser— save totests/fixtures/{venue}/ - Inspect single-event page source for data sources (in priority order):
- Schema.org JSON-LD (
<script type="application/ld+json">) - Embedded JS variables (e.g.
uv_tablesitems) og:*meta tags- Structured HTML (classes, data attrs)
- Raw HTML text (last resort — fragile)
- Document the URL pattern (e.g.
https://domain.com/event/EVE{id}{YYYYMMDD}/{slug}/) - Check if the venue shares a domain with existing extractors (requires disambiguation)
- Identify the
venue_id/venuecode— check urvenue embed, JS vars, or API calls in devtools - Locate image sources — CDN URLs, og:image, JSON-LD
imagefield - Check for streaming/social links in page source
Playwright note: If the event page relies on client-side JS rendering (e.g. React/Vue SPA, dynamically loaded content), static HTML from
httpx/requestswon't contain the data. Check the saved HTML for missing content — if the data isn't in the raw HTML, you'll need Playwright (Crawlee'sPlaywrightCrawler) instead of the default HTTP crawler. Signs you need Playwright: empty<div id="app">, data only visible after JS execution, API calls triggered byuseEffect/onMounted.
0b. Save fixtures and verify extractability¶
- Save the 3 sample event HTML files to
tests/fixtures/{venue}/for offline testing - Open the saved HTML locally — confirm the data you need is actually present in the static source
- If data is missing → Playwright required (see note above)
- Manually extract ALL verifiable fields from each saved HTML and record them in
tests/fixtures/{venue}/phase0.json: - Required:
event_date,performer,venue,url - Optional:
event_time,venue_id,external_id,ticket_url,images(source URLs) - Table pricing: section names, min spend amounts, guest counts
- Example:
"Dance Floor Table: $2,000 min / 6 guests"→{"name": "Dance Floor Table", "min_spend": 2000, "guests": 6} - Note in
phase0.jsonwhether table pricing data is present → determines if Phase 3d applies (it almost always will) - Record all Phase 0 findings in
tests/fixtures/{venue}/debug.md— this file is the running log for the entire extractor build
phase0.jsonis your test oracle. Every field recorded here will be cross-referenced against scraper output in Phase 3a and pricing output in Phase 3d. Get these values right now so you have a reliable baseline.
0c. Image resolution audit¶
Test which image sizes are available from the venue's CDN/image source. Our image pipeline supports these size presets:
| Preset | Pixels | Use case |
|---|---|---|
small |
250px | Thumbnails, cards |
main |
500px | Default event image |
medium |
750px | Mid-resolution |
large |
1000px | Detail pages |
hd |
1500px | High-definition |
raw |
Original | Highest available resolution |
- Identify the base image URL from the event page (og:image, JSON-LD
image, or CDN URL) - Test each size preset against the venue's CDN to see which resolutions are available
- For VEA CDN venues: modify the URL dimension parameter and check for 200 vs 404
- For other CDNs: check if resize parameters exist in the URL pattern
- Document which sizes are downloadable for this venue
- Note the highest available resolution — this becomes the source for the image pipeline
- Save sample images to
tests/fixtures/{venue}/for at least one artist, using the naming convention below - Record available sizes and source URLs in
debug.md
Naming convention for downloaded images:
| Venue | Tag | Example filename |
|---|---|---|
| Encore Beach Club | ebc |
diplo_ebc_main.jpg |
| Encore Beach Club at Night | ebcn |
diplo_ebcn_raw.jpg |
| XS Nightclub | xs |
diplo_xs_hd.jpg |
| LIV Nightclub / LIV Las Vegas | liv |
diplo_liv_main.jpg |
| LIV Beach | livb |
diplo_livb_small.jpg |
New venues: add a 2-4 char tag to ImageStorage.VENUE_SUFFIXES in src/plugins/images/storage.py. Unknown venues fall back to the first 3 chars of the slugified venue name.
Fixture directory structure¶
After Phase 0 is complete, tests/fixtures/{venue}/ should look like this:
tests/fixtures/omnia/
├── events-sitemap.xml # Local copy of venue sitemap(s)
├── event1.html # Sample event page HTML
├── event2.html
├── event3.html
├── event1.png # Screenshot from agent-browser
├── event2.png
├── event3.png
├── phase0.json # Ground-truth field values (test oracle)
├── debug.md # Running log: findings, r2_urls, discrepancies
├── artist_omn_main.jpg # Sample image: main (500px)
└── artist_omn_small.jpg # Sample image: small (250px)
phase0.json— structured expected values for all 3 events, used as assertions in Phase 3a and 3ddebug.md— append-only log updated at each phase; records screenshots, image audit results, r2_urls, pricing cross-checks, and any discrepancies requiring human review
Phase Gate: Complete Phase 0 before writing code
Do not start implementation until all Phase 0 items are checked. You need: 3 saved HTML fixtures, a phase0.json test oracle with manually extracted field values, and a completed image resolution audit. Skipping this leads to extractors that "work" but produce wrong data.
Phase 1: Implement the extractor¶
File: src/extractors/{venue}.py¶
- Inherit from
VenueExtractor(orWynnSocialBasefor Wynn properties) - Implement required properties:
name,domain - If sharing a domain, implement
can_handle()with disambiguation logic - Implement
extract(soup, url) → VegasEvent | None
Required fields (extraction will fail without these)¶
| Field | Format | Example |
|---|---|---|
event_date |
YYYY-MM-DD |
"2026-03-15" |
performer |
Display name | "DOM DOLLA" |
venue |
Full venue name | "Encore Beach Club" |
url |
Source URL | passed through |
scraped_at |
ISO-8601 | datetime.now(tz=UTC).isoformat() |
Critical optional fields (things that break downstream if wrong)¶
| Field | Why it matters | What breaks without it |
|---|---|---|
venue_id |
Table pricing API needs VEN* code |
Pricing queries return nothing; D1 gets synthetic venue:{slug} key |
images |
ImageMetadata list → R2 upload → D1 |
No artist image on site; cards show placeholder |
external_id |
Dedup across re-scrapes | Minor — composite_key handles identity |
event_time |
Display + event_datetime | Shows "TBD" on site |
ticket_url |
CTA button on event page | No ticket link shown |
Construction pattern¶
Always use EventData TypedDict + factory method (never VegasEvent(**dict)):
from src.models import EventData, VegasEvent
event_data: EventData = {
"url": url,
"scraped_at": datetime.now(tz=timezone.utc).isoformat(),
"performer": performer,
"venue": "My Venue Name",
"event_date": event_date,
}
# Add optional fields conditionally
if venue_id:
event_data["venue_id"] = venue_id
if images:
event_data["images"] = images
return VegasEvent.from_extractor_data(event_data)
Image extraction¶
from src.models import ImageMetadata
images = []
if img_url:
images.append(ImageMetadata(source_url=img_url, category="artist_full"))
event_data["images"] = images
- Prefer full-resolution source URLs (avoid thumbnail crops)
- Use
category="artist_full"for the main artist/event flyer - The image pipeline handles downloading, resizing, and R2 upload — just provide the source URL
Phase 2: Register and wire up¶
- Register in
create_default_registry()insrc/extractors/__init__.py - Order matters: if sharing a domain, put the extractor with richer data (e.g. JSON-LD) first
- Add start URL(s) to crawlee router in
src/crawlee_main.py - If using sitemaps, add URL-level pre-filters for date range and venue (avoid crawling thousands of irrelevant pages)
- Add venue aliases to
VENUE_ALIASESinsrc/cli_scrape.py(e.g.vinny scrape omnia) - Per-venue filtering (for multi-venue domains like TAO Group):
- Add URL slug mapping to
_TAO_ALIAS_SLUGSinsrc/cli_scrape.pysovinny scrape omniaonly enqueues Omnia URLs from the shared sitemaps - Slugs are extracted from CLI args before alias resolution (aliases become sitemap URLs)
- Passed as
tao_venue_slugstorun_scraper()→ used in sitemap handler to filter URLs - Omitting a venue from the slug map (e.g.
tao,tao-group) means no filter = all LV venues - Add venue tag(s) to
ImageStorage.VENUE_SUFFIXESinsrc/plugins/images/storage.py - Run
just check(ruff + ty) — fix any type errors
Phase 3: Verification (the part we kept skipping)¶
Pipeline order is critical
Images must be downloaded and uploaded to R2 before the D1 export, otherwise artist_image_url will be NULL. The steps below follow the correct pipeline order: extract → images → D1 → pricing → site. vinny sync handles this automatically.
Order matters. Images must be downloaded and uploaded to R2 before the D1 export, otherwise
artist_image_urlwill be NULL. The steps below follow the correct pipeline order: extract → images → D1 → pricing → site.
3a. Unit extraction test¶
Scrape all 3 sample events from Phase 0 and cross-reference against phase0.json.
Sample test output
# Scrape the 3 sample events
vinny scrape -u "https://example.com/event/1" -u "https://example.com/event/2" -u "https://example.com/event/3" --max-requests 10
# Inspect the output
cat runs/latest/events.json | jq '.[]'
Expected: each event's composite_key, event_date, performer, and venue match the values in phase0.json.
Cross-reference each event against tests/fixtures/{venue}/phase0.json:
-
composite_keylooks right:{date}-{performer-slug}-{venue-slug} -
event_datematches Phase 0 expected value -
performermatches Phase 0 expected value (not the event title, not the venue) -
venuematches the canonical venue name used in the site -
venue_idmatches Phase 0 expected value (starts withVEN, or intentionally null with a comment) -
event_timematches Phase 0 expected value (or documented as unavailable) -
imagesarray has at least one entry with a validsource_url -
ticket_urlmatches Phase 0 expected value (if available) - Record any discrepancies in
tests/fixtures/{venue}/debug.mdwith explanation
3b. Image pipeline test¶
Images must be downloaded and uploaded to R2 before the D1 export so that artist_image_url is populated (not NULL).
# Download images for the run
vinny images download -r latest
# Check status
vinny images status -r latest
# Upload to R2 + sync URLs to D1
vinny images upload-r2 --sync-d1
Verify:
- Images downloaded to
data/artists/{artist-slug}/{artist-slug}_{venue-tag}_{size}.jpg - Artist slug in filename matches the performer slug in composite_key
- R2 upload succeeds (check for
r2_urlin image metadata) - Record
r2_urlvalues intests/fixtures/{venue}/debug.mdunder a## Phase 3b: Image Pipelineheading
3c. D1 export test¶
Now that images have R2 URLs, export to D1. The composite_key becomes the D1 event_id primary key, and venue_id becomes a foreign key to the venues table.
# Export to D1 (database auto-resolved from wrangler.jsonc or D1_DATABASE_ID env)
vinny export-d1 --execute
# Or inspect the generated SQL first
cat runs/latest/d1_import.sql | head -100
Verify in the SQL:
-
event_idvalue matchescomposite_key(e.g.'2026-03-15-dom-dolla-encore-beach-club') -
venue_idis a real VEN code OR an intentionalvenue:{slug}with pricing disabled -
venue_namematches what the site expects (checksite/src/lib/queries.tsfor any venue name filters) -
artist_image_urlis an R2 URL (not a VEA CDN URL; NULL is fine if images aren't on R2 yet) -
streaming_links_jsonis valid JSON or NULL (not the string"null")
Database auto-resolution:
--databaseflag →D1_DATABASE_IDenv →wrangler.jsonc. You don't need to pass--databaseif credentials are set up.
3d. Table pricing test¶
Skip only if Phase 0b explicitly documented that no table pricing data exists for this venue. This is rare — most venues have it.
# Check pricing extraction for all 3 sample events
cat runs/latest/events.json | jq '.[].table_pricing'
Cross-reference against phase0.json table pricing values:
-
table_pricing.sectionsis populated (or null if venue doesn't embed pricing) - Each tier's
name,min_spend,guestsmatches the values manually extracted in Phase 0b - If using urvenue API,
venue_idis the correct VEN code for THIS venue (not another venue's code) - Record results in
tests/fixtures/{venue}/debug.mdunder## Phase 3d: Table Pricing
Handling discrepancies: The API may return more data than what's visible on the event page (e.g. extra tiers, different min spends). When this happens:
1. Review the Phase 0 screenshots to confirm what's actually shown on the site
2. Prioritize what's visible on the site — the user sees the site, not the API
3. If the API has more data, document it in debug.md but don't assert against it
4. If values conflict (e.g. different min spend), flag for human review in debug.md
3e. Site rendering test¶
The Astro dev server uses a local miniflare D1 database (not the remote production D1). Data must be imported into this local DB before pages will render.
Prerequisites — import data into local D1¶
The local D1 SQLite file lives at:
If using sqlite3 directly instead of wrangler:
LOCAL_DB=$(find site/.wrangler/state/v3/d1 -name "*.sqlite" | head -1)
sqlite3 "$LOCAL_DB" < runs/latest/d1_import.sql
URL patterns¶
- Venue list:
/venues— all venues with upcoming event counts - Venue detail:
/venues/{venue_id}— e.g./venues/venue:hakkasan-nightclubor/venues/VEN1121561 - Event detail:
/events/{composite_key}— e.g./events/2026-03-20-tyga-hakkasan-nightclub - Events list:
/eventsor/events?venue={venue_id}for filtered view
Start the dev server¶
The server starts on port 4321 (or next available if occupied). Check the terminal output for the actual port.
Checklist¶
- Navigate to
/venues— does the new venue appear with the correct event count? - Navigate to the venue detail page (e.g.
/venues/venue:hakkasan-nightclub) — are events listed? - Navigate to each of the 3 sample event pages — do they render without SSR errors?
- Check dev server terminal output for stack traces (SSR errors are silent in the browser — HTTP 200 with empty
<main>) - Artist image shows (R2 URL, not placeholder)
- Event date, time, venue name display correctly
- Streaming links render (or gracefully absent)
- Table pricing section renders (or gracefully absent)
- Take screenshots and save to
tests/fixtures/{venue}/for the record
Phase 4: Documentation¶
- Add venue section to EXTRACTORS.md
- Update the "Registered Venues" table
- Document any quirks (shared domains, missing data, fragile selectors)
- Update PLAN.md if this was a tracked milestone
Lessons Learned¶
Add to this section each time an extractor issue is discovered post-deploy.
2026-03-03: EBC D1 key mismatch¶
Problem: EBC events exported to D1 but keys were wrong and images didn't show.
Root cause: Skipped Phase 3b/3c verification. The composite_key was generating correctly but the venue_id wasn't being set (fell back to synthetic venue:encore-beach-club), and images weren't going through the R2 upload pipeline before the D1 export.
Fix: Always run the full Phase 3 verification before marking an extractor as done. Never go straight from "extract works" to "ship it".
2026-03-04: Local D1 is separate from remote D1¶
Problem: After running just export-d1 (which pushes to remote D1), the Astro dev server showed no new venues or events.
Root cause: The Astro dev server uses platformProxy: { enabled: true } in astro.config.mjs, which means it reads from a local miniflare D1 SQLite file — not the remote Cloudflare D1. The just export-d1 command only pushes to remote.
Fix: Import d1_import.sql directly into the local SQLite file using sqlite3. See Phase 3e above for the full procedure. Alternatively, use vinny export-d1 --execute --local which resolves the database from wrangler.jsonc automatically.
2026-03-04: D1 export fails with "duplicate column name"¶
Problem: vinny export-d1 --execute failed with duplicate column name: category: SQLITE_ERROR.
Root cause: The generated SQL included ALTER TABLE table_tiers ADD COLUMN category TEXT migrations for backward compatibility, but the CREATE TABLE statement already defined those columns. The REST API path handled this gracefully (ran migrations one-by-one, caught duplicates), but the wrangler CLI sent the whole file as one transaction and failed on the first error.
Fix: Removed the redundant ALTER TABLE migrations. The CREATE TABLE schema is the source of truth — if a column exists there, no migration is needed.
2026-03-04: Pipeline order matters — images before D1¶
Problem: D1 export showed artist_image_url = NULL even though images were on R2.
Root cause: Ran D1 export before image download/upload. The D1 exporter only writes R2 URLs, never VEA CDN URLs, so artist_image_url is NULL until images are on R2.
Fix: Reordered Phase 3 verification: 3a extraction → 3b images → 3c D1 → 3d pricing → 3e site. The vinny sync command already does this in the right order; the contract now matches.
2026-03-04: Database auto-resolution for image sync¶
Problem: vinny images sync-d1 required --database flag even though the database name is in wrangler.jsonc.
Root cause: The images CLI didn't use the same fallback chain as export-d1. Only cli_export.py and cli_pipeline.py resolved from wrangler.jsonc.
Fix: Added the same --database → D1_DATABASE_ID env → read_wrangler_d1_config(wrangler.jsonc) fallback to _apply_r2_sync() in the images CLI.
Last updated: 2026-03-04