Image Pipeline

Artist image processing from scrape to R2 upload to D1 sync

Pipeline Overview

End-to-end flow from venue scraping through R2 storage and D1 database sync. Images are keyed by artist and deduped at multiple stages.

graph TD A["Scrape Phase
LIV/XS/TAO"] --> B["VegasEvent
event.images list"] B --> C["ImagePlugin
process_run"] C --> D["Parse
CDN URL"] D --> E{"VEA
CDN?"} E -->|Yes| F["VEAImageUrl
Parser"] E -->|No| G["Direct URL
TAO wp-content"] F --> H["Size Variant
Builder"] G --> H H --> I["ImageDownloader
async, rate-limited"] I --> J{"Download
OK?"} J -->|Yes| K["Local Storage
data/images/artists/"] J -->|No| L["Error
+ Retry"] K --> M{"Duplicate?"} M -->|Yes| N["Skip
reuse"] M -->|No| O["Update
Manifest"] N --> P["R2Storage
upload"] O --> P P --> Q{"On R2?"} Q -->|Yes| R["Skip
Upload"] Q -->|No| S["PUT
+ Public ACL"] R --> T["R2 URL
Stored"] S --> T T --> U["D1 Sync
artist_image_url"] U --> V["Pipeline
Complete"] classDef stage fill:#38bdf808,stroke:#38bdf8,stroke-width:2px; classDef storage fill:#4ade8008,stroke:#4ade80,stroke-width:2px; classDef decision fill:#fbbf2408,stroke:#fbbf24,stroke-width:2px; classDef error fill:#fb718508,stroke:#fb7185,stroke-width:2px; classDef complete fill:#a78bfa08,stroke:#a78bfa,stroke-width:2px; class A,B,C,D,F,G,H,I,K,P,T,U stage; class M,E,J,Q decision; class L error; class V complete;
Core Principle
Images are keyed by artist, not event. The same artist performing at LIV and XS stores a single image file (with venue-specific suffixes: liv, xs, etc.). Deduplication happens at download time (via manifest checking) and upload time (via R2 HEAD requests).
  • Dedup key: (venue_tag, artist_slug, size)
  • Storage path: data/artists/{artist_slug}/{artist_slug}_{venue_tag}_{size}.jpg
  • Default sizes: main (500px) and hd (1500px)
  • Venue suffixes: ebc, ebcn, xs, liv, livb (mapped from venue names)
Scrape Phase

Three venues provide image URLs in different formats. Extractors populate VegasEvent.images list during crawling.

LIV Las Vegas & LIV Beach

VEA CDN URLs via event data. Parsed by VEAImageUrl class for size variants.

VEA CDN
XS Nightclub

Wynn-hosted. Also VEA CDN (inherited from WynnSocialBase). Same parsing path as LIV.

VEA CDN
TAO Group

Direct wp-content URLs from Hakkasan, Omnia, Marquee. No CDN; raw domain-hosted images.

Direct URL
URL Parsing & Sizes

VEA CDN and direct URLs are parsed to extract artist names and generate size variants.

SIZE_MAP = {
    "main": (500, 500),
    "hd": (1500, 1500)
}

# VEA CDN URL example:
# https://img.vea-cdn.com/img/artists/a87c1de/dj-name_500.jpg
# → Parsed into (artist_slug, size_pixels)
# → Generates: main (500px) + hd (1500px)

# TAO Direct URL example:
# https://hakkasan.booketing.com/wp-content/uploads/dj-name.jpg
# → Same processing: artist_slug extracted from filename
# → Size variants built on download

VEA parser detects CDN pattern and extracts artist_slug. TAO uses regex on filename. Both feed into ImageDownloader for async fetching with size options.

Download Phase

Images are downloaded asynchronously with concurrency limits, rate limiting, retries, and validation.

Async Concurrency

ImageDownloader uses asyncio.Semaphore to limit parallel requests. Default: 5 concurrent.

Rate Limiting

200ms delay between requests per domain. Respects CDN and hosting provider limits.

Retry Logic

3 retries on transient errors (timeout, 5xx). Permanent errors (404, 403) logged and skipped.

Validation

JPEG/PNG only. Min size: 100x100px. Corrupted files rejected and re-queued.

Local Storage & Dedup

Downloaded images are stored locally in a structured artist-based directory tree.

data/artists/
  {artist_slug}/
    {artist_slug}_{venue_tag}_main.jpg    (500px)
    {artist_slug}_{venue_tag}_hd.jpg      (1500px)

Example: data/artists/carl-cox/carl-cox_liv_main.jpg

Stale path detection: process_run checks local_path.exists() before skipping. If a file was deleted from disk, it is re-downloaded even if the dedup manifest says it should exist.

Dedup manifest: image_manifest.json tracks downloaded files by dedup key. validate_batch and update_manifest_from_events use a seen_paths set to count unique files (not per-event duplicates).

R2 Upload

Local images are uploaded to Cloudflare R2 with dedup and public access configured.

R2Storage Init

R2Storage initialized with bucket name and credentials from .env. Uses boto3 or native Cloudflare SDK.

Upload Methods

upload_local_images() iterates local files. upload(path) does single-file PUT with public ACL.

Dedup on R2

HEAD {key} checks if object exists. Skip PUT if already present; saves bandwidth.

Public URL

Custom domain: https://img.vinny.vegas. R2 bucket auto-serves with public ACL (not cloudflare-edge links).

After upload, event.images list is updated with R2 URLs. These URLs are stored in the master database and synced to D1 later.

D1 Sync & Insights

After R2 upload completes, the master database is exported to D1 with artist_image_url populated from R2.

Column Type Source
id UUID Event key (immutable)
artist_name TEXT event.artist_name
artist_slug TEXT Slugified name (for dedup)
artist_image_url TEXT https://img.vinny.vegas/{path}
venue_tag TEXT Short code (liv, xs, ebc, etc.)

Design Insights

Artist-level dedup. One artist image per venue, reused across all events. Reduces storage, speeds up scrapes, improves UX (consistent artist branding across app).
Stale path checks. process_run() verifies local_path.exists() before skipping. This catches cases where files are manually deleted or corrupted. Without this check, a deleted file would be permanently skipped if already in the manifest.
model_dump gotcha. Always use mode="json" when storing values that will be serialized later. Pydantic Path and datetime objects in FieldChange history cause json.dump() to crash if not serialized first. Example: event.model_dump(mode="json") before storing in cache.