Image Pipeline
Artist image processing from scrape to R2 upload to D1 sync
End-to-end flow from venue scraping through R2 storage and D1 database sync. Images are keyed by artist and deduped at multiple stages.
LIV/XS/TAO"] --> B["VegasEvent
event.images list"] B --> C["ImagePlugin
process_run"] C --> D["Parse
CDN URL"] D --> E{"VEA
CDN?"} E -->|Yes| F["VEAImageUrl
Parser"] E -->|No| G["Direct URL
TAO wp-content"] F --> H["Size Variant
Builder"] G --> H H --> I["ImageDownloader
async, rate-limited"] I --> J{"Download
OK?"} J -->|Yes| K["Local Storage
data/images/artists/"] J -->|No| L["Error
+ Retry"] K --> M{"Duplicate?"} M -->|Yes| N["Skip
reuse"] M -->|No| O["Update
Manifest"] N --> P["R2Storage
upload"] O --> P P --> Q{"On R2?"} Q -->|Yes| R["Skip
Upload"] Q -->|No| S["PUT
+ Public ACL"] R --> T["R2 URL
Stored"] S --> T T --> U["D1 Sync
artist_image_url"] U --> V["Pipeline
Complete"] classDef stage fill:#38bdf808,stroke:#38bdf8,stroke-width:2px; classDef storage fill:#4ade8008,stroke:#4ade80,stroke-width:2px; classDef decision fill:#fbbf2408,stroke:#fbbf24,stroke-width:2px; classDef error fill:#fb718508,stroke:#fb7185,stroke-width:2px; classDef complete fill:#a78bfa08,stroke:#a78bfa,stroke-width:2px; class A,B,C,D,F,G,H,I,K,P,T,U stage; class M,E,J,Q decision; class L error; class V complete;
liv, xs, etc.). Deduplication happens at download time (via manifest checking) and upload time (via R2 HEAD requests).
- Dedup key:
(venue_tag, artist_slug, size) - Storage path:
data/artists/{artist_slug}/{artist_slug}_{venue_tag}_{size}.jpg - Default sizes:
main(500px) andhd(1500px) - Venue suffixes: ebc, ebcn, xs, liv, livb (mapped from venue names)
Three venues provide image URLs in different formats. Extractors populate VegasEvent.images list during crawling.
VEA CDN URLs via event data. Parsed by VEAImageUrl class for size variants.
Wynn-hosted. Also VEA CDN (inherited from WynnSocialBase). Same parsing path as LIV.
Direct wp-content URLs from Hakkasan, Omnia, Marquee. No CDN; raw domain-hosted images.
VEA CDN and direct URLs are parsed to extract artist names and generate size variants.
SIZE_MAP = {
"main": (500, 500),
"hd": (1500, 1500)
}
# VEA CDN URL example:
# https://img.vea-cdn.com/img/artists/a87c1de/dj-name_500.jpg
# → Parsed into (artist_slug, size_pixels)
# → Generates: main (500px) + hd (1500px)
# TAO Direct URL example:
# https://hakkasan.booketing.com/wp-content/uploads/dj-name.jpg
# → Same processing: artist_slug extracted from filename
# → Size variants built on download
VEA parser detects CDN pattern and extracts artist_slug. TAO uses regex on filename. Both feed into ImageDownloader for async fetching with size options.
Images are downloaded asynchronously with concurrency limits, rate limiting, retries, and validation.
ImageDownloader uses asyncio.Semaphore to limit parallel requests. Default: 5 concurrent.
200ms delay between requests per domain. Respects CDN and hosting provider limits.
3 retries on transient errors (timeout, 5xx). Permanent errors (404, 403) logged and skipped.
JPEG/PNG only. Min size: 100x100px. Corrupted files rejected and re-queued.
Downloaded images are stored locally in a structured artist-based directory tree.
data/artists/
{artist_slug}/
{artist_slug}_{venue_tag}_main.jpg (500px)
{artist_slug}_{venue_tag}_hd.jpg (1500px)
Example: data/artists/carl-cox/carl-cox_liv_main.jpg
process_run checks local_path.exists() before skipping. If a file was deleted from disk, it is re-downloaded even if the dedup manifest says it should exist.
Dedup manifest: image_manifest.json tracks downloaded files by dedup key. validate_batch and update_manifest_from_events use a seen_paths set to count unique files (not per-event duplicates).
Local images are uploaded to Cloudflare R2 with dedup and public access configured.
R2Storage initialized with bucket name and credentials from .env. Uses boto3 or native Cloudflare SDK.
upload_local_images() iterates local files. upload(path) does single-file PUT with public ACL.
HEAD {key} checks if object exists. Skip PUT if already present; saves bandwidth.
Custom domain: https://img.vinny.vegas. R2 bucket auto-serves with public ACL (not cloudflare-edge links).
After upload, event.images list is updated with R2 URLs. These URLs are stored in the master database and synced to D1 later.
After R2 upload completes, the master database is exported to D1 with artist_image_url populated from R2.
| Column | Type | Source |
|---|---|---|
id |
UUID | Event key (immutable) |
artist_name |
TEXT | event.artist_name |
artist_slug |
TEXT | Slugified name (for dedup) |
artist_image_url |
TEXT | https://img.vinny.vegas/{path} |
venue_tag |
TEXT | Short code (liv, xs, ebc, etc.) |
Design Insights
process_run() verifies local_path.exists() before skipping. This catches cases where files are manually deleted or corrupted. Without this check, a deleted file would be permanently skipped if already in the manifest.
mode="json" when storing values that will be serialized later. Pydantic Path and datetime objects in FieldChange history cause json.dump() to crash if not serialized first. Example: event.model_dump(mode="json") before storing in cache.