Vinny Plugin Development Guide¶
This guide explains how to extend Vinny with new venues, features, and plugins.
Table of Contents¶
- Architecture Overview
- Creating a New Venue Extractor
- Adding Features to Existing Extractors
- Creating New Plugins
- Artist Enrichers
- Best Practices
- Testing Your Changes
- Examples
Architecture Overview¶
Vinny uses a plugin-based architecture with two main extension points:
1. Venue Extractors¶
Located in src/extractors/. Each venue (LIV, XS, Omnia, etc.) has its own extractor that knows how to:
- Identify URLs it can handle
- Extract event data from HTML/JSON
- Find images, pricing, artist info
2. Feature Plugins¶
Located in src/plugins/. Cross-cutting features like:
- Image downloading and processing
- Table/bottle pricing extraction (coming soon)
- Social media enrichment
- Notion/Google Sheets integration
Data Flow¶
flowchart TD
A["Sitemap URL"] --> B["Parse loc + lastmod"]
B --> C["SitemapIndex.diff()"]
C -->|new/updated only| D["Event URL"]
D --> E["Crawlee Router"]
E --> F["Venue Extractor"]
F --> G["VegasEvent"]
G --> H["Storage / Export"]
G --> I["Feature Plugins"]
I --> J["Images"]
I --> K["Pricing"]
I --> L["Enrichment"]
style A fill:#7c3aed,color:#fff
style G fill:#059669,color:#fff
style J fill:#d97706,color:#fff
style K fill:#d97706,color:#fff
style L fill:#d97706,color:#fff
For venues without sitemaps (XS, EBC), all event URLs from the listing page are enqueued directly.
Creating a New Venue Extractor¶
Let's walk through creating an extractor for XS Nightclub (which already has a skeleton).
Step 1: Create the Extractor File¶
Create src/extractors/xs.py:
"""XS Nightclub extractor plugin."""
from __future__ import annotations
import re
from datetime import datetime
from urllib.parse import urlparse
from bs4 import BeautifulSoup
from src.extractors import VenueExtractor
from src.models import VegasEvent, StreamingLinks
from src.plugins.images.models import ImageMetadata
class XSExtractor(VenueExtractor):
"""Extractor for XS Nightclub at Wynn Las Vegas."""
@property
def name(self) -> str:
"""Human-readable venue name."""
return "XS Nightclub"
@property
def domain(self) -> str:
"""Domain pattern for URL matching."""
# This tells Vinny which URLs this extractor handles
return "wynnsocial.com"
def can_handle(self, url: str) -> bool:
"""Check if this extractor can handle the given URL.
Override this for more complex URL matching.
"""
# Default checks if domain is in URL
# You can add additional checks here
return self.domain in url.lower()
def extract(self, soup: BeautifulSoup, url: str) -> VegasEvent | None:
"""Extract event data from XS event pages.
Args:
soup: Parsed BeautifulSoup object
url: Source URL
Returns:
VegasEvent if extraction successful, None otherwise
"""
# Step 1: Extract basic info
scraped_at = datetime.utcnow().isoformat()
# Step 2: Try to find embedded JSON (if available)
event_data = self._extract_embedded_json(soup)
if event_data:
return self._parse_event_data(event_data, url, scraped_at)
# Step 3: Fallback to HTML extraction
return self._extract_from_html(soup, url, scraped_at)
def _extract_embedded_json(self, soup: BeautifulSoup) -> dict | None:
"""Extract JSON-LD or embedded event data.
Many sites embed structured data in <script type="application/ld+json">
"""
scripts = soup.find_all("script", type="application/ld+json")
for script in scripts:
try:
if script.string:
data = json.loads(script.string)
# Adjust this check based on the site's data structure
if isinstance(data, dict) and "event" in data.get("@type", "").lower():
return data
except (json.JSONDecodeError, AttributeError):
continue
return None
def _parse_event_data(self, data: dict, url: str, scraped_at: str) -> VegasEvent | None:
"""Parse extracted JSON data into VegasEvent."""
# Extract performer name
performer = self._extract_performer(data, url)
if not performer:
return None
# Extract event date
event_date = self._extract_date(data, url)
if not event_date:
return None
# Build event data dict
event_dict = {
"url": url,
"scraped_at": scraped_at,
"performer": performer,
"title": f"{performer} at XS Nightclub",
"venue": "XS Nightclub",
"event_date": event_date,
}
# Add optional fields
event_time = self._extract_time(data)
if event_time:
event_dict["event_time"] = event_time
event_dict["event_datetime"] = f"{event_date}T{event_time}:00"
# Extract images
images = self._extract_images(data)
if images:
event_dict["images"] = images
# Extract streaming links
streaming = self._extract_streaming_links(data)
if streaming:
event_dict["streaming_links"] = streaming
return VegasEvent(**event_dict)
def _extract_from_html(self, soup: BeautifulSoup, url: str, scraped_at: str) -> VegasEvent | None:
"""Fallback HTML extraction when JSON is not available."""
# Extract from meta tags, headings, page text
# This is venue-specific
performer = self._extract_performer_from_html(soup, url)
event_date = self._extract_date_from_html(soup, url)
if not performer or not event_date:
return None
return VegasEvent(
url=url,
scraped_at=scraped_at,
performer=performer,
title=f"{performer} at XS Nightclub",
venue="XS Nightclub",
event_date=event_date,
)
def _extract_performer(self, data: dict, url: str) -> str | None:
"""Extract performer name from data or URL."""
# Try data first
if data.get("name"):
return data["name"]
# Try performer field
if data.get("performer"):
if isinstance(data["performer"], dict):
return data["performer"].get("name")
elif isinstance(data["performer"], str):
return data["performer"]
# Fallback to URL
path_parts = urlparse(url).path.strip("/").split("/")
if len(path_parts) >= 2:
return path_parts[-1].replace("-", " ").title()
return None
def _extract_date(self, data: dict, url: str) -> str | None:
"""Extract event date from data or URL."""
# Try data first
if data.get("startDate"):
try:
dt = datetime.fromisoformat(data["startDate"].replace("Z", "+00:00"))
return dt.strftime("%Y-%m-%d")
except (ValueError, AttributeError):
pass
# Try URL pattern extraction
# Adjust regex based on URL format
match = re.search(r'/(\d{4})(\d{2})(\d{2})/', url)
if match:
return f"{match.group(1)}-{match.group(2)}-{match.group(3)}"
return None
def _extract_images(self, data: dict) -> list[ImageMetadata]:
"""Extract image URLs from data."""
images = []
# Look for image fields
if data.get("image"):
img_url = data["image"]
if isinstance(img_url, dict):
img_url = img_url.get("url", "")
if img_url:
images.append(ImageMetadata(
source_url=img_url,
category="artist_full"
))
return images
def _extract_streaming_links(self, data: dict) -> StreamingLinks | None:
"""Extract streaming platform links."""
# Look in description or social fields
# This is venue-specific
return None
def extract_table_pricing(self, soup: BeautifulSoup) -> TablePricing | None:
"""Extract table/bottle pricing if available on page.
This is called automatically for table pricing plugin.
Return None if pricing is not on this page.
"""
# TODO: Implement pricing extraction
# Look for pricing sections, table tiers, etc.
return None
Step 2: Register the Extractor¶
Edit src/extractors/__init__.py:
def create_default_registry() -> ExtractorRegistry:
"""Create registry with all built-in extractors."""
from src.extractors.liv import LIVExtractor
from src.extractors.xs import XSExtractor # Add this line
registry = ExtractorRegistry()
registry.register(LIVExtractor())
registry.register(XSExtractor()) # Add this line
return registry
Step 3: Test the Extractor¶
# Run scraper on XS events
just scrape -u https://www.wynnsocial.com/events-sitemap.xml --max-requests 5
# Check if events were extracted
just list-runs
Adding Features to Existing Extractors¶
Example: Adding Table Pricing to LIV¶
Edit src/extractors/liv.py:
class LIVExtractor(VenueExtractor):
# ... existing code ...
def extract_table_pricing(self, soup: BeautifulSoup) -> TablePricing | None:
"""Extract table/bottle pricing from LIV pages.
Note: LIV loads pricing dynamically via JavaScript.
This is a placeholder - full implementation requires Playwright.
"""
# Look for pricing container
pricing_section = soup.find("div", class_=re.compile(r"pricing|tables", re.I))
if not pricing_section:
return None
# Extract pricing tiers
tiers = {}
# Look for stage tables
stage_elem = pricing_section.find(string=re.compile(r"stage", re.I))
if stage_elem:
parent = stage_elem.find_parent("div", class_="tier-card")
if parent:
tiers["stage"] = self._extract_tier_pricing(parent)
# Look for dance floor tables
dance_elem = pricing_section.find(string=re.compile(r"dance.?floor", re.I))
if dance_elem:
parent = dance_elem.find_parent("div", class_="tier-card")
if parent:
tiers["dance_floor"] = self._extract_tier_pricing(parent)
if tiers:
return TablePricing(**tiers)
return None
def _extract_tier_pricing(self, element) -> TablePricingTier:
"""Extract pricing for a single tier."""
# Look for min spend
min_spend_match = re.search(r'\$([\d,]+)', element.get_text())
min_spend = None
if min_spend_match:
min_spend = float(min_spend_match.group(1).replace(",", ""))
# Look for guest count
guests_match = re.search(r'(\d+)\s*guests?', element.get_text(), re.I)
guests = None
if guests_match:
guests = int(guests_match.group(1))
return TablePricingTier(
min_spend=min_spend,
guests=guests,
available=True # Assume available if shown
)
Creating New Plugins¶
Plugin Structure¶
Create a new plugin in src/plugins/{plugin_name}/:
src/plugins/{plugin_name}/
├── __init__.py # Main plugin class
├── models.py # Data models (if needed)
├── cli.py # CLI commands
└── README.md # Plugin documentation
Example: Social Media Plugin¶
Create src/plugins/social/__init__.py:
"""Social media enrichment plugin.
Finds and adds social media links for artists and venues.
"""
from __future__ import annotations
import re
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from src.models import VegasEvent
class SocialMediaPlugin:
"""Enrich events with social media links."""
def __init__(self):
"""Initialize plugin."""
self.platforms = {
"instagram": r'instagram\.com/([\w.]+)',
"twitter": r'twitter\.com/([\w]+)',
"facebook": r'facebook\.com/([\w.]+)',
}
def enrich_event(self, event: VegasEvent) -> VegasEvent:
"""Add social media links to an event.
Args:
event: VegasEvent to enrich
Returns:
Enriched event
"""
# Search in description
if event.description:
social_links = self._extract_from_text(event.description)
if social_links:
# Add to event or update existing
if not hasattr(event, 'social_links'):
event.social_links = {}
event.social_links.update(social_links)
return event
def _extract_from_text(self, text: str) -> dict[str, str]:
"""Extract social links from text."""
links = {}
for platform, pattern in self.platforms.items():
matches = re.findall(pattern, text, re.I)
if matches:
links[platform] = f"https://{platform}.com/{matches[0]}"
return links
async def enrich_batch(self, events: list[VegasEvent]) -> list[VegasEvent]:
"""Enrich multiple events.
Args:
events: List of events to enrich
Returns:
List of enriched events
"""
return [self.enrich_event(e) for e in events]
Register Plugin CLI¶
Edit src/cli.py:
from src.plugins.social.cli import app as social_app
# Add to main app as a subcommand
app.command(social_app, name="social")
Best Practices¶
1. Error Handling¶
Always handle errors gracefully:
def extract(self, soup: BeautifulSoup, url: str) -> VegasEvent | None:
try:
# Extraction logic
return VegasEvent(**data)
except Exception as e:
logger.error(f"Failed to extract {url}: {e}")
return None
2. Rate Limiting¶
Be respectful when scraping:
# In downloader:
delay_seconds = 1.0 # 1 second between requests
max_workers = 3 # 3 concurrent downloads
3. Data Validation¶
Use Pydantic models for validation:
from pydantic import BaseModel, field_validator
class VegasEvent(BaseModel):
event_date: str
@field_validator("event_date")
@classmethod
def validate_date(cls, v: str) -> str:
"""Ensure date is YYYY-MM-DD format."""
try:
datetime.strptime(v, "%Y-%m-%d")
return v
except ValueError:
raise ValueError(f"Invalid date format: {v}")
4. Testing¶
Always test your extractor:
# Test with limited requests
just scrape --max-requests 3
# Check output
cat runs/latest/events.json | jq '.events[0]'
# Validate images downloaded
just images-status latest
5. Documentation¶
Document your extractor:
"""XS Nightclub extractor.
Handles event pages from wynnsocial.com
- Uses JSON-LD when available
- Falls back to HTML parsing
- Extracts: performer, date, time, images, streaming links
Known Issues:
- Pricing requires JavaScript (Playwright needed)
- Some events have minimal descriptions
Example URLs:
- https://www.wynnsocial.com/event/EVE111500020260227/hntr/
"""
Testing Your Changes¶
Local Testing¶
# 1. Install in development mode
uv pip install -e ".[dev]"
# 2. Run linter
just lint
# 3. Run type checker
just typecheck
# 4. Test scrape
just scrape --max-requests 5
# 5. Check results
just list-runs
just stats
# 6. Test images
just images-download latest
just images-status latest
Integration Testing¶
# Test specific venue
just scrape -u https://venue.com/sitemap.xml
# Test export
just export-csv latest
just export-md latest
# Test diff
just diff run1 run2
Examples¶
Complete Venue Extractor: LIV¶
See src/extractors/liv.py for a full example.
Complete Plugin: Images¶
See src/plugins/images/ for a complete plugin example.
Artist Enrichers¶
Artist enrichers live in src/plugins/enrichment/ and add bio, social links, streaming data, and other metadata to VegasEvent objects after scraping. They run as a post-processing step via vinny enrich artists.
How It Works¶
sequenceDiagram
participant M as master_events.json
participant R as EnrichmentRegistry
participant T as TracklistsEnricher
participant RA as ResidentAdvisorEnricher
participant S as SpotifyEnricher
M->>R: Load events
R->>T: enrich(event)
T-->>R: social_links (RA URL, instagram, etc.)
R->>RA: enrich(event)
Note right of RA: Uses RA URL from Tracklists
RA-->>R: artist_stats.ra_bio, ra_url
R->>S: enrich(event)
S-->>R: spotify_id, genres, top_tracks
R->>M: Save updated events
Order matters: Tracklists runs first to find the RA profile URL; RA uses that URL if present (more accurate than guessing the slug from the name).
The ABC¶
# src/plugins/enrichment/__init__.py
class ArtistEnricher(ABC):
@property
@abstractmethod
def name(self) -> str: ... # unique key, e.g. "spotify"
@abstractmethod
async def enrich(self, event: VegasEvent) -> VegasEvent: ...
# MUST return a new event via event.model_copy(update={...})
# NEVER mutate the input — VegasEvent is immutable (Pydantic)
def should_enrich(self, event: VegasEvent) -> bool:
return True # override to skip already-enriched events
Enrichment Status Tracking¶
Every enricher records its outcome in event.enrichment_status (a EnrichmentStatus Pydantic model):
| Field | Type | Meaning |
|---|---|---|
spotify |
bool |
Spotify enrichment succeeded |
tracklists |
bool |
1001tracklists enrichment succeeded |
resident_advisor |
bool |
RA enrichment succeeded |
errors |
dict[str, str] |
Per-source error messages |
The CLI skips events where all selected enrichers already have True status. Use --force to re-run.
Writing a New Enricher¶
Complete enricher template
The following shows all the code needed for a new enricher — create the file, add the status flag, register it, and update the skip check.
1. Create src/plugins/enrichment/my_source.py:
from __future__ import annotations
from src.models import ArtistStats, EnrichmentStatus, SocialLinks, VegasEvent
from src.plugins.enrichment import ArtistEnricher
class MySourceEnricher(ArtistEnricher):
@property
def name(self) -> str:
return "my_source"
def should_enrich(self, event: VegasEvent) -> bool:
# Skip events already enriched by this source
status = event.enrichment_status
return not (status and status.my_source) # add field to EnrichmentStatus model first
async def enrich(self, event: VegasEvent) -> VegasEvent:
existing_status = event.enrichment_status or EnrichmentStatus()
existing_stats = event.artist_stats or ArtistStats()
stats_update: dict = {}
status_update: dict = {}
try:
# ... fetch data for event.performer ...
stats_update["ra_bio"] = "fetched bio"
status_update["my_source"] = True
except Exception as e:
status_update["errors"] = {**existing_status.errors, "my_source": str(e)}
if not stats_update and not status_update:
return event # nothing changed — return original
return event.model_copy(update={
"artist_stats": existing_stats.model_copy(update=stats_update),
"enrichment_status": existing_status.model_copy(update=status_update),
})
2. Add the status flag to EnrichmentStatus in src/models/__init__.py:
class EnrichmentStatus(BaseModel):
spotify: bool = False
tracklists: bool = False
resident_advisor: bool = False
my_source: bool = False # ← add this
errors: dict[str, str] = {}
3. Register in src/plugins/enrichment/cli.py:
from src.plugins.enrichment.my_source import MySourceEnricher
def _build_registry(save_dir=None) -> EnrichmentRegistry:
registry = EnrichmentRegistry()
registry.register(TracklistsEnricher())
registry.register(ResidentAdvisorEnricher(save_dir=save_dir))
registry.register(MySourceEnricher()) # ← add here, after sources it depends on
registry.register(SpotifyEnricher(save_dir=save_dir))
return registry
4. Update the "all enriched" check in cmd_enrich_artists:
elif status and status.spotify and status.tracklists and status.resident_advisor and status.my_source:
continue
Existing Enrichers at a Glance¶
| Enricher | Source | Method | Populates | Slug strategy |
|---|---|---|---|---|
TracklistsEnricher |
1001tracklists.com | HTML scrape (BeautifulSoup) | social_links (instagram, twitter, beatport, soundcloud, RA URL, bandcamp, tiktok), top_tracks |
Search by performer name |
ResidentAdvisorEnricher |
ra.co | GraphQL API (/graphql) |
artist_stats.ra_bio, artist_stats.ra_url, social links (fills gaps only) |
Extract from social_links.resident_advisor URL first; fall back to stripping non-alphanumeric from performer name (e.g. "Carl Cox" → "carlcox") |
SpotifyEnricher |
Spotify Web API | REST API (client credentials) | artist_stats.spotify_id, spotify_embed_url, spotify_images, genres, popularity, follower_count, top_tracks, newest_release |
Extract ID from existing streaming_links.spotify; fall back to /search?type=artist |
Key Implementation Patterns¶
Immutability Rule
VegasEvent is a frozen Pydantic v2 model. You must use model_copy(update={...}) to return a new event — never assign to fields directly. Assigning to fields will raise a ValidationError at runtime.
Immutability — always use model_copy(update={...}), never assign to fields:
# ✓ correct
return event.model_copy(update={"artist_stats": new_stats})
# ✗ wrong — Pydantic v2 models are frozen
event.artist_stats = new_stats
Don't overwrite with None — only update fields when you have real data:
if parsed["bio"]:
stats_update["ra_bio"] = parsed["bio"]
# Don't set stats_update["ra_bio"] = None if nothing found
Rate Limiting
Always add a delay between external requests to avoid being blocked. 0.5–1.0 seconds is a good default for most APIs.
Rate limiting — add a sleep before each external request:
Saving raw responses — write per-artist JSON to data/artists/{slug}/source.json so you can re-parse without re-fetching:
if self._save_dir:
slug = _slugify(performer)
artist_dir = self._save_dir / slug
artist_dir.mkdir(parents=True, exist_ok=True)
(artist_dir / "my_source.json").write_text(json.dumps(raw, indent=2))
Running Enrichment¶
# Enrich all unenriched events (all sources)
vinny enrich artists
# Re-enrich everything, ignoring status
vinny enrich artists --force
# Single artist
vinny enrich artists --artist "Calvin Harris"
# Preview what would run
vinny enrich artists --dry-run
# Spotify only
vinny enrich artists --spotify-only
Venue Extractor Reference¶
For the current list of registered venues, URL patterns, and per-venue implementation notes (including the EBC phased rollout), see EXTRACTORS.md.
Contributing Checklist¶
Before submitting a new venue extractor or plugin:
- Extractor handles both JSON and HTML fallback
- Error handling for missing data
- Rate limiting implemented (if making additional requests)
- Tests pass (
just check) - Documentation added
- Example URLs provided
- Registered in
create_default_registry() - CLI commands added (if applicable)
Getting Help¶
- Check existing extractors in
src/extractors/ - Look at the LIV extractor for best practices
- Review
docs/vea-image-transformations.mdfor image handling - Open an issue on GitHub with questions
Last updated: 2026-03-03