Vinny Plugin Development Guide¶

This guide explains how to extend Vinny with new venues, features, and plugins.

Table of Contents¶

Architecture Overview
Creating a New Venue Extractor
Adding Features to Existing Extractors
Creating New Plugins
Artist Enrichers
Best Practices
Testing Your Changes
Examples

Architecture Overview¶

Vinny uses a plugin-based architecture with two main extension points:

1. Venue Extractors¶

Located in src/extractors/. Each venue (LIV, XS, Omnia, etc.) has its own extractor that knows how to: - Identify URLs it can handle - Extract event data from HTML/JSON - Find images, pricing, artist info

2. Feature Plugins¶

Located in src/plugins/. Cross-cutting features like: - Image downloading and processing - Table/bottle pricing extraction (coming soon) - Social media enrichment - Notion/Google Sheets integration

Data Flow¶

flowchart TD
    A["Sitemap URL"] --> B["Parse loc + lastmod"]
    B --> C["SitemapIndex.diff()"]
    C -->|new/updated only| D["Event URL"]
    D --> E["Crawlee Router"]
    E --> F["Venue Extractor"]
    F --> G["VegasEvent"]
    G --> H["Storage / Export"]
    G --> I["Feature Plugins"]
    I --> J["Images"]
    I --> K["Pricing"]
    I --> L["Enrichment"]

    style A fill:#7c3aed,color:#fff
    style G fill:#059669,color:#fff
    style J fill:#d97706,color:#fff
    style K fill:#d97706,color:#fff
    style L fill:#d97706,color:#fff

For venues without sitemaps (XS, EBC), all event URLs from the listing page are enqueued directly.

Creating a New Venue Extractor¶

Let's walk through creating an extractor for XS Nightclub (which already has a skeleton).

Step 1: Create the Extractor File¶

Create src/extractors/xs.py:

"""XS Nightclub extractor plugin."""

from __future__ import annotations

import re
from datetime import datetime
from urllib.parse import urlparse

from bs4 import BeautifulSoup

from src.extractors import VenueExtractor
from src.models import VegasEvent, StreamingLinks
from src.plugins.images.models import ImageMetadata


class XSExtractor(VenueExtractor):
    """Extractor for XS Nightclub at Wynn Las Vegas."""

    @property
    def name(self) -> str:
        """Human-readable venue name."""
        return "XS Nightclub"

    @property
    def domain(self) -> str:
        """Domain pattern for URL matching."""
        # This tells Vinny which URLs this extractor handles
        return "wynnsocial.com"

    def can_handle(self, url: str) -> bool:
        """Check if this extractor can handle the given URL.

        Override this for more complex URL matching.
        """
        # Default checks if domain is in URL
        # You can add additional checks here
        return self.domain in url.lower()

    def extract(self, soup: BeautifulSoup, url: str) -> VegasEvent | None:
        """Extract event data from XS event pages.

        Args:
            soup: Parsed BeautifulSoup object
            url: Source URL

        Returns:
            VegasEvent if extraction successful, None otherwise
        """
        # Step 1: Extract basic info
        scraped_at = datetime.utcnow().isoformat()

        # Step 2: Try to find embedded JSON (if available)
        event_data = self._extract_embedded_json(soup)
        if event_data:
            return self._parse_event_data(event_data, url, scraped_at)

        # Step 3: Fallback to HTML extraction
        return self._extract_from_html(soup, url, scraped_at)

    def _extract_embedded_json(self, soup: BeautifulSoup) -> dict | None:
        """Extract JSON-LD or embedded event data.

        Many sites embed structured data in <script type="application/ld+json">
        """
        scripts = soup.find_all("script", type="application/ld+json")

        for script in scripts:
            try:
                if script.string:
                    data = json.loads(script.string)
                    # Adjust this check based on the site's data structure
                    if isinstance(data, dict) and "event" in data.get("@type", "").lower():
                        return data
            except (json.JSONDecodeError, AttributeError):
                continue

        return None

    def _parse_event_data(self, data: dict, url: str, scraped_at: str) -> VegasEvent | None:
        """Parse extracted JSON data into VegasEvent."""
        # Extract performer name
        performer = self._extract_performer(data, url)
        if not performer:
            return None

        # Extract event date
        event_date = self._extract_date(data, url)
        if not event_date:
            return None

        # Build event data dict
        event_dict = {
            "url": url,
            "scraped_at": scraped_at,
            "performer": performer,
            "title": f"{performer} at XS Nightclub",
            "venue": "XS Nightclub",
            "event_date": event_date,
        }

        # Add optional fields
        event_time = self._extract_time(data)
        if event_time:
            event_dict["event_time"] = event_time
            event_dict["event_datetime"] = f"{event_date}T{event_time}:00"

        # Extract images
        images = self._extract_images(data)
        if images:
            event_dict["images"] = images

        # Extract streaming links
        streaming = self._extract_streaming_links(data)
        if streaming:
            event_dict["streaming_links"] = streaming

        return VegasEvent(**event_dict)

    def _extract_from_html(self, soup: BeautifulSoup, url: str, scraped_at: str) -> VegasEvent | None:
        """Fallback HTML extraction when JSON is not available."""
        # Extract from meta tags, headings, page text
        # This is venue-specific

        performer = self._extract_performer_from_html(soup, url)
        event_date = self._extract_date_from_html(soup, url)

        if not performer or not event_date:
            return None

        return VegasEvent(
            url=url,
            scraped_at=scraped_at,
            performer=performer,
            title=f"{performer} at XS Nightclub",
            venue="XS Nightclub",
            event_date=event_date,
        )

    def _extract_performer(self, data: dict, url: str) -> str | None:
        """Extract performer name from data or URL."""
        # Try data first
        if data.get("name"):
            return data["name"]

        # Try performer field
        if data.get("performer"):
            if isinstance(data["performer"], dict):
                return data["performer"].get("name")
            elif isinstance(data["performer"], str):
                return data["performer"]

        # Fallback to URL
        path_parts = urlparse(url).path.strip("/").split("/")
        if len(path_parts) >= 2:
            return path_parts[-1].replace("-", " ").title()

        return None

    def _extract_date(self, data: dict, url: str) -> str | None:
        """Extract event date from data or URL."""
        # Try data first
        if data.get("startDate"):
            try:
                dt = datetime.fromisoformat(data["startDate"].replace("Z", "+00:00"))
                return dt.strftime("%Y-%m-%d")
            except (ValueError, AttributeError):
                pass

        # Try URL pattern extraction
        # Adjust regex based on URL format
        match = re.search(r'/(\d{4})(\d{2})(\d{2})/', url)
        if match:
            return f"{match.group(1)}-{match.group(2)}-{match.group(3)}"

        return None

    def _extract_images(self, data: dict) -> list[ImageMetadata]:
        """Extract image URLs from data."""
        images = []

        # Look for image fields
        if data.get("image"):
            img_url = data["image"]
            if isinstance(img_url, dict):
                img_url = img_url.get("url", "")

            if img_url:
                images.append(ImageMetadata(
                    source_url=img_url,
                    category="artist_full"
                ))

        return images

    def _extract_streaming_links(self, data: dict) -> StreamingLinks | None:
        """Extract streaming platform links."""
        # Look in description or social fields
        # This is venue-specific
        return None

    def extract_table_pricing(self, soup: BeautifulSoup) -> TablePricing | None:
        """Extract table/bottle pricing if available on page.

        This is called automatically for table pricing plugin.
        Return None if pricing is not on this page.
        """
        # TODO: Implement pricing extraction
        # Look for pricing sections, table tiers, etc.
        return None

Step 2: Register the Extractor¶

Edit src/extractors/__init__.py:

def create_default_registry() -> ExtractorRegistry:
    """Create registry with all built-in extractors."""
    from src.extractors.liv import LIVExtractor
    from src.extractors.xs import XSExtractor  # Add this line

    registry = ExtractorRegistry()
    registry.register(LIVExtractor())
    registry.register(XSExtractor())  # Add this line

    return registry

Step 3: Test the Extractor¶

# Run scraper on XS events
just scrape -u https://www.wynnsocial.com/events-sitemap.xml --max-requests 5

# Check if events were extracted
just list-runs

Adding Features to Existing Extractors¶

Example: Adding Table Pricing to LIV¶

Edit src/extractors/liv.py:

class LIVExtractor(VenueExtractor):
    # ... existing code ...

    def extract_table_pricing(self, soup: BeautifulSoup) -> TablePricing | None:
        """Extract table/bottle pricing from LIV pages.

        Note: LIV loads pricing dynamically via JavaScript.
        This is a placeholder - full implementation requires Playwright.
        """
        # Look for pricing container
        pricing_section = soup.find("div", class_=re.compile(r"pricing|tables", re.I))

        if not pricing_section:
            return None

        # Extract pricing tiers
        tiers = {}

        # Look for stage tables
        stage_elem = pricing_section.find(string=re.compile(r"stage", re.I))
        if stage_elem:
            parent = stage_elem.find_parent("div", class_="tier-card")
            if parent:
                tiers["stage"] = self._extract_tier_pricing(parent)

        # Look for dance floor tables
        dance_elem = pricing_section.find(string=re.compile(r"dance.?floor", re.I))
        if dance_elem:
            parent = dance_elem.find_parent("div", class_="tier-card")
            if parent:
                tiers["dance_floor"] = self._extract_tier_pricing(parent)

        if tiers:
            return TablePricing(**tiers)

        return None

    def _extract_tier_pricing(self, element) -> TablePricingTier:
        """Extract pricing for a single tier."""
        # Look for min spend
        min_spend_match = re.search(r'\$([\d,]+)', element.get_text())
        min_spend = None
        if min_spend_match:
            min_spend = float(min_spend_match.group(1).replace(",", ""))

        # Look for guest count
        guests_match = re.search(r'(\d+)\s*guests?', element.get_text(), re.I)
        guests = None
        if guests_match:
            guests = int(guests_match.group(1))

        return TablePricingTier(
            min_spend=min_spend,
            guests=guests,
            available=True  # Assume available if shown
        )

Creating New Plugins¶

Plugin Structure¶

Create a new plugin in src/plugins/{plugin_name}/:

src/plugins/{plugin_name}/
├── __init__.py          # Main plugin class
├── models.py           # Data models (if needed)
├── cli.py              # CLI commands
└── README.md           # Plugin documentation

Create src/plugins/social/__init__.py:

"""Social media enrichment plugin.

Finds and adds social media links for artists and venues.
"""

from __future__ import annotations

import re
from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from src.models import VegasEvent


class SocialMediaPlugin:
    """Enrich events with social media links."""

    def __init__(self):
        """Initialize plugin."""
        self.platforms = {
            "instagram": r'instagram\.com/([\w.]+)',
            "twitter": r'twitter\.com/([\w]+)',
            "facebook": r'facebook\.com/([\w.]+)',
        }

    def enrich_event(self, event: VegasEvent) -> VegasEvent:
        """Add social media links to an event.

        Args:
            event: VegasEvent to enrich

        Returns:
            Enriched event
        """
        # Search in description
        if event.description:
            social_links = self._extract_from_text(event.description)
            if social_links:
                # Add to event or update existing
                if not hasattr(event, 'social_links'):
                    event.social_links = {}
                event.social_links.update(social_links)

        return event

    def _extract_from_text(self, text: str) -> dict[str, str]:
        """Extract social links from text."""
        links = {}

        for platform, pattern in self.platforms.items():
            matches = re.findall(pattern, text, re.I)
            if matches:
                links[platform] = f"https://{platform}.com/{matches[0]}"

        return links

    async def enrich_batch(self, events: list[VegasEvent]) -> list[VegasEvent]:
        """Enrich multiple events.

        Args:
            events: List of events to enrich

        Returns:
            List of enriched events
        """
        return [self.enrich_event(e) for e in events]

Register Plugin CLI¶

Edit src/cli.py:

from src.plugins.social.cli import app as social_app

# Add to main app as a subcommand
app.command(social_app, name="social")

Best Practices¶

1. Error Handling¶

Always handle errors gracefully:

def extract(self, soup: BeautifulSoup, url: str) -> VegasEvent | None:
    try:
        # Extraction logic
        return VegasEvent(**data)
    except Exception as e:
        logger.error(f"Failed to extract {url}: {e}")
        return None

2. Rate Limiting¶

Be respectful when scraping:

# In downloader:
delay_seconds = 1.0  # 1 second between requests
max_workers = 3      # 3 concurrent downloads

3. Data Validation¶

Use Pydantic models for validation:

from pydantic import BaseModel, field_validator

class VegasEvent(BaseModel):
    event_date: str

    @field_validator("event_date")
    @classmethod
    def validate_date(cls, v: str) -> str:
        """Ensure date is YYYY-MM-DD format."""
        try:
            datetime.strptime(v, "%Y-%m-%d")
            return v
        except ValueError:
            raise ValueError(f"Invalid date format: {v}")

4. Testing¶

Always test your extractor:

# Test with limited requests
just scrape --max-requests 3

# Check output
cat runs/latest/events.json | jq '.events[0]'

# Validate images downloaded
just images-status latest

5. Documentation¶

Document your extractor:

"""XS Nightclub extractor.

Handles event pages from wynnsocial.com
- Uses JSON-LD when available
- Falls back to HTML parsing
- Extracts: performer, date, time, images, streaming links

Known Issues:
- Pricing requires JavaScript (Playwright needed)
- Some events have minimal descriptions

Example URLs:
- https://www.wynnsocial.com/event/EVE111500020260227/hntr/
"""

Testing Your Changes¶

Local Testing¶

# 1. Install in development mode
uv pip install -e ".[dev]"

# 2. Run linter
just lint

# 3. Run type checker
just typecheck

# 4. Test scrape
just scrape --max-requests 5

# 5. Check results
just list-runs
just stats

# 6. Test images
just images-download latest
just images-status latest

Integration Testing¶

# Test specific venue
just scrape -u https://venue.com/sitemap.xml

# Test export
just export-csv latest
just export-md latest

# Test diff
just diff run1 run2

Examples¶

Complete Venue Extractor: LIV¶

See src/extractors/liv.py for a full example.

Complete Plugin: Images¶

See src/plugins/images/ for a complete plugin example.

Artist Enrichers¶

Artist enrichers live in src/plugins/enrichment/ and add bio, social links, streaming data, and other metadata to VegasEvent objects after scraping. They run as a post-processing step via vinny enrich artists.

How It Works¶

sequenceDiagram
    participant M as master_events.json
    participant R as EnrichmentRegistry
    participant T as TracklistsEnricher
    participant RA as ResidentAdvisorEnricher
    participant S as SpotifyEnricher

    M->>R: Load events
    R->>T: enrich(event)
    T-->>R: social_links (RA URL, instagram, etc.)
    R->>RA: enrich(event)
    Note right of RA: Uses RA URL from Tracklists
    RA-->>R: artist_stats.ra_bio, ra_url
    R->>S: enrich(event)
    S-->>R: spotify_id, genres, top_tracks
    R->>M: Save updated events

Order matters: Tracklists runs first to find the RA profile URL; RA uses that URL if present (more accurate than guessing the slug from the name).

The ABC¶

# src/plugins/enrichment/__init__.py
class ArtistEnricher(ABC):
    @property
    @abstractmethod
    def name(self) -> str: ...           # unique key, e.g. "spotify"

    @abstractmethod
    async def enrich(self, event: VegasEvent) -> VegasEvent: ...
    # MUST return a new event via event.model_copy(update={...})
    # NEVER mutate the input — VegasEvent is immutable (Pydantic)

    def should_enrich(self, event: VegasEvent) -> bool:
        return True                       # override to skip already-enriched events

Enrichment Status Tracking¶

Every enricher records its outcome in event.enrichment_status (a EnrichmentStatus Pydantic model):

Field	Type	Meaning
`spotify`	`bool`	Spotify enrichment succeeded
`tracklists`	`bool`	1001tracklists enrichment succeeded
`resident_advisor`	`bool`	RA enrichment succeeded
`errors`	`dict[str, str]`	Per-source error messages

The CLI skips events where all selected enrichers already have True status. Use --force to re-run.

Writing a New Enricher¶

Complete enricher template

The following shows all the code needed for a new enricher — create the file, add the status flag, register it, and update the skip check.

1. Create src/plugins/enrichment/my_source.py:

from __future__ import annotations
from src.models import ArtistStats, EnrichmentStatus, SocialLinks, VegasEvent
from src.plugins.enrichment import ArtistEnricher

class MySourceEnricher(ArtistEnricher):
    @property
    def name(self) -> str:
        return "my_source"

    def should_enrich(self, event: VegasEvent) -> bool:
        # Skip events already enriched by this source
        status = event.enrichment_status
        return not (status and status.my_source)  # add field to EnrichmentStatus model first

    async def enrich(self, event: VegasEvent) -> VegasEvent:
        existing_status = event.enrichment_status or EnrichmentStatus()
        existing_stats = event.artist_stats or ArtistStats()
        stats_update: dict = {}
        status_update: dict = {}

        try:
            # ... fetch data for event.performer ...
            stats_update["ra_bio"] = "fetched bio"
            status_update["my_source"] = True
        except Exception as e:
            status_update["errors"] = {**existing_status.errors, "my_source": str(e)}

        if not stats_update and not status_update:
            return event  # nothing changed — return original

        return event.model_copy(update={
            "artist_stats": existing_stats.model_copy(update=stats_update),
            "enrichment_status": existing_status.model_copy(update=status_update),
        })

2. Add the status flag to EnrichmentStatus in src/models/__init__.py:

class EnrichmentStatus(BaseModel):
    spotify: bool = False
    tracklists: bool = False
    resident_advisor: bool = False
    my_source: bool = False   # ← add this
    errors: dict[str, str] = {}

3. Register in src/plugins/enrichment/cli.py:

from src.plugins.enrichment.my_source import MySourceEnricher

def _build_registry(save_dir=None) -> EnrichmentRegistry:
    registry = EnrichmentRegistry()
    registry.register(TracklistsEnricher())
    registry.register(ResidentAdvisorEnricher(save_dir=save_dir))
    registry.register(MySourceEnricher())   # ← add here, after sources it depends on
    registry.register(SpotifyEnricher(save_dir=save_dir))
    return registry

4. Update the "all enriched" check in cmd_enrich_artists:

elif status and status.spotify and status.tracklists and status.resident_advisor and status.my_source:
    continue

Existing Enrichers at a Glance¶

Enricher	Source	Method	Populates	Slug strategy
`TracklistsEnricher`	1001tracklists.com	HTML scrape (BeautifulSoup)	`social_links` (instagram, twitter, beatport, soundcloud, RA URL, bandcamp, tiktok), `top_tracks`	Search by performer name
`ResidentAdvisorEnricher`	ra.co	GraphQL API (`/graphql`)	`artist_stats.ra_bio`, `artist_stats.ra_url`, social links (fills gaps only)	Extract from `social_links.resident_advisor` URL first; fall back to stripping non-alphanumeric from performer name (e.g. `"Carl Cox"` → `"carlcox"`)
`SpotifyEnricher`	Spotify Web API	REST API (client credentials)	`artist_stats.spotify_id`, `spotify_embed_url`, `spotify_images`, `genres`, `popularity`, `follower_count`, `top_tracks`, `newest_release`	Extract ID from existing `streaming_links.spotify`; fall back to `/search?type=artist`

Key Implementation Patterns¶

Immutability Rule

VegasEvent is a frozen Pydantic v2 model. You must use model_copy(update={...}) to return a new event — never assign to fields directly. Assigning to fields will raise a ValidationError at runtime.

Immutability — always use model_copy(update={...}), never assign to fields:

# ✓ correct
return event.model_copy(update={"artist_stats": new_stats})

# ✗ wrong — Pydantic v2 models are frozen
event.artist_stats = new_stats

Don't overwrite with None — only update fields when you have real data:

if parsed["bio"]:
    stats_update["ra_bio"] = parsed["bio"]
# Don't set stats_update["ra_bio"] = None if nothing found

Rate Limiting

Always add a delay between external requests to avoid being blocked. 0.5–1.0 seconds is a good default for most APIs.

Rate limiting — add a sleep before each external request:

_RATE_LIMIT_DELAY = 0.5  # seconds
await asyncio.sleep(_RATE_LIMIT_DELAY)

Saving raw responses — write per-artist JSON to data/artists/{slug}/source.json so you can re-parse without re-fetching:

if self._save_dir:
    slug = _slugify(performer)
    artist_dir = self._save_dir / slug
    artist_dir.mkdir(parents=True, exist_ok=True)
    (artist_dir / "my_source.json").write_text(json.dumps(raw, indent=2))

Running Enrichment¶

# Enrich all unenriched events (all sources)
vinny enrich artists

# Re-enrich everything, ignoring status
vinny enrich artists --force

# Single artist
vinny enrich artists --artist "Calvin Harris"

# Preview what would run
vinny enrich artists --dry-run

# Spotify only
vinny enrich artists --spotify-only

Venue Extractor Reference¶

For the current list of registered venues, URL patterns, and per-venue implementation notes (including the EBC phased rollout), see EXTRACTORS.md.

Contributing Checklist¶

Before submitting a new venue extractor or plugin:

Extractor handles both JSON and HTML fallback
Error handling for missing data
Rate limiting implemented (if making additional requests)
Tests pass (just check)
Documentation added
Example URLs provided
Registered in create_default_registry()
CLI commands added (if applicable)

Getting Help¶

Check existing extractors in src/extractors/
Look at the LIV extractor for best practices
Review docs/vea-image-transformations.md for image handling
Open an issue on GitHub with questions

Last updated: 2026-03-03