Vinny FAQ¶

Frequently asked questions about Vinny development and usage.

General Questions¶

Q: What is Vinny?¶

A: Vinny is a Las Vegas nightlife event scraper designed to extract comprehensive event data from nightclub websites. It uses a plugin-based architecture to support multiple venues and provides advanced features like multi-size image downloading and field-level change tracking.

Q: Why "Vinny"?¶

A: Named after the quintessential Vegas promoter - the guy who knows everyone, gets you on the list, and always has the inside scoop on what's happening. Vinny the scraper does the same thing for data.

Q: Is this legal?¶

A: Vinny scrapes publicly available data from venue websites. It: - Respects rate limits (1s delay between requests) - Uses standard HTTP requests (no circumventing protections) - Only scrapes public event pages - Does not scrape private/restricted content

Always check venue Terms of Service before scraping commercially.

Installation & Setup¶

Q: Do I need Apify to use Vinny?¶

Apify is optional

While Vinny was originally built for Apify, v1.5+ works standalone with Crawlee. You can run it locally without any Apify dependencies.

A: No! While Vinny was originally built for Apify, v1.5+ works standalone with Crawlee. You can run it locally without any Apify dependencies.

Q: What's the difference between `pip` and `uv`?¶

A: - pip: Traditional Python package manager (slower, requirements.txt) - uv: Modern, fast Python package manager (recommended)

Vinny supports both, but uv is faster and preferred.

Q: How do I install on Windows?¶

A:

# Install uv
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

# Clone and install
git clone <repo-url>
cd vinny
uv pip install -e ".[dev]"

# Run
just scrape --max-requests 5

Q: Can I run this in Docker?¶

A: Yes, but it's not necessary. The Dockerfile is maintained for Apify platform deployment. For local use, uv is faster and easier.

Scraping¶

Q: Which venues are supported?¶

A: Currently: - ✅ LIV Las Vegas (Fontainebleau) - ✅ LIV Beach (Fontainebleau pool) - ✅ XS Nightclub (Wynn) — full event + pricing extraction - ➕ Easy to add more via plugin system

Q: How do I add a new venue?¶

A: See PLUGIN_DEVELOPMENT.md. Briefly: 1. Create src/extractors/{venue}.py 2. Inherit from VenueExtractor 3. Implement extract() method 4. Register in src/extractors/__init__.py

Q: What is `vinny sync` and when should I use it?¶

Use vinny sync for the full pipeline

This is the recommended one-command workflow. It runs all four steps in sequence and handles dependencies (images before D1, R2 URLs before export) automatically.

A: vinny sync is the recommended one-command pipeline that runs all four steps in sequence:

Scrape — crawl venue websites
Images download — download artist photos (main/500px)
R2 upload + D1 sync — upload images to Cloudflare R2, write R2 URLs back to D1
D1 export — push all events/artists/images to Cloudflare D1

# Full pipeline (all venues)
vinny sync

# Specific venue
vinny sync xs
vinny sync liv --with-pricing

# Skip scrape — resume from an existing run
vinny sync --run latest
vinny sync --run 2026-03-01-201004

Use --run latest to re-run the pipeline steps after a scrape if a step failed.

Q: Can I scrape multiple venues at once?¶

A: Yes! Create a sitemap or list of URLs from different venues:

just scrape \
  -u https://livnightclub.com/sitemap.xml \
  -u https://wynnsocial.com/sitemap.xml \
  --max-requests 100

Q: Why are some events missing data?¶

A: Common reasons: - Minimal description: Some events just say "Special Guest" - Dynamic content: Pricing loaded via JavaScript (requires Playwright) - Missing fields: Not all events have streaming links or bios - Parse errors: HTML structure changed

Check runs/latest/events.json to see what was extracted.

Q: How fast does it scrape?¶

A: With rate limiting (default): - ~60 requests/minute (1 request per second) - ~500-600 events/hour - 10 events typically take 2-3 minutes

Increase max_requests for more speed (but be respectful).

Images¶

Q: Where are images stored?¶

A: In a global artists directory (not per-run):

data/images/artists/
├── dom-dolla_main.jpg     # 500px — keyed by artist, not by event
├── dom-dolla_small.jpg    # 250px
├── cloonee_main.jpg
└── ...

The same file is reused across every event featuring that artist. Downloads are skipped automatically if the file already exists.

Q: What image sizes are available?¶

A: Via VEA CDN: - small — 250px (~19KB) — thumbnails, lists - main — 500px (~41KB) — default, cards, web display - medium — 750px (~95KB) - large — 1000px (~133KB) - hd — 1500px (~290KB) - raw — original resolution

Size names changed in v1.9: thumbnail was renamed to small, small was renamed to main. Old JSON files with the old names are silently migrated on load.

Q: Can I get WebP images?¶

A: Not directly. VEA CDN only serves JPEG. However, you can: 1. Download JPEG 2. Convert locally using Pillow/imagemagick 3. Or use Cloudflare Polish if you proxy through Cloudflare

Q: Why are my images downloading slowly?¶

A: Intentional rate limiting to be respectful: - 1 second delay between downloads - 3 concurrent downloads max - Prevents getting blocked by VEA CDN

Q: Can I download images without scraping again?¶

A: Yes! Images are downloaded in a separate step:

# Scrape events (saves image URLs)
just scrape

# Download images anytime later
just images-download latest --size hd

Q: What are the image effects?¶

A: VEA CDN supports: - Grayscale: g prefix (e.g., g500SC500) - Blur: b prefix (e.g., b500SC500) - Sepia: s prefix (e.g., s500SC500) - Crop: Center, Top, Left, Right, Bottom - Color bleed: Background fill for aspect ratio mismatches

See vea-image-transformations.md for details.

Data & Storage¶

Q: Where is data stored?¶

A: - Runs: runs/{timestamp}/ - Self-contained scrape results - Master DB: data/master_events.json - Accumulated data - Latest symlink: runs/latest/ - Always points to newest run

Q: What's the difference between a "run" and the "master database"?¶

A: - Run: One-time scrape, immutable, timestamped folder - Master DB: Accumulates data from all runs, tracks changes, enables enrichment over time

Q: Can I delete old runs?¶

A: Yes! Each run is self-contained. Delete safely:

rm -rf runs/2026-02-26-*

The master database (data/) keeps the accumulated data.

Q: How much disk space does it use?¶

A: Typical sizes: - JSON events: ~50KB per event - CSV export: ~20KB per event - Markdown: ~30KB per event - Images: 40-300KB per size variant

Example: 100 events with 2 image sizes = ~20MB

Q: Can I export to Excel/Google Sheets?¶

A: Yes! Use the CSV export:

just export-csv latest
# Then open runs/latest/events.csv in Excel/Sheets

Q: How do I import into Notion?¶

A: 1. Export to CSV: just export-csv latest 2. In Notion: Import → CSV 3. Map columns to Notion properties

Or use the Markdown export for page content:

just export-md latest
# Import .md files as Notion pages

Development¶

Q: How do I create a new plugin?¶

A: See PLUGIN_DEVELOPMENT.md. Create:

src/plugins/my_plugin/
├── __init__.py      # Main plugin class
├── models.py        # Data models
└── cli.py           # CLI commands (optional)

Q: How do I add table/bottle pricing?¶

A: This requires JavaScript execution (pricing loaded dynamically):

Add extract_table_pricing() to venue extractor
Use Playwright to click pricing sections
Extract pricing from revealed content

See src/extractors/liv.py for skeleton implementation.

Q: Can I use Playwright instead of BeautifulSoup?¶

A: Yes! Vinny supports both: - BeautifulSoupCrawler - Fast, static HTML (default) - PlaywrightCrawler - JavaScript rendering (for dynamic content)

See src/crawlee_main.py to switch crawlers.

Q: How do I test my extractor?¶

A:

# 1. Run linter
just lint

# 2. Type check
just typecheck

# 3. Quick test scrape
just scrape --max-requests 3

# 4. Check output
just list-runs
cat runs/latest/events.json | jq '.events[0]'

Q: Where should I put my code?¶

A: - New venue: src/extractors/{venue}.py - New feature: src/plugins/{feature}/ - CLI command: src/cli.py or src/plugins/{feature}/cli.py - Utilities: src/utils/ (create if needed)

Q: How do I debug extraction?¶

A:

# Add logging to your extractor
import logging
logger = logging.getLogger(__name__)

def extract(self, soup, url):
    logger.info(f"Extracting: {url}")
    logger.debug(f"Page title: {soup.title}")
    # ... extraction logic
    logger.info(f"Found {len(events)} events")

Then run with debug logging:

export VINNY_DEBUG=1
just scrape --max-requests 1

Troubleshooting¶

Q: D1 export fails with HTTP 403 (error code 7403)¶

Missing D1 permission

The error "The given account is not valid or is not authorized to access this service" means your API token lacks the Account > D1 > Edit permission. Edit the token at dash.cloudflare.com/profile/api-tokens and add it.

A: Your Cloudflare API token is missing the D1 permission.

Fix: 1. Go to dash.cloudflare.com/profile/api-tokens 2. Edit your token → add Account > D1 > Edit permission 3. Save and re-run

The error looks like: "The given account is not valid or is not authorized to access this service"

Q: Why does Vinny use the D1 REST API instead of `wrangler d1 execute --file`?¶

A: The wrangler d1 execute --file command uses Cloudflare's /import endpoint internally — a multi-step protocol that requires OAuth-style auth. This often fails with confusing errors when using an API token.

Vinny uses the /query endpoint directly (POST .../d1/database/{id}/query), which accepts a standard Bearer token and is far more reliable for programmatic use.

See D1_DEPLOYMENT.md for full auth setup and troubleshooting.

Q: D1 is targeting the wrong database¶

A: wrangler.jsonc needs both fields:

{
  "d1_databases": [{
    "database_name": "vinny-vegas",    // for wrangler CLI fallback
    "database_id": "ca47027a-..."     // UUID for REST API — get from CF dashboard
  }]
}

Verify your account ID with bunx wrangler whoami.

Q: "No extractor found for URL"¶

A: 1. Check if venue extractor exists: just venues 2. Verify URL contains extractor's domain 3. Check can_handle() method in extractor

Q: "Failed to download image"¶

A: 1. Check if image URL is valid: curl -I <url> 2. Try downloading manually: just images-download latest 3. Check rate limits (wait a few minutes) 4. Retry: just images-retry latest

Q: "JSON decode error"¶

A: - Page structure changed - Venue updated their website - Check runs/latest/raw/*.json for actual scraped data - Update extractor to handle new structure

Q: "Out of memory"¶

A: - Reduce max_requests: just scrape --max-requests 10 - Don't download all image sizes at once - Close other applications - Increase Docker memory limit (if using Docker)

Q: "Permission denied"¶

A:

# Ensure virtual environment is activated
source .venv/bin/activate  # or: .venv\Scripts\activate on Windows

# Reinstall
uv pip install -e ".[dev]"

Q: Images are corrupted/invalid¶

A: 1. Run validation: just images-validate latest 2. Delete corrupted files: rm runs/latest/images/*/* 3. Re-download: just images-download latest

Performance¶

Q: Can I scrape faster?¶

A: You can, but be careful:

# Reduce delay (default: 1.0s)
just scrape --delay 0.5

# Increase workers (default: 3)
just scrape --workers 5

Warning: Faster scraping increases risk of being blocked. Use responsibly.

Q: Can I run multiple scrapes simultaneously?¶

A: Not recommended. The master database isn't designed for concurrent writes. Run sequentially:

just scrape --max-requests 50
just scrape -u https://other-venue.com/sitemap.xml

Q: How do I scrape only new events?¶

A: Use the master database + diff:

# Scrape
just scrape

# Compare to previous
just diff run1 run2

# Only new events will show as "Added"

Integration¶

Q: Can I use this with n8n/Zapier?¶

A: Yes! Export to CSV/JSON and use as webhook trigger:

just scrape --max-requests 10
just export-csv latest
curl -X POST https://n8n.example.com/webhook \
  -F "file=@runs/latest/events.csv"

Q: Can I schedule automatic scraping?¶

A: Use cron (Linux/Mac) or Task Scheduler (Windows):

# Cron example: scrape daily at 9am
0 9 * * * cd /path/to/vinny && just scrape --max-requests 100

Q: Can I use this in production?¶

A: Yes! Consider: - Running on a server (not your laptop) - Using absolute paths - Setting up monitoring/logging - Having backup strategies for data - Respecting venue rate limits

Contributing¶

Q: How can I contribute?¶

A: 1. New venues: Create extractors for Omnia, Hakkasan, etc. 2. Features: Add pricing extraction, social media enrichment 3. Documentation: Improve docs, add examples 4. Bug fixes: Fix extraction issues, handle edge cases

See PLUGIN_DEVELOPMENT.md for guidelines.

Q: What's the license?¶

A: MIT License - Free for commercial and personal use.

Q: Who maintains this?¶

A: Originally built for the Vegas nightlife community. Open source - contributions welcome!

Advanced¶

Q: Can I use a proxy?¶

A: Yes, via environment variables:

export HTTP_PROXY=http://proxy.example.com:8080
export HTTPS_PROXY=http://proxy.example.com:8080
just scrape

Q: Can I modify the data schema?¶

A: Yes! Edit src/models/__init__.py:

class VegasEvent(BaseModel):
    # Add your new fields
    my_custom_field: str | None = None

Then update extractors to populate the field.

Q: Can I use a database instead of JSON?¶

A: Yes! Modify src/master_database.py to use SQLite/PostgreSQL instead of JSON files. Vinny also exports to Cloudflare D1 (cloud SQLite) and local SQLite via vinny export-d1 and vinny export-sqlite.

Q: How do I add authentication?¶

A: For venues requiring login: 1. Use PlaywrightCrawler 2. Add login step before scraping 3. Store cookies/session

Example in src/crawlee_main.py.

Getting Help¶

Q: Where can I get help?¶

A: - Read the docs: docs/ - Check examples: src/extractors/liv.py - Open an issue on GitHub - Review the code: It's well-commented!

Q: I found a bug!¶

A: 1. Check if it's already reported 2. Include: - Command you ran - Expected vs actual behavior - Error message - Venue/URL affected

Q: I have a feature request!¶

A: Open an issue describing: - What you want to do - Why it would be useful - Example use case

Last updated: 2026-03-02 Have a question not answered here? Open an issue!