OpenSpec Instructions
These instructions are for AI assistants working in this project.
Always open @/openspec/AGENTS.md when the request:
- Mentions planning or proposals (words like proposal, spec, change, plan)
- Introduces new capabilities, breaking changes, architecture shifts, or big performance/security work
- Sounds ambiguous and you need the authoritative spec before coding
Use @/openspec/AGENTS.md to learn:
- How to create and apply change proposals
- Spec format and conventions
- Project structure and guidelines
Keep this managed block so 'openspec update' can refresh the instructions.
<!-- OPENSPEC:END -->AGENTS.md - OEVK Data Processing Project
Build/Lint/Test Commands
- Test command:
python -m pytest tests/ - Single test:
python -m pytest tests/test_file.py::test_function -v - Lint:
ruff check . - Type check:
mypy . - Release workflow:
python -m src.cli release create --auto - Deduplication tests:
python -m pytest tests/contract/test_deduplication.py tests/integration/test_deduplication_*.py tests/unit/test_deduplication_*.py -v
Release Workflow Commands
Data Validation
- Validate release data:
python -m src.cli release validate --staging-dir data/staging --exports-dir exports - Validate with custom directories:
python -m src.cli release validate --staging-dir /path/to/staging --exports-dir /path/to/exports - Verbose validation with debug logging:
python -m src.cli --verbose release validate --staging-dir data/staging --exports-dir exports
Release Creation
- Create release with auto-generated tag:
python -m src.cli release create --repo-owner owner --repo-name repo --auto - Create release with specific tag:
python -m src.cli release create --repo-owner owner --repo-name repo --tag 20250101-1200 - Create draft release:
python -m src.cli release create --repo-owner owner --repo-name repo --auto --draft - Create prerelease:
python -m src.cli release create --repo-owner owner --repo-name repo --auto --prerelease - Force overwrite existing release:
python -m src.cli release create --repo-owner owner --repo-name repo --tag existing-tag --force - Create packages without upload:
python -m src.cli release create --repo-owner owner --repo-name repo --auto --skip-upload
Release Management
- Check release status:
python -m src.cli release status --repo-owner owner --repo-name repo --tag 20250101-1200 - List recent releases:
python -m src.cli release history --repo-owner owner --repo-name repo --limit 10
Deduplication Commands
Basic Deduplication
- Run deduplication with report:
python -c "from src.etl.deduplicate import AddressDeduplicator; dedup = AddressDeduplicator(); result = dedup.deduplicate_addresses(addresses_df)" - Run deduplication without report:
python -c "from src.etl.deduplicate import AddressDeduplicator; dedup = AddressDeduplicator(); result = dedup.deduplicate_addresses(addresses_df, generate_report=False)" - Process large datasets:
python -c "from src.etl.deduplicate import deduplicate_large_dataset; result = deduplicate_large_dataset(addresses_df, chunk_size=100000)"
Report Generation
- Generate deduplication report:
python -c "from src.etl.deduplicate import AddressDeduplicator; dedup = AddressDeduplicator(); report = dedup.generate_deduplication_report(addresses_df, result, processing_time_ms)" - Export report to JSON:
python -c "from src.etl.deduplicate import AddressDeduplicator; dedup = AddressDeduplicator(); json_report = dedup.export_report_to_json(report)"
Testing
- Run deduplication tests:
python -m pytest tests/contract/test_deduplication.py -v - Run integration tests:
python -m pytest tests/integration/test_deduplication_*.py -v - Run unit tests:
python -m pytest tests/unit/test_deduplication_*.py -v - Run all deduplication tests:
python -m pytest tests/contract/test_deduplication.py tests/integration/test_deduplication_*.py tests/unit/test_deduplication_*.py -v
Environment Variables
- GitHub Token:
GITHUB_TOKEN(required for release operations) - Default directories:
STAGING_DIR=data/staging,EXPORTS_DIR=exports
Complete Data Processing Pipeline
The OEVK data processing pipeline includes the following stages:
Pipeline Stages
- Ingestion: Downloads and loads source data from valasztas.hu
- Transformation: Normalizes data into 8 core tables with parallel processing
- Public Space Extraction: Extracts public space entities (names and types) from addresses
- Export: Creates CSV files for all tables including public space data
- Release: Packages and publishes data to GitHub releases
Public Space Extraction Features
- Entity Recognition: Extracts public space names and types from addresses
- Relationship Mapping: Creates settlement-public space relationships
- Hash-based IDs: Deterministic xxhash64 identifiers for all entities
- Data Integrity: Full validation and referential integrity
- Export Support: CSV export for all public space entities
Address Deduplication Features
- Duplicate Identification: Identifies duplicate addresses using deterministic hash IDs
- Canonical Address Creation: Creates unique canonical address records
- Relationship Preservation: Maintains all original relationships (polling stations, PIR codes)
- Report Generation: Generates comprehensive deduplication reports with statistics
- JSON Export: Exports deduplication reports to JSON format for external analysis
- Large Dataset Support: Processes large datasets in chunks for optimal performance
Public Space Tables
- PublicSpaceName: Unique public space names extracted from addresses
- PublicSpaceType: Unique public space types (utca, tér, etc.)
- SettlementPublicSpaces: Many-to-many relationships between settlements and public spaces
Deduplication Tables
- CanonicalAddresses: Unique canonical address records after deduplication
- AddressMapping: Mapping between original addresses and canonical addresses
- AddressPollingStations: Preserved polling station assignments
- AddressPIRCodes: Preserved PIR code relationships
- DeduplicationReport: Audit reports with deduplication statistics
Code Style Guidelines
Language & Framework
- Primary language: Python 3.11+
- Data processing: Polars or DuckDB for large datasets (>3M rows)
- Database: SQLite/DuckDB for staging/target (single-file, zero admin)
Imports & Structure
- Use absolute imports
- Group imports: standard library, third-party, local modules
- Follow PEP 8 for Python code
- Use type hints for all functions and variables
Naming Conventions
- Variables: snake_case (
county_code,settlement_name) - Functions: snake_case (
load_addresses_csv,transform_data) - Classes: PascalCase (
CountyLoader,AddressTransformer) - Constants: UPPER_SNAKE_CASE (
DATA_DIR,CHUNK_SIZE)
Data Processing Patterns
- Use deterministic hash IDs for all entities (xxhash64 recommended)
- Trim whitespace, convert empty strings to
NULL - Preserve diacritics and original casing for Hungarian text
- Use vectorized operations, avoid Python row loops
- Process data in chunks (100k-500k rows)
Error Handling
- Use structured logging with levels (INFO/DEBUG)
- Validate data quality with referential integrity checks
- Stage invalid rows with error codes and messages
- Make operations idempotent and restartable
File Organization
- Keep SQL DDL in separate files
- Modular scripts: ingest.py, transform.py, export.py, release/
- Separate configuration from business logic
- Use environment variables for configuration
Release Workflow Architecture
Core Modules
- Workflow Orchestrator:
src/release/workflow.py- Coordinates complete release process - Data Validation:
src/release/validation.py- Pre-release integrity and quality checks - File Packaging:
src/release/packaging.py- Creates compressed ZIP archives - GitHub Integration:
src/release/github.py- GitHub CLI integration for releases - Data Models:
src/release/models.py- ReleasePackage, ReleaseArtifact, ReleaseMetadata
Release Artifacts
- CSV Archive:
oevk-data-csv-{tag}.zip- Contains all CSV files:addresses/- Directory containing split address files by settlement (e.g.,Address_001_Budapest.csv,Address_002_Debrecen.csv)settlements.csv- Settlement reference datacounties.csv- County reference dataNationalIndividualElectoralDistrict.csv- National electoral districtsPollingStation.csv- Polling station locationsPostalCode.csv- Postal code dataPostalCode_Settlement.csv- Postal code to settlement mappingSettlementIndividualElectoralDistrict.csv- Settlement to electoral district mappingPublicSpaceName.csv- Unique public space names extracted from addressesPublicSpaceType.csv- Unique public space types (utca, tér, etc.)SettlementPublicSpaces.csv- Many-to-many relationships between settlements and public spaces
- Database Archive:
oevk-data-db-{tag}.zip- Contains oevk.db (main transformed database) - Release Metadata: JSON metadata with validation results and performance metrics
Release Tags
- Format: YYYYMMDD-HHMM (timestamp-based to prevent duplicates)
- Auto-generation: Uses current timestamp when not specified
- Validation: Ensures unique tags to prevent conflicts
Data Validation
- File Existence: Verifies all required files exist
- File Sizes: Ensures files have reasonable sizes
- File Integrity: Validates files are readable and not corrupted
- Data Completeness: Checks for required headers and data
- Referential Integrity: Validates relationships between entities
- Data Freshness: Ensures data is recent (≤24 hours old)
Performance Targets
- Complete Workflow: ≤15 minutes for full release process
- Data Validation: ≤2 minutes for comprehensive checks
- Package Creation: ≤5 minutes for artifact compression
- GitHub Integration: ≤3 minutes for release creation
- CSV Export: ~2.6 minutes for 3.3M addresses (single-query optimization)
- Idempotent Operations: Safe to retry failed operations
CSV Export Optimization
The project uses an optimized single-query approach for exporting canonical addresses:
Export Performance
- Approach: Single database query + Python partitioning
- Performance: ~2.6 minutes for 3.3M addresses across 3,177 settlements
- Speed: ~21,000 addresses/second
- Improvement: ~17x faster than per-settlement query approach
- Query Time: ~24 seconds to fetch all addresses with JOINs
- Write Time: ~2.2 minutes to write 3,177 CSV files
Export Features
- Automatic Cleanup: Removes old export files before new export
- Incremental Progress: Logs progress every 500 settlements
- Memory Efficient: Streaming write approach
- Symlink Management: Creates symlinks for release validation
- Default Directory:
exports/(aligned with release workflow)
Export Commands
# Basic export to default directory (exports/)
python -m src.cli export --db-path data/oevk.db
# Export with custom output directory
python -m src.cli export --db-path data/oevk.db --output-dir /path/to/output
# Export with custom run tag
python -m src.cli export --run-tag "v1.0.0"
# Export only entity tables (skip addresses)
python -m src.cli export --tables-only
# Export only addresses (skip entity tables)
python -m src.cli export --addresses-only
Export Structure
exports/
├── {run_tag}_Address/ # 3,177 settlement CSV files
│ ├── Address_001_Aba.csv
│ ├── Address_002_Abádszalók.csv
│ └── ...
├── {run_tag}_County.csv # Entity tables
├── {run_tag}_Settlement.csv
├── ... (8 more entity tables)
├── addresses -> {run_tag}_Address # Symlinks for release
├── settlements.csv -> {run_tag}_Settlement.csv
├── counties.csv -> {run_tag}_County.csv
└── database.duckdb -> ../data/oevk.db
Release Workflow Usage Examples
Basic Release Creation
# Set GitHub token (required)
export GITHUB_TOKEN="ghp_your_token_here"
# Create release with auto-generated tag
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --auto
# Create release with specific tag
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --tag 20250101-1200
Advanced Release Scenarios
# Create draft release for review
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --auto --draft
# Create prerelease (beta/alpha)
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --auto --prerelease
# Force overwrite existing release
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --tag existing-tag --force
# Create packages without uploading to GitHub (local testing)
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --auto --skip-upload
# Validate data before release
python -m src.cli release validate --staging-dir data/staging --exports-dir exports
Release Management
# Check release status
python -m src.cli release status --repo-owner your-org --repo-name oevk-data --tag 20250101-1200
# List recent releases
python -m src.cli release history --repo-owner your-org --repo-name oevk-data --limit 10
# Get detailed release information
python -m src.cli release info --repo-owner your-org --repo-name oevk-data --tag 20250101-1200
Environment Setup
Required Environment Variables
# GitHub Personal Access Token (required for releases)
export GITHUB_TOKEN="ghp_your_token_here"
# Optional: Custom directories
export STAGING_DIR="/path/to/staging"
export EXPORTS_DIR="/path/to/exports"
GitHub Token Permissions
- repo (full repository access) - REQUIRED
- workflow (if using GitHub Actions)
- read:org (if accessing organization repositories)
IMPORTANT: For organization repositories, use classic personal access tokens instead of fine-grained tokens. Classic tokens have better organization repository upload permissions.
Prerequisites
- Python 3.11+ with required dependencies
- GitHub CLI (gh) installed and authenticated
- GitHub Personal Access Token with appropriate permissions
- Data processing pipeline completed (staging and exports directories populated)
Troubleshooting
Common Issues
GitHub Authentication
# Verify GitHub CLI authentication
gh auth status
# Login if needed
gh auth login
# Set token explicitly
gh auth login --with-token <<< "$GITHUB_TOKEN"
# For organization repositories, use classic tokens instead of fine-grained tokens:
# 1. Go to GitHub Settings > Developer settings > Personal access tokens > Tokens (classic)
# 2. Create a new classic token with "repo" scope
# 3. Use the classic token (starts with "gho_") instead of fine-grained tokens (start with "github_pat_")
Release Creation Failures
# Check if release already exists
gh release view 20250101-1200 --repo your-org/oevk-data
# Delete existing release if needed
gh release delete 20250101-1200 --repo your-org/oevk-data --yes
# Force recreate release
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --tag 20250101-1200 --force
# Test GitHub connection and permissions
gh auth status
gh api repos/your-org/oevk-data
Data Validation Issues
# Run validation with verbose output
python -m src.cli --verbose release validate --staging-dir data/staging --exports-dir exports
# Check file permissions
ls -la data/staging/
ls -la exports/
# Verify file contents
head -n 5 data/staging/addresses.csv
head -n 5 exports/addresses.csv
Performance Issues
# Run performance tests
python -m pytest tests/performance/ -v
# Monitor system resources during release
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --auto --monitor
# Check disk space
df -h
du -sh data/staging/ exports/
Debug Mode
# Enable debug logging
export LOG_LEVEL="DEBUG"
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --auto
# Enable verbose logging with --verbose flag
python -m src.cli --verbose release validate --staging-dir data/staging --exports-dir exports
# Dry run (validate without creating release)
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --auto --dry-run
# Test with explicit GitHub token
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --auto --github-token $GITHUB_TOKEN
Organization Repository Troubleshooting
Common Organization Repository Issues
Error: HTTP 403: Resource not accessible by personal access token
Cause: Fine-grained personal access tokens have upload permission issues with organization repositories
Solution: Use classic personal access tokens instead of fine-grained tokens
# 1. Regenerate GitHub token as classic token
# Go to GitHub Settings > Developer settings > Personal access tokens > Tokens (classic)
# Create new classic token with these scopes:
# - repo (full repository access)
# - workflow (if using GitHub Actions)
# - read:org (organization read access)
# 2. Update environment variable
export GITHUB_TOKEN="your_classic_token_here" # Should start with "gho_"
# 3. Test repository access
gh api repos/your-org/your-repo
# 4. Test release creation and upload
gh release create test-release --repo your-org/your-repo --title "Test" --notes "Testing"
echo "test file" > test_upload.txt
gh release upload test-release test_upload.txt --repo your-org/your-repo
Token Type Identification
# Check token type
export GITHUB_TOKEN="your_token_here"
gh auth status --show-token
# Fine-grained tokens start with: github_pat_...
# Classic tokens start with: gho_...
Skip-Upload Workaround
If organization permissions can't be resolved immediately, use skip-upload:
# Create packages without uploading to GitHub
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --auto --skip-upload
# Packages will be saved to data/temp/ for manual upload
ls -la data/temp/
# Manual upload to GitHub release
gh release upload 20250101-1200 data/temp/oevk-data-csv-20250101-1200.zip data/temp/oevk-data-db-20250101-1200.zip --repo your-org/oevk-data
Verify Organization Access
# Check current token scopes
gh auth status
# Test repository permissions
gh api repos/your-org/your-repo | jq '.permissions'
# Expected output should include:
# {
# "admin": true,
# "maintain": true,
# "push": true,
# "triage": true,
# "pull": true
# }
Skip-Upload Workaround
If organization permissions can't be resolved immediately, use skip-upload:
# Create packages without uploading to GitHub
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --auto --skip-upload
# Packages will be saved to data/temp/ for manual upload
ls -la data/temp/
# Manual upload to GitHub release
gh release upload 20250101-1200 data/temp/oevk-data-csv-20250101-1200.zip data/temp/oevk-data-db-20250101-1200.zip --repo your-org/oevk-data
Verify Organization Access
# Check current token scopes
gh auth status
# Test repository permissions
gh api repos/your-org/your-repo | jq '.permissions'
# Expected output should include:
# {
# "admin": true,
# "maintain": true,
# "push": true,
# "triage": true,
# "pull": true
# }
Performance Monitoring
Release Performance Commands
# Run performance benchmarks
python -m pytest tests/performance/test_release_performance.py -v
# Monitor release timing
python -m src.cli release create --repo-owner your-org --repo-name oevk-data --auto --timing
# Generate performance report
python -m src.cli release performance --repo-owner your-org --repo-name oevk-data --tag 20250101-1200
Expected Performance Metrics
- Total Release Time: ≤15 minutes
- Data Validation: ≤2 minutes
- Package Creation: ≤5 minutes
- GitHub Operations: ≤3 minutes
- File Compression: ≤2 minutes
Resource Requirements
- Memory: ≥4GB RAM for large datasets
- Disk Space: ≥2GB free space for temporary files
- Network: Stable internet connection for GitHub operations
- CPU: Multi-core processor for parallel operations