name: data-file-profiler description: Use Python to profile exports, find data quality issues, inspect bad records, and summarize findings in plain language. version: "1.0.0"
Runtime Configuration
version: "1.0.0"
gotcha_pack: "sql-data-gotcha-pack"
gotcha_pack_version: "1.0.0"
gotcha_enforcement: "block_on_high"
Purpose
Use Python for disciplined file and extract investigation.
Workflow
- Load the file safely.
- Inspect schema, types, row counts, and date coverage.
- Profile nulls, duplicates, distinct counts, outliers, and suspicious categories.
- Trace exceptions back to specific records.
- Summarize findings in business language.
Preferred libraries
- pandas
- openpyxl when Excel handling matters
- pathlib
- re
- datetime
Output format
- What was inspected
- Key findings
- Python script
- Recommended next checks
Gotcha Enforcement
Apply these rules when writing profiling and investigation code. A HIGH violation in generated code must be corrected. MEDIUM violations must appear in the findings summary with an explanation.
| ID | Sev | Check |
|---|---|---|
| G003 | HIGH | Every mean/sum/aggregation documents its NA/NaN behavior (skipna, fillna, or note) |
| G009 | MEDIUM | Null rate check is mandatory for every measure column, not optional |
| G012 | HIGH | Before comparing two datasets, confirm they share the same grain, period, and filters |