AGENTS.md
Utilize Bruin MCP and use Bruin CLI to run assets and query the tables. Reference Bruin docs.
This repository contains data pipelines built with Bruin, warehoused in BigQuery, with Streamlit dashboards visualized using Altair. Raw data ingestion is done in Python.
Repository Structure
data_playground/
├── .bruin.yml # Root Bruin config (connections, credentials)
├── .gitignore
├── requirements.txt # Root-level Python dependencies
├── AGENTS.md # You are here
├── prompt.md # Generic template for prompting new pipelines
├── credentials/ # Service account keys (gitignored)
├── <pipeline-name>/ # Each pipeline is a top-level directory
│ ├── pipeline.yml # Pipeline config (schedule, connections)
│ ├── README.md # Pipeline-specific docs (data sources, assets, run commands)
│ └── assets/
│ ├── raw/ # Ingestion layer (Python or SQL)
│ ├── staging/ # Transformation layer (SQL)
│ └── reports/ # Dashboards & analytical queries
└── ...
Pipeline Structure
Every pipeline follows the same three-layer pattern. Use berlin-weather/ as the reference implementation, stackoverflow-trends/ for advanced patterns (multiple data sources, API ingestion, append + dedup), and baby-bust/ for a complete example with comprehensive data validation, bruin ai enhance, and a 4-chart Streamlit dashboard.
1. pipeline.yml
Defines the pipeline metadata and default connections.
name: <pipeline-name>
schedule: daily
start_date: "2009-01-01"
default_connections:
google_cloud_platform: "bruin-playground-arsalan"
namemust match the directory name.schedulecan bedaily,hourly,weekly,monthly, or a cron expression.default_connectionssets which BigQuery project/connection assets use unless overridden.
2. pipeline-name/README.md
Every pipeline must have its own README covering:
- Data sources used (with links)
- All assets by layer (raw, staging, reports) with brief descriptions
- Key run commands (
bruin run,bruin validate,streamlit run) - Known limitations or data gaps
3. assets/raw/ — Ingestion Layer (Python or SQL)
Raw assets fetch data from external sources and materialize it into BigQuery.
- Python (
type: python) — for fetching from external APIs or files. Use an embedded Bruin YAML header in a docstring. - SQL (
type: bq.sql) — for querying existing BigQuery datasets (e.g. public datasets). Use an embedded Bruin YAML header in a SQL comment.
File naming: the file name (without extension) must match the table name in the asset's name field. For example, asset raw.stackoverflow_api_monthly lives in stackoverflow_api_monthly.py.
Table naming: raw.<descriptive_name> — all raw tables live in the raw schema.
Template:
"""@bruin
name: raw.<table_name>
type: python
image: python:3.11
connection: bruin-playground-arsalan
description: |
Brief description of what this asset ingests.
Include the data source URL and license if applicable.
materialization:
type: table
strategy: append
columns:
- name: <column_name>
type: <VARCHAR|INTEGER|DOUBLE|TIMESTAMP|BOOLEAN|DATE>
description: <what this column contains, including units>
primary_key: true # only on the natural key
@bruin"""
import logging
import os
from datetime import datetime, timezone
import pandas as pd
import requests
logging.basicConfig(
level=os.environ.get("LOG_LEVEL", "INFO").upper(),
format="%(asctime)s %(levelname)s %(name)s - %(message)s",
)
logger = logging.getLogger(__name__)
def fetch_data(start_date: str, end_date: str) -> pd.DataFrame:
# Fetch from API / file / etc.
...
def materialize():
start_date = os.environ.get("BRUIN_START_DATE", "<default>")
end_date = os.environ.get("BRUIN_END_DATE", "<default>")
logger.info("Interval: %s to %s", start_date, end_date)
df = fetch_data(start_date, end_date)
df["extracted_at"] = datetime.now(timezone.utc)
logger.info("Fetched %d rows", len(df))
return df
Rules:
- Always use
image: python:3.11. - Always include an
extracted_attimestamp column (usedatetime.now(timezone.utc)). - Always include structured logging (
logging.basicConfig+logger = logging.getLogger(__name__)). - Document every column with a description including units where applicable.
- Mark exactly one column (or composite key) as
primary_key: true. - Use
BRUIN_START_DATEandBRUIN_END_DATEenvironment variables for date-bounded fetches. - Place a
requirements.txtalongside the Python file with only the dependencies needed for that layer.
Materialization Strategy
| Strategy | Use when |
|---|---|
create+replace | Small/immutable reference data (e.g. ticker lists, tag catalogs) |
append | Large/incremental data (e.g. daily prices, hourly readings). Always deduplicate in staging. |
merge | When you need upsert behavior on the raw table itself |
delete+insert | When re-processing a date partition should replace old rows |
Prefer append for most raw ingestion assets. This allows safe re-runs and backfills without losing previously ingested data. Deduplication is handled in staging SQL using ROW_NUMBER() ... ORDER BY extracted_at DESC.
API Ingestion Best Practices
When fetching from external APIs:
- Chunk large date ranges: Break requests into 30-day (or smaller) windows to avoid timeouts and stay within API limits.
- Retry with backoff: Use exponential backoff for transient errors (429, 502, 503, timeouts). 5 retries is a good default.
- Handle rate limits gracefully: If you hit a rate limit mid-run, return whatever data was already fetched rather than crashing. Log a warning about the partial result.
- Add delay between requests: Use
time.sleep(0.5)(or more) between API calls to avoid throttling. - Log progress: Log each chunk/batch with counts so you can monitor long-running ingestions.
- Environment variables for testing: Use env vars like
STOCK_TICKER_LIMITto limit scope during development, so you don't need to fetch all 500+ tickers every test run.
Secrets
When an asset needs API credentials, declare them in the Bruin header under secrets. The keys must match the secret names in .bruin.yml:
secrets:
- key: epias_username
- key: epias_password
Access them in Python via os.environ["epias_username"].
Dependencies (assets/raw/requirements.txt):
pandas
requests
Add additional packages as needed (e.g. yfinance, lxml, python-dateutil).
4. assets/staging/ — Transformation Layer (SQL)
Staging assets transform raw data into analysis-ready tables using BigQuery SQL. They reference upstream raw tables via depends.
File naming: <entity>_<grain>.sql — e.g. weather_daily.sql, games_monthly.sql.
Table naming: staging.<descriptive_name> — all staging tables live in the staging schema.
Template:
/* @bruin
name: staging.<table_name>
type: bq.sql
connection: bruin-playground-arsalan
description: |
What this transformation does and why.
Link to any reference material (e.g. code tables, standards).
depends:
- raw.<upstream_table>
materialization:
type: table
strategy: create+replace
columns:
- name: <column_name>
type: <DATE|INTEGER|VARCHAR|DOUBLE|BOOLEAN>
description: <description>
primary_key: true
nullable: false
@bruin */
WITH deduped AS (
SELECT *
FROM raw.<upstream_table>
WHERE <primary_key> IS NOT NULL
QUALIFY ROW_NUMBER() OVER (PARTITION BY <natural_key_columns> ORDER BY extracted_at DESC) = 1
)
SELECT
...
FROM deduped
ORDER BY <primary_key>
Rules:
- Always declare
dependslisting every upstream asset by its fullschema.tablename. - Always deduplicate raw data using a
dedupedCTE withROW_NUMBER() OVER (PARTITION BY <natural_key> ORDER BY extracted_at DESC) = 1. This is critical when raw assets useappendstrategy. - Add derived columns: temporal dimensions (year, month, season, day_of_week), human-readable labels, boolean flags, unit conversions, derived ratios.
- Use
COALESCEfor nullable numeric fields to default to 0. - Use
CASEexpressions for categorization and code-to-label mappings. - Document every output column.
- SQL should be pure
SELECT— let Bruin handle the DDL viamaterialization. - Staging always uses
create+replacestrategy (it rebuilds from raw each time).
Multiple Data Sources
When the same entity comes from multiple sources (e.g. BigQuery public data + API supplement):
- Keep each source as a separate raw asset.
- In staging,
UNION ALLthe sources and deduplicate withROW_NUMBER(), preferring the richer source for overlapping periods. - Use
CAST(NULL AS INT64)for columns that only exist in one source.
Common Staging Patterns
- Unpivot: Turn wide-format raw tables (one column per category) into long format using
UNION ALLwithsource_name/categorycolumns. E.g. energy sources, financial statement line items. - Enrichment joins: Join with reference/dimension tables (e.g.
raw.stock_tickersfor sector/industry). - Window functions: Moving averages (
AVG ... ROWS BETWEEN N PRECEDING), rolling highs/lows, daily returns, period-over-period growth. - Ratio derivation: Margins, ROE, debt-to-equity, share percentages — compute in staging, not in reports.
5. assets/reports/ — Reports Layer (Streamlit + Altair)
This layer contains Streamlit dashboards and their supporting SQL queries.
Files:
streamlit_app.py— the main dashboard application.<analysis_name>.sql— standalone analytical queries loaded at runtime by the Streamlit app. These are plain SQL files (no Bruin header) that query the BigQuery staging tables directly using fully-qualified table names (project.schema.table).requirements.txt— dependencies for the dashboard.
Dependencies (assets/reports/requirements.txt):
streamlit
google-cloud-bigquery
db-dtypes
altair
pandas
Streamlit App Conventions:
from pathlib import Path
import altair as alt
import pandas as pd
import streamlit as st
from google.cloud import bigquery
from google.oauth2 import service_account
st.set_page_config(page_title="<Dashboard Title>", layout="wide")
PROJECT_ID = "bruin-playground-arsalan"
base_path = Path(__file__).parent
@st.cache_resource
def get_client():
credentials = service_account.Credentials.from_service_account_info(
dict(st.secrets["gcp_service_account"]),
scopes=["https://www.googleapis.com/auth/bigquery"],
)
return bigquery.Client(project=PROJECT_ID, credentials=credentials)
def run_raw(sql: str) -> pd.DataFrame:
return get_client().query(sql).to_dataframe()
def run_query(filename: str) -> pd.DataFrame:
sql = (base_path / filename).read_text()
return get_client().query(sql).to_dataframe()
- Use
st.secrets["gcp_service_account"]for credentials (stored in.streamlit/secrets.toml, gitignored). - Load main data with
run_raw()for inline SQL,run_query()for SQL files in the same directory. - Use
@st.cache_resourcefor the BigQuery client.
Data Visualization Standards (Altair + Streamlit):
These rules are mandatory for every chart in every dashboard. They encode best practices from Tufte, Cleveland, Few, Munzner, and the ethical data visualization principles from the Data Visualization Society. Violations should be treated like bugs.
Before You Build Any Chart
- Validate the data first. Before writing any visualization code, query the data like a data analyst: check row counts, null rates, distributions, outliers, duplicates, and correlations. Understand what you actually have. Compute percentiles, check for join fanouts, verify dedup logic. Never visualize data you haven't inspected.
- Analyze, don't summarize. A dashboard that just shows "here's the data" is not analysis. Find correlations, compute derived metrics (e.g. price/ELO, gap ratios), identify Pareto frontiers, test hypotheses. Each chart must prove or disprove something non-obvious. If the takeaway is "the data exists," the chart has failed.
- Follow the narrative arc. Every dashboard tells a story structured as: (1) hypothesis/question, (2) evidence that the phenomenon exists, (3) evidence that it is systematic or repeatable, (4) quantification of the magnitude, (5) implications and limitations. If a chart doesn't advance this arc, cut it.
- Fewer charts, more narrative. 2–4 well-chosen charts with clear explanatory text beats 8 charts that overwhelm the viewer. Every chart must earn its place by answering a specific question. If you can say it in a sentence, don't make a chart.
- Enrich aggressively. Before building the dashboard, check what other datasets exist in BigQuery that could be joined. Cross-domain correlations (pricing + stock prices, quality rankings + prediction markets) are what make analysis interesting.
- Be honest about sample sizes. If a data point is based on 3 observations, say so. Annotate sample sizes directly on charts. Small-n medians are noise, not signal.
- Explain the data. Every dashboard must include: where the data comes from (with links), how it was collected and transformed, what the key metrics mean (with units), and what the limitations are. Put this in a dedicated methodology section, not hidden in tooltips. State explicitly what the data cannot tell you.
- Don't make claims the data doesn't support. If only 18 of 348 models have quality rankings, don't title a chart "Every Arena-Ranked Model" — say "The 18 models we can actually rank." If early time periods have tiny samples, caveat the trend explicitly.
- Tables can be better than charts. A 10-row dataset does not need a chart. Use
st.dataframe()and let the reader scan the numbers. Charts are for patterns in data too large to read as a table. - Interactive legends. Use
alt.selection_point(fields=[...], bind="legend")so viewers can click legend entries to isolate series. This replaces overcrowded charts with focused exploration. - Quantify inline. After every chart, include a
st.markdown("> ...")blockquote that states the specific finding with numbers (e.g. "r = 0.23", "18x price difference for 5% quality gap"). The chart shows the pattern; the text states the magnitude.
Color and Accessibility
- Colorblind-safe palette only. Use the Wong (2011) palette from Nature Methods:
#D55E00vermillion,#56B4E9sky blue,#E69F00orange,#009E73bluish green,#CC79A7reddish purple,#0072B2blue,#F0E442yellow,#999999grey. These 8 colors are the maximum for categorical encoding. If you need more categories, aggregate or facet — do not invent new colors. - Never rely on color alone. Every color-encoded dimension must also be conveyed through a second channel: shape (
alt.Shape), stroke dash pattern (strokeDash), direct text labels, or spatial position (faceting). A viewer who cannot distinguish any two colors in the palette must still be able to read the chart. - Highlight vs default. For binary emphasis (e.g. "this item" vs "everything else"), use
HIGHLIGHT = "#D55E00"andDEFAULT = "#56B4E9"withalt.condition. Never use red/green for binary states — use vermillion/sky-blue or vermillion/grey. - Sequential and diverging scales. For continuous color scales use
bluesorviridis(sequential) andblueorange(diverging). Never useredgreen,redblue, orrainbow/jet— all are colorblind-hostile or perceptually non-uniform. - Overlaid series must have a legend. If two series share the same positional axes and are distinguished only by color, a legend is required on the chart — not just in a
st.caption(). Captions supplement; they do not replace legends.
Truthful Representation
- Y-axis baseline. Bar charts and area charts must start the quantitative axis at zero. A truncated axis exaggerates differences and misleads the viewer. Use
alt.Scale(zero=True)(default for bars). If zero-baseline makes the data unreadable (e.g. ELO scores clustered in 1300–1500), switch to a dot plot or line chart — do not truncate a bar chart. - Log scales must be labeled. If you use
scale=alt.Scale(type="log"), the axis title must include "(Log Scale)" and the chart title or a caption must explain why the log scale is used (e.g. "Log scale used because values span 3+ orders of magnitude"). Never use a log scale to make a trend look more dramatic. - No dual Y-axes. They are virtually always misleading — the viewer cannot compare magnitudes across two unrelated scales. Use faceted charts (side-by-side or vertically stacked) instead.
- No 3D, no pie charts. 3D adds no information and distorts area/length perception. Pie charts are inferior to bar charts for comparing quantities (Cleveland & McGill 1984). Use horizontal bar charts sorted by value instead.
- No gradient fills for quantitative data. Gradient fills (e.g. fading from color to white) obscure the data boundary, create phantom visual weight, and serve no analytical purpose. Use solid fills with reduced opacity (0.3–0.4 for area charts behind a line) or solid line charts.
- Aspect ratio matters. Time-series should use a wide aspect ratio (~3:1). Use the banking-to-45-degrees heuristic: slopes of ~45 degrees maximize readability.
Encoding Discipline
- Every visual encoding must be explained. If a chart uses size, color, shape, opacity, or stroke as a data channel, each must have either a visible legend or a direct label on the chart.
legend=Noneis only acceptable when the channel is redundant with another fully-explained encoding (e.g. color matching x-axis categories that are already labeled). - Limit encodings to 3 channels max per chart. Position (x, y) + one of {color, size, shape}. Adding a fourth channel (e.g. color + size + shape simultaneously) overloads working memory. If you need more dimensions, use faceting or a table.
- Tooltips are mandatory. Every chart must include
alt.Tooltipentries for all encoded fields plus any context fields the viewer would want on hover. Numeric tooltips must have format strings (e.g.format=",.0f"for integers,format="$.3f"for prices). - Consistent chart heights. Standard:
height=380. Taller (scatter, dense):height=480. Never vary heights arbitrarily within a dashboard. - Sort bars by value. Categorical bar charts must be sorted by the quantitative axis (largest to smallest or vice versa) unless there is a natural order (e.g. time, tiers). Alphabetical sort is almost never useful.
Annotation and Context
- Reference lines must be labeled on the chart. Use
mark_rule+mark_textlayered with+. The label text should state what the line represents (e.g. "GPT-4 Launch Price") and be positioned to avoid overlapping data points. - Chart titles state the insight, not the data. Prefer "Median AI prices fell 98% in 3 years" over "Median Price by Quarter". The title tells the reader what to see; the axis labels tell them what is plotted.
- Axis titles are required on both axes unless the meaning is unambiguous from context (e.g. a single-series time chart where the x-axis is clearly dates). Include units in axis titles (e.g. "Price ($/MTok)", not just "Price").
- Captions explain methodology. Use
st.caption()below the chart to explain filtering choices, what outliers were excluded and why, or caveats about the data. Never use captions as a substitute for a missing legend.
Streamlit Layout
- Use
st.altair_chart(..., use_container_width=True)to render all charts. - Use
cornerRadiusTopLeft=4, cornerRadiusTopRight=4onmark_bar()for rounded bar tops. - Place KPI metrics above the first chart. Use
st.metric()with explicit delta values when comparing periods. - Use
st.columns(2)for side-by-side comparisons of related charts. Never put unrelated charts side-by-side. - Use
st.divider()between story sections, not between every chart. - Data tables complement charts. Show the underlying data (top N, summary) below the chart so the viewer can verify what they see. Use
st.dataframe(..., hide_index=True).
Bruin Asset Metadata Reference
All Bruin metadata lives inside a comment block at the top of the file:
- Python:
"""@bruin ... @bruin""" - SQL:
/* @bruin ... @bruin */
Key Fields
| Field | Required | Description |
|---|---|---|
name | Yes | schema.table_name — determines the destination table |
type | Yes | python for Python assets, bq.sql for BigQuery SQL |
image | Python only | Python Docker image, use python:3.11 |
connection | Yes | BigQuery connection name from .bruin.yml |
description | Yes | Multi-line description of what the asset does |
depends | SQL only | List of upstream asset names (schema.table) |
materialization.type | Yes | table or view |
materialization.strategy | Yes | create+replace, merge, delete+insert, append, etc. |
columns | Yes | Full column definitions with name, type, description |
Column Types
Use these types for column definitions: VARCHAR, INTEGER, DOUBLE, BOOLEAN, DATE, TIMESTAMP.
Bruin CLI Quick Reference
bruin run <path/to/asset> # Run a single asset
bruin run <path/to/pipeline/> # Run entire pipeline
bruin run --downstream <path> # Run asset + all downstream
bruin run --start-date 2024-07-01 --end-date 2024-12-31 <path> # Run with date range
bruin validate <path> # Validate asset/pipeline definitions
bruin format <path> # Auto-format asset files
bruin lineage <path> # Show asset dependency graph
bruin connections list # List configured connections
bruin connections ping <name> # Test a connection
- Always run individual assets during development, not entire pipelines.
- Use
--start-date/--end-datefor backfilling specific date ranges without re-ingesting everything. - Use
--downstreamwhen you want to run a raw asset and automatically rebuild its staging dependents.
Testing & Development Workflow
Follow this order when building or modifying a pipeline:
- Validate first:
bruin validate <pipeline-dir>/— catches header/config errors before any execution. - Test raw assets individually with a small date range (2-3 days) or a limited scope (e.g.
STOCK_TICKER_LIMIT=5). - Verify data in BigQuery after each raw asset — check row counts, date ranges, column types.
- Test staging SQL once raw tables exist — run individually, then check output row counts and derived metrics.
- Test full pipeline with a slightly larger window (e.g. 1 week) using
bruin run <pipeline-dir>/. - Run Streamlit locally to verify the dashboard renders:
streamlit run <pipeline>/assets/reports/streamlit_app.py.
For financial or quarterly data: the source API may only return recent quarters regardless of date range. Test with whatever the API actually provides rather than forcing specific dates.
Creating a New Pipeline — Checklist
- Create a new top-level directory named after the pipeline.
- Add
pipeline.ymlwith name, schedule, start_date, and default_connections. - Add a
README.mddocumenting data sources, all assets, and run commands. - Create
assets/raw/with Python ingestion scripts and arequirements.txt. - Create
assets/staging/with SQL transformations thatdependson the raw assets. Always deduplicate. - Create
assets/reports/with a Streamlit app, supporting SQL files, andrequirements.txt. - Validate with
bruin validate <pipeline-dir>/. - Test each raw asset individually with a small subset of data.
- Test staging assets once raw data exists.
- Test the full pipeline end-to-end.
Dependency Resolution
Bruin resolves Python dependencies by walking up the file tree from the asset to find the nearest requirements.txt. Keep a separate requirements.txt per layer (raw/, reports/) so dependencies stay isolated.
Secrets Management
No secrets, credentials, API keys, tokens, or passwords may ever be committed to this repository. This is a hard rule with no exceptions.
What is gitignored
The following are excluded via .gitignore and must never be committed:
.bruin.yml— contains connection credentials (API keys, passwords, service account paths)**/secrets.toml— Streamlit secrets files in any directorycredentials/— service account JSON files*.pem,*.key,*.p12,*.pfx— private key files.env,.env.*— environment variable files
Rules
- Never hardcode secrets in Python or SQL. Use
os.environ["KEY_NAME"]in Python and Bruinsecrets:declarations in the asset YAML header. - Never commit
.streamlit/secrets.toml— generate it locally from the service account JSON incredentials/. See the Streamlit section below. - Never commit
.bruin.yml— this file contains all connection credentials. It stays local. - Never create service account JSON files outside
credentials/— that directory is gitignored. - If you accidentally commit a secret, the credential must be rotated immediately. Removing the file from git tracking does not remove it from history. Use
git-filter-repoto rewrite history and force-push.
Streamlit secrets setup
Each dashboard that queries BigQuery needs a .streamlit/secrets.toml in its reports directory. Create it from the GCP service account JSON:
# From the pipeline's reports directory:
mkdir -p .streamlit
# Then manually create secrets.toml with the service account fields
# DO NOT copy the JSON file directly — use the TOML format shown in AGENTS.md
The .streamlit/secrets.toml file is gitignored globally via **/secrets.toml. Verify before committing: git status should never show a secrets.toml file.
Things to Avoid
- Do not put non-asset files inside
assets/. Use separate directories for ad-hoc queries or analyses. - Do not hardcode dates in SQL — use Bruin templating or
BRUIN_START_DATE/BRUIN_END_DATEin Python. - Do not use
.ymlextension for asset definitions — use.asset.ymlif defining assets in YAML. - Do not commit
.bruin.yml, credentials,secrets.toml,.env, private keys, or any file containing secrets — they are all gitignored. Ifgit statusshows any of these files as untracked or modified, do not stage them. - Do not write
CREATE TABLEorINSERTin SQL assets — let Bruin'smaterializationhandle DDL. - Do not give the asset file a different name than the asset name (file name and asset name must match — asset name is
<dataset>.<table_name>which is<parent_folder_name>.<asset_file_name>). - Do not use
create+replacefor large incremental data — useappendand deduplicate in staging. - Do not crash on API rate limits — return partial data and log a warning.
- Do not skip logging — every Python asset must have structured logging with progress output.
- Do not leave throwaway test scripts in the root directory or inside
assets/. - When running
bruin run <pipeline>/on pipelines withappendraw assets, the default date interval is today — which may return no data for historical APIs. Run raw assets explicitly with--start-date/--end-datefor initial loads, then run staging assets separately. - For flaky APIs (e.g., World Bank), use chunked requests (10-year windows) with high
per_pagevalues and retry logic with exponential backoff. Single large requests are more likely to timeout. - When pivoting long-to-wide in staging (e.g., indicator rows → columns), validate every pivoted column against raw by joining on natural key and checking for zero diff. This catches silent data loss from incorrect indicator codes or join fanout.
- After running
bruin ai enhance, always re-runbruin validateandbruin runon the affected assets to verify the enhanced metadata doesn't break anything. The enhance command adds quality checks (not_null, accepted_values, min/max) that may fail if the data has edge cases. Never do bulk regex edits on YAML column definitions — ifbruin ai enhancecorrupts the YAML, rewrite the section manually. - For Streamlit secrets: create
.streamlit/secrets.tomlin the reports directory (not root). Generate it from the GCP service account JSON incredentials/. This file is gitignored. - Use
python3 -m streamlit runinstead of barestreamlit runif the streamlit binary isn't on PATH.
Geospatial Data Rules
These rules are mandatory when working with geospatial data (OSMnx, GeoPandas, GHSL, etc.):
- Consistent spatial methodology. When comparing cities or regions, every entity MUST use identical spatial parameters: same query function, same radius, same resolution, same projection. Never mix
graph_from_place(admin boundaries) across cities — admin boundary sizes vary wildly (e.g., "City of London" = 1 sq mi vs "Chicago" = 234 sq mi). Usegraph_from_point(center, dist=RADIUS)with a fixed radius for all cities. - Verify query scope before charting. Before building any visualization, verify what each geospatial query actually returned. Log the bounding box or area. Compare areas across all entities to catch inconsistencies. A query for "Barcelona" might return the city, the province, or a single neighborhood depending on the API and query string.
- Document the methodology explicitly. State the exact spatial parameters (radius, center coordinates, projection, data version) in the README and in the dashboard's methodology section. Future users must be able to reproduce the analysis.
- GHSL GeoPackage handling. The GHSL R2024A release contains 16 thematic layers in a single GeoPackage that must be joined on
ID_UC_G0. It uses Mollweide projection — convert centroids to WGS84 (EPSG:4326) for lat/lon coordinates. Usepycountrywith fuzzy matching + manual overrides for country name → ISO code mapping. - OSMnx Overpass API. Set
ox.settings.timeout = 300for large network downloads. Addtime.sleep(2)between city downloads to respect rate limits. UseCITY_LIMITenv var for testing. Full-city queries for megacities (Tokyo, Istanbul) can exceed 2,000 km² and hang — always use bounded queries. - Name matching across datasets. When joining datasets from different sources (e.g., GHSL city names to OSMnx queries), use proximity matching (lat/lon within threshold + country code match) rather than string matching. City names vary across datasets (e.g., "Mumbai" vs "Bombay", "Brasília" vs "Lago Norte"). Filter out known mismatches from all charts.
Data Analysis Process
This is the process for building analysis-driven dashboards. Follow this order:
Phase 1: Data inventory
Before writing any chart code, list every field available across all data sources. Understand the data range, granularity, null rates, and distributions. Query the staging tables directly to verify what you actually have.
Phase 2: Find non-obvious insights
A dashboard that just shows "here's the data" is not analysis. Look for:
- Correlations — do two variables move together? (e.g., GDP vs grid-ness, temperature vs street length)
- Outliers — which cities/countries break the pattern? Why?
- Derived metrics — intersection density (intersections / area), orientation order (entropy-derived), population growth rates
- Cross-domain joins — combine datasets that weren't designed to go together (GHSL urban data + OSMnx street analysis + World Bank economic indicators)
Phase 3: Build charts iteratively
Start with the most insightful chart. Show it to the user. Get feedback. Iterate. Each chart must answer a specific question — if the takeaway is "the data exists," cut the chart. Expect to:
- Remove boring or redundant charts
- Add new charts based on what the data reveals
- Filter aggressively (e.g., cities over 5M population only, exclude data mismatches)
- Adjust visual encoding (dot sizes, label sizes, axis scales) based on what makes the data readable
Phase 4: Validate everything
- Cross-check data claims against the actual data
- Verify join quality (are matched cities actually correct?)
- Filter out known data quality issues (e.g., GHSL proximity mismatches)
- Include units on all metrics
- Include methodology section with data source links and limitations
Altair Gotchas
Lessons learned from building dashboards with Altair:
- Layered charts must share field names. If a dot layer uses
population_2015and a line layer usesy, Altair creates independent scales. Rename DataFrame columns to match across layers. - Angle values must be 0-360.
angle=-33raises a validation error in newer Altair. Useangle=327instead (equivalent rotation). - Log scales on all layers. When using
alt.Scale(type="log"), every layer sharing that axis must also specify the log scale explicitly — it does not propagate across layers. zero=Falseon Y-axis. When data is clustered in a narrow range (e.g., building heights 10-50m), usealt.Scale(zero=False)on scatter/point charts to spread the data. Never do this on bar charts.- Interactive legends. Use
alt.selection_point(fields=[...], bind="legend")so viewers can click legend entries to filter. Encode both color and shape on the same field for dual encoding (accessibility).
Matplotlib Polar Plot Gotchas
- Default orientation is wrong for compass bearings. Matplotlib's default: 0° at East (right), counter-clockwise. For street orientation plots, set
ax.set_theta_zero_location("N")andax.set_theta_direction(-1)to get North at top, clockwise. Do this before drawing bars.