name: bigquery-cli description: Use when working with BigQuery from command line using bq tool, including querying data, loading/exporting tables, managing datasets, cost estimation with dry-run, or partitioning strategies
BigQuery CLI (bq)
Overview
The BigQuery CLI (bq) provides command-line access to Google BigQuery for data warehousing and analytics. Core principle: Always estimate costs with dry-run, use proper authentication, and leverage partitioning for scale.
When to Use
- Running SQL queries on BigQuery
- Loading data from CSV/JSON/Avro/Parquet
- Exporting table data to GCS
- Managing datasets, tables, views
- Cost estimation and optimization
- Creating partitioned/clustered tables
- Viewing schemas and data samples
When NOT to use:
- When BigQuery UI is more appropriate (ad-hoc exploration)
- When using BigQuery client libraries (Python, Java, etc.)
Authentication and Project Setup
Before ANY bq operation:
# Check authentication
gcloud auth list
# Login if needed
gcloud auth login
# Set default project
gcloud config set project PROJECT_ID
# Verify current project
gcloud config get-value project
Command Structure
Format: bq [--global_flags] <command> [--command_flags] [args]
Important global flags:
--project_id=PROJECT- Override default project--dataset_id=DATASET- Default dataset--location=LOCATION- Geographic location (us, eu, asia-northeast1)--format=FORMAT- Output format: pretty, sparse, prettyjson, json, csv--quiet/-q- Suppress status updates--dry_run- Validate without executing (queries only)
Quick Reference
Query Operations
| Task | Command | Key Flags |
|---|---|---|
| Run query | bq query 'SELECT ...' | --dry_run, --use_legacy_sql=false, --destination_table, --max_rows |
| Estimate cost | bq query --dry_run 'SELECT ...' | Shows bytes to be processed |
| Save results | bq query --destination_table=ds.table 'SELECT ...' | --append_table, --replace |
| Parameterized query | bq query --parameter='name:STRING:value' 'SELECT ...' | Repeat --parameter for multiple |
Data Loading
| Task | Command | Key Flags |
|---|---|---|
| Load CSV | bq load ds.table gs://bucket/file.csv schema | --skip_leading_rows=1, --autodetect, --field_delimiter |
| Load JSON | bq load --source_format=NEWLINE_DELIMITED_JSON ds.table file.json | --autodetect, --schema |
| Load Parquet/Avro | bq load --source_format=PARQUET ds.table gs://bucket/*.parquet | Schema auto-detected |
| Replace table | bq load --replace ds.table source schema | Overwrites existing data |
| Append to table | bq load --noreplace ds.table source schema | Adds to existing data (default) |
Data Export
| Task | Command | Key Flags |
|---|---|---|
| Export to CSV | bq extract --destination_format=CSV ds.table gs://bucket/file.csv | --field_delimiter, --print_header |
| Export to JSON | bq extract --destination_format=NEWLINE_DELIMITED_JSON ds.table gs://bucket/*.json | Use wildcard for large files |
| Compressed export | bq extract --compression=GZIP ds.table gs://bucket/*.json.gz | GZIP, SNAPPY, or NONE |
| Export model | bq extract -m ds.model gs://bucket/model | For ML models |
Resource Management
| Task | Command | Key Flags |
|---|---|---|
| List datasets | bq ls or bq ls PROJECT: | -a for all (including hidden) |
| List tables | bq ls DATASET | -m for models, -a for all |
| List jobs | bq ls -j PROJECT | --filter='state:RUNNING,PENDING', --max_results |
| Show table | bq show ds.table | --schema, --format=prettyjson |
| Show schema only | bq show --schema ds.table | Faster than full show |
| Preview data | bq head ds.table | -n 100, -s 10 for offset |
| Show job | bq show -j JOB_ID | Check job status and details |
Creating Resources
| Task | Command | Key Flags |
|---|---|---|
| Create dataset | bq mk DATASET | --location=us, --description |
| Create table | bq mk -t ds.table schema | See schema format below |
| Create view | bq mk --view='SELECT ...' ds.view | SQL definition |
| Create materialized view | bq mk --materialized_view='SELECT ...' ds.mview | Auto-refreshed |
| Create partitioned table | bq mk --table --time_partitioning_type=DAY ds.table schema | See partitioning below |
Deleting Resources
| Task | Command | Key Flags |
|---|---|---|
| Delete table | bq rm ds.table | -f to skip confirmation |
| Delete dataset | bq rm -r DATASET | -r removes all tables, -f force |
| Delete model | bq rm -m ds.model | For ML models |
| Cancel job | bq cancel JOB_ID | Stops running query |
Cost Estimation - ALWAYS USE DRY RUN
# ALWAYS check cost before running expensive queries
bq query --dry_run 'SELECT * FROM project.dataset.huge_table'
# Output shows bytes to be processed
# Cost = (Bytes / 1TB) × $6.25 (US, on-demand pricing)
# First 1 TB per month is free
# Add safety limit (in bytes)
bq query --maximum_bytes_billed=5000000000000 'SELECT ...' # ~5TB limit
Schema Format
Inline (comma-separated):
field1:STRING,field2:INTEGER,field3:FLOAT,field4:BOOLEAN,field5:TIMESTAMP
With mode:
field1:STRING:REQUIRED,field2:INTEGER:NULLABLE,field3:RECORD:REPEATED
JSON file:
[
{"name": "field1", "type": "STRING", "mode": "REQUIRED"},
{"name": "field2", "type": "INTEGER", "mode": "NULLABLE"},
{
"name": "nested",
"type": "RECORD",
"fields": [
{"name": "subfield", "type": "STRING"}
]
}
]
Common types: STRING, INTEGER, FLOAT, BOOLEAN, TIMESTAMP, DATE, TIME, DATETIME, NUMERIC, BIGNUMERIC, BYTES, GEOGRAPHY, JSON, RECORD (nested), ARRAY
Partitioning and Clustering
Time-based Partitioning
# Partition by DATE/TIMESTAMP column
bq mk --table \
--time_partitioning_type=DAY \
--time_partitioning_field=event_date \
--time_partitioning_expiration=2592000 \
ds.events \
event_id:STRING,event_date:DATE,data:STRING
# Require partition filter (prevents expensive full scans)
bq mk --table \
--time_partitioning_type=DAY \
--require_partition_filter=true \
ds.events \
schema.json
Partition types: DAY, HOUR, MONTH, YEAR
Clustering
# Add clustering (up to 4 columns)
bq mk --table \
--time_partitioning_type=DAY \
--time_partitioning_field=event_date \
--clustering_fields=user_id,region \
ds.events \
schema.json
Benefits: Faster queries, lower costs when filtering/grouping by clustered columns
Output Formats
# Pretty table (default)
bq query --format=pretty 'SELECT ...'
# JSON (compact)
bq query --format=json 'SELECT ...'
# Pretty JSON (readable)
bq query --format=prettyjson 'SELECT ...'
# CSV with header
bq query --format=csv 'SELECT ...'
# Sparse table (simpler)
bq query --format=sparse 'SELECT ...'
Common Workflows
Cost Estimation → Query
# 1. Estimate cost
bq query --dry_run 'SELECT ...'
# 2. Review bytes to be processed
# Calculate: (bytes / 1099511627776) × $6.25
# 3. Run query if acceptable
bq query 'SELECT ...'
# OR add safety limit
bq query --maximum_bytes_billed=10000000000000 'SELECT ...'
Load Data Pipeline
# 1. Check authentication and project
gcloud auth list
gcloud config get-value project
# 2. Create dataset if needed
bq ls | grep my_dataset || bq mk my_dataset
# 3. Load with autodetect (fast) or explicit schema (production)
bq load --autodetect --skip_leading_rows=1 \
my_dataset.table \
gs://bucket/data.csv
# 4. Verify
bq show my_dataset.table
bq head -n 10 my_dataset.table
Export Large Table
# 1. Export with wildcard (required for >1GB)
bq extract \
--compression=GZIP \
--destination_format=NEWLINE_DELIMITED_JSON \
dataset.large_table \
'gs://bucket/export_*.json.gz'
# 2. Verify export
gsutil ls -lh gs://bucket/export_*
# 3. Check total size
gsutil du -sh gs://bucket/
Create Partitioned Table for Scale
# 1. Create with partitioning and clustering
bq mk --table \
--time_partitioning_type=DAY \
--time_partitioning_field=event_date \
--clustering_fields=user_id,event_type \
--require_partition_filter=true \
--time_partitioning_expiration=31536000 \
dataset.events \
event_id:STRING,user_id:STRING,event_type:STRING,event_date:DATE,data:JSON
# 2. Load initial data
bq load dataset.events gs://bucket/events_*.json
# 3. Query with partition filter (required)
bq query 'SELECT * FROM dataset.events WHERE event_date = "2025-01-20"'
Best Practices
Query Cost Optimization
- Always use --dry_run first for queries scanning >1TB
- Add WHERE clauses on partition columns
- SELECT specific columns, not
SELECT * - Use --maximum_bytes_billed as safety net
- Partition large tables by date/timestamp
- Cluster by common filter columns
Loading Data
- Use --autodetect for quick loads, explicit schema for production
- Load from GCS (not local files) for large data
- Use appropriate source format: Parquet/Avro > CSV/JSON
- Batch small files into larger files before loading
- Set expiration on staging tables
Exporting Data
- Use wildcards for large exports (>1GB limit per file)
- Compress exports with GZIP or SNAPPY
- Use columnar formats (Parquet, Avro) for analytics workflows
- Match regions (BigQuery dataset and GCS bucket)
Partitioning
- Partition by date/timestamp for time-series data
- Use require_partition_filter to prevent expensive scans
- Set partition expiration to auto-delete old data
- Combine with clustering for additional optimization
- Partition types: DAY for most use cases, HOUR for high-volume
Schema Design
- Use REQUIRED for mandatory fields
- Use TIMESTAMP over STRING for timestamps
- Use NUMERIC for financial data (exact precision)
- Use JSON type for flexible nested data (Standard SQL only)
- Nested records better than wide tables with many columns
Common Mistakes
| Mistake | Why It's Wrong | Correct Approach |
|---|---|---|
| No dry-run for large queries | Unexpected costs | Always bq query --dry_run first |
| Skipping authentication check | Commands fail | Run gcloud auth list before bq commands |
| SELECT * on huge tables | Scans all columns | Select only needed columns |
| Loading without --skip_leading_rows | Header becomes data | Use --skip_leading_rows=1 for CSVs with headers |
| Single file for large exports | 1GB limit per file | Use wildcard: gs://bucket/file_*.json.gz |
| No partitioning on large tables | Expensive full scans | Use --time_partitioning_type=DAY |
| Legacy SQL (default in old versions) | Different syntax | Use --use_legacy_sql=false (or omit, it's default now) |
| Wrong dataset reference | Ambiguous table | Use fully qualified: project.dataset.table |
| No compression on exports | Larger GCS storage costs | Use --compression=GZIP |
| Forgetting location parameter | Cross-region costs | Match --location to dataset location |
Red Flags - CHECK BEFORE RUNNING
- Running query without
--dry_runon production data - Using
SELECT *on tables with >1TB - Loading data without schema validation
- Exporting large table without wildcard
- Creating unpartitioned table for billions of rows
- No authentication check before commands
- Using wrong project (check
gcloud config get-value project) - Not matching dataset and GCS bucket regions
All of these mean: Stop, review the command, fix the issue.
Dataset and Table References
Fully qualified:
project:dataset.table
# or
project.dataset.table
Without project (uses default):
dataset.table
Partitioned table (specific date):
dataset.table$20250120
Legacy SQL (bracket notation):
[project:dataset.table]
Job Management
# List running jobs
bq ls -j --filter='state:RUNNING'
# Show job details
bq show -j JOB_ID
# Cancel long-running job
bq cancel JOB_ID
# Wait for job completion
bq wait JOB_ID
Getting Help
# General help
bq help
# Command-specific help
bq help query
bq help load
bq help extract
# Show all flags
bq --helpfull
When in doubt, check bq help <command> for exact flags and syntax.
Real-World Impact
Cost savings:
- Dry-run prevents accidental multi-thousand dollar queries
- Partitioning reduces scan costs by 10-100x
- Clustering adds 20-40% additional savings
Performance:
- Partitioning + clustering: queries 10-100x faster
- Proper schema: faster loads and queries
- Columnar formats (Parquet): 5-10x faster loads than CSV
Common scenarios:
- Ad-hoc analysis:
bq query --dry_run→ review →bq query - Data pipelines:
bq loadfrom GCS → process →bq extract - Cost monitoring: Always dry-run before production queries
- Scale: Partition + cluster tables with >100M rows