Skill: Databricks Lakeflow — Data Engineering
Metadata
What It Is
Lakeflow is Databricks' end-to-end data engineering solution. Three components:
- Lakeflow Connect — Managed connectors for ingestion
- Lakeflow Spark Declarative Pipelines (SDP) — Declarative ETL framework (formerly DLT)
- Lakeflow Jobs — Orchestration and production monitoring
Lakeflow Connect (Ingestion)
| Connector Type | Description | IPAI Use |
|---|
| Managed connectors | UI-based, config-driven, minimal ops | Odoo DB → Bronze (if DB replica available) |
| Standard connectors | Wider range, used within pipelines | ADLS, PostgreSQL, REST API |
Relevant Managed Connectors
| Source | Connector | Status |
|---|
| PostgreSQL (Odoo DB) | postgresql | Available — use for Odoo → Bronze |
| Azure Blob / ADLS | azure_storage | Available — use for file ingestion |
| Kafka / Event Hubs | kafka | Available — for streaming events |
| REST API | Custom via standard | Build for Odoo Extract API |
Lakeflow SDP (Transformation)
Formerly Delta Live Tables (DLT). Declarative framework for batch + streaming pipelines.
| Concept | Description | IPAI Equivalent |
|---|
| Flows | Process data using Spark DataFrame API | n8n workflows (simpler) |
| Streaming tables | Delta tables with incremental processing | ops.platform_events (append-only) |
| Materialized views | Cached views with auto-refresh | Superset cached queries |
| Sinks | Output to Kafka, Event Hubs, external tables | Supabase REST API writes |
| Pipelines | Container that orchestrates flows, tables, views | n8n workflow sequences |
SDP Pipeline Example (Python)
import dlt
from pyspark.sql.functions import col, sum as spark_sum
# Bronze: ingest raw Odoo journal entries
@dlt.table(comment="Raw Odoo journal entries from Extract API")
def bronze_account_move():
return (
spark.read.format("parquet")
.load("abfss://bronze@ipailakehouse.dfs.core.windows.net/odoo/account_move/")
)
# Silver: clean and type
@dlt.table(comment="Cleaned journal entries")
@dlt.expect_or_drop("valid_state", "state IN ('posted', 'draft')")
def silver_account_move():
return (
dlt.read("bronze_account_move")
.select(
col("id"),
col("name").alias("move_name"),
col("date").cast("date").alias("move_date"),
col("partner_id"),
col("journal_id"),
col("amount_total").cast("decimal(15,2)"),
col("state"),
)
)
# Gold: monthly revenue mart
@dlt.table(comment="Monthly revenue aggregation")
def gold_monthly_revenue():
return (
dlt.read("silver_account_move")
.filter(col("state") == "posted")
.groupBy(
spark_sum("amount_total").alias("total_revenue"),
)
)
Data Quality with Expectations
@dlt.expect("valid_amount", "amount_total > 0") # Warn but keep
@dlt.expect_or_drop("has_date", "move_date IS NOT NULL") # Drop invalid
@dlt.expect_or_fail("valid_state", "state IS NOT NULL") # Fail pipeline
Lakeflow Jobs (Orchestration)
| Feature | Description | IPAI Equivalent |
|---|
| Jobs | Primary orchestration resource, scheduled | n8n cron workflows |
| Tasks | Unit of work (notebook, pipeline, SQL, ML) | n8n nodes |
| Control flow | If/else branching, for-each loops | n8n IF/Switch nodes |
Job Types
| Task Type | Use Case |
|---|
| Notebook | Python/SQL ETL scripts |
| Pipeline | Lakeflow SDP pipeline execution |
| SQL | Databricks SQL query execution |
| dbt | dbt project execution |
| Python wheel | Packaged Python applications |
| ML training | Model training jobs |
Databricks Runtime
| Feature | Description |
|---|
| Photon | High-performance vectorized query engine |
| Structured Streaming | Near real-time processing |
| Autoscaling | Automatic cluster scaling |
| Delta Lake | ACID transactions on Parquet |
IPAI Integration Architecture
Source Systems Databricks Lakeflow Consumption
Odoo PG ───── Lakeflow Connect ──► Bronze (Delta) ──►
│
Supabase ──── REST API / CDC ────► Bronze (Delta) ──► SDP Pipeline ──► Silver ──► Gold
│ (expectations) │
n8n events ── Webhook / Batch ───► Bronze (Delta) ──► │
▼
┌─────────────────┐
│ Superset (BI) │
│ Databricks SQL │
│ ML Models │
│ Reverse ETL │
└─────────────────┘
Decision: n8n vs Lakeflow for ETL
| Criteria | n8n | Lakeflow |
|---|
| Simple webhook routing | Winner | Overkill |
| Data quality (expectations) | Manual | Built-in |
| Large volume transforms | Limited (memory) | Spark-native (distributed) |
| Streaming | Not supported | Structured Streaming |
| Schema evolution | Manual | Delta Lake native |
| Cost | Self-hosted (free) | Databricks DBUs ($$$) |
| Monitoring | Basic | Production-grade |
Recommendation:
- n8n for: event routing, webhook handling, Odoo CRUD, Slack notifications, simple ETL (<100K rows)
- Lakeflow for: Bronze → Silver → Gold transforms, data quality, streaming, ML features (>100K rows)
Next Steps (from n8n-odoo-supabase-etl skill)
- Provision Databricks workspace (
dbw-ipai-dev in rg-ipai-ai-dev)
- Create Unity Catalog for data governance
- Set up Lakeflow Connect for Odoo PostgreSQL
- Build first SDP pipeline:
bronze_account_move → silver_account_move → gold_monthly_revenue
- Connect Superset to Databricks SQL warehouse for Gold queries