name: Data Contracts description: Data contracts สำหรับกำหนด schema, quality expectations และ SLAs ระหว่าง data producers และ consumers
Data Contracts
Overview
Data contracts define the schema, quality expectations, และ SLAs for data shared between producers and consumers ช่วยให้ data layer เชื่อถือได้
Why This Matters
- Trust: Consumers รู้ว่า data format ไม่เปลี่ยน
- Quality: Define expectations ชัดเจน
- Decoupling: Producers/consumers evolve independently
- Discovery: รู้ว่า data อะไรมี format ไหน
Data Contract Template
# contracts/users.contract.yaml
name: users
version: 1.0.0
owner: user-team
description: User profile data
status: active
schema:
type: object
properties:
id:
type: string
format: uuid
description: Unique user identifier
email:
type: string
format: email
description: User email address
name:
type: string
description: User full name
created_at:
type: string
format: date-time
description: Account creation timestamp
status:
type: string
enum: [active, inactive, suspended]
description: Account status
required: [id, email, created_at, status]
quality:
- name: no_null_emails
check: email IS NOT NULL
threshold: 100%
severity: critical
- name: valid_email_format
check: email LIKE '%@%.%'
threshold: 99%
severity: high
- name: unique_emails
check: COUNT(DISTINCT email) = COUNT(*)
threshold: 100%
severity: critical
- name: recent_data
check: created_at > NOW() - INTERVAL '7 days'
threshold: 95%
severity: medium
sla:
freshness: 1 hour # Data updated within 1 hour
availability: 99.9% # Uptime guarantee
latency_p95: 100ms # 95th percentile query time
completeness: 99% # No missing required fields
consumers:
- analytics-team
- marketing-team
- billing-service
producer:
team: user-team
service: user-api
contact: user-team@example.com
changelog:
- version: 1.0.0
date: 2024-01-01
changes: Initial contract
- version: 1.1.0
date: 2024-01-15
changes: Added status field (non-breaking)
Contract Validation
Python Example
from datacontract import Contract, validate
# Load contract
contract = Contract.load('contracts/users.contract.yaml')
# Validate data
result = validate(data, contract)
if not result.passed:
print(f"Validation failed: {result.failures}")
for failure in result.failures:
print(f"- {failure.check}: {failure.message}")
raise DataQualityError(result.failures)
print("✓ Data meets contract requirements")
SQL Example
-- Validate quality checks
WITH quality_checks AS (
SELECT
'no_null_emails' as check_name,
COUNT(*) FILTER (WHERE email IS NULL) as failures,
COUNT(*) as total
FROM users
UNION ALL
SELECT
'valid_email_format',
COUNT(*) FILTER (WHERE email NOT LIKE '%@%.%'),
COUNT(*)
FROM users
)
SELECT
check_name,
failures,
total,
(1 - failures::float / total) * 100 as pass_rate,
CASE
WHEN (1 - failures::float / total) * 100 < 99 THEN 'FAIL'
ELSE 'PASS'
END as status
FROM quality_checks;
Breaking Change Detection
# Compare contract versions
datacontract diff v1.0.0 v1.1.0
# Output:
# BREAKING CHANGES:
# - Removed field 'age' (was required)
# - Changed type of 'phone' from string to number
#
# COMPATIBLE CHANGES:
# - Added optional field 'address'
# - Added new quality check 'valid_status'
Breaking vs Non-Breaking
# BREAKING (requires consumer updates):
- Remove required field
- Change field type
- Rename field
- Add new required field
- Stricter validation
# NON-BREAKING (backward compatible):
- Add optional field
- Remove optional field
- Relax validation
- Add new quality check
Contract Registry
// contracts/registry.ts
export const contracts = {
users: {
version: '1.1.0',
path: 'contracts/users.contract.yaml',
owner: 'user-team',
consumers: ['analytics', 'marketing']
},
orders: {
version: '2.0.0',
path: 'contracts/orders.contract.yaml',
owner: 'order-team',
consumers: ['billing', 'shipping']
}
};
// Get contract
export function getContract(name: string, version?: string) {
const contract = contracts[name];
if (!contract) {
throw new Error(`Contract ${name} not found`);
}
return Contract.load(contract.path, version);
}
CI/CD Integration
# .github/workflows/contract-validation.yml
name: Contract Validation
on: [pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Validate Contract Schema
run: |
datacontract validate contracts/*.yaml
- name: Check Breaking Changes
run: |
datacontract diff main HEAD
if [ $? -eq 1 ]; then
echo "Breaking changes detected!"
exit 1
fi
- name: Test Data Quality
run: |
python scripts/test_contracts.py
Monitoring
# Monitor contract SLAs
import time
from prometheus_client import Gauge
# Metrics
freshness_gauge = Gauge('data_freshness_seconds', 'Data freshness', ['dataset'])
quality_gauge = Gauge('data_quality_score', 'Quality score', ['dataset', 'check'])
def monitor_contract(contract_name: str):
contract = get_contract(contract_name)
# Check freshness
last_update = get_last_update_time(contract_name)
freshness = time.time() - last_update
freshness_gauge.labels(dataset=contract_name).set(freshness)
# Check quality
for check in contract.quality:
score = run_quality_check(contract_name, check)
quality_gauge.labels(
dataset=contract_name,
check=check.name
).set(score)
# Alert if below threshold
if score < check.threshold:
alert(f"{contract_name}: {check.name} below threshold")
Best Practices
1. Version Semantically
1.0.0 → 1.0.1: Bug fix (patch)
1.0.0 → 1.1.0: New optional field (minor)
1.0.0 → 2.0.0: Breaking change (major)
2. Document Changes
changelog:
- version: 2.0.0
date: 2024-01-20
changes: |
BREAKING: Removed 'age' field
Reason: Privacy compliance
Migration: Use 'birth_year' instead
3. Notify Consumers
Before breaking change:
1. Announce in #data-platform
2. Email all consumers
3. Provide migration guide
4. Set deprecation timeline (30 days)
4. Test Contracts
def test_user_contract():
contract = Contract.load('contracts/users.contract.yaml')
# Test valid data
valid_data = {
'id': '123',
'email': 'test@example.com',
'created_at': '2024-01-16T12:00:00Z',
'status': 'active'
}
assert validate(valid_data, contract).passed
# Test invalid data
invalid_data = {'id': '123'} # Missing required fields
assert not validate(invalid_data, contract).passed
Summary
Data Contracts: กำหนด schema, quality และ SLAs
Components:
- Schema (fields, types, required)
- Quality checks (validation rules)
- SLAs (freshness, availability, latency)
- Ownership (producer, consumers)
Versioning:
- Semantic versioning (major.minor.patch)
- Breaking vs non-breaking changes
- Changelog documentation
Enforcement:
- Validation in CI/CD
- Quality monitoring
- SLA tracking
- Consumer notifications
Benefits:
- Trust in data
- Clear expectations
- Independent evolution
- Early error detection