Platform & Infrastructure Improvement Plan

B2B Analytics SaaS — Series B, 50 Engineers

1. Shared Capabilities Plan

1.1 Problem Statement

Multiple feature teams are independently rebuilding export functionality, filtering logic, and permission checks. This results in:

Duplicated effort: Engineering hours wasted re-implementing solved problems
Inconsistent behavior: Users experience different filtering semantics, export formats, and permission enforcement across features
Maintenance burden: Bug fixes and improvements must be applied in multiple places
Security risk: Permission checks implemented ad hoc increase the surface area for authorization bugs

1.2 Shared Capability Inventory

Capability	Current State	Estimated Duplication
Export (CSV, Excel, PDF)	Rebuilt per feature team	4-6 implementations
Filtering & query building	Custom per feature	5+ implementations
Permission / authorization checks	Inline per endpoint	Scattered across codebase
Pagination	Mixed cursor/offset	3-4 implementations
Audit logging	Partial coverage	Inconsistent
Rate limiting	Ad hoc or missing	Varies

1.3 Recommended Shared Services Architecture

A. Export Service

Scope: Centralized async export pipeline supporting CSV, Excel, PDF, and scheduled exports
Design:
- Accept a standardized export request (data source, filters, format, recipient)
- Use a job queue (e.g., Redis-backed Sidekiq, or a dedicated task queue) for async processing
- Store generated files in object storage (S3/GCS) with signed URLs
- Notify users via webhook/email on completion

API Contract:

POST /api/v1/exports
{
  "source": "dashboard_metrics",
  "filters": { ... },
  "format": "csv",
  "notify": "email"
}

Benefits: Uniform format support, centralized size limits, progress tracking, retry logic

B. Filtering & Query Engine

Scope: Shared filter definition, parsing, validation, and SQL generation
Design:
- Define a filter DSL (JSON-based) that all feature teams use
- Central library validates filter syntax, prevents injection, applies tenant scoping
- Generates parameterized SQL or query builder calls
- Supports common operations: equality, range, contains, in-list, date ranges, null checks

Filter Schema Example:

{
  "filters": [
    { "field": "created_at", "op": "gte", "value": "2026-01-01" },
    { "field": "status", "op": "in", "value": ["active", "trial"] }
  ],
  "sort": { "field": "created_at", "dir": "desc" }
}

Benefits: Consistent UX, single place for query optimization, SQL injection prevention

C. Authorization Service

Scope: Centralized permission evaluation replacing inline checks
Design:
- Implement a policy engine (consider OPA/Rego, Casbin, or custom RBAC/ABAC service)
- Define resources, actions, and roles in a central policy store
- Expose a lightweight SDK that feature teams call: authz.check(user, action, resource)
- Cache permission evaluations with short TTLs for hot paths
- Emit audit events for every authorization decision
Migration Path:
1. Catalog all existing permission checks across the codebase
2. Map them to a unified role/permission model
3. Implement the central service with a compatibility shim
4. Migrate teams incrementally (one feature area per sprint)
5. Remove legacy inline checks after validation
Benefits: Single source of truth for access control, auditable, easier compliance (SOC 2, etc.)

1.4 Platform Team Structure

Recommendation: Form a dedicated Platform Engineering team (3-5 engineers initially)
Charter: Own shared capabilities, define APIs, provide SDKs, maintain SLOs for platform services
Operating Model:
- Platform team builds and maintains core services
- Feature teams consume via well-documented SDKs and APIs
- Platform team holds office hours and reviews integration PRs
- Shared capabilities are versioned; breaking changes follow a deprecation policy

1.5 Migration Strategy

Phase	Duration	Focus
Phase 1	Weeks 1-4	Catalog existing implementations, define unified APIs
Phase 2	Weeks 5-10	Build Export Service and Filter Engine v1
Phase 3	Weeks 8-14	Build Authorization Service v1 (overlapping)
Phase 4	Weeks 12-20	Migrate feature teams (1-2 teams per sprint)
Phase 5	Weeks 20-24	Deprecate and remove legacy implementations

2. Postgres Scaling Analysis

2.1 Current State Assessment

Metric	Current	Concern Level
Database size	500 GB	Moderate — approaching practical limits for single-node
Query latency	Increasing	High — user-facing impact
Expected growth	5x in 6 months	Critical — 2.5 TB without intervention
Connection count	Likely growing with team/feature count	Moderate
Write volume	Unknown but likely growing with analytics ingestion	High

2.2 Immediate Optimizations (Weeks 1-4)

These require no architectural changes and can deliver quick wins:

A. Query Performance Audit

Run pg_stat_statements to identify the top 20 slowest and most frequent queries
Use EXPLAIN ANALYZE on each; look for sequential scans, missing indexes, and poor join plans
Add targeted indexes (partial indexes, covering indexes, expression indexes as needed)
Identify and fix N+1 query patterns in application code

B. Connection Pooling

Deploy PgBouncer in transaction mode if not already present
Target: reduce active connections from potentially hundreds to a managed pool (50-100 server connections)
This alone often improves p99 latency significantly

C. Configuration Tuning

Review and optimize: shared_buffers (25% of RAM), effective_cache_size (75% of RAM), work_mem, maintenance_work_mem
Enable pg_stat_monitor for detailed query analytics
Set random_page_cost appropriately for SSD storage (1.1-1.5)

D. Table Maintenance

Identify bloated tables with pgstattuple; run targeted VACUUM FULL during maintenance windows
Check for unused indexes consuming write overhead
Add table partitioning to the largest tables (likely time-series analytics data) using native declarative partitioning by date range

2.3 Medium-Term Scaling (Weeks 4-12)

A. Read Replicas

Deploy 1-2 read replicas for analytics/reporting queries
Route read traffic at the application level (or use a proxy like PgPool-II)
Target: offload 60-80% of read traffic from primary
Acceptable replication lag: <1 second for most analytics use cases

B. Table Partitioning

Partition the largest tables (analytics events, audit logs, time-series metrics) by month or week
Benefits: faster queries with partition pruning, easier data lifecycle management, parallel query execution

Implementation:

CREATE TABLE events (
  id bigint GENERATED ALWAYS AS IDENTITY,
  tenant_id uuid NOT NULL,
  created_at timestamptz NOT NULL,
  ...
) PARTITION BY RANGE (created_at);

CREATE TABLE events_2026_q1 PARTITION OF events
  FOR VALUES FROM ('2026-01-01') TO ('2026-04-01');

C. Data Archival Strategy

Define retention policies: hot data (0-90 days in primary), warm data (90-365 days in read replica or cheaper storage), cold data (>1 year in object storage/Parquet)
Implement automated archival jobs that move old partitions
Target: keep primary database under 500 GB even with 5x growth

D. Caching Layer

Add Redis/Memcached for frequently-read, slowly-changing data:
- Permission/role lookups (TTL: 60s)
- Dashboard metadata and configuration (TTL: 300s)
- Aggregated analytics results (TTL: varies by freshness requirement)
Target: 40-60% cache hit rate for read queries, reducing DB load proportionally

2.4 Long-Term Architecture (Months 3-6)

A. Analytical Workload Separation (OLAP Split)

Move heavy analytical queries to a dedicated analytical store
Options:
- ClickHouse: Excellent for time-series analytics, columnar storage, fast aggregations
- TimescaleDB: If staying in the Postgres ecosystem is important
- BigQuery/Redshift: If preferring managed services
Stream data from Postgres via CDC (Debezium/Kafka Connect) to the analytical store
Feature teams query the analytical store for dashboards and reports; Postgres handles transactional workloads

B. Sharding Evaluation

If single-tenant queries dominate, consider tenant-based sharding using Citus or application-level sharding
For a B2B SaaS with analytics, tenant-based sharding is natural since queries rarely cross tenant boundaries
Citus (Postgres extension) can shard by tenant_id with minimal application changes
Decision point: pursue sharding only if the OLAP split + read replicas + partitioning are insufficient

C. Connection Architecture

Evaluate upgrading to a service mesh or sidecar proxy pattern for database connections
Consider Postgres logical replication for zero-downtime migrations

2.5 Projected Capacity Timeline

Milestone	Database Size	Strategy	Target Latency (p95)
Today	500 GB	Single node	Degrading
Month 1	550 GB	Optimized queries + pooling	<200ms
Month 3	700 GB	Read replicas + partitioning + caching	<150ms
Month 6	800 GB primary + archive	OLAP split + archival	<100ms transactional, <2s analytical
Month 6 (without action)	~2.5 TB	Unscalable	>1s, outages likely

3. Reliability SLOs

3.1 SLO Framework

We recommend defining SLOs across four dimensions: availability, latency, correctness, and freshness.

3.2 Proposed SLOs

Tier 1 — User-Facing API (Dashboard, Core Analytics)

SLI	SLO Target	Measurement
Availability	99.9% (8.7 hours downtime/year)	Successful responses / total requests (exclude 4xx)
Latency (p50)	< 100ms	Server-side request duration
Latency (p95)	< 500ms	Server-side request duration
Latency (p99)	< 1,500ms	Server-side request duration
Error rate	< 0.1%	5xx responses / total responses

Tier 2 — Export & Batch Processing

SLI	SLO Target	Measurement
Availability	99.5%	Job processing success rate
Completion time (p95)	< 5 minutes for exports under 100K rows	Job start to file available
Completion time (p95)	< 30 minutes for exports under 1M rows	Job start to file available
Data correctness	100%	Export matches query results at time of request

Tier 3 — Internal Platform Services (AuthZ, Filtering)

SLI	SLO Target	Measurement
Availability	99.95%	Higher than Tier 1 since they are dependencies
Latency (p99)	< 50ms	Authorization check duration
Correctness	100%	No false-positive authorizations ever
Cache hit rate	> 80%	For permission lookups

Tier 4 — Database & Infrastructure

SLI	SLO Target	Measurement
Database availability	99.95%	Connection success rate
Replication lag	< 5 seconds (p99)	Replica lag metric
Query latency (p95)	< 200ms for transactional queries	pg_stat_statements
Backup RPO	< 1 hour	Time since last successful backup
Backup RTO	< 4 hours	Time to restore from backup

3.3 Error Budget Policy

Budget: With 99.9% availability SLO, the error budget is 0.1% per rolling 30-day window (~43 minutes/month)
When budget is healthy (>50% remaining): Normal feature velocity, deploy at will
When budget is stressed (10-50% remaining): Require additional review for risky deploys, increase monitoring
When budget is exhausted (<10% remaining): Freeze non-critical deploys, all engineering effort focuses on reliability until budget recovers
Review cadence: Weekly SLO review in engineering standup, monthly SLO review with leadership

3.4 Observability Requirements

To measure these SLOs, implement:

Metrics: Prometheus/Datadog for request latency histograms, error rates, saturation
Logging: Structured JSON logs with request IDs, tenant IDs, and latency breakdowns
Tracing: Distributed tracing (Jaeger/Datadog APM) across all services, especially for the shared capability calls
Dashboards: One SLO dashboard per tier, visible to all engineers
Alerting:
- Page (PagerDuty): >1% error rate sustained for 5 minutes, database primary down
- Warn (Slack): SLO burn rate exceeding budget, replication lag >10s, export queue depth >100
SLO Tracking Tool: Consider Nobl9, Sloth, or a custom Prometheus recording rules setup

4. Execution Roadmap

4.1 Overview

The plan spans 6 months, divided into three phases. Each phase delivers independent value while building toward the full vision.

Month 1-2:  STABILIZE  — Quick wins, stop the bleeding
Month 3-4:  PLATFORM   — Build shared capabilities, scale reads
Month 5-6:  SCALE      — Analytical workload separation, full migration

4.2 Phase 1: Stabilize (Weeks 1-8)

Goal: Reduce immediate pain, establish foundations

Week	Workstream	Deliverable	Owner	Dependencies
1-2	DB Performance	Query audit report; top 10 queries optimized	1 Senior SRE + 1 Backend	pg_stat_statements access
1-2	Observability	SLO dashboards for Tier 1 and Tier 4	1 SRE	Metrics pipeline
1-3	Connection Pooling	PgBouncer deployed and configured	1 SRE	Staging environment
2-4	Shared Capabilities	Catalog of all existing export/filter/authz implementations	Platform lead + feature team leads	Codebase access, interviews
3-4	DB Performance	Missing indexes added; bloated tables vacuumed	1 SRE	Maintenance window
3-6	Read Replicas	2 read replicas provisioned; read routing implemented	1 SRE + 1 Backend	Infrastructure provisioning
4-8	Platform Team	Platform team formed (3 engineers); charter defined	Engineering leadership	Hiring/reallocation
5-8	API Design	Shared Export, Filter, and AuthZ API specs finalized (RFC process)	Platform team	Stakeholder review

Phase 1 Success Criteria:

p95 API latency reduced by 30%
SLO dashboards operational
Platform team staffed and chartered
API specs approved by feature team leads

4.3 Phase 2: Platform Build (Weeks 9-16)

Goal: Deliver shared capabilities v1, partition database, begin migrations

Week	Workstream	Deliverable	Owner	Dependencies
9-12	Export Service	Export Service v1 (CSV, Excel) with async processing	Platform team (2 eng)	Object storage, job queue
9-12	Filter Engine	Shared filter library v1 with SDK	Platform team (1 eng)	API spec
9-14	AuthZ Service	Authorization service v1 with policy engine	Platform team (2 eng) + 1 Security	Policy model
10-12	DB Partitioning	Top 3 largest tables partitioned by date range	1 SRE + 1 Backend	Maintenance window, testing
11-13	Caching	Redis caching layer for permission lookups and dashboard metadata	1 Backend	Redis infrastructure
12-14	Migration: Team 1	First feature team migrated to shared Export + Filter	Platform team + Feature team 1	Export Service v1, Filter v1
14-16	Migration: Team 2-3	Two more feature teams migrated	Platform team support	v1 services stable

Phase 2 Success Criteria:

Export Service handling 100% of Team 1's exports
Filter Engine SDK adopted by 3 teams
Authorization service evaluating permissions for at least 1 feature area
Database primary under 500 GB with partitioning + archival active
p95 latency < 200ms

4.4 Phase 3: Scale (Weeks 17-24)

Goal: Handle 5x traffic, complete migrations, operational maturity

Week	Workstream	Deliverable	Owner	Dependencies
17-19	OLAP Split	Analytical store deployed (ClickHouse or TimescaleDB); CDC pipeline operational	1 SRE + 1 Data eng	Kafka/CDC infrastructure
17-20	Migration: All Teams	Remaining feature teams migrated to shared capabilities	Platform team + all feature teams	Stable v1 services
18-20	Dashboard Queries	Analytical dashboard queries routed to OLAP store	Backend + Platform	OLAP store populated
19-22	Data Archival	Automated archival pipeline moving cold data to object storage	1 SRE	Archival policy approved
20-22	Load Testing	Comprehensive load test at 5x current traffic	SRE team	Staging environment at scale
21-23	Legacy Cleanup	Remove deprecated export/filter/authz implementations	Feature teams	All migrations complete
22-24	Operational Runbooks	Runbooks for all platform services; on-call rotation established	Platform + SRE	Service stability
24	SLO Review	Formal SLO review; adjust targets based on 3 months of data	Engineering leadership	SLO data

Phase 3 Success Criteria:

System handles 5x current traffic in load tests with SLOs met
All feature teams consuming shared capabilities (zero legacy export/filter/authz code)
OLAP store handling 100% of analytical queries
Database primary stable at <600 GB with archival active
Error budget healthy across all tiers
On-call rotation staffed and runbooks complete

4.5 Staffing Summary

Role	Phase 1	Phase 2	Phase 3	Notes
Platform Engineers	0 → 3	3-4	3-4	Core platform team
SRE	2	2	2	DB scaling, infrastructure
Backend (borrowed from feature teams)	1-2	1	0	Migration support
Data Engineer	0	0	1	OLAP pipeline
Total dedicated	3-5	6-7	6-7	~12-14% of 50 engineers

4.6 Risk Register

Risk	Likelihood	Impact	Mitigation
Database outage before scaling completes	Medium	Critical	Prioritize read replicas and connection pooling in Phase 1
Feature teams resist adopting shared capabilities	Medium	High	Executive sponsorship; demonstrate time savings; make migration easy with SDKs
Platform team understaffed due to competing priorities	High	High	Dedicated headcount commitment from leadership; protect from feature work
Shared service becomes bottleneck/single point of failure	Medium	High	Design for high availability from day 1; circuit breakers; graceful degradation
OLAP migration introduces data inconsistencies	Medium	Medium	Dual-write validation period; automated reconciliation checks
5x traffic arrives sooner than expected	Low	Critical	Phase 1 quick wins provide 2-3x headroom; can accelerate Phase 2

4.7 Key Decision Points

Week 4: Finalize analytical store choice (ClickHouse vs. TimescaleDB vs. managed service)
Week 8: Decide on authorization model (RBAC vs. ABAC vs. hybrid) based on catalog findings
Week 12: Evaluate whether Citus sharding is needed based on partitioning + OLAP split results
Week 16: Go/no-go on Phase 3 timeline based on Phase 2 migration velocity
Week 20: Assess whether SLO targets need adjustment based on real-world data

5. Summary of Expected Outcomes

Metric	Today	After 6 Months
Duplicate export implementations	4-6	1 (shared service)
Duplicate filter implementations	5+	1 (shared library)
Permission check consistency	Ad hoc, inconsistent	Centralized, auditable
Database size (primary)	500 GB, growing	<600 GB (with archival)
API latency (p95)	Degrading (>500ms)	<200ms transactional
Traffic capacity	~1x (at limit)	5-10x headroom
Availability SLO	Unmeasured	99.9% measured and tracked
Engineering time on redundant infra work	~15-20%	<5%
Time to add export to new feature	2-3 weeks	1-2 days (SDK integration)

This plan should be reviewed with engineering leadership, SRE, and feature team leads before execution begins. Priorities may shift based on the query audit findings in Week 1-2 and the capability catalog in Week 2-4.

ナビゲーション

Skillsとは？

リンク

Platform & Infrastructure Improvement Plan

Platform & Infrastructure Improvement Plan

B2B Analytics SaaS — Series B, 50 Engineers

1. Shared Capabilities Plan

1.1 Problem Statement

1.2 Shared Capability Inventory

1.3 Recommended Shared Services Architecture

A. Export Service

B. Filtering & Query Engine

C. Authorization Service

1.4 Platform Team Structure

1.5 Migration Strategy

2. Postgres Scaling Analysis

2.1 Current State Assessment

2.2 Immediate Optimizations (Weeks 1-4)

A. Query Performance Audit

B. Connection Pooling

C. Configuration Tuning

D. Table Maintenance

2.3 Medium-Term Scaling (Weeks 4-12)

A. Read Replicas

B. Table Partitioning

C. Data Archival Strategy

D. Caching Layer

2.4 Long-Term Architecture (Months 3-6)

A. Analytical Workload Separation (OLAP Split)

B. Sharding Evaluation

C. Connection Architecture

2.5 Projected Capacity Timeline

3. Reliability SLOs

3.1 SLO Framework

3.2 Proposed SLOs

Tier 1 — User-Facing API (Dashboard, Core Analytics)

Tier 2 — Export & Batch Processing

Tier 3 — Internal Platform Services (AuthZ, Filtering)

Tier 4 — Database & Infrastructure

3.3 Error Budget Policy

3.4 Observability Requirements

4. Execution Roadmap

4.1 Overview

4.2 Phase 1: Stabilize (Weeks 1-8)

4.3 Phase 2: Platform Build (Weeks 9-16)

4.4 Phase 3: Scale (Weeks 17-24)

4.5 Staffing Summary

4.6 Risk Register

4.7 Key Decision Points

5. Summary of Expected Outcomes

関連スキル(🔧 開発ツール)