Platform & Infrastructure Improvement Plan
B2B Analytics SaaS — Series B, 50 Engineers
1. Shared Capabilities Plan
1.1 Problem Statement
Multiple feature teams are independently rebuilding export functionality, filtering logic, and permission checks. This results in:
- Duplicated effort: Engineering hours wasted re-implementing solved problems
- Inconsistent behavior: Users experience different filtering semantics, export formats, and permission enforcement across features
- Maintenance burden: Bug fixes and improvements must be applied in multiple places
- Security risk: Permission checks implemented ad hoc increase the surface area for authorization bugs
1.2 Shared Capability Inventory
| Capability | Current State | Estimated Duplication |
|---|
| Export (CSV, Excel, PDF) | Rebuilt per feature team | 4-6 implementations |
| Filtering & query building | Custom per feature | 5+ implementations |
| Permission / authorization checks | Inline per endpoint | Scattered across codebase |
| Pagination | Mixed cursor/offset | 3-4 implementations |
| Audit logging | Partial coverage | Inconsistent |
| Rate limiting | Ad hoc or missing | Varies |
1.3 Recommended Shared Services Architecture
A. Export Service
- Scope: Centralized async export pipeline supporting CSV, Excel, PDF, and scheduled exports
- Design:
- Accept a standardized export request (data source, filters, format, recipient)
- Use a job queue (e.g., Redis-backed Sidekiq, or a dedicated task queue) for async processing
- Store generated files in object storage (S3/GCS) with signed URLs
- Notify users via webhook/email on completion
- API Contract:
POST /api/v1/exports
{
"source": "dashboard_metrics",
"filters": { ... },
"format": "csv",
"notify": "email"
}
- Benefits: Uniform format support, centralized size limits, progress tracking, retry logic
B. Filtering & Query Engine
- Scope: Shared filter definition, parsing, validation, and SQL generation
- Design:
- Define a filter DSL (JSON-based) that all feature teams use
- Central library validates filter syntax, prevents injection, applies tenant scoping
- Generates parameterized SQL or query builder calls
- Supports common operations: equality, range, contains, in-list, date ranges, null checks
- Filter Schema Example:
{
"filters": [
{ "field": "created_at", "op": "gte", "value": "2026-01-01" },
{ "field": "status", "op": "in", "value": ["active", "trial"] }
],
"sort": { "field": "created_at", "dir": "desc" }
}
- Benefits: Consistent UX, single place for query optimization, SQL injection prevention
C. Authorization Service
- Scope: Centralized permission evaluation replacing inline checks
- Design:
- Implement a policy engine (consider OPA/Rego, Casbin, or custom RBAC/ABAC service)
- Define resources, actions, and roles in a central policy store
- Expose a lightweight SDK that feature teams call:
authz.check(user, action, resource)
- Cache permission evaluations with short TTLs for hot paths
- Emit audit events for every authorization decision
- Migration Path:
- Catalog all existing permission checks across the codebase
- Map them to a unified role/permission model
- Implement the central service with a compatibility shim
- Migrate teams incrementally (one feature area per sprint)
- Remove legacy inline checks after validation
- Benefits: Single source of truth for access control, auditable, easier compliance (SOC 2, etc.)
1.4 Platform Team Structure
- Recommendation: Form a dedicated Platform Engineering team (3-5 engineers initially)
- Charter: Own shared capabilities, define APIs, provide SDKs, maintain SLOs for platform services
- Operating Model:
- Platform team builds and maintains core services
- Feature teams consume via well-documented SDKs and APIs
- Platform team holds office hours and reviews integration PRs
- Shared capabilities are versioned; breaking changes follow a deprecation policy
1.5 Migration Strategy
| Phase | Duration | Focus |
|---|
| Phase 1 | Weeks 1-4 | Catalog existing implementations, define unified APIs |
| Phase 2 | Weeks 5-10 | Build Export Service and Filter Engine v1 |
| Phase 3 | Weeks 8-14 | Build Authorization Service v1 (overlapping) |
| Phase 4 | Weeks 12-20 | Migrate feature teams (1-2 teams per sprint) |
| Phase 5 | Weeks 20-24 | Deprecate and remove legacy implementations |
2. Postgres Scaling Analysis
2.1 Current State Assessment
| Metric | Current | Concern Level |
|---|
| Database size | 500 GB | Moderate — approaching practical limits for single-node |
| Query latency | Increasing | High — user-facing impact |
| Expected growth | 5x in 6 months | Critical — 2.5 TB without intervention |
| Connection count | Likely growing with team/feature count | Moderate |
| Write volume | Unknown but likely growing with analytics ingestion | High |
2.2 Immediate Optimizations (Weeks 1-4)
These require no architectural changes and can deliver quick wins:
A. Query Performance Audit
- Run
pg_stat_statements to identify the top 20 slowest and most frequent queries
- Use
EXPLAIN ANALYZE on each; look for sequential scans, missing indexes, and poor join plans
- Add targeted indexes (partial indexes, covering indexes, expression indexes as needed)
- Identify and fix N+1 query patterns in application code
B. Connection Pooling
- Deploy PgBouncer in transaction mode if not already present
- Target: reduce active connections from potentially hundreds to a managed pool (50-100 server connections)
- This alone often improves p99 latency significantly
C. Configuration Tuning
- Review and optimize:
shared_buffers (25% of RAM), effective_cache_size (75% of RAM), work_mem, maintenance_work_mem
- Enable
pg_stat_monitor for detailed query analytics
- Set
random_page_cost appropriately for SSD storage (1.1-1.5)
D. Table Maintenance
- Identify bloated tables with
pgstattuple; run targeted VACUUM FULL during maintenance windows
- Check for unused indexes consuming write overhead
- Add table partitioning to the largest tables (likely time-series analytics data) using native declarative partitioning by date range
2.3 Medium-Term Scaling (Weeks 4-12)
A. Read Replicas
- Deploy 1-2 read replicas for analytics/reporting queries
- Route read traffic at the application level (or use a proxy like PgPool-II)
- Target: offload 60-80% of read traffic from primary
- Acceptable replication lag: <1 second for most analytics use cases
B. Table Partitioning
- Partition the largest tables (analytics events, audit logs, time-series metrics) by month or week
- Benefits: faster queries with partition pruning, easier data lifecycle management, parallel query execution
- Implementation:
CREATE TABLE events (
id bigint GENERATED ALWAYS AS IDENTITY,
tenant_id uuid NOT NULL,
created_at timestamptz NOT NULL,
...
) PARTITION BY RANGE (created_at);
CREATE TABLE events_2026_q1 PARTITION OF events
FOR VALUES FROM ('2026-01-01') TO ('2026-04-01');
C. Data Archival Strategy
- Define retention policies: hot data (0-90 days in primary), warm data (90-365 days in read replica or cheaper storage), cold data (>1 year in object storage/Parquet)
- Implement automated archival jobs that move old partitions
- Target: keep primary database under 500 GB even with 5x growth
D. Caching Layer
- Add Redis/Memcached for frequently-read, slowly-changing data:
- Permission/role lookups (TTL: 60s)
- Dashboard metadata and configuration (TTL: 300s)
- Aggregated analytics results (TTL: varies by freshness requirement)
- Target: 40-60% cache hit rate for read queries, reducing DB load proportionally
2.4 Long-Term Architecture (Months 3-6)
A. Analytical Workload Separation (OLAP Split)
- Move heavy analytical queries to a dedicated analytical store
- Options:
- ClickHouse: Excellent for time-series analytics, columnar storage, fast aggregations
- TimescaleDB: If staying in the Postgres ecosystem is important
- BigQuery/Redshift: If preferring managed services
- Stream data from Postgres via CDC (Debezium/Kafka Connect) to the analytical store
- Feature teams query the analytical store for dashboards and reports; Postgres handles transactional workloads
B. Sharding Evaluation
- If single-tenant queries dominate, consider tenant-based sharding using Citus or application-level sharding
- For a B2B SaaS with analytics, tenant-based sharding is natural since queries rarely cross tenant boundaries
- Citus (Postgres extension) can shard by
tenant_id with minimal application changes
- Decision point: pursue sharding only if the OLAP split + read replicas + partitioning are insufficient
C. Connection Architecture
- Evaluate upgrading to a service mesh or sidecar proxy pattern for database connections
- Consider Postgres logical replication for zero-downtime migrations
2.5 Projected Capacity Timeline
| Milestone | Database Size | Strategy | Target Latency (p95) |
|---|
| Today | 500 GB | Single node | Degrading |
| Month 1 | 550 GB | Optimized queries + pooling | <200ms |
| Month 3 | 700 GB | Read replicas + partitioning + caching | <150ms |
| Month 6 | 800 GB primary + archive | OLAP split + archival | <100ms transactional, <2s analytical |
| Month 6 (without action) | ~2.5 TB | Unscalable | >1s, outages likely |
3. Reliability SLOs
3.1 SLO Framework
We recommend defining SLOs across four dimensions: availability, latency, correctness, and freshness.
3.2 Proposed SLOs
Tier 1 — User-Facing API (Dashboard, Core Analytics)
| SLI | SLO Target | Measurement |
|---|
| Availability | 99.9% (8.7 hours downtime/year) | Successful responses / total requests (exclude 4xx) |
| Latency (p50) | < 100ms | Server-side request duration |
| Latency (p95) | < 500ms | Server-side request duration |
| Latency (p99) | < 1,500ms | Server-side request duration |
| Error rate | < 0.1% | 5xx responses / total responses |
Tier 2 — Export & Batch Processing
| SLI | SLO Target | Measurement |
|---|
| Availability | 99.5% | Job processing success rate |
| Completion time (p95) | < 5 minutes for exports under 100K rows | Job start to file available |
| Completion time (p95) | < 30 minutes for exports under 1M rows | Job start to file available |
| Data correctness | 100% | Export matches query results at time of request |
Tier 3 — Internal Platform Services (AuthZ, Filtering)
| SLI | SLO Target | Measurement |
|---|
| Availability | 99.95% | Higher than Tier 1 since they are dependencies |
| Latency (p99) | < 50ms | Authorization check duration |
| Correctness | 100% | No false-positive authorizations ever |
| Cache hit rate | > 80% | For permission lookups |
Tier 4 — Database & Infrastructure
| SLI | SLO Target | Measurement |
|---|
| Database availability | 99.95% | Connection success rate |
| Replication lag | < 5 seconds (p99) | Replica lag metric |
| Query latency (p95) | < 200ms for transactional queries | pg_stat_statements |
| Backup RPO | < 1 hour | Time since last successful backup |
| Backup RTO | < 4 hours | Time to restore from backup |
3.3 Error Budget Policy
- Budget: With 99.9% availability SLO, the error budget is 0.1% per rolling 30-day window (~43 minutes/month)
- When budget is healthy (>50% remaining): Normal feature velocity, deploy at will
- When budget is stressed (10-50% remaining): Require additional review for risky deploys, increase monitoring
- When budget is exhausted (<10% remaining): Freeze non-critical deploys, all engineering effort focuses on reliability until budget recovers
- Review cadence: Weekly SLO review in engineering standup, monthly SLO review with leadership
3.4 Observability Requirements
To measure these SLOs, implement:
- Metrics: Prometheus/Datadog for request latency histograms, error rates, saturation
- Logging: Structured JSON logs with request IDs, tenant IDs, and latency breakdowns
- Tracing: Distributed tracing (Jaeger/Datadog APM) across all services, especially for the shared capability calls
- Dashboards: One SLO dashboard per tier, visible to all engineers
- Alerting:
- Page (PagerDuty): >1% error rate sustained for 5 minutes, database primary down
- Warn (Slack): SLO burn rate exceeding budget, replication lag >10s, export queue depth >100
- SLO Tracking Tool: Consider Nobl9, Sloth, or a custom Prometheus recording rules setup
4. Execution Roadmap
4.1 Overview
The plan spans 6 months, divided into three phases. Each phase delivers independent value while building toward the full vision.
Month 1-2: STABILIZE — Quick wins, stop the bleeding
Month 3-4: PLATFORM — Build shared capabilities, scale reads
Month 5-6: SCALE — Analytical workload separation, full migration
4.2 Phase 1: Stabilize (Weeks 1-8)
Goal: Reduce immediate pain, establish foundations
| Week | Workstream | Deliverable | Owner | Dependencies |
|---|
| 1-2 | DB Performance | Query audit report; top 10 queries optimized | 1 Senior SRE + 1 Backend | pg_stat_statements access |
| 1-2 | Observability | SLO dashboards for Tier 1 and Tier 4 | 1 SRE | Metrics pipeline |
| 1-3 | Connection Pooling | PgBouncer deployed and configured | 1 SRE | Staging environment |
| 2-4 | Shared Capabilities | Catalog of all existing export/filter/authz implementations | Platform lead + feature team leads | Codebase access, interviews |
| 3-4 | DB Performance | Missing indexes added; bloated tables vacuumed | 1 SRE | Maintenance window |
| 3-6 | Read Replicas | 2 read replicas provisioned; read routing implemented | 1 SRE + 1 Backend | Infrastructure provisioning |
| 4-8 | Platform Team | Platform team formed (3 engineers); charter defined | Engineering leadership | Hiring/reallocation |
| 5-8 | API Design | Shared Export, Filter, and AuthZ API specs finalized (RFC process) | Platform team | Stakeholder review |
Phase 1 Success Criteria:
- p95 API latency reduced by 30%
- SLO dashboards operational
- Platform team staffed and chartered
- API specs approved by feature team leads
4.3 Phase 2: Platform Build (Weeks 9-16)
Goal: Deliver shared capabilities v1, partition database, begin migrations
| Week | Workstream | Deliverable | Owner | Dependencies |
|---|
| 9-12 | Export Service | Export Service v1 (CSV, Excel) with async processing | Platform team (2 eng) | Object storage, job queue |
| 9-12 | Filter Engine | Shared filter library v1 with SDK | Platform team (1 eng) | API spec |
| 9-14 | AuthZ Service | Authorization service v1 with policy engine | Platform team (2 eng) + 1 Security | Policy model |
| 10-12 | DB Partitioning | Top 3 largest tables partitioned by date range | 1 SRE + 1 Backend | Maintenance window, testing |
| 11-13 | Caching | Redis caching layer for permission lookups and dashboard metadata | 1 Backend | Redis infrastructure |
| 12-14 | Migration: Team 1 | First feature team migrated to shared Export + Filter | Platform team + Feature team 1 | Export Service v1, Filter v1 |
| 14-16 | Migration: Team 2-3 | Two more feature teams migrated | Platform team support | v1 services stable |
Phase 2 Success Criteria:
- Export Service handling 100% of Team 1's exports
- Filter Engine SDK adopted by 3 teams
- Authorization service evaluating permissions for at least 1 feature area
- Database primary under 500 GB with partitioning + archival active
- p95 latency < 200ms
4.4 Phase 3: Scale (Weeks 17-24)
Goal: Handle 5x traffic, complete migrations, operational maturity
| Week | Workstream | Deliverable | Owner | Dependencies |
|---|
| 17-19 | OLAP Split | Analytical store deployed (ClickHouse or TimescaleDB); CDC pipeline operational | 1 SRE + 1 Data eng | Kafka/CDC infrastructure |
| 17-20 | Migration: All Teams | Remaining feature teams migrated to shared capabilities | Platform team + all feature teams | Stable v1 services |
| 18-20 | Dashboard Queries | Analytical dashboard queries routed to OLAP store | Backend + Platform | OLAP store populated |
| 19-22 | Data Archival | Automated archival pipeline moving cold data to object storage | 1 SRE | Archival policy approved |
| 20-22 | Load Testing | Comprehensive load test at 5x current traffic | SRE team | Staging environment at scale |
| 21-23 | Legacy Cleanup | Remove deprecated export/filter/authz implementations | Feature teams | All migrations complete |
| 22-24 | Operational Runbooks | Runbooks for all platform services; on-call rotation established | Platform + SRE | Service stability |
| 24 | SLO Review | Formal SLO review; adjust targets based on 3 months of data | Engineering leadership | SLO data |
Phase 3 Success Criteria:
- System handles 5x current traffic in load tests with SLOs met
- All feature teams consuming shared capabilities (zero legacy export/filter/authz code)
- OLAP store handling 100% of analytical queries
- Database primary stable at <600 GB with archival active
- Error budget healthy across all tiers
- On-call rotation staffed and runbooks complete
4.5 Staffing Summary
| Role | Phase 1 | Phase 2 | Phase 3 | Notes |
|---|
| Platform Engineers | 0 → 3 | 3-4 | 3-4 | Core platform team |
| SRE | 2 | 2 | 2 | DB scaling, infrastructure |
| Backend (borrowed from feature teams) | 1-2 | 1 | 0 | Migration support |
| Data Engineer | 0 | 0 | 1 | OLAP pipeline |
| Total dedicated | 3-5 | 6-7 | 6-7 | ~12-14% of 50 engineers |
4.6 Risk Register
| Risk | Likelihood | Impact | Mitigation |
|---|
| Database outage before scaling completes | Medium | Critical | Prioritize read replicas and connection pooling in Phase 1 |
| Feature teams resist adopting shared capabilities | Medium | High | Executive sponsorship; demonstrate time savings; make migration easy with SDKs |
| Platform team understaffed due to competing priorities | High | High | Dedicated headcount commitment from leadership; protect from feature work |
| Shared service becomes bottleneck/single point of failure | Medium | High | Design for high availability from day 1; circuit breakers; graceful degradation |
| OLAP migration introduces data inconsistencies | Medium | Medium | Dual-write validation period; automated reconciliation checks |
| 5x traffic arrives sooner than expected | Low | Critical | Phase 1 quick wins provide 2-3x headroom; can accelerate Phase 2 |
4.7 Key Decision Points
- Week 4: Finalize analytical store choice (ClickHouse vs. TimescaleDB vs. managed service)
- Week 8: Decide on authorization model (RBAC vs. ABAC vs. hybrid) based on catalog findings
- Week 12: Evaluate whether Citus sharding is needed based on partitioning + OLAP split results
- Week 16: Go/no-go on Phase 3 timeline based on Phase 2 migration velocity
- Week 20: Assess whether SLO targets need adjustment based on real-world data
5. Summary of Expected Outcomes
| Metric | Today | After 6 Months |
|---|
| Duplicate export implementations | 4-6 | 1 (shared service) |
| Duplicate filter implementations | 5+ | 1 (shared library) |
| Permission check consistency | Ad hoc, inconsistent | Centralized, auditable |
| Database size (primary) | 500 GB, growing | <600 GB (with archival) |
| API latency (p95) | Degrading (>500ms) | <200ms transactional |
| Traffic capacity | ~1x (at limit) | 5-10x headroom |
| Availability SLO | Unmeasured | 99.9% measured and tracked |
| Engineering time on redundant infra work | ~15-20% | <5% |
| Time to add export to new feature | 2-3 weeks | 1-2 days (SDK integration) |
This plan should be reviewed with engineering leadership, SRE, and feature team leads before execution begins. Priorities may shift based on the query audit findings in Week 1-2 and the capability catalog in Week 2-4.