Platform & Infrastructure Improvement Pack

Company: B2B Analytics SaaS (Series B, 50 engineers) Date: 2026-03-17 Decision statement: We will extract export, filtering, and permissions into shared platform services, define a Postgres scaling plan with lead-time-aware triggers, and commit to reliability SLOs -- all sequenced by blast radius so the highest-leverage work ships first, enabling the enterprise push without a firefighting spiral.

1) Context Snapshot

System(s) in scope: Core B2B analytics SaaS platform -- all product services, the shared Postgres database, and the internal tooling layer consumed by feature teams.
Users/customers: Enterprise and mid-market analytics buyers; internal consumers are ~8-10 feature teams (50 engineers total).
Primary pains (1-3):
1. Developer velocity -- every feature team re-implements export, filtering, and permission checks, creating duplicated effort and inconsistent behavior.
2. Database scaling -- Postgres at 500 GB with increasing query latency; 5x traffic growth expected in 6 months from enterprise push.
3. Reliability risk -- no formal SLOs; enterprise customers will demand contractual uptime and performance guarantees.
Time horizon / deadline: 6 months to enterprise launch readiness. Postgres scaling work must begin immediately given lead times.
Stakeholders / decision-maker(s): VP Engineering (decision-maker), Platform/Infra lead (DRI for shared services), Product Engineering leads (consumers), SRE/on-call rotation (reliability ownership).
Constraints (security/compliance, staffing, risk tolerance):
- Series B staffing: no dedicated platform team yet; will need to carve out 4-6 engineers from feature teams or hire.
- Enterprise push implies SOC 2 / data residency requirements are imminent.
- Risk tolerance: moderate -- can tolerate planned migrations but not extended outages or data loss.
Assumptions (explicit):
- A1: Current Postgres instance is a single primary with read replicas (no sharding today).
- A2: Feature teams number 8-10, each with 4-6 engineers; at least 4 teams have built their own export, filtering, or permissions logic.
- A3: No formal SLOs exist today; monitoring is basic (uptime pings, some application metrics).
- A4: The enterprise push will bring customers with contractual SLA requirements (99.9%+ availability).
- A5: Current query latency degradation is primarily from large analytical queries competing with transactional workload on the same Postgres instance.
Success definition (measures):
- Export, filtering, and permissions available as platform services consumed by >= 3 teams within 4 months.
- Postgres scaling plan executed with headroom for 5x growth before enterprise launch.
- Published SLOs for top 5 user journeys with measurement infrastructure in place.
- Zero P0 incidents caused by DB saturation or permission inconsistencies during enterprise onboarding.
Non-goals / out of scope:
- Rewriting the entire application architecture or migrating off Postgres entirely.
- Product/market positioning of the analytics platform (use platform-strategy).
- Broader technical roadmap sequencing beyond infra (use technical-roadmaps).
- Legacy code cleanup unrelated to shared capabilities (use managing-tech-debt).
- Engineering culture or process changes (use engineering-culture).

2) Shared Capabilities Inventory + Platformization Plan

Shared Capabilities Inventory

Capability	Current duplication (where/how)	Consumer teams/services	Proposed platform contract (API/schema/SDK)	Migration approach	Expected impact	Risks
Data Export Service	4+ teams each built CSV/Excel/PDF export with own queuing, formatting, progress tracking. Different timeout handling, file size limits, and error behavior across teams.	5	REST API: `POST /platform/exports` (accepts query definition, format, delivery method). Async job with webhook/polling status. SDK wrapper for common languages. Returns signed download URL.	Phase 1: New exports use platform service. Phase 2: Migrate existing exports team-by-team with adapter shim (old endpoints proxy to new service). Phase 3: Deprecate team-specific implementations over 2 sprints per team.	Eliminates ~3 weeks/quarter of duplicated export work across teams. Consistent UX (progress bars, retry, size limits). Single place to enforce export audit logging for compliance.	Migration friction if teams have custom export formats. Must support current file-size limits during transition.
Filtering & Query Engine	4+ teams built bespoke filtering UIs and query builders. Different syntax, operators, and performance characteristics. Some teams hit Postgres directly; others use materialized views.	6	Internal SDK/library: `FilterEngine.build(schema, filters) -> SQL/query`. Shared filter grammar (field, operator, value, combinator). Server-side validation and query plan analysis (reject queries exceeding cost threshold).	Phase 1: Ship SDK as internal package; new features adopt it. Phase 2: Teams wrap existing filters with adapter that delegates to SDK. Phase 3: Remove bespoke query builders over 3-month window.	Consistent filter behavior across product. Single optimization point for query performance. Blocks dangerous queries before they hit Postgres.	Filter grammar must be expressive enough for all current use cases. Performance regression risk if SDK adds overhead; mitigate with benchmarking.
Permissions Service	3+ teams implemented role checks, feature flags, and entitlement gates independently. Inconsistent enforcement (some check at API layer, some at DB layer, some at UI only).	7	gRPC service: `PermissionsService.Check(subject, action, resource) -> {allowed, reason}`. Policy-as-code (OPA/Cedar). SDK with middleware for common frameworks. Caching layer (local + distributed) with TTL-based invalidation.	Phase 1: Deploy permissions service alongside existing checks (shadow mode -- log discrepancies, don't enforce). Phase 2: Flip enforcement to platform service per team/endpoint. Phase 3: Remove inline permission logic.	Consistent access control (critical for enterprise/SOC 2). Single audit log for all permission decisions. Eliminates ~2 weeks/quarter of duplicated authz work.	Shadow mode must run long enough to catch edge cases. Latency budget: permission checks must add < 5 ms p99. Cache invalidation bugs could cause access control failures.

Platformization Decisions

What becomes a shared primitive (and why):
- Export -- 5 consumers, high duplication, compliance requirement for audit trail. Stable contract surface (input: query + format; output: file).
- Filtering -- 6 consumers, highest duplication count, and directly tied to Postgres performance problems (unoptimized queries). Centralizing this is also a scaling lever.
- Permissions -- 7 consumers (every team needs it), enterprise customers require consistent RBAC, and SOC 2 demands a single audit trail. Inconsistent enforcement is a security risk.
What remains product-specific (and why):
- Visualization rendering -- highly product-specific; each analytics view has unique charting/rendering needs. Not enough commonality for a shared primitive yet.
- Notification preferences -- only 2 teams use notifications today and the UX requirements differ significantly. Revisit when a third consumer appears.
- Custom report scheduling -- closely tied to individual product domains; too early to abstract.
Ownership model:
- Dedicated Platform Services team (4-6 engineers, carved from feature teams + 2 new hires). This team owns the shared services, SLOs, and migration support.
- Feature teams own integration/migration of their code to platform services. Platform team provides pairing support during migration sprints.
Versioning + backwards compatibility plan:
- Semantic versioning for all platform service APIs and SDKs.
- Breaking changes require a 2-sprint deprecation window with migration guide.
- Export and Permissions services: versioned API paths (/v1/, /v2/). Old versions supported for 3 months after new version GA.
- Filtering SDK: major version bumps require opt-in; minor/patch versions are backward-compatible.

3) Quality Attributes Spec (SLOs/SLIs + Privacy/Safety)

Reliability Targets

Availability: 99.9% measured monthly for all tier-1 user journeys (see SLO table below). This translates to ~43 minutes of allowed downtime per month.
Error rate: < 0.1% 5xx error rate on tier-1 APIs measured over rolling 7-day windows.
MTTR (Mean Time to Recover): < 30 minutes for P0 incidents (complete service unavailability); < 2 hours for P1 (degraded but functional).
Error budget policy: When monthly error budget is < 25% remaining, freeze non-critical deployments and prioritize reliability work until budget resets.

Performance Targets

Dashboard load (primary journey): p95 < 2 seconds, p99 < 4 seconds end-to-end.
API response (CRUD operations): p95 < 200 ms, p99 < 500 ms.
Export jobs: Initiation < 1 second; completion for datasets < 100 MB within 60 seconds. Larger exports: progress updates every 10 seconds.
Permission checks: p99 < 5 ms (cached), p99 < 50 ms (uncached).
Filter query execution: p95 < 500 ms for standard filters; queries exceeding 5 seconds are killed and user is prompted to narrow scope.

Privacy/Safety Requirements

Encryption: TLS 1.2+ in transit; AES-256 at rest for all data stores (Postgres, object storage, caches).
Access control: RBAC enforced through the Permissions Service for all API endpoints. No direct DB access from application code without going through the service layer.
Data residency: Prepare for regional deployment (US, EU) to support enterprise data residency requirements. Architecture must support tenant-level data isolation.
Retention: Define retention policies per data class: operational data (2 years), audit logs (7 years), analytics events (1 year raw, aggregated indefinitely). Automated purge jobs.
Audit trail: All permission checks, data exports, and admin actions logged to immutable audit store. Required for SOC 2 Type II.

Operability Requirements

Dashboards: Unified platform health dashboard (Datadog/Grafana) covering: DB metrics, API latency/error rates, export job queue depth, permission service latency, SLO burn rate.
Alerts: PagerDuty integration. Alert on SLO burn rate (fast burn: 10x consumption rate, slow burn: 2x consumption rate). DB-specific alerts on connection count, replication lag, disk usage, query duration.
Runbooks: One runbook per P0 scenario (DB failover, permission service outage, export queue backup, full disk). Runbooks linked from alert definitions.
On-call: Platform team owns a dedicated on-call rotation. Feature teams handle product-specific incidents but escalate to platform on-call for shared service issues.

Cost Guardrails

Top drivers: Postgres (compute + storage), application compute (Kubernetes), object storage (exports), observability tooling.
Monthly budget caps: Set alerts at 80% and 100% of monthly infrastructure budget. Any single service exceeding 120% of its allocation triggers cost review.
Optimization targets: Reduce per-query cost by 40% through filtering engine optimization and read replica routing. Export storage: auto-expire files after 7 days.

Proposed SLOs/SLIs

User journey / API	SLI	SLO target	Measurement method	Owner	Notes
Dashboard load (primary)	Time from request to interactive render	p95 < 2 s, p99 < 4 s	RUM (Real User Monitoring) + synthetic checks every 60 s	Product Eng + Platform	Tier-1 journey; measured end-to-end including API + rendering
API CRUD operations	Server-side latency (request received to response sent)	p95 < 200 ms, p99 < 500 ms	Application metrics (histogram)	Platform team	Excludes network transit; measured at load balancer
Data export completion	Time from job creation to download-ready	< 60 s for datasets < 100 MB	Export service metrics (job duration histogram)	Platform team	Larger exports measured separately; SLO applies to 90th percentile of jobs
Permission check latency	Latency of `Check()` RPC	p99 < 5 ms (cached), p99 < 50 ms (uncached)	gRPC service metrics	Platform team	Cache hit rate target: > 95%
Overall availability	Successful requests / total requests (excluding maintenance)	99.9% monthly	Load balancer access logs + health checks	SRE / Platform team	43 min downtime budget per month
Filter query execution	Query execution time for standard filter operations	p95 < 500 ms	DB query metrics + application instrumentation	Platform team	Queries exceeding 5 s are killed; tracked separately as "timeout rate"

4) Scaling "Doomsday Clock" + Capacity Plan

Doomsday Clock

Component/limit	Metric	Current	Trigger threshold	Estimated lead time to mitigate	Mitigation project	Owner
Postgres disk (500 GB)	Total DB size (GB)	500 GB	650 GB (70% of typical managed instance max before perf cliff)	6-8 weeks	Data archival + partitioning (see below)	Platform lead
Postgres IOPS	Read/Write IOPS	~8,000 (est.)	12,000 (80% of provisioned IOPS on current instance class)	4-6 weeks	Read replica routing for analytics queries + connection pooler (PgBouncer)	Platform lead
Postgres connections	Active connections	~150 (est.)	300 (75% of max_connections, typically 400 on managed instances)	2-3 weeks	PgBouncer connection pooling; review connection lifecycle in application code	Platform eng
Postgres query latency	p95 query duration (ms)	~800 ms (est., degrading)	500 ms (target), 1,500 ms (critical)	4-6 weeks	Separate OLTP/OLAP workloads; read replicas for heavy analytics; query optimization via filtering engine	Platform lead
Postgres replication lag	Replica lag (seconds)	< 1 s (est.)	10 s sustained	2-3 weeks	Investigate write amplification; tune WAL settings; consider logical replication for selective tables	Platform eng
Application compute (K8s)	CPU/memory utilization across pods	~55% (est.)	75% sustained over 1 hour	1-2 weeks	Horizontal auto-scaling policy; right-size pod resource requests	SRE
Export queue depth	Pending export jobs	~20 (est.)	200 (indicates backlog buildup)	1-2 weeks	Auto-scale export workers; implement priority queue (enterprise jobs first)	Platform eng
Object storage (exports)	Total stored export files (GB)	~50 GB (est.)	500 GB (cost threshold)	1 week	Auto-expire exports after 7 days; lazy-generate on re-request	Platform eng

Capacity Plan

Top scaling risks (ordered by time-to-breach):

Postgres disk + query latency (CRITICAL -- breach in ~3 months at current growth): At 5x traffic growth, the 500 GB database will approach managed instance limits within 3 months. Query latency is already degrading, indicating the problem is immediate.
Postgres IOPS + connections (HIGH -- breach in ~4 months): 5x traffic means ~5x connection demand and proportional IOPS increase. Connection pooling buys time but doesn't solve the fundamental read/write contention.
Export queue saturation (MEDIUM -- breach in ~5 months): Enterprise customers will drive heavier export usage; queue must scale horizontally.

Proposed scaling projects (sequenced by urgency):

Project S1: Postgres Workload Separation (Month 1-2)

Separate OLTP (transactional) and OLAP (analytical/reporting) workloads.
Route read-heavy analytics queries to dedicated read replicas.
Deploy PgBouncer for connection pooling (reduce active connections by ~60%).
Expected outcome: Buys 6+ months of headroom on connections and IOPS.

Project S2: Data Archival + Table Partitioning (Month 2-3)

Implement time-based partitioning on the largest tables (event logs, audit trails, analytics data).
Archive data older than 12 months to cold storage (S3 + Athena for ad-hoc queries).
Target: Reduce active DB size from 500 GB to ~200 GB.
Expected outcome: Significant improvement in query performance; disk pressure eliminated for 12+ months.

Project S3: Filtering Engine Query Optimization (Month 2-4)

Deploy the shared Filtering SDK with built-in query cost analysis.
Kill queries exceeding cost threshold; guide users to narrow filters.
Add query plan caching for common filter patterns.
Expected outcome: 40% reduction in average query cost; eliminates runaway queries.

Project S4: Evaluate Postgres Vertical Upgrade vs. Citus/Read Scaling (Month 3-4)

If S1-S3 are insufficient for 5x headroom, evaluate:
- Option A: Vertical upgrade to larger instance class (quick but has ceiling).
- Option B: Citus extension for horizontal scaling (distributes large tables across nodes).
- Option C: Introduce a dedicated analytical data store (ClickHouse/Redshift) for reporting workloads, keeping Postgres lean for OLTP.
Decision criteria: cost, migration complexity, operational burden, and headroom provided.

Feature-freeze / priority policy when triggers fire:

Yellow (trigger threshold reached): Scaling work becomes P1; no new features that increase DB load. Platform team gets 2 additional engineers from feature teams.
Red (critical threshold reached): Full feature freeze on DB-intensive work. All available engineers support scaling mitigation. Stakeholder communication within 4 hours of red status.
Monitoring: Weekly capacity review meeting (30 min) until all metrics are below 50% of trigger thresholds.

5) Instrumentation Plan (Observability + Server-Side Analytics)

Observability Gaps

Area	Current state	Gap	Proposed instrumentation	Owner	Priority
Database metrics	Basic uptime monitoring	No query-level latency tracking, no connection pool metrics, no replication lag alerts	Postgres exporter (prometheus) + PgBouncer metrics. Dashboards: query duration histograms, connection utilization, replication lag, table bloat, cache hit ratio. Alerts: p95 query > 500 ms, connections > 300, replication lag > 10 s.	Platform eng	P0
SLO burn rate	No SLOs defined	No burn-rate tracking or alerting	Implement SLO tracking (Datadog SLO monitors or Prometheus + sloth). Multi-window burn-rate alerts (fast: 5 min window, slow: 1 hr window). Dashboard showing remaining error budget per SLO.	SRE / Platform	P0
Platform service health	N/A (services don't exist yet)	No metrics for new shared services	Each platform service (Export, Filtering, Permissions) ships with: request rate, error rate, latency histograms, queue depth (export), cache hit rate (permissions). Standard RED metrics dashboard per service.	Platform eng	P1 (ship with services)
Distributed tracing	Partial or absent	Cannot trace a request end-to-end across services	Deploy OpenTelemetry SDK across all services. Trace context propagation through HTTP headers and gRPC metadata. Sample rate: 100% for errors, 10% for success in production.	Platform eng	P1
Cost monitoring	Cloud provider billing dashboard only	No per-service or per-feature cost attribution	Tag all infrastructure resources by service/team. Weekly automated cost report. Alert on >20% week-over-week increase per service.	SRE	P2
Export job observability	Basic job success/fail logging	No duration tracking, no queue depth visibility, no per-tenant metrics	Export service emits: job_created, job_started, job_completed, job_failed events with duration, file size, tenant_id. Dashboard: queue depth, completion time histogram, failure rate by type.	Platform eng	P1

Server-Side Analytics Event Contract

Canonical identity fields:
- user_id (UUID) -- authenticated user; always present for logged-in actions.
- account_id (UUID) -- the organization/tenant; always present.
- anonymous_id (UUID) -- generated client-side for pre-auth tracking; merged to user_id on login via server-side merge event.
- Merge rules: On authentication, emit identity_merged(anonymous_id, user_id, account_id). Analytics pipeline deduplicates and re-attributes pre-auth events to the resolved user.
Delivery semantics:
- At-least-once delivery from application to event bus (Kafka/SQS).
- Dedupe strategy: Every event carries a event_id (UUID v7, time-sortable). Consumers deduplicate on event_id within a 24-hour window.
- Events are produced server-side at the point of action completion (not on request receipt).
Schema/versioning:
- JSON Schema registry (e.g., SchemaStore in a git repo or a schema registry service).
- Events follow noun.verb naming convention (e.g., export.completed, filter.applied, permission.checked).
- Schema changes require a PR review; breaking changes produce a new event version (export.completed.v2) with a 3-month overlap period.
Data QA checks:
- Schema validation: Events validated against JSON Schema at production time (reject malformed events to dead-letter queue).
- Volume anomaly detection: Alert if any event type volume drops > 50% or increases > 300% compared to 7-day rolling average.
- Null-rate checks: Alert if required fields have null rate > 1%.
- Dedupe rate monitoring: Track duplicate event rate; alert if > 5% (indicates producer retry storms).

Event Taxonomy (Starter Table)

Event name	When emitted (server action)	Required properties	Identity fields	Consumers (teams)	Notes
`dashboard.loaded`	Server completes data fetch for dashboard render	`dashboard_id`, `query_count`, `total_duration_ms`, `data_points_returned`	`user_id`, `account_id`	Product analytics, Performance monitoring	Primary journey; correlate with RUM for full picture
`export.requested`	Export job created in queue	`export_id`, `format` (csv/xlsx/pdf), `estimated_rows`, `filter_hash`	`user_id`, `account_id`	Platform team, Product analytics	Track export patterns to optimize common formats
`export.completed`	Export file ready for download	`export_id`, `format`, `file_size_bytes`, `duration_ms`, `row_count`	`user_id`, `account_id`	Platform team, Billing (large exports)	Used for SLO measurement
`export.failed`	Export job fails after retries exhausted	`export_id`, `error_type`, `error_message`, `retry_count`	`user_id`, `account_id`	Platform team, SRE	Triggers alert if failure rate > 2%
`filter.applied`	Filter query executed via Filtering SDK	`filter_hash`, `field_count`, `query_duration_ms`, `rows_scanned`, `rows_returned`	`user_id`, `account_id`	Platform team, Product analytics	Feeds query optimization; identifies expensive patterns
`filter.rejected`	Query killed due to cost threshold	`filter_hash`, `estimated_cost`, `threshold`, `rejection_reason`	`user_id`, `account_id`	Platform team, Product (UX improvement)	Track to improve filter UX guidance
`permission.checked`	Permission service processes a Check() call	`subject_id`, `action`, `resource_type`, `resource_id`, `result` (allowed/denied), `latency_ms`, `cache_hit`	`user_id`, `account_id`	Security, Compliance/Audit	High-volume; sample at 10% for analytics, 100% for audit log
`permission.denied`	Permission check returns denied	`subject_id`, `action`, `resource_type`, `resource_id`, `reason`	`user_id`, `account_id`	Security, Product (UX -- show proper error)	100% capture; used for security review
`identity.merged`	User authenticates, linking anonymous to known	`anonymous_id`, `method` (password/sso/oauth)	`user_id`, `account_id`, `anonymous_id`	Analytics pipeline	Triggers re-attribution of pre-auth events
`account.limit_approached`	Tenant usage approaches plan limit	`limit_type`, `current_value`, `limit_value`, `percentage_used`	`account_id`	Billing, Customer Success, Product	Drives upsell and capacity planning

6) Discoverability Plan

Not applicable. This is a B2B SaaS analytics product, not a content-heavy web property. SEO/discoverability is not a primary concern for the application itself. Marketing site SEO is out of scope for this infrastructure plan.

7) Execution Roadmap

Prioritized by blast radius (how many teams/users are affected if we don't act) crossed with urgency (time-to-breach).

Roadmap

Milestone	Scope	Acceptance criteria	Owner	Dependencies	ETA range	Rollout/rollback
M0: Platform Team Formation	Hire/reassign 4-6 engineers; establish platform team charter, on-call rotation, and communication channels	Team staffed; charter published; on-call rotation active; Slack channel + weekly sync established	VP Engineering	Budget approval; backfill plan for feature teams	Week 1-2	N/A
M1: Emergency DB Relief (PgBouncer + Read Replicas)	Deploy PgBouncer connection pooler; route analytics read queries to dedicated read replica(s)	Active connections reduced by >= 50%; analytics queries running on replica; p95 query latency reduced by >= 30%	Platform lead	M0 (team exists); DBA access to Postgres config	Week 2-4	Rollback: disable PgBouncer and revert DNS/connection strings to primary. Read replica routing toggled via feature flag.
M2: Observability Foundation	Deploy Postgres metrics exporter, SLO tracking, distributed tracing (OpenTelemetry), platform health dashboards	All SLO dashboards live; burn-rate alerts firing; DB metrics (connections, IOPS, replication lag, query duration) visible; tracing deployed to >= 3 critical services	SRE / Platform eng	M0; observability tooling access (Datadog/Grafana)	Week 3-5	Rollback: disable exporters/agents if perf impact; dashboards are additive (no rollback needed).
M3: Permissions Service (Shadow Mode)	Deploy permissions service; integrate with 2 pilot teams in shadow mode (log-only, no enforcement)	Service deployed; shadow mode processing 100% of permission checks for pilot teams; discrepancy rate tracked on dashboard; latency < 5 ms p99 (cached)	Platform eng	M0; policy language chosen (OPA/Cedar); pilot teams identified	Week 4-8	Rollback: disable shadow mode integration (feature flag per team). No user impact since shadow mode is non-enforcing.
M4: Data Archival + Table Partitioning	Partition largest tables by time; archive data > 12 months to cold storage (S3); verify query performance improvement	Active DB size reduced to < 250 GB; archived data queryable via Athena; no data loss verified via row-count reconciliation; p95 query latency improved by >= 40%	Platform lead + DBA	M1 (read replicas for safe migration); M2 (monitoring to verify)	Week 5-9	Rollback: partitioning is additive; if issues, queries can still access all partitions. Archive has 30-day restore window from S3.
M5: Filtering SDK (v1)	Ship internal Filtering SDK with query cost analysis; integrate with 2 pilot teams	SDK published as internal package; 2 teams using it in production; queries exceeding cost threshold are rejected; average query cost reduced by >= 25% for pilot teams	Platform eng	M1 (read replicas reduce load); M2 (query metrics to measure improvement)	Week 6-10	Rollback: teams revert to direct query building (SDK is opt-in via import). Feature flag to disable cost-threshold rejection.
M6: Permissions Service (Enforcement)	Flip from shadow mode to enforcement for pilot teams; roll out to remaining teams	All teams enforcing via platform permissions service; discrepancy rate < 0.1%; audit log capturing 100% of permission decisions; SOC 2 audit trail requirement met	Platform eng + Security	M3 (shadow mode validated); M2 (monitoring in place)	Week 8-12	Rollback: per-team feature flag reverts to inline permission checks. Gradual rollout: 1 team per week.
M7: Export Service (v1)	Deploy shared export service; migrate 2 pilot teams	Export service handling production traffic for 2 teams; async job processing with status tracking; export SLO met (< 60 s for < 100 MB); audit logging active	Platform eng	M2 (observability); M5 (filtering SDK for export query building)	Week 9-14	Rollback: pilot teams revert to existing export code (old endpoints remain active during migration). Traffic split via feature flag.
M8: Full Rollout + Scaling Evaluation	All teams on platform services (filtering, permissions, export); evaluate need for S4 (Citus/vertical upgrade/analytical store)	>= 6 teams consuming each platform service; all SLOs met for 2 consecutive weeks; scaling evaluation document published with recommendation	Platform lead	M4-M7 complete; 4 weeks of production data on new architecture	Week 14-20	Rollback: per-team feature flags for each service. Scaling evaluation informs next phase (no rollback needed).
M9: Enterprise Readiness Certification	Validate all SLOs met under load test simulating 5x traffic; SOC 2 controls verified; runbooks tested via game day	Load test passes with all SLOs green at 5x current traffic; SOC 2 evidence package complete; 1 game day executed with < 30 min MTTR	VP Engineering + Platform lead + Security	M6, M7, M8 complete; load testing infrastructure	Week 20-24	N/A (validation milestone). If SLOs fail under load test, trigger S4 scaling project immediately.

Sequencing Rationale (Blast Radius Priority)

M0-M1 (Weeks 1-4): Stop the bleeding. DB is the single point of failure for all 50 engineers and all customers. PgBouncer + read replicas are the highest-blast-radius, lowest-effort wins.
M2 (Weeks 3-5): See before you act. Without observability, every subsequent decision is guesswork. This is foundational.
M3/M4 (Weeks 4-9): Permissions + DB scaling in parallel. Permissions affects all 7 consumer teams (highest blast radius among shared capabilities). DB archival/partitioning addresses the most urgent scaling risk.
M5/M6 (Weeks 6-12): Filtering + Permissions enforcement. Filtering SDK directly reduces DB load (scaling lever) and improves developer velocity. Permissions enforcement completes the security/compliance story.
M7 (Weeks 9-14): Export. Important but lower blast radius than permissions and filtering; fewer teams blocked and no security/compliance urgency.
M8-M9 (Weeks 14-24): Consolidation + certification. Full rollout, scaling evaluation, and enterprise readiness validation.

8) Risks / Open Questions / Next Steps

Risks

Risk	Likelihood	Impact	Mitigation
Platform team staffing delay -- cannot hire or reassign 4-6 engineers quickly enough	Medium	High (entire roadmap slips)	Begin internal reassignment immediately (2-3 engineers); hire for remaining slots in parallel. Feature teams accept temporary velocity reduction.
Postgres hits critical threshold before M1 completes -- query latency becomes unacceptable before PgBouncer/replicas are ready	Medium	High (customer-facing outages)	Fast-track M1 to 2-week delivery. Prepare emergency vertical upgrade as backup (can execute in days). Implement query timeout (kill queries > 10 s) as immediate stopgap.
Migration friction underestimated -- teams resist or are slower than expected to adopt platform services	Medium	Medium (roadmap extends 4-8 weeks)	Dedicated migration support from platform team (pairing). Executive mandate from VP Eng. Track migration per-team on weekly dashboard.
Permission service shadow mode reveals deep inconsistencies -- existing permission implementations disagree significantly	Medium	Medium (delays enforcement)	Extend shadow mode by 2-4 weeks. Triage discrepancies by severity: fix critical ones immediately, defer cosmetic ones. Document "golden" behavior as the authoritative source.
5x traffic growth arrives faster than 6 months -- enterprise deals close early or marketing spike occurs	Low	High (DB/infra crisis)	M1 (DB relief) must complete in 4 weeks regardless. Maintain "emergency scaling playbook" (vertical upgrade + aggressive caching) as a break-glass option.
SOC 2 audit requirements are broader than anticipated -- additional controls needed beyond permissions audit trail	Medium	Medium (scope creep)	Engage security/compliance consultant in Week 1-2 to enumerate full control requirements. Build controls inventory in parallel with M3.

Open Questions

Postgres managed service or self-hosted? If managed (e.g., AWS RDS/Aurora), what is the current instance class and max scaling tier? This affects the ceiling for vertical upgrades and available extensions (Citus).
How many distinct permission models exist across teams? Need an audit of current RBAC/ACL implementations to understand the scope of consolidation into the permissions service.
Is there an existing data warehouse or analytics pipeline? If yes, the export service and analytics event contract can leverage it; if no, we need to factor pipeline setup into the roadmap.
What is the current deployment model (K8s, ECS, bare metal)? This affects how platform services are deployed and how auto-scaling is configured.
Are there existing enterprise customer SLA commitments? If contractual SLAs already exist, they constrain our SLO targets (SLOs must be stricter than SLAs).
What is the budget for infrastructure scaling? Vertical Postgres upgrades and Citus licensing have different cost profiles; need budget parameters to recommend the right option.
Is there an existing on-call rotation, or is this being created from scratch? Affects M0 timeline and platform team formation.

Next Steps

This week: VP Engineering approves platform team formation (M0). Identify 2-3 engineers to reassign immediately. Post job requisitions for remaining slots.
This week: DBA/platform lead begins M1 -- deploy PgBouncer in staging; configure read replica routing. Target production deployment in 2 weeks.
This week: Run a 1-hour audit of existing permission implementations across all teams. Document the current state to scope M3.
Week 2: Finalize observability tooling choices (Datadog vs. Grafana stack) and begin M2 implementation.
Week 2: Publish this Platform & Infrastructure Improvement Pack to all engineering teams. Schedule 30-minute walkthrough for stakeholders.
Week 3: Hold first weekly capacity review meeting (30 min). Review doomsday clock metrics and track progress on M1.
Week 4: Evaluate M1 results. If insufficient, fast-track S4 evaluation (vertical upgrade vs. Citus vs. analytical store).
Ongoing: Bi-weekly platform team retrospective to assess migration progress and surface blockers.

Quality Gate Self-Assessment

Checklist Verification

A) Scope + contracts

"When to use / When NOT to use" is explicit; redirects to platform-strategy, technical-roadmaps, managing-tech-debt, and engineering-culture.
Inputs are sufficient; missing info handled via 5 explicit assumptions (A1-A5).
Deliverables are explicit and ordered (sections 1-8).

B) Platformization quality

All 3 shared capability candidates have 2+ consumers (Export: 5, Filtering: 6, Permissions: 7).
Each has a proposed contract (REST API, internal SDK, gRPC service) and ownership model.
Migration/rollout plan exists per capability (phased with shims, shadow mode, deprecation windows).

C) Infrastructure quality attributes

Reliability and performance targets are measurable (SLOs/SLIs table with specific numbers).
Privacy/safety requirements spelled out (encryption, residency, retention, audit).
Operability covered (dashboards, alerts, runbooks, on-call).
Cost guardrails included (budgets, alerts, optimization targets).

D) Scaling readiness ("doomsday clock")

8 limits enumerated with current values (or explicit estimates).
Trigger thresholds account for lead time (e.g., disk trigger at 650 GB with 6-8 week lead time).
Each trigger has an owner and named mitigation project.
Clear yellow/red policy for reprioritization/feature freeze.

E) Instrumentation + analytics

6 observability gaps identified with owners and priorities.
10 canonical events captured server-side.
Identity strategy defined (user_id, account_id, anonymous_id) with merge rules.
Data quality checks defined (schema validation, volume anomalies, null rates, dedupe rates).

F) Discoverability

Explicitly marked "Not applicable" with rationale.

G) Execution readiness

10 milestones with acceptance criteria and owners.
Dependencies and rollout/rollback plans included for every milestone.
Risks (6), open questions (7), and next steps (8) are present and actionable.

ナビゲーション

Skillsとは？

リンク

Platform & Infrastructure Improvement Pack

Platform & Infrastructure Improvement Pack

1) Context Snapshot

2) Shared Capabilities Inventory + Platformization Plan

Shared Capabilities Inventory

Platformization Decisions

3) Quality Attributes Spec (SLOs/SLIs + Privacy/Safety)

Reliability Targets

Performance Targets

Privacy/Safety Requirements

Operability Requirements

Cost Guardrails

Proposed SLOs/SLIs

4) Scaling "Doomsday Clock" + Capacity Plan

Doomsday Clock

Capacity Plan

5) Instrumentation Plan (Observability + Server-Side Analytics)

Observability Gaps

Server-Side Analytics Event Contract

Event Taxonomy (Starter Table)

6) Discoverability Plan

7) Execution Roadmap

Roadmap

Sequencing Rationale (Blast Radius Priority)

8) Risks / Open Questions / Next Steps

Risks

Open Questions

Next Steps

Quality Gate Self-Assessment

Checklist Verification

関連スキル(🔧 開発ツール)