Engineering Culture Improvement Plan

Context Summary

Company: B2B analytics SaaS
Team: 40 engineers across 5 teams
Architecture: Rails monolith + 3 Go microservices
Current deploy cadence: Twice/week via manual release trains
Incident rate: ~2 P1 incidents/month
Key symptoms: Slow PR reviews (3+ days), no on-call ownership, platform team bottleneck, poor mid-sprint communication
Goal: Daily deploys + 50% P1 reduction within one quarter

1. Deployment Pipeline & Release Process

Current Problem

Manual release trains twice a week create large batch sizes, increase risk per deploy, and slow feedback loops.

Recommendations

Move to continuous delivery with feature flags

Adopt a trunk-based development model. Engineers merge small PRs to main daily.
Implement a feature flag system (e.g., LaunchDarkly, Flipper for Rails, or a lightweight internal solution). Every new feature ships behind a flag so that deploys are decoupled from releases.
Replace the manual release train with an automated CI/CD pipeline that deploys to production on every green merge to main.

Invest in deployment confidence

Require a passing CI suite (unit, integration, and a lightweight smoke test against staging) before any merge.
Add automated canary or rolling deploys for the Rails monolith. Route 5% of traffic to the new version, monitor error rates and latency for 10 minutes, then promote or roll back automatically.
For the Go microservices, implement blue-green or canary deploys via your container orchestrator (Kubernetes or similar).

Reduce batch size

Set a soft guideline: PRs should be under 300 lines of diff. Anything larger needs a justification or should be broken into a stack.
Encourage short-lived branches (< 1 day).

Timeline

Weeks 1-2: Set up feature flag infrastructure and automated CI gating.
Weeks 3-4: Implement canary deploys on one service as a pilot.
Weeks 5-8: Roll out automated deploys across all services. Sunset the manual release train.
Weeks 9-12: Refine, monitor, and achieve daily (or more frequent) deploys.

2. PR Review Process & Code Velocity

Current Problem

PRs sit in review for 3+ days. This kills velocity, increases merge conflicts, and demoralizes engineers.

Recommendations

Establish review SLAs

First meaningful review within 4 business hours. Final approval within 24 hours.
Track these metrics and make them visible on a team dashboard (use a tool like LinearB, Sleuth, or a simple script against your Git provider's API).

Assign clear reviewers

Use CODEOWNERS files to auto-assign reviewers based on file paths. This eliminates the "who should review this?" ambiguity.
Each PR should have exactly one required reviewer (not two or three). Trust your engineers.

Reduce review burden

Invest in automated linting, formatting (RuboCop for Rails, gofmt/golangci-lint for Go), and static analysis. Robots should catch style issues, not humans.
Write clear PR descriptions with a template: What changed? Why? How to test? Any risks?
Encourage authors to self-review before requesting review and to annotate non-obvious sections.

Cultural shifts

Normalize small PRs. A PR that takes 15 minutes to review gets reviewed fast. A PR that takes 2 hours sits.
Introduce "review blocks" -- 30-minute windows at 10am and 2pm where engineers prioritize clearing their review queue.
Recognize and praise fast, high-quality reviewers publicly.

Timeline

Week 1: Implement CODEOWNERS, PR templates, and automated linting.
Week 2: Announce review SLAs. Start tracking metrics.
Weeks 3-12: Monitor, coach, and iterate. Publicly celebrate improvements.

3. On-Call & Incident Management

Current Problem

Random on-call rotation means no ownership, no expertise building, and slow incident response. Two P1s/month is too many.

Recommendations

Structured on-call rotation by team

Each of the 5 teams owns the services and code they build. On-call rotates within the team on a weekly basis.
Every team has a primary and secondary on-call. Primary responds; secondary is backup.
Use PagerDuty or Opsgenie to manage rotations, escalation policies, and alerting.

On-call onboarding & runbooks

Each team maintains runbooks for their services covering: common alerts, diagnostic steps, rollback procedures, escalation criteria.
New on-call engineers shadow for one rotation before going primary.
Provide a "first 5 minutes" checklist for every alert: check dashboards, check recent deploys, check dependent services.

Incident response process

Define severity levels clearly (P1 = customer-facing outage or data integrity issue, P2 = degraded service, etc.).
For P1s: Incident commander (rotating role) coordinates. Communication goes to a dedicated Slack channel. Status updates every 15 minutes.
Mandatory blameless post-mortems for every P1 within 48 hours. Post-mortems must identify contributing causes and produce action items with owners and deadlines.

Reduce P1 frequency

Analyze the last 6 months of P1s. Categorize them (deploy-related, infrastructure, data pipeline, third-party dependency, etc.). Attack the top category.
Improve observability: structured logging, distributed tracing (e.g., Datadog, Honeycomb), and SLO-based alerting rather than threshold-based alerting.
Require pre-deploy checklists for high-risk changes (database migrations, schema changes, infrastructure modifications).

Timeline

Weeks 1-2: Define severity levels, set up PagerDuty with team-based rotations, create escalation policies.
Weeks 3-4: Teams draft initial runbooks. Conduct a P1 retrospective analysis.
Weeks 5-8: Implement observability improvements targeting the top P1 category.
Weeks 9-12: Refine alerting, complete runbook coverage, measure P1 trend.

4. Platform Team & Internal Developer Experience

Current Problem

The platform team is a bottleneck for every feature team. This creates dependencies, waiting, and frustration.

Recommendations

Shift platform from gatekeeper to enabler

Platform team should build self-service tools, not do work on behalf of feature teams. The goal is to make feature teams autonomous.
Identify the top 5 reasons feature teams file tickets to platform. Build self-service solutions for at least 3 of them within the quarter.

Specific self-service targets (likely candidates)

Infrastructure provisioning: Provide Terraform modules or internal CLI tools so feature teams can spin up their own staging environments, add new queues, or create database read replicas without a platform ticket.
CI/CD pipeline configuration: Let feature teams own their pipeline configs (e.g., .github/workflows/ or equivalent) with platform-provided templates.
Observability setup: Provide dashboards-as-code templates so teams can instrument and monitor their own services.

Embed, don't centralize

Consider rotating a platform engineer into each feature team for 2-week stints to transfer knowledge and identify friction points.
Establish "platform office hours" (2 hours/week) for ad-hoc questions instead of requiring tickets for everything.

Define a platform product roadmap

Treat the platform as an internal product. The platform team's "customers" are the feature teams.
Run a quarterly survey or retrospective asking feature teams: What slows you down? What do you need from platform?
Prioritize platform work based on developer-hours-saved, not technical elegance.

Timeline

Weeks 1-2: Survey feature teams to identify top friction points. Audit current ticket backlog.
Weeks 3-6: Build and ship first 2 self-service solutions.
Weeks 7-10: Ship 1-2 more. Begin platform embedding rotation.
Weeks 11-12: Re-survey. Measure reduction in platform tickets.

5. Communication & Sprint Transparency

Current Problem

PMs complain that engineering "goes dark" mid-sprint. This erodes trust and leads to misaligned priorities.

Recommendations

Async status updates

Engineers post a brief daily standup update in a shared Slack channel or tool (Geekbot, Standuply, or just a pinned thread): What I did, what I'm doing, any blockers.
This replaces or supplements synchronous standups, which often become rote.

Mid-sprint check-ins

At the midpoint of each sprint (e.g., Wednesday of a 2-week sprint), hold a 30-minute sync between the tech lead and PM for each team. Review: Are we on track? Any scope changes needed? Any surprises?
If something is at risk, surface it here -- not at sprint review.

Work-in-progress visibility

Use your project management tool (Jira, Linear, etc.) rigorously. Every piece of work should have a ticket. Ticket status should reflect reality.
Automate status transitions where possible (e.g., PR opened = "In Review", PR merged = "Done").

Demo culture

End every sprint with a 30-minute demo. Engineers show working software, not slides. PMs, designers, and stakeholders attend.
This builds shared understanding and celebrates progress.

Escalation norms

Define a clear norm: If a task is blocked for more than half a day, escalate. No one should silently spin.
Create a lightweight escalation path: engineer -> tech lead -> engineering manager. Response expected within 2 hours during business hours.

Timeline

Week 1: Set up async standup tooling. Announce new norms.
Week 2: Begin mid-sprint check-ins.
Weeks 3-4: Automate ticket status transitions. Introduce demo culture.
Weeks 5-12: Iterate based on PM and engineer feedback.

6. Metrics & Accountability

What to Measure

Track these weekly and review them in a monthly engineering leadership meeting:

Metric	Current Baseline	90-Day Target
Deploy frequency	2x/week	1x/day (minimum)
PR review time (first review)	3+ days	< 4 hours
PR merge time (open to merge)	4-5 days (est.)	< 24 hours
P1 incidents/month	2	1 or fewer
Mean time to recovery (MTTR)	Unknown (measure)	< 1 hour
Platform team tickets from feature teams	Unknown (measure)	50% reduction
Sprint goal completion rate	Unknown (measure)	> 80%

How to Measure

Deploy frequency: Count production deploys per day from CI/CD logs.
PR metrics: Use GitHub/GitLab API or a tool like LinearB.
Incident metrics: Track in PagerDuty or your incident management tool.
Platform tickets: Track in your ticketing system with a "platform" label.
Sprint completion: Track in your project management tool.

Accountability

Each team's tech lead owns their team's metrics and reports weekly.
The engineering manager reviews cross-team trends monthly.
Share a monthly "engineering health" summary with the broader org (PMs, leadership) to build trust.

7. Implementation Roadmap (12-Week Overview)

Phase 1: Foundation (Weeks 1-4)

Set up feature flag infrastructure
Implement CODEOWNERS and PR review SLAs
Define incident severity levels and set up PagerDuty with team-based rotations
Survey feature teams on platform friction points
Launch async standups and mid-sprint check-ins
Begin measuring all baseline metrics

Phase 2: Build (Weeks 5-8)

Roll out automated deploys (canary/rolling) across services
Ship first self-service platform tools
Implement observability improvements targeting top P1 category
Teams draft runbooks; conduct P1 retrospective
Automate ticket status transitions
Review and adjust PR review SLAs based on data

Phase 3: Scale & Refine (Weeks 9-12)

Achieve daily deploys; sunset manual release trains
Ship remaining self-service platform tools
Refine alerting and on-call processes
Re-survey feature teams on platform experience
Conduct quarter retrospective; measure against targets
Document what worked and plan next quarter

8. Risks & Mitigations

Risk	Mitigation
Engineers resist daily deploys due to fear of breaking production	Start with low-risk services. Invest heavily in canary deploys and automated rollback. Celebrate successful daily deploys.
PR review SLAs feel like surveillance	Frame as a team norm, not a management mandate. Track at team level, not individual level initially.
On-call burden feels unfair across teams (some services are noisier)	Invest in reducing alert noise first. Compensate on-call with time off or stipend.
Platform self-service tools take too long to build	Start with the simplest wins (documentation, templates, scripts) before building full self-service platforms.
PMs over-index on new communication rituals	Keep ceremonies lightweight. If a mid-sprint check-in has nothing to discuss, cancel it. Avoid process theater.
Quarter is too short to achieve all goals	Prioritize deploy frequency and P1 reduction as the primary goals. Communication and platform improvements are supporting goals that may extend into Q2.

9. Cultural Principles to Reinforce

Throughout this transformation, consistently reinforce these principles:

Ownership over assignment. Teams own their services end-to-end: building, shipping, running, and fixing them.
Small batches over big bangs. Small PRs, small deploys, small experiments. Reduce the blast radius of everything.
Transparency over opacity. Share status, share metrics, share post-mortems. Default to open.
Automation over manual gates. If a human is doing something a machine could do, automate it.
Speed and safety are not tradeoffs. Deploying more frequently with smaller changes is both faster and safer than deploying less frequently with larger changes.

10. Quick Wins (First 2 Weeks)

To build momentum, prioritize these high-impact, low-effort changes immediately:

Set up CODEOWNERS -- 1 hour of work, immediate impact on review assignment.
Add automated linting to CI -- eliminates an entire class of review comments.
Announce PR review SLA (4-hour first review) -- costs nothing, sets expectations.
Assign on-call by team in PagerDuty -- can be done in a day.
Start async standups -- pick a Slack bot and turn it on.
Schedule the first mid-sprint PM check-in -- put it on the calendar now.
Pull the last 6 months of P1 data and categorize root causes -- the analysis alone will reveal your highest-leverage improvement.

These seven actions can all be completed in the first two weeks and will create visible, immediate improvements in velocity, reliability, and communication.

ナビゲーション

Skillsとは？

リンク

Engineering Culture Improvement Plan

Engineering Culture Improvement Plan

Context Summary

1. Deployment Pipeline & Release Process

Current Problem

Recommendations

Timeline

2. PR Review Process & Code Velocity

Current Problem

Recommendations

Timeline

3. On-Call & Incident Management

Current Problem

Recommendations

Timeline

4. Platform Team & Internal Developer Experience

Current Problem

Recommendations

Timeline

5. Communication & Sprint Transparency

Current Problem

Recommendations

Timeline

6. Metrics & Accountability

What to Measure

How to Measure

Accountability

7. Implementation Roadmap (12-Week Overview)

Phase 1: Foundation (Weeks 1-4)

Phase 2: Build (Weeks 5-8)

Phase 3: Scale & Refine (Weeks 9-12)

8. Risks & Mitigations

9. Cultural Principles to Reinforce

10. Quick Wins (First 2 Weeks)

関連スキル(🔧 開発ツール)