DevOps / Platform Engineer¶

The infrastructure owner and reliability guardian.

Profile¶

Firmographics¶

Attribute	Profile
Company Size	Mid-market to enterprise
Team Size	5-20 platform engineers
Responsibility	Infrastructure, deployment, observability, reliability
Tools	Terraform, Kubernetes, AWS/Azure/GCP, Datadog
Pain	"I find out about drift from incidents, not before"

Role Definition¶

Aspect	Detail
Primary	Maintain infrastructure, enable developer velocity
Secondary	Cost optimization, security compliance, incident response
Reports To	VP Engineering, Director of Infrastructure
Collaborates With	Development teams, Security, SRE

Pain Points¶

Priority 1: "I find out about infrastructure drift from incidents"¶

The Problem: - Terraform state says one thing - Reality is different - Discovery happens during incident response

Example:

Incident Timeline:
- 14:00: Service degradation reported
- 14:15: Root cause identified: wrong instance type
- 14:30: Terraform plan shows: should be m5.large
- 14:35: AWS console shows: actually m5.xlarge (changed manually 3 weeks ago)
- 15:00: Fixed, post-mortem begins

Question: Why didn't we know about the manual change?

Substrate Solution: - SSH Runtime Connector compares declared vs observed - Detects drift within 15 minutes - Alerts before incident

Priority 2: "Terraform apply is scary on large changes"¶

The Problem: - Large Terraform changes = unknown blast radius - No way to preview impact on services - Rollbacks are painful

Current State:

$ terraform plan
# ... 500 lines of changes ...
# Do you want to apply? (yes/no)
# 🤞 Hope this doesn't break anything

Substrate Solution: - Simulation engine: preview impact before apply - Blast radius calculation: which services affected - Policy evaluation: will this violate constraints?

Priority 3: "No visibility into what's actually running"¶

The Problem: - Kubernetes has 500 pods - Some are from old deployments - Some are manually created - No unified view of runtime topology

Substrate Solution: - Live graph from K8s API watch - SSH verification of host state - Unified view: code + infrastructure + runtime

Priority 4: "Configuration changes bypass review"¶

The Problem: - Someone changes a ConfigMap directly - Service behavior changes - No audit trail - Root cause analysis takes hours

Substrate Solution: - Runtime verification detects changes - Graph diff shows what changed - Linked to policy violations if any

Use Cases¶

1. Pre-Deployment Simulation¶

Before Terraform apply:

> Simulate: terraform plan for vpc-changes

Affected infrastructure:
- VPC: production-vpc
- Subnets: 6
- Route tables: 4
- Security groups: 12

Service impact:
- Services affected: 8
- API dependencies: 15
- Database connections: 6

Policy evaluation:
- Violations introduced: 0
- Violations resolved: 1 (unauthorized route)

Drift delta: -0.08 (improvement)

Recommendation: PROCEED
Confidence: 92%

2. Runtime Drift Detection¶

SSH Runtime Connector finds:

ALERT: Runtime Drift Detected

Host: prod-worker-05
Declared: payment-service v2.3.1, port 8080
Observed: payment-service v2.3.0, port 8080

Drift: Version mismatch (patch version)
Severity: Medium
Last deploy: 14 days ago

Possible causes:
- Manual rollback
- Failed deployment
- Configuration drift

Action: Investigate or approve exception

3. Blast Radius Analysis¶

Before maintenance:

> Blast radius: database-primary

Direct dependencies: 6 services
Indirect dependencies (2 hops): 14 services
Total affected: 20 services

Critical path:
- payment-service (P0)
- order-service (P0)
- inventory-service (P1)

Recommended maintenance window:
- Lowest traffic: Sunday 02:00-04:00 UTC
- Estimated impact: 500 users

4. Configuration Validation¶

Detecting invalid configs:

VIOLATION: Configuration Drift

Resource: ConfigMap/app-config
Declared: max_connections: 100
Observed: max_connections: 1000

Policy: resource-limits (POLICY-023)
Rule: max_connections must match declared value

Risk: Resource exhaustion, cascading failure

Fix: Revert to 100 or update Terraform

Integration Points¶

Terraform¶

# Terraform module with Substrate metadata
module "payment_service" {
  source = "./modules/service"

  name = "payment-service"
  domain = "payments"

  # Substrate annotations
  substrate_metadata = {
    owner = "payments-team"
    dependencies = ["database-primary", "redis-cache"]
    policies = ["pci-boundary", "api-gateway-first"]
  }
}

Kubernetes¶

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
  annotations:
    substrate.io/domain: payments
    substrate.io/owner: payments-team
    substrate.io/policies: pci-boundary,api-gateway-first
spec:
  # ... standard deployment spec

CI/CD¶

# GitHub Actions workflow
- name: Substrate Pre-Deploy Check
  uses: substrate/github-action@v1
  with:
    terraform-plan: plan.json
    fail-on-violation: true

Value Proposition¶

Before Substrate¶

Activity	Frequency	Time	Annual Cost
Manual drift detection	Weekly	4 hrs	$40K
Incident response (drift-related)	Monthly	8 hrs	$24K
Pre-deploy analysis	Per deploy	2 hrs	$30K
Root cause analysis	Monthly	4 hrs	$12K
Total			$106K

With Substrate¶

Subscription: $6K/year
Implementation: $2.5K
Total: $8.5K

Net savings: $97.5K/year (12x ROI)

Operational Improvements¶

Metric	Before	After
Drift detection time	Days/weeks	15 minutes
Pre-deploy confidence	Gut feeling	Data-driven (92%+)
Incident root cause	Hours	Minutes
Blast radius knowledge	Tribal	Queryable

Messaging¶

Elevator Pitch¶

"You maintain infrastructure but find out about drift from incidents. Substrate continuously verifies what you declared against what's actually running, simulates changes before you apply them, and shows you the blast radius of any component — so you prevent incidents instead of responding to them."

Key Messages¶

Know what's running
"SSH verification every 15 minutes"
"Detect manual changes immediately"
Deploy with confidence
"Simulate Terraform changes before apply"
"Know the blast radius in advance"
Prevent incidents
"Catch drift before it causes outage"
"Validate configs continuously"
Debug faster
"Query architecture in natural language"
"Trace dependencies in seconds"

Case Study: SaaS Platform¶

Company: B2B SaaS, 200 engineers
Challenge: Multi-tenant infrastructure, frequent deploys, reliability requirements

Before: - 3 incidents/quarter from configuration drift - Terraform changes: 2-day review process - No visibility into cross-service impact

With Substrate: - SSH Runtime Connector on all hosts - Pre-deploy simulation mandatory - Blast radius queries for all changes

Results: - Drift-related incidents: 3/quarter → 0 - Deploy frequency: 5/day → 20/day (confidence) - Mean time to resolution: 45 min → 12 min - Terraform review time: 2 days → 4 hours

Quote:

"Substrate turned infrastructure management from reactive firefighting to proactive governance."