The Tech Due Diligence Playbook

Performance Engineering & Load Testing

Eitan Schuler — Mon, 13 Apr 2026 10:06:40 GMT

Originally published on LinkedIn on April 8, 2026

When the load test was a lie

Six months before a Series C close, a CTO walked us through a load test report. 10,000 concurrent users. Response times flat at 120 milliseconds. The graph was beautiful. We asked one question: what was the think time between requests? ...Silence.

The test had been run at machine speed. Zero pause between requests, no session modeling, no realistic traffic shape. It was measuring how fast the server could flush a queue of simultaneous calls, not how the system behaved under actual human load. They re-ran it the following week with realistic behavior. At 3,200 concurrent users, a connection pool on the checkout service saturated. P99 latency climbed to 8 seconds. One queue backed up and stayed backed up for 20 minutes after the test ended. The deal closed, but with a performance remediation plan baked into the term sheet and two senior engineers pulled off roadmap work to execute it.

A load test that doesn’t model reality is not a load test. It’s a confidence trick you play on yourself.

Why this matters

Investors don’t buy your current throughput numbers. They buy headroom. The question is not whether the system handles today’s load. It is whether it handles 3x growth without a rewrite, survives an incident at peak, or avoids a surprise infrastructure bill that compresses the margins they just modeled.

Performance failures are expensive in a specific way: they surface under the conditions you least want them. A product launch. A viral campaign. The week after a major partnership announcement. By then the cost is not just engineering time. It is SLA credits, customer churn, and a deal that closes at a discount.

Investors and their technical advisors increasingly ask for load test evidence, not just uptime charts. The question is whether you can prove the system holds up before they wire the money.

What investors look for

Realistic tests, not synthetic benchmarks. A test that floods endpoints at machine speed with uniform payloads tells you almost nothing useful. Investors want evidence that tests model actual user sessions: login flows, searches, cart additions, checkouts, with realistic think times, session durations, and a traffic shape that resembles production. The closer the test is to real behavior, the more the results are worth.

Documented bottlenecks with evidence. Every system has a ceiling somewhere. What investors want to see is that you found yours before a customer did. That means knowing which component saturates first, at what load level, and what the failure mode looks like. A capacity map with concrete numbers beats a general claim that the system “scales horizontally.”.

SLOs measured under load, not just in steady state. P95 and P99 latency targets should hold at peak, not just at Tuesday morning baseline. If your SLO is checkout responding in under 500ms, that target needs to be tested at 2x current peak.

A capacity model tied to growth. Investors want a simple model: at current ARR growth, when will you hit the next infrastructure ceiling? What does it cost to extend that runway by 12 months? Teams that can answer with data are much easier to price than teams that say “we’ll scale when we need to.”

Stage and stake: how the lens sharpens

Seed and early A: Basic load tests on critical user journeys are enough. Investors accept that the system is not fully hardened. The key signal is awareness. Can the founders articulate where the limits are and what they would do as they approach them?

Series B and growth: Systematic load testing is expected. Bottlenecks should be documented, not speculated. Capacity models should be tied to ARR projections. P95 and P99 latency targets should be tracked and tested, not just aspirational.

Control buyouts: Buyers may run their own load tests against a staging environment that mirrors production. They will ask for the last six months of test results, correlate them with incident history, and model the infrastructure spend needed to handle 3x current volume. Gaps get priced in.

Patterns and practices worth adopting

Test user journeys, not endpoints in isolation. A checkout flow involves authentication, product lookup, inventory check, payment processing, and confirmation. Testing each endpoint independently misses the compound effect of all of them running under realistic concurrency.

Set performance budgets before you test. Decide what acceptable looks like before you run the test, not after you see the results. P95 under 300ms for checkout. Queue lag under 2 seconds. Without a pre-defined target, every result is open to interpretation and easy to rationalize.

Find the three-layer ceiling. For most SaaS systems, performance limits live at one of three layers: application (thread pool saturation, connection limits), database (query time, connection pool, IOPS), or infrastructure (CPU, memory, network). Testing should identify which layer hits the ceiling first and at what load level.

Profile before you optimize. Flame graphs and slow query logs are more useful than assumptions. The fix for a performance problem is almost never where you expect it. Teams that instrument first and optimize second move faster and spend less.

Red flags that lengthen negotiations

Load tests run against staging with data volumes far below production
No documented saturation points or capacity limits for any service
Tests that model machine-speed requests with no think time or session behavior
Performance tracked only at P50; P99 numbers unknown or not measured
No correlation between historical load test results and production incident history
Capacity planning based on “we’ll add more instances” with no model behind it

Two or more of these typically produce a remediation condition or a discount. Three or more, and buyers start asking whether the system can support the growth plan they just paid for.

Mini-Glossary

Think time: The pause between user actions in a session; critical for realistic load modeling.
P95 / P99 latency: Response time at the 95th or 99th percentile; a better signal of user experience than averages.
Throughput: Requests or transactions processed per unit of time.
Saturation point: The load level at which a resource (connection pool, CPU, thread pool, etc.) becomes the binding constraint.
Connection pool: A cache of reusable database connections; saturation causes queuing and latency spikes.
Capacity model: A projection of when current infrastructure limits will be reached, given growth assumptions.

Your turn

What performance problem hit you at the worst possible moment? A connection pool that saturated during a product launch, a database that crawled under realistic load, or a test that looked fine until someone asked the right question? Share the scar. It helps the next team.

Founders and CTOs: Need a performance engineering assessment or a realistic load testing plan before your next raise? Let’s talk.

Investors: Want a second opinion on load test methodology and capacity headroom in a target company? Let’s talk.

Subscribe now

Next in the Playbook: Edition 26 will explore People Risk and Succession Planning. How talent retention and succession depth show up in diligence, and what investors read between the lines of an org chart. Stay tuned!

Observability and Monitoring Excellence

Eitan Schuler — Tue, 07 Apr 2026 10:10:39 GMT

Subscribe now

When the dashboard looked fine and the customer was already churning

Six weeks before a Series B close, the technical advisor asked a simple question: “Walk me through how you knew the payment service was degraded last Tuesday.” The CTO pulled up the infrastructure dashboard. Green across the board. Uptime: 99.97%. Then the advisor opened a support ticket from that same Tuesday. A customer in Germany had been unable to complete checkout for 40 minutes. P95 latency had spiked to 14 seconds. No alert had fired. The monitoring stack was measuring the wrong things with the wrong thresholds in the wrong places. The deal closed, but with a holdback and a mandatory observability sprint that consumed two senior engineers for a quarter.

Observability is not about having dashboards. It is about asking questions you and getting answers before your customers do.

Why this matters

Investors do not buy uptime numbers in isolation. They buy the confidence that when something goes wrong, the team will know fast, know precisely what broke, contain the blast radius, and learn from it. They ask: How fast do you detect that a customer is affected, not just that a server is down? Can you trace a single failing request across your entire stack in under five minutes? When an alert fires, does it route to someone with context to act? Can you answer questions about production behavior that nobody anticipated at design time?

Get observability right and incidents shrink, on-call becomes humane, and every post-mortem produces a real answer. Get it wrong and you discover problems from customer complaints, lose hours in log archaeology during incidents, and price discounts get written into term sheets.

What investors look for

The three pillars (Metrics, logs, and traces) connected, not separate. A latency spike in a metric should lead to the slow traces causing it. Those traces should point to the service and log lines that explain why. The three pillars should be linked with some kind or correlation ID tying them together. Investors or their Tech Due Diligence contractors check by asking: “Show me the last incident, walk me through how you went from alert to root cause.”.

Signals over noise. Mature observability means alerting on customer impact, not system internals. CPU at 80% is not an alert. P99 latency for checkout exceeding your SLO threshold is. Alert fatigue is a proxy for everything else: investors may sample on-call calendars and ask what percentage of pages required human action. Below 50% signals a noise problem.

Structured logs as a queryable record. Logs that are plain text and not structured JSON are a red flag at Series B and above. Ops/Dev should be able to query production logs by tenant ID, request ID, and service name in under 30 seconds or so.

Distributed tracing with meaningful coverage. A trace that covers your API gateway and main service but stops at the database call, external API, or async queue tells you half the story. Investors are interested in trace coverage across service boundaries. Do traces include database query time, external API latency, and queue processing lag?

SLOs as the north star. Service Level Objectives shift observability from reactive to proactive. Instead of alerting when something breaks, you alert when the rate of failure consumes your error budget faster than expected. Investors increasingly expect SLOs defined and measured at the service level, not just a global uptime number.

Stage and stake: how the lens sharpens

Seed / early A: Basic application metrics and centralized logging are sufficient. Alerts should exist for Sev-1 conditions. The key signal is not sophistication but awareness: can the founders explain what they cannot currently see and why that is acceptable at this stage?

Series B / Growth: Structured logging, distributed tracing across critical paths, SLOs per revenue-critical service, and alert routing to the right team with runbooks. Dashboards should be organized by customer journey, not by infrastructure component.

Control buy-outs: Buyers will ask for the last six months of incident data correlated with observability artifacts. They check whether alerts predicted incidents or lagged behind them, whether runbooks point to dashboards that actually answer the questions, and whether per-tenant observability exists.

Patterns and practices worth adopting

USE and RED as the baseline. For every resource, track Utilization, Saturation, and Errors. For every service, track Rate, Errors, and Duration. These two frameworks cover 90% of what you need to detect and diagnose most production problems.

Correlation IDs across every hop. Generate a request ID at the edge and carry it through every service call, queue message, async job, and log line. This converts log archaeology into a single query. Instrument it early. Retrofitting across a mature microservices stack is expensive and never quite complete.

Synthetic monitoring for critical journeys. Running a synthetic transaction through login, checkout, and data retrieval every few minutes from multiple regions means you detect degradations before any real user encounters them.

Alerting that routes to context. An alert that links to the relevant dashboard, runbook, and on-call contact cuts mean time to acknowledge in half.

For your top five revenue-critical journeys:

Define SLOs and track error budget burn weekly. Actual SLOs measured against real request outcomes, not annual uptime percentages.

Build a dashboard showing login, onboarding, the core value action, and billing as separate panels with their own SLO indicators.

Instrument your top ten external dependencies with timeout tracking, error rate monitoring, and latency percentiles.

Run a monthly alert hygiene review. Remove or adjust any alert that fired more than ten times last month without requiring human action.

Track and publish on-call load per engineer. Pages per week, after-hours pages, time to resolve. This kind of visibility creates pressure to fix noisy alerts.

Red flags that lengthen negotiations

Dashboards organized by infrastructure layer with no customer journey view. The team sees machine health, but not user experience.

No distributed tracing. Cross-service incidents require manual log correlation with no shared request identifier.

Unstructured plain text logs.

SLOs defined on paper but not measured. Error budgets that nobody checks.

On-call load concentrated on one or two engineers. Investors read this as key-person risk.

Observability gaps at external boundaries: third-party APIs, payment processors, messaging queues.

Two or more of these typically produce a remediation condition or a discount.

Subscribe now

Mini-Glossary

RED method: Rate, Errors, Duration. The three metrics every service should expose as a baseline (See https://grafana.com/blog/the-red-method-how-to-instrument-your-services/).

USE method: Utilization, Saturation, Errors: the three metrics every resource that can become a bottleneck should expose.

SLO (Service Level Objective): An internal target for service reliability, expressed as a percentage of successful requests over a time window.

Error budget: The allowed amount of unreliability derived from an SLO. Burn it too fast and feature releases pause until reliability is restored.

Distributed tracing: Recording the path of a single request across multiple services, with timing data for each hop.

Correlation ID: A unique identifier attached to a request at its origin and passed through every system it touches.

Synthetic monitoring: Scripted transactions that simulate user behavior on a schedule and alert when they fail or exceed latency thresholds.

Alert fatigue: The state where engineers ignore alerts because too many fire on conditions that do not require action.

Your turn

What observability gap hit you hardest? An incident your monitoring missed entirely, a customer who knew before you did, or a war room that lasted hours because you could not correlate logs across services? Share the scar. It helps the next team avoid it.

Founders and CTOs: Need an observability maturity assessment or a practical roadmap to SLOs before your next raise? Let’s talk.

Investors: Want a second opinion on monitoring depth and incident detection in a target company? Let’s talk.

Subscribe now

Next in the Playbook: Edition 25 will explore Performance Engineering & Load Testing. Stay tuned!

Multi-Tenant Architecture Assessment

Eitan Schuler — Thu, 02 Apr 2026 09:57:20 GMT

When one customer’s bad day became everyone’s bad day

We had a scheduled maintenance window. Routine index rebuild on the shared database. Halfway through, query times spiked across the board. Every customer, every tenant, every API call slowing to a crawl. Support tickets started flooding in from customers who had nothing to do with the database table we were touching. One tenant’s maintenance window had become everyone’s outage.

We patched it. We apologized. We promised better blast radius controls. Six months later, during diligence for a growth round, the lead investor’s technical advisor asked one simple question: “If your largest customer runs a heavy batch job tonight, what happens to your smallest customer’s response times?” We had a much better answer by then. But the scar is real.

Multi-tenancy is not a deployment detail. It is the architectural commitment that determines whether your SaaS scales gracefully or starts bleeding customers as it grows.

Subscribe now

Why this matters

Investors buying into a SaaS bet on one thing above almost everything else: that you can add customers without linearly adding cost and risk. Multi-tenant architecture is where that bet is won or lost.

The questions they ask are blunt:

Can one tenant’s load, bad data, or misbehavior affect another tenant?

Can you prove where each customer’s data lives and who can access it?

Can you onboard a customer 10x your current largest without a rewrite?

Can you offer enterprise isolation without building a separate product?

Get multi-tenancy right and you widen your market, win regulated enterprise deals, and improve margins at scale. Get it wrong and you face a choice between expensive re-architecture and a ceiling on the deals you can close.

What investors look for

Isolation model that matches your market

There is no single correct isolation model. There are three common approaches, each with real trade-offs. Shared everything (one database, one schema, tenant ID as a filter column) is fast and cheap to operate, but blast radius is wide and enterprise customers often reject it on compliance grounds. Shared infrastructure, separate schemas or databases gives stronger isolation with reasonable operational overhead and is often the sweet spot for growth-stage SaaS. Full silo deployment (separate infrastructure per tenant) maximizes isolation and customization but multiplies cost and operational complexity fast.

Investors want to see that you made a deliberate choice, documented the trade-offs, and have a path to offer stronger isolation to enterprise customers without rebuilding everything.

Blast radius controls

Noisy neighbor problems are architectural, not operational. Rate limiting, query timeouts, resource quotas, connection pool partitioning: these need to be designed in, not bolted on after the first enterprise customer saturates your shared database. Investors check whether you can demonstrate, with evidence, that one tenant cannot meaningfully degrade another.

Data isolation that survives scrutiny

Tenant ID filters are a logical boundary. Enterprise buyers want a physical boundary. The question during diligence is not just whether your query includes a “WHERE tenant_id” clause, but whether a bug or a misconfiguration could ever serve one tenant’s data to another. Row-level security, scoped API tokens, and per-tenant encryption keys all raise the bar. Cross-tenant data access in audit logs is a red flag. Its complete absence from logs is a worse one.

Auth and access scoped per tenant

Every API call should carry a tenant identity, and that identity should be enforced at the infrastructure layer, not just the application layer. Short-lived tokens, tenant-scoped roles, and audit logs that answer “who accessed what, from which tenant, and when” move the conversation from trust me to show me.

Operational model that scales

Can you onboard a new tenant without engineering involvement? Can you apply a schema migration across 500 tenants without a 6-hour maintenance window? Can you run a restore for one tenant without touching another? These sound operational questions, but they are also architecture questions. Investors scoring your multi-tenant maturity are really scoring whether your operations team can keep pace with your sales team.

Stage and stake: how the lens sharpens

Seed / early A: A shared-database model with row-level isolation is acceptable if the data model is clean, tenant IDs are enforced consistently, and you can articulate what you would change for an enterprise customer. Show awareness, not perfection.

Series B / Growth: Expect evidence of blast radius controls, documented isolation model, tenant-scoped auth, and a path to stronger isolation for regulated customers. If your enterprise pipeline includes financial services, healthcare, or public sector, the timeline on that path matters.

Control buy-outs: Buyers sample actual queries for cross-tenant leakage, review audit logs for tenant access patterns, test restore procedures for single-tenant recovery, and model the cost to offer dedicated silo deployments to key customers. They will price any re-architecture risk into the deal.

Red flags that lengthen negotiations

Tenant isolation enforced only in application code with no database-layer controls

No rate limiting or resource quotas per tenant; noisy neighbor incidents in the incident history

Cross-tenant queries possible through admin endpoints with no audit trail

Schema migrations require full downtime affecting all tenants simultaneously

“We have never had a data leak between tenants” as the primary assurance, with no technical controls to back it

No single-tenant restore capability; recovery requires restoring the full shared database

Two or more of these will trigger either a price adjustment or a post-close remediation plan. Three or more can stall a deal with enterprise-focused investors entirely.

Habits worth adopting before the next round

Map your isolation model explicitly. Write down which tier of isolation you offer, what the trade-offs are, and what a customer upgrade path to stronger isolation looks like. Keep it in the data room.

Run a cross-tenant penetration test. Not a general pen test. A test specifically designed to probe whether tenant A can read tenant B’s data. Document findings and fixes.

Instrument tenant-level metrics. Latency, error rates, and resource consumption per tenant. If you cannot show the impact of your largest tenant on your smallest, you cannot prove your blast radius controls work.

Test single-tenant restore. Quarterly. Time it. If you cannot recover one tenant’s data without affecting others, that is an architectural gap, not just an operational one.

Track the isolation ask in your sales pipeline. If enterprise prospects are asking for dedicated infrastructure and you are losing deals over it, that is product data, not just a sales conversation.

Mini-Glossary

Noisy neighbor: A tenant consuming disproportionate shared resources and degrading performance for others.

Row-level security (RLS): A database feature that enforces access control at the row level, filtering results automatically based on the session context.

Silo deployment: Separate infrastructure per tenant; maximum isolation, maximum operational overhead.

Blast radius: The scope of impact when something goes wrong; designing to limit blast radius means failures stay local.

Tenant-scoped token: An auth token that carries tenant identity and limits access to that tenant’s resources only.
Share

Your turn

Which multi-tenant problem bit you hardest: noisy neighbors, a near-miss on data isolation, or an enterprise deal lost because you could not offer dedicated infrastructure? Share the scar. It helps the next team.

Founders: Need a multi-tenant architecture review or a gap assessment before your next enterprise deal? Let’s talk.

Investors: Want a pre-deal assessment of isolation model, blast radius controls, and enterprise readiness in a target company? Let’s talk.

Next in the Playbook: Edition 24 we’ll explore Observability & Monitoring Excellence. Logging, metrics, tracing, and alerting strategies. Stay tuned!

Subscribe now

Tech Governance for Hypergrowth

Eitan Schuler — Wed, 25 Mar 2026 13:21:44 GMT

Originally published on LinkedIn on February 25, 2026

When the board asked who owns the risk register

A portfolio company had just closed its Series B and was sprinting toward 3x ARR growth. Engineering had doubled in nine months. Then a regulator sent a routine questionnaire about data processing activities and incident reporting. The CEO forwarded it to the CTO, who forwarded it to a senior engineer who had joined three weeks earlier. Nobody was sure which policies existed, which were current, and which had been written for the Series A data room and never touched again. The response went out late, incomplete, and contradicted what sales had told two enterprise prospects. The fine was small. The pipeline damage was not.

Hypergrowth stress-tests everything, but governance breaks first because it depends on clarity, ownership, and consistency, exactly the things that erode when a company doubles every year. Tech governance is not bureaucracy bolted on after a scare. It is the connective tissue between board-level risk oversight and the daily decisions engineers make about security, data, change management, and resilience.

Subscribe now

Why this matters

Investors back hypergrowth expecting the company can scale operations as fast as revenue. Governance is how you ensure it. Without it, every new team, market, and regulation becomes a surprise. With it, decisions are faster because boundaries are clear, incidents are smaller because escalation paths exist, and enterprise deals close because you can answer the security questionnaire on time.

The pattern is predictable. A company grows from 20 to 80 engineers. Policies that lived in the CTO’s head now need to live in writing. Access reviews that happened informally now need a schedule and a log. Change management that was “we all sit in the same room” now spans four time zones. The companies that stumble are not the ones lacking ambition, rather those that treated governance as a post-IPO problem and discovered it was a Series B problem.

What investors look for

A living policy framework, not a document graveyard. A small set of policies people actually follow: information security, change management, incident response, data classification, and access control. Each with an owner, a review date, and evidence of acknowledgment. A 60-page security policy last revised in 2022 is worse than a 2-page one updated last quarter.

Risk governance with board visibility. A technology risk register (Edition 1) reviewed quarterly, with trends visible to the board. Not a formal risk committee at Series B, but a standing agenda item where someone presents the top risks, what changed, and what is being done. Cover security, compliance, resilience, key-person dependencies (Edition 11), vendor concentration (Edition 9), and technical debt (Edition 4).

Clear accountability without bottlenecks. Distributed ownership: a security lead who owns posture, an engineering manager who owns change management, a data lead who owns classification. The CTO orchestrates but does not hold every thread. If the CTO is the only person who can approve a production change, approve a vendor, and respond to a regulator, the company has a governance bottleneck dressed up as leadership.

Change management that scales. Classified changes (standard, normal, emergency), approval workflows that do not slow low-risk deploys but enforce review for high-risk ones, and an audit trail connecting each change to a ticket, a review, and a deployment record. This is evidence the company can explain what changed, when, why, and who approved it.

Cyber-insurance as a governance signal. The underwriting process forces discipline: MFA everywhere, tested backups, incident response plan, access reviews. Companies that cannot get insured are telling investors something about their security posture without saying a word. A clean policy at a reasonable premium is quiet evidence that an independent third party reviewed your controls and found them adequate.

Compliance that is operational, not aspirational. Automated access reviews, real-time policy enforcement in CI/CD (Edition 16), log retention matching stated policy, and a process to handle data subject requests on time. If your SOC 2 report says you review access quarterly but the last review was seven months ago, diligence will find the gap.

Stage and stake: how the lens sharpens

Seed and early Series A: Formal governance is minimal and that is fine. A basic risk register, MFA on critical systems, a written incident response plan, and founders who can say: “Here is what keeps me up at night and here is what we are doing about it.”

Series B and growth: Written policies with owners and review cycles, risk register visible to the board, change management with an audit trail, scheduled access reviews, and either a completed SOC 2 Type II or a credible timeline. Cyber-insurance should be in place. If the company has 60 or more engineers and no security lead, that is a gap.

Control buy-outs: Buyers probe governance as an operational system. They ask for recent board risk reports, sample access reviews, check offboarded employees against active directories, verify incident response was tested recently, and review change logs correlated with incident history. Gaps are priced directly as remediation capex or a discount on enterprise value.

Red flags that lengthen negotiations

Policies written for the last round with no owners, no review dates, no evidence anyone read them.
No technology risk reporting to the board. Revenue, pipeline, and burn are visible, but security posture and operational risk are not.
The CTO is the single approval point for production access, vendor selection, incident escalation, and regulatory response.
Several engineers deploying multiple times a day with no change classification, no audit trail, and no way to correlate a change to an incident.
No cyber-insurance, or a policy with exclusions so broad it would not cover a ransomware event.
Offboarded employees still appearing as active in IAM. Access reviews not performed on schedule.

Two or three of these are fixable post-close. Four or more usually trigger price protection, escrow, or a pause.

Habits worth adopting before the next round

Stand up a quarterly risk review with board visibility. Even 15 minutes as a recurring board agenda item transforms governance from back-office function to strategic conversation.
Assign policy owners and enforce review cycles. Every policy gets a name, not a team. Review annually at minimum and after any major incident.
Classify changes and match process to risk. Standard changes flow through CI/CD. Normal changes require peer review. Emergency changes get a fast path with mandatory post-change review within 48 hours.
Get cyber-insurance and use the underwriting questionnaire as a free gap assessment. Fix what the underwriter flags. Review coverage annually.
Run access reviews on a real schedule. Quarterly for production and admin access. Correlate with HR offboarding to catch stale accounts.
Build a governance dashboard, not a document library. Track policy status, access reviews, open risk items, and compliance health in one place. If governance lives only in PDFs, it is already stale.

Mini-Glossary

Tech governance: The framework of policies, roles, and oversight ensuring technology decisions align with business objectives and risk appetite.
Change management: Classifying, approving, recording, and reviewing changes to production systems. Balances speed with safety.
Cyber-insurance: Insurance covering losses from cyber incidents. The underwriting process itself is a governance checkpoint.
Access review: Periodic verification that users have only the access they need. Catches stale permissions and privilege creep.
Policy acknowledgment: Documented evidence that employees have read and accepted a policy. Without it, the policy is aspirational.
Subscribe now

Your turn

Where did governance break during your hypergrowth phase? A policy nobody followed, a board that never saw tech risk, or a cyber-insurance application that exposed gaps you did not know you had? Share the scar. It helps the next team.

Founders/CTOs: Need a governance health check before the next board meeting or term sheet? Let’s talk.

Investors: Want a pre-deal assessment of governance maturity and board-level risk oversight? Let’s talk.

Next in the Playbook: Edition 23 will explore Multi-Tenant Architecture Assessment. SaaS scalability, data isolation, and customer onboarding automation.

Stay tuned!

Infrastructure as Code & GitOps

Eitan Schuler — Tue, 17 Feb 2026 13:29:06 GMT

Originally published on LinkedIn on February 11, 2026.

Subscribe now

When the repo said one thing and production showed another

During a pre-close Disaster Recovery (DR) drill, the acquirer asked a simple question: “Can you rebuild this environment from scratch?”. The team pointed to the Terraform repo with confidence. Three hours later, the apply failed on 47 resources that had been modified in the console “temporarily” over the past year. Security groups, IAM policies, and a critical load balancer config existed only in production, nowhere in code. The deal closed, but with an escrow tied to a 90-day “codify reality” sprint that consumed two senior engineers who should have been shipping features.

Infrastructure as Code (IaC) isn’t about tooling preferences. It’s about whether your infrastructure knowledge is durable, auditable, and recoverable without heroes.

Why this matters

Investors care about IaC and GitOps because they answer a question that matters more than uptime dashboards: “What happens when things go wrong and your best people aren’t available?”

Ideally DR becomes provable, not aspirational. Change management has an audit trail. Knowledge transfers with the codebase, not the staff. Environment parity reduces “works on staging” surprises. Compliance frameworks (SOC 2, ISO 27001, DORA) require change control evidence that console clicking cannot provide.

Get IaC right and you can rebuild, audit, and hand over infrastructure without heroics. Get it wrong and every incident, every DR drill, and every acquisition exposes how much tribal knowledge holds the system together.

What investors look for

Coverage that matches reality. IaC should describe what actually runs, not what was intended 18 months ago. Drift detection shows the gap between declared and actual state. If you can’t answer “What percentage of production infrastructure is codified?” with confidence, neither can a buyer.

Single source of truth. Infrastructure definitions live in version control with review history. Changes go through pull requests (PR), not console clicks. “Who changed this and when?” should be a git log query, not a forensic investigation.

State management done right. For tools like Terraform or OpenTofu, remote state with locking is non-negotiable. Local state files or state committed to git without locking invites trouble. State corruption becomes a business continuity event.

Environment parity. Dev, staging, and prod created from the same templates with parameterized differences. If you can’t spin up a new environment from code relatively fast, the code isn’t the source of truth.

Secrets handled properly. No credentials in code. Use a secrets manager (Vault, AWS Secrets Manager, or equivalent) with rotation policies. IaC references secrets, doesn’t contain them. CI/CD logs don’t leak sensitive values.

GitOps as operational model. For Kubernetes environments, git should be the single source of truth for what runs in the cluster. A controller inside the cluster (typically Argo CD or Flux) watches the repo and continuously reconciles reality to match. This pull-based approach means your CI/CD pipeline doesn’t need cluster credentials. This reduces the blast radius if CI is compromised. Manual kubectl changes get detected and reverted automatically.

Rollback capability. Can you revert infrastructure to yesterday’s state? Git history plus state management makes this predictable, not improvized. If rollback is “we’ll figure it out,” expect diligence to slow down.

Stage and stake: how the lens sharpens

Seed / early A: Basic IaC for core infrastructure is acceptable. Some manual elements are fine if documented. Founders/CTOs should show they understand why codifying matters and have a roadmap. The key signal: can they explain what’s in code and what isn’t, and why?

Series B / Growth: Full coverage expected for production infrastructure. CI/CD pipelines apply changes. Drift detection runs automatically. Staging and prod share templates. Secrets management is solved. GitOps patterns for Kubernetes if applicable.

Control buy-outs: Buyers might run terraform plan against live state and check for drift. They’ll ask for the last 6 months of infrastructure PRs and check for incident correlation. They expect environment rebuild time measured in minutes, not days. They check for evidence that DR actually works from code.

Patterns that work (and why)

Declarative over imperative. Define what should exist, not how to create it. Tools reconcile reality to match the declaration. This makes code readable, auditable, and diffable.

Everything in modules. Reusable, versioned modules for common patterns (VPC, database, compute cluster, etc.). Teams use these building-blocks from a library rather than copy-pasting. Changes propagate cleanly.

Policy as code. Open Policy Agent, Sentinel, or similar. “No public S3 buckets” is a rule that fails PRs. Security guardrails enforced before apply, not discovered in an audit.

Drift detection as routine. Scheduled jobs compare declared state to actual state. Alerts are triggered on drifts before they become an incident. Treat drifts like failing tests: fix them or document why they’re intentional.

Immutable infrastructure. Instead of patching running servers, build a new image and replace them entirely. Servers that live for months accumulate mystery: packages installed during incidents, config tweaks nobody documented. Fresh instances from a known image eliminate that drift. Same applies for containers.

GitOps reconciliation loops. For Kubernetes: git is the source of truth. Controllers pull desired state and converge. No kubectl apply from laptops. Deployments are auditable, reversible, and don’t depend on CI credentials with cluster access.

Red flags that lengthen negotiations

IaC exists but doesn’t match production
State files stored locally or committed to git with no locking
No drift detection; unknown gap between code and reality
Secrets in code, environment files, or CI logs
Console access for “quick fixes” is routine, not exceptional
Single person who “knows the infrastructure” and does most changes manually
Cannot rebuild an environment from code without significant manual steps
GitOps claimed but deployments still push from CI with cluster admin credentials

Two or more of these typically trigger remediation holdbacks or post-close 100-day plans.

Habits worth adopting before the next round

Run drift detection weekly and treat findings like bugs: track, triage, fix or document exceptions.
Enforce PR review for all infrastructure changes. “I’ll fix it in the console” becomes “I’ll open a PR.”
Practice environment rebuild quarterly. Time it. If it takes longer than you’d expect or requires improvisation, the code isn’t complete.
Tag every resource with ownership. “Who created this?” should be answerable from code and cloud tags, not collective memory.
Document what’s intentionally not in IaC. Some things (like certain legacy resources during migration) may be exempt, but make it explicit.
Treat infrastructure changes like application deployments: review, merge, apply through pipeline, verify.

Mini-Glossary

IaC (Infrastructure as Code): Defining infrastructure in version-controlled, declarative files rather than manual configuration.
DR drill (Disaster Recovery drill): A planned exercise that tests whether you can restore systems after a failure, using your documented procedures rather than improvisation.
GitOps: Operational model where git is the source of truth for infrastructure and application state; controllers reconcile reality to match.
Drift: Gap between declared infrastructure state and actual running state; accumulates through manual changes.
State file: In Terraform/OpenTofu, the record of what was created; used to plan changes and detect drift.
Immutable infrastructure: Pattern of replacing servers rather than patching them; prevents configuration drift.
Reconciliation loop: Controller that continuously compares desired state to actual state and corrects differences.
Share

Your turn

What infrastructure surprise hit you hardest in a deal or an outage? Console changes that weren’t in code, state file corruption, or a rebuild that took days instead of hours? Share the scar; it helps the next team.

Founders/CTOs: Need an IaC coverage assessment or help closing the gap between code and reality? Let’s talk.

Investors: Want a pre-deal review of infrastructure maturity and DR readiness? Let’s talk.

Next in the Playbook: Edition 22 will explore Tech Governance for Hypergrowth. Board-level oversight, policies, and how cyber-insurance fits into the diligence picture.

Stay tuned!

Subscribe now

Product-Tech Fit

Eitan Schuler — Tue, 27 Jan 2026 06:14:44 GMT

Originally published on LinkedIn on January 21, 2026

A Series B candidate had impressive metrics: 40% year-over-year growth, solid retention, and a clean codebase. Then the investor asked a simple question: “How long would it take to add per-seat pricing?” The CTO’s answer was sobering: six months minimum. The billing system assumed flat subscriptions. The usage tracking lived in a different service with no reliable link to accounts. The data model had baked in assumptions from 2019 that no longer matched the product direction. The architecture wasn’t bad. It just wasn’t built for where the business needed to go. The deal closed, but with a remediation plan that consumed two quarters of engineering capacity.

Product-tech fit isn’t about having the “best” architecture. It’s about having an architecture that accelerates the product roadmap instead of fighting it.

Subscribe now

Why this matters

Investors don’t just buy what exists today. They buy the next three years of product evolution. If the architecture makes obvious next moves expensive, they price in the engineering quarters required to fix it. Worse, they worry about opportunity cost: every sprint spent on structural rework is a sprint not spent on features that drive revenue.

The math compounds. A team that ships three experiments per quarter will learn faster than one that ships one. If architecture is the bottleneck, competitors with better product-tech fit will outpace you regardless of how clean your code looks. Investors have seen this movie before.

What investors look for

Architecture that matches product velocity needs. A team shipping daily to test pricing experiments needs different infrastructure than one releasing quarterly to enterprise customers. Investors ask whether the architecture supports how the product actually evolves. Feature flags, A/B testing hooks, and modular boundaries that allow independent changes signal alignment. Monoliths where every change requires a full regression signal friction.

Data models that allow product pivots. Early schema decisions often encode business assumptions that become constraints. Can you add a new customer segment without a migration? Can you change your pricing model without rewriting billing? Investors probe whether the data layer supports where the product is going, not just where it’s been.

Platform investments with measurable payback. Internal platforms can accelerate delivery or become expensive distractions. Investors want to see that platform work ties to product outcomes: “We built this because it cut feature delivery time by X” beats “We built this because it’s the right way.” If the platform team is three sprints ahead of anyone using their work, something is off.

Technical decisions with product context. Show that architecture choices connect to business goals, not just engineering aesthetics. “We chose this database because our product requires sub-10ms reads on user profiles” is compelling. “We chose it because it’s industry standard” is not.

Appropriate investment for stage. Over-engineering is as risky as under-engineering. A seed-stage company with a Kubernetes cluster, service mesh, and event-driven architecture for 500 users has optimized for problems it doesn’t have. A Series B company still deploying from laptops has under-invested. The architecture should match the current reality while leaving room to grow.

Stage and stake: how the lens sharpens

Seed / early A: Investors accept rough edges if the team shows product awareness. A modular monolith that’s easy to change beats a microservices sprawl that’s hard to operate. Show you can ship fast, measure what matters, and articulate which architectural bets you’d revisit at 10x scale.

Series B / growth: Expect architecture that supports the stated growth plan. If the pitch says “expand to enterprise,” investors look for multi-tenancy, audit logging, and configurable workflows. If it says “launch in three new markets,” they look for localization hooks and regional deployment options. Mismatches between product narrative and technical reality raise questions.

Control buy-outs: Buyers stress-test alignment hard. They map the roadmap to the architecture and ask: what breaks first? They price the engineering work needed to support planned product moves, and they discount for uncertainty. A clean system that can’t evolve is worth less than a messy one that can.

Patterns that work (and why)

Product and engineering roadmaps that reference each other. When major features explicitly call out architectural prerequisites, and platform work explicitly ties to product outcomes, alignment is visible. If the two roadmaps live in separate documents that never mention each other, expect friction.

Decision records that include product context. Architectural Decision Records (ADRs) that explain the business problem, not just the technical solution, prove that engineering thinks in product terms. “We chose X to support planned pricing flexibility” sounds better than “We chose X because it’s best practice”.

Modular boundaries that match product boundaries. When the checkout flow is one deployable unit and the recommendation engine is another, changes stay local. When they’re tangled, every product experiment becomes a cross-cutting change. Investors check whether the system’s seams match where the product evolves fastest.

Regular architecture-product sync. A quarterly review where product and engineering leadership explicitly ask “What’s blocking us? What should we build ahead of need?” surfaces possible misalignments before they becomes expensive.

Red flags that slow or sink deals

Product roadmap features that require “significant refactoring” before they can start
Data models that encode assumptions the business has already outgrown
Platform investments with no clear product beneficiary or adoption metrics
Architecture diagrams that don’t map to how the product is sold or used
Technical decisions justified by trends rather than business requirements
Engineering estimates that routinely exceed product expectations because of structural constraints
No clear answer to “What would break if you 5x’d your largest customer?”

Two or more of these typically trigger deeper diligence, price adjustments, or earn-outs tied to architectural remediation.

Habits worth adopting before the next round

Run a quarterly alignment review: Map the next four quarters of product roadmap to architectural prerequisites. Flag gaps early.
Track “architecture tax” on features: When estimates include rework, measure it. If 30% of sprint capacity constantly goes to working around the architecture, that’s a signal.
Write ADRs with product context: Every significant technical decision should explain what product capability it enables or protects.
Maintain a “flexibility budget”: Identify the three or four areas most likely to change (pricing, customer segments, integrations) and deliberately avoid over-optimizing them. An abstraction layer around your payment provider costs a little now but saves a quarter when you need to switch.
Ask engineering: “What in the product roadmap can’t we build easily?”: The answer reveals where product-tech fit is weakest.
Subscribe now

Mini-Glossary

Product-tech fit: The degree to which architecture choices accelerate rather than constrain product evolution.
Architecture tax: The extra effort required to ship features because of structural constraints in the system.
Modular boundary: A seam in the system where changes can happen independently without rippling across other components.
ADR (Architectural Decision Record): A document capturing each significant technical decision, its context, and its consequences.
Over-engineering: Building for scale or complexity that doesn’t exist and may never arrive at the cost of current velocity.

Your turn

Where has product-tech misalignment bitten you hardest? A data model that couldn’t support a pricing change? A platform bet that never paid back? An architecture built for a product direction that pivoted? Share the scar; it helps the next team.

Founders: Need a product-tech alignment assessment before your next raise? Let’s talk.

Investors: Want a second opinion on whether a target’s architecture supports their growth narrative? Let’s talk.

Next in the Playbook: Edition 21 will explore a topic close to my heart: Infrastructure as Code & GitOps. Stay tuned!

Thanks for reading The Tech Due Diligence Playbook! This post is public so feel free to share it.

Quality Assurance and Testing Maturity

Eitan Schuler — Mon, 12 Jan 2026 13:39:17 GMT

Three days before a Series B close, the investor’s technical advisor asked to see the test suite run. The demo went smoothly until someone noticed the timer: 47 minutes for a “full” run that skipped integration tests, mocked every external service, and covered roughly 40% of critical paths. Worse, the team admitted they ran the suite “when we remember” because it was too slow for daily use. The code looked clean, but the safety net had holes you could drive a truck through. The deal survived, but with a 90-day remediation plan that consumed two senior engineers who should have been shipping the next revenue feature.

Testing isn’t about finding bugs after the fact. It’s about preventing expensive mistakes from reaching customers and proving to investors that velocity won’t collapse under its own weight as the team scales.

Subscribe now

Why this matters

Investors track testing maturity because it predicts three things they care about: delivery risk, customer churn, and engineering margin. A strong test culture means features ship faster with fewer rollbacks, incidents drop, and new engineers onboard without breaking production. Weak testing shows up as lengthening release cycles, incident fatigue, and the dreaded “we can’t change this code because we don’t know what will break.”

The math is brutal. Fixing a bug in code review costs hours. Fixing it in production costs days of engineering time, customer support load, potential SLA credits, and reputational damage that compounds. For growth-stage companies where every percentage point of margin matters, the difference between “we catch it in CI” and “customers report it” can swing enterprise value by millions.

What investors look for

Investors aren’t asking for 100% test coverage or zero bugs. They’re looking for systematic quality that scales without heroics.

Test pyramid in practice, not theory. The classic model holds: many fast unit tests at the base, fewer integration tests in the middle, minimal end-to-end tests at the top. Investors check whether this exists in reality. If your “test suite” is actually 90% slow E2E tests running against a full stack, delivery will grind down as the codebase grows. A healthy split might be 70% unit, 25% integration, 5% E2E, with total run time under 10 minutes for the core suite.

Coverage tied to risk, not vanity metrics. An 85% coverage number means nothing if it skips authentication, payments, and data-export paths. Smart teams track coverage on critical flows and show it in dashboards: “checkout path: 94%, billing logic: 91%, auth: 89%.” Investors want proof you know where the money lives and you’re protecting it.

Tests that fail fast and clearly. A flaky test that passes 80% of the time trains teams to ignore failures. Investors sample CI logs looking for retry logic, ignored tests, or “TODO: fix this flake” comments. When they find it, they assume quality is theater. Healthy teams quarantine flakes immediately, fix or delete them within a sprint, and treat flakiness as a reliability bug.

Shift-left integration with development. Testing shouldn’t be a separate phase that happens “after dev is done.” Investors sometimes look for test-driven development or at least test-first practices where applicable, pre-commit hooks that run fast tests, and PR gates that block merge on failure. If QA is still a separate team that “validates” work weeks after it was written, release cycles will stay long and defect escape rates will stay high.

Test environments that mirror production. Staging or pre-prod should match production in config, data shape, and integrations. Using localhost with SQLite while production runs Postgres on a managed service guarantees surprises. Investors check whether test data is realistic (volume, edge cases, PII properly masked), whether secrets and configs are managed consistently, and whether the team can spin up isolated environments on demand for testing breaking changes.

Automated regression and smoke tests. Every release should trigger a smoke test in staging that exercises critical paths: can users log in, complete a transaction, retrieve data, and trigger key workflows? Regression suites should run nightly against realistic data volumes. If these don’t exist or run “when someone remembers,” investors assume delivery confidence is based on hope.

Stage and stake: how the lens sharpens

Seed and early Series A teams often test by hand and ship fixes fast. Investors accept this if the team shows awareness: a written testing roadmap, a small but growing automated suite, and evidence that defect rates aren’t climbing. The key signal is trajectory, not perfection.

Series B and growth companies must demonstrate systematic testing. Expect code coverage on critical paths above 80%, fast CI feedback (under 15 minutes), staging environments that match production, and QA embedded in product teams rather than operating as a bottleneck. Investors will sample test reports, check flake rates, and verify that the test suite actually prevents bad deploys.

Control buy-outs trigger deep inspection. Buyers run the test suite themselves, review coverage reports against incident history, check test data provenance, and verify that compliance-critical flows (GDPR deletion, SOC 2 controls, financial calculations) have documented test cases. They price testing debt the same way they price technical debt: as deferred capex that will hit the P&L post-close.

Red flags that slow or sink deals

Test suite takes over 30 minutes to run; teams skip it to stay productive
Flaky tests routinely ignored; “just rerun the build” is standard practice
Critical revenue paths (payments, billing, auth) have minimal or no test coverage
Test environments use different databases, configs, or services than production
QA operates as a separate, downstream gate; features wait days or weeks for “QA approval”
Integration and E2E tests mock every external service; real integration failures surface in production
No smoke tests; every deploy is “validated” by hoping customers don’t complain
Test data is either trivial toy examples or production data copied without masking

Two or more of these typically trigger either a mandated 100-day remediation plan or a price adjustment to cover the engineering quarters needed to build a real safety net.

Habits worth adopting before the next round

Track quality KPIs: test coverage by critical path, flake rate, defect escape rate, and suite run time. Publish monthly and act on trends.
Build a test pyramid dashboard showing unit vs. integration vs. E2E distribution.
Make the pyramid visible so teams naturally write fast, focused tests.
Automate smoke tests for the 10-15 revenue-critical journeys. Run in under 5 minutes, block deploys on failure.
Quarantine flakes ruthlessly: any test failing more than once in 50 runs gets disabled immediately with owner and fix-by date.
Invest in realistic test data: synthetic or properly masked snapshots with edge cases and volume. Poor test data is why passing tests miss real bugs.

Mini-Glossary

Test pyramid: A model showing many fast unit tests, fewer integration tests, minimal E2E tests; optimizes speed and reliability.
Flaky test: A test that sometimes passes and sometimes fails for the same code; destroys trust in the test suite.
Shift-left: Moving quality activities earlier in development (design, code review) rather than late (QA phase, production).
Defect escape rate: Percentage of bugs that reach production versus caught pre-deploy; lower is better.
Smoke test: A fast, shallow test of critical paths to verify basic functionality before deeper testing or deploy.
Test coverage: Percentage of code exercised by tests; useful when focused on critical paths, misleading as a global metric.
Share

Your turn

What testing gap bit you hardest in a deal? Flaky tests that trained teams to ignore failures? Critical paths with no coverage? Test environments that hid production bugs? Share the scar; it helps the next team avoid it.

Founders: Need a testing maturity assessment or a roadmap to shift left before your next raise? Let’s talk.

Investors: Want a second opinion on test coverage and quality practices in a target company? Let’s talk.

Next in the Playbook: Edition 20 will explore Product-Tech Fit: when architecture choices align with (or fight against) product strategy, and how investors spot the mismatches that stall growth. Stay tuned!

Subscribe now

Year-End Wrap-up for 2025

Eitan Schuler — Thu, 25 Dec 2025 20:48:28 GMT

Originally published on LinkedIn on December 25, 2025.

Subscribe now

The Year Tech Due Diligence Got Real

December 2025. I’m reviewing notes, comments and direct messages related to the first seventeen editions of this newsletter, and one pattern screams louder than the rest: the tolerance for “we’ll handle it later” disappeared completely this year - investors now demand “show me the evidence” from day one.

2025 was the year tech due diligence stopped being a checkbox exercise and became the front door to every serious conversation. Let me tell you what actually happened, what you told me in the comments and tons of direct messages, and what’s worth carrying into 2026.

Three things that changed the game in 2025

1. Supply chain became the “kill chain”

Verizon’s 2025 report dropped a bomb: 30% of all data breaches involved third parties, double the previous year. Average remediation cost hit $4.91 million per incident.

The headlines kept coming. In June, procurement provider Chain IQ was breached, exposing internal customer data including UBS executive contact details. Between June and August, multiple organizations reported unauthorized access to Salesforce-hosted or Salesforce-integrated CRM systems, affecting companies including Google and Workday, primarily through social engineering and third-party access. Qantas disclosed that 5.7 million customer records were accessed via a contact-center vendor and later published after a ransom demand was refused. Allianz Life confirmed 1.4 million records exposed through a third-party CRM in July.

Edition 9 on vendor due diligence became my most-shared piece. Why? Because founders realized their stack’s weakest link could rewrite their valuation more than any internal tech debt. DORA kicked in for financial services, NIS2 started rolling across EU member states, and suddenly “we have an SBOM” stopped being sufficient. Investors wanted continuous monitoring, vendor-risk scoring in live dashboards, and proof you could swap a critical vendor in 48 hours.

2. The AI Act went from theory to leverage

The EU AI Act hit enforcement back in September. What surprised me wasn’t the regulation itself, it was how quickly it became a sales tool for prepared companies. B2B SaaS selling into Europe suddenly had to answer one question: “Which risk tier? Show me your conformity file.”

Companies that had already mapped their AI risk classification, documented their data governance, and proven their human-oversight mechanisms started closing enterprise deals faster than competitors who were still scrambling. One founder told me their Data Protection Impact Assessment and SOC 2 controls basically wrote their AI Act conformity documentation. That 70% overlap I mentioned in Edition 8? Turned out to be pretty close.

The penalty structure (up to 7% of global revenue) made this non-negotiable. Investors started treating AI compliance readiness the way they treated GDPR in 2018, as a gating item, not a nice-to-have.

3. People risk became quantifiable

This one surprised me. Edition 11 (The People Lens) started as the piece I thought would get the least traction. Org charts and succession planning feel “soft” compared to API contracts and database sharding. I was completely wrong.

Investors in 2025 started asking for bench-depth models alongside burn-down charts. They wanted to know: “If your principal engineer is out for a month, what slows down?”. Bus-factor-of-one became a deal discount. On-call health metrics started appearing in data rooms next to deployment frequency.

The shift? People finally connected fragile teams to fragile deliveries. A clean codebase with a hero culture is just technical debt with better PR. Three weeks before one close, a buyer asked: “If your principal engineer is out for a month, what slows down?” The CTO said: “only he knows the data plane and deployment system end-to-end.” That answer triggered a holdback tied to succession planning and delivery predictability.

The patterns that paid off across 17 editions

Looking back at every conversation, every comment, every “this burned us” story, five habits separated teams that sailed through diligence from teams that scrambled:

They treated evidence like a product. The winners didn’t create artifacts for diligence, they created them as operating tools. Data ownership maps updated with every schema change. Vendor registers refreshed quarterly. SBOMs generated in CI, not assembled the week before close. When diligence asked, they just pointed at what already existed.

They priced risk before investors did. Edition 4‘s Debt Ledger, Edition 1‘s Risk Register, Edition 5‘s FinOps dashboard, these weren’t compliance theater. They were steering mechanisms. Teams that tracked tech debt with owners and estimates, cloud costs per ARR Euro, and incident post-mortem velocity didn’t negotiate with fear. They negotiated with data.

They made residency and sovereignty architectural, not contractual. The companies that aced Edition 10 and 14 didn’t bolt on data residency later. They designed for it: regional sharding from day one, KMS keys pinned in-region, telemetry that never crossed borders. When customers asked “prove EU data stays in EU,” they exported a dashboard, not a promise.

They automated the boring parts of governance. Tagging policies enforced in Terraform. Secrets scanning in every commit. Policy-as-code for IaC changes. Vulnerability SLAs tracked per service with automated escalation. This wasn’t DevSecOps theater, it was margin protection. Every hour not spent chasing down an untagged resource or a leaked secret was an hour of shipping features.

They ran drills, not slides. Business continuity plans that had never been tested became liabilities (Edition 13). API contracts that drifted from production became integration hell (Edition 12). But teams that ran quarterly failover drills, chaos exercises, and “assume vendor down” scenarios turned operational discipline into a competitive moat. Investors trust what’s rehearsed, not what’s promised.

What I got wrong (and learned fast)

I underestimated how fast supply chain risk would dominate. When I wrote Edition 9 in early fall, third-party breaches were a concern. By December, they were the primary attack vector. The Salesforce cascade, the island hopping attacks, the procurement vendor compromises, these weren’t edge cases. They were the new normal. If you haven’t stress-tested your vendor exit clauses or run an “assume critical vendor compromised” drill, make that your January priority.

I didn’t emphasize multi-account architecture enough early. Edition 5 touched on it, but I should have screamed it from page one: separate cloud accounts for each product and environment from Day 1. Every founder who told me “we’re trying to untangle mixed environments now” paid for it in diligence time and FinOps headaches.

I should have written the People Lens edition earlier. Org design, succession planning, and on-call health, these aren’t nice-to-haves at Series B. They’re gating items. The technical stack is easier to fix. Fragile teams take quarters.

Looking ahead: what’s coming in 2026

The regulatory ratchet keeps tightening. The AI Act is enforcing. DORA is live. NIS2 is rolling out. The EU Data Act kicked in September 2025, with full switching-fee elimination coming January 2027. Every one of these shifts costs from “we’ll figure it out” to “show me the controls.” If your compliance posture is reactive, 2026 will hurt.

Unit economics become non-negotiable. Cheap capital is gone. Every Series B and beyond will face one question: “Prove your cloud spend grows slower than ARR.” FinOps isn’t optional anymore, it’s margin defense. Edition 5‘s habits (tagging, showback, right-sizing) are table stakes now.

Supply chain scrutiny becomes standard. SBOM generation, vendor-risk scoring, continuous monitoring, and exit-clause negotiation won’t be “nice work if you have it,” they’ll be expected in every data room. The companies that treat third-party risk as a feature, not a footnote, will close deals faster.

Platform engineering becomes the differentiator. The gap between teams with paved roads (golden paths, self-service IDP, automated guardrails) and teams without is widening. In 2026, investors will ask: “How long does it take a new engineer to ship to production?” and “What percentage of teams can deploy without a ticket?”

If you read only three editions before your next round

If time is short and a term sheet is close, start here:

Edition 1 (Deal-Maker, Not Checkbox) - Build your risk register. Frame risk, don’t hide it.
Edition 3 (The Metrics That Matter) - Get your five signals clean: speed, stability, quality, reliability, unit economics.
Edition 9 (Vendor Due Diligence & Third-Party Risk) - Map your supply chain. Continuous monitoring, not annual PDFs.

Then layer in whatever matches your stage and sector: APIs if you’re B2B SaaS, sovereign data if you touch EU customers, DevSecOps if you’re post-Series A.

Subscribe now

Your turn

This newsletter exists because you’ve shared your scars, your surprises, and your shortcuts. Over seventeen editions, the comments and direct messages taught me as much as the research.

So here’s my ask: What topic did I miss in 2025 that bit you during a deal? Was it M&A carve-out complexity? Product-market fit under technical constraints? IP ownership in distributed teams? Kubernetes cost chaos? AI model governance beyond compliance checkboxes? Drop it in the comments. The best suggestions become Edition 18 and beyond in 2026.

Founders: Need a year-end tech health check before the next fundraise? A fast gap-scan of your data room readiness? Let’s talk.

Investors: Want a second pair of eyes on a live deal, or a portfolio-wide benchmark of tech maturity? Let’s talk.

Here’s to a year of better questions, cleaner evidence, and deals that close on momentum instead of surprises. See you all in 2026. Stay tuned!

Subscribe now

Database Architecture & Data Strategy

Eitan Schuler — Mon, 17 Nov 2025 12:38:35 GMT

Originally published on LinkedIn on November 12. 2025.

Subscribe now

Two months before a Series B close, the lead investor asked a simple question: “Walk me through your plan to migrate off Sybase.” The CTO froze. The 15-year-old database worked fine (until it didn’t). Licensing costs scaled with revenue, not usage. Cloud-native tooling couldn’t connect. Hiring engineers who understood the stack was nearly impossible. The migration plan? “We’ll figure it out after we close.”

The deal survived, but with a 10% holdback tied to a 12-month Microsoft SQL Server migration. Two senior engineers spent a year rewriting queries, testing data integrity, and managing dual-write periods instead of shipping features. The opportunity cost? Roughly €800K in delayed product work and a competitor who shipped faster.

Why this matters

Database architecture isn’t plumbing. It’s the foundation that determines how fast you scale, how much you spend, and whether you can evolve without rewriting everything. Investors care because the wrong choice compounds: migration risk grows with data volume, vendor lock-in tightens with revenue, and architectural debt strangles margin.

There is no universal “right” answer. A single Postgres instance can outperform a distributed system if workload and team fit the model. Microservices with 47 databases can be worse than a well-factored monolith if ownership is unclear. The question isn’t “what’s trendy?” but “what fits your constraints, and can you prove it?”

What investors look for

Clear data ownership and boundaries
Every dataset has a single writer and a named owner. Cross-service writes to shared tables signal fragile deploys and integration hell during carve-outs. Maintain a data ownership map showing which service owns which tables and how data flows between them (see Edition 10 on data governance and Edition 15 on domain boundaries).

Architecture matched to workload
Show why you chose what you chose. High-write transactional systems favor strong consistency. Analytics or read-heavy patterns tolerate eventual consistency. Mixed workloads might justify polyglot persistence. But every choice should tie to measured query patterns, not trends.

Performance tracked as SLOs
Track P95/P99 latency per critical query type. Prove you know which queries are slow, why, and what happens when they breach SLO. Slow query logs, index coverage, and optimization backlogs with ROI estimates matter.

Scalability with evidence
Can you handle 5x traffic without a rewrite? Show vertical scaling limits, sharding plans if needed, read replica strategy, and load test results that prove the theory.

Migration and exit readiness
Vendor lock-in is priced like technical debt (see edition 4). Can you export data in standard formats? How long would a migration take? If you’re on proprietary tech, investors will assume a multi-quarter, high-risk migration and discount accordingly.

Capacity forecasting tied to revenue
Model database cost and performance against ARR growth. Storage rates, IOPS trends, connection pool saturation should stay flat or decline as you scale (more in Edition 5).

Decision framework: when to choose what

Single database works when domains are evolving, teams are small, strong consistency matters, and you enforce modular boundaries within the schema. Watch for: if one table becomes the bottleneck, plan extraction seams.

Distributed databases work when domains are stable with clear ownership, independent scaling matters (batch vs. real-time), and you can tolerate eventual consistency or partition data cleanly. Watch for: if ownership is fuzzy or transactions span services, a modular monolith is simpler.

Polyglot persistence works when workload shapes genuinely differ (transactional + search + analytics), each DB has a clear owner and migration path, and operational overhead is staffed. Watch for: using five databases “because we tried them all” is sprawl, not strategy.

How stage and stake sharpen the lens

Seed / early A: A single database with a documented scaling plan is acceptable. Show you understand vertical limits, have a read-replica or caching strategy sketched, and can articulate when you’d shard or split. Basic query observability and nightly backups with one timed restore per quarter.

Series B / Growth: Expect clear data ownership, capacity models tied to ARR, query SLOs tracked per service, and a tested migration path if you’re on proprietary tech. Distributed systems require proof: how you handle consistency, partition data, and recover from split-brain scenarios.

Control buy-outs: Buyers sample query performance against SLOs, run capacity models forward 3 years, verify backup/restore in clean rooms, and check migration cost if lock-in exists. They price the engineering quarters needed to scale or migrate, and they discount uncertainty.

Red flags that trigger discounts or 100-day plans

Single massive database with no sharding or replica strategy; vertical scaling limit is 12 months out
Cross-service writes to shared tables; changes ripple unpredictably
Proprietary database with no export path; migration estimate is “we’ll figure it out”
Query performance degrading with growth; no SLOs, no optimization backlog
Backup/restore untested; RTO/RPO are aspirational
Database cost growing faster than ARR; no capacity forecasting

Habits worth adopting before the next round

Maintain a data ownership map: services, datasets, writers, readers, contracts (APIs/events/CDC), and owners. Update it with every schema change.
Set query SLOs and track them: P95/P99 latency per critical query type. Alert on breaches, maintain an optimization backlog.
Run quarterly capacity reviews: storage growth, IOPS saturation, connection pool usage. Model against ARR growth; flag when you’ll hit limits.
Keep a migration readiness checklist: schema portability, dialect mapping, tooling gaps, testing plan, timeline estimate. Update annually.
Test backups with timed restores: clean-room restore every quarter. Measure RTO/RPO and verify data integrity (Edition 13).
Model database spend as FinOps: cost per query, per GB, per connection. Tie it to revenue; prove margins improve at scale (Edition 5).

Mini-Glossary

Polyglot persistence: Using multiple database types (SQL, NoSQL, time-series) matched to workload needs.
CDC (Change Data Capture): Streaming database changes to other systems via transaction logs, not polling.
Sharding: Horizontal partitioning of data across multiple DB instances; critical for scale but adds operational overhead.
Read replica: A copy of the database that handles read queries to reduce load on the primary; can’t accept writes.
IOPS: Input/Output Operations Per Second; measures database workload and performance capacity.
Connection pool saturation: When all available database connections are in use; causes new requests to queue or fail.

Your turn

What database choice has bitten you (or saved you) during diligence? Was it a migration nightmare, a scalability wall, or tight coupling that stalled a carve-out? Share the story; scars teach best.

Founders: Need a data architecture health check or migration readiness review? Let’s talk.
Investors: Want a pre-deal assessment of database strategy and scaling risk? Let’s talk.

Subscribe now

Next in the Playbook: Edition 18 will explore Quality Assurance & Testing Maturity. Stay tuned!

DevSecOps Maturity & Shift‑Left Security

Eitan Schuler — Mon, 10 Nov 2025 12:25:42 GMT

Originally published on LinkedIn on November 5, 2025

Subscribe now

Three weeks before close, a buyer’s security team requested “evidence of vulnerability management.” The CTO pulled up a quarterly scan report showing 47 high-severity findings, half of them six months old. One critical CVE sat in a transitive dependency nobody had noticed. Worse, secrets were scattered across config files, CI tokens had admin scope, and the team couldn’t prove which services were actually exposed to the internet. The deal survived, but with a 15% discount and a mandatory 90-day security sprint that stalled two product launches.

Why this matters now

Shift-left means moving security checks earlier in the development cycle and catching issues in code review rather than production. For investors, this translates directly to margin and speed: fixing a SQL injection during PR review costs hours; fixing it after a breach costs millions in remediation, legal fees, and customer churn.

Supply-chain attacks have gone mainstream. SolarWinds, Log4Shell, and MOVEit showed that your dependency tree is your attack surface. Buyers now expect SBOMs (Software Bills of Materials), signed artifacts, and proof you know what’s running in production.

Enterprise buyers demand it. SOC 2, ISO 27001, and vendor security questionnaires are table stakes for mid-market deals. Companies that treat security as a Friday audit rather than a daily gate lose sales cycles or accept margin-crushing exceptions.

Regulators are watching. DORA, NIS2, and the AI Act all require demonstrable security controls, not promises. I wrote about these extensively in Edition 8 and 9. Investors price the risk of enforcement, reputational damage, and the engineering capacity needed to retrofit controls post-close.

What investors look for: signals that separate theater from discipline

Security built into delivery, not bolted on later Investors want to see controls wired into CI/CD: branch protection, mandatory code reviews, automated scanning that fails fast on secrets or high-severity vulnerabilities, and signed build artifacts. A monthly scanner report gathering dust signals that security is separate from shipping, not part of it.

Clear ownership and time-bound SLAs Every vulnerability should map to severity, exploitability, and a fix deadline. Critical CVEs in exposed services get patched within days, not quarters. Exceptions require explicit approval, expiry dates, and compensating controls, not a backlog labeled “technical debt.”

Supply-chain transparency Generate an SBOM for every build and attach it to release artifacts. Pin dependencies to specific versions, scan them before merge, and verify signatures at deploy. Investors check whether you can answer “what’s inside this container?” in minutes, not days.

Infrastructure policy enforced as code Terraform and Kubernetes manifests should pass policy checks before they reach production: no overly permissive IAM roles, no unencrypted storage, no surprise egress paths. Hand-edited production environments signal drift, weak change control, and brittle disaster recovery.

Secrets and credentials hygiene Use short-lived, scoped tokens in CI/CD, rotate secrets on a schedule, and scan for leaked credentials in every commit. Long-lived admin tokens or secrets checked into repos are immediate red flags. Investors assume the blast radius is “everything.”

Threat modeling that fits the sprint Run lightweight threat reviews at story kickoff for any new API endpoint, auth change, or data flow. Annual workshops that never affect code are “theater”. Practical, card-based models that shape design decisions prove security is embedded in product thinking.

Runtime visibility tied to teams Security dashboards should live where engineers look: alongside deployment frequency, error rates, and SLOs. Alerts must route to clear owners with runbooks, not a generic “security queue.” If engineering never sees security metrics, they won’t act on them.

How stage and stake sharpen the lens

Seed / early Series A: Basic controls are enough: MFA and SSO on critical systems, secrets scanning in CI, branch protection on main repos, and SBOMs generated for core services. Investors accept that the paved road is still under construction as long as you show momentum and a plan.

Series B / Growth: Enforcement becomes the standard. PRs should fail on policy violations, images must be signed and verified at deploy, IaC changes require policy approval, and vulnerability SLAs are tracked and met. Golden templates exist so new services start secure by default.

Control buy-outs / Late-stage: Buyers expect provenance: cryptographic proof of how and where artifacts were built, reproducible builds, and audit-ready evidence packs that map SOC 2 or ISO controls to pipeline data. Investors will sample your SBOM, test a credential rotation drill, and verify that critical vulnerabilities are patched and deployed within a week.

Red flags that slow or sink deals

CI tokens with admin scope that never expire; if CI is compromised, the blast radius is “everything.”
No SBOMs or signing; buyers can’t verify what’s running or where it came from.
Scanner fatigue: thousands of findings with zero enforcement gates; nobody knows what matters.
Secrets in environment files, CI logs, or issue comments; leaked credentials with no rotation plan.
Infrastructure drift: production patched by hand while IaC sits in a repo as documentation, not source of truth.
Vulnerability backlogs older than a fiscal year; exploit risk priced into escrow or discounted from enterprise value.

Two or more of these typically trigger price protection, mandated remediation roadmaps, or sometimes even a walk-away.

Habits worth adopting before the next round

Track patch time as a KPI: measure the time from CVE disclosure to deployed fix, per service. Publish it monthly.
Attach SBOMs to every release artifact and keep 12 months of history. Make “show me what’s inside” a 2-minute query, not a forensic exercise.
Enforce policy-as-code for IaC and manifests; treat security rules like application code with review, versioning, and rollback.
Run a quarterly controls game-day: rotate a leaked secret under time pressure, block an unsigned image at deploy, patch a live vulnerability via the pipeline. Measure MTTR and refine runbooks.
Embed a security engineer in your platform team to own the paved road and treat security controls as product features with SLAs, not audit checklists.

Mini-Glossary

Shift-left: Moving security checks earlier in development (design, code review) rather than late (production, audits).
SBOM: Software Bill of Materials—a machine-readable list of all components and versions in an artifact.
Policy-as-code: Security and compliance rules expressed as code, enforced in CI/CD pipelines.
Provenance: Cryptographically signed metadata proving how and where an artifact was built.
Subscribe now

Your turn

Which security gap hit hardest in your last deal? Unpatched dependencies, leaked secrets, or policy theater? What fixed it, and how fast? Share the scar. It helps the next founder dodge it.

Founders: Need a 2-hour security posture review before the next term sheet? Or need me to review your development processes? Let’s talk.

Investors: Want a pre-deal assessment of DevSecOps maturity in a target company? Let’s talk.

Next in the Playbook: Edition 17 will explore Database Architecture & Data Strategy.

Stay tuned!

Platform Architecture Assessment

Eitan Schuler — Mon, 27 Oct 2025 20:01:19 GMT

Originally published on LinkedIn on October 22, 2025

A familiar pattern in a Quarterly Business Review: harsh questions are being asked about why performance at peak was uneven, releases were risky, and that one noisy dependency kept rippling across teams. The debate starts again and again. Re-platform and rewrite to microservices or harden the good old monolith? As always - the right answer is not ideology. It is a clear read on your domain boundaries, deployment independence, and the cost of change.

Subscribe now

Why this matters

Investors want scale without fragility, faster iteration without runaway cost, and a predictable path from here to the next 10x. They will ask the founders how architecture supports revenue goals, regulated customers, and uptime promises. Founders need a framework, not slogans.

Investor expectations: Dos and Don’ts

Domain boundaries Draw clear lines between parts of the system and show them on one page. Use context maps and bounded contexts so each area owns its data and decisions, and changes stay local. Don’t let teams call each other at random or share code in ways that blur ownership. If boundaries are not written down, every change risks breaking something else.
Coupling and cohesion Keep each module focused on one purpose and list its allowed dependencies. Use code reviews and simple tools to prevent circular dependencies. Don’t hide shared state or let services talk through the same database tables. Avoid grab bag utility libraries that everything depends on.
Data ownership and integration Assign a clear owner for every dataset. Version your schemas. Connect systems through APIs or events. If you use change data capture, say so in simple terms and document the contract. Don’t allow multiple services to write to the same table. Never ship breaking changes without a version and a migration plan.
Deployment independence Make it possible to deploy one service at a time with blue green or canary releases. Keep interfaces backward compatible so others do not need to deploy with you. Don’t force lockstep releases or shared maintenance windows just because systems are tightly coupled.
Runtime performance and bottlenecks Measure real user latency at P95 and P99. Profile hotspots. Size queues. Tune databases and caches based on evidence. Don’t overprovision without data or chase CPU charts without flame graphs and load tests.
Scalability mode Be explicit about where you scale up a machine, where you scale out across many machines, and how state is partitioned. Test autoscaling under load. Don’t just rely on a single stateful chokepoint or expect autoscaling to fix design limits.
Resilience and failure isolation Set service level objectives, use error budgets, add circuit breakers and bulkheads, and run disaster recovery and restore drills in a clean room. Don’t accept failures that cascade across services or skip timed restore tests.
Observability and SLOs Ship metrics, logs, and traces with the same tags for service, environment, and team. Track simple health views like Requests, Errors, Duration and Utilization, Saturation. Review alerts often. Don’t rely on ad hoc log searches or tolerate silent failures that users feel before you do.
Team topology fit Align teams to the slices of the system they build and run. Each team should own its service, its data, and its on-call. Don’t organize by technical layers that create gaps in ownership and constant cross team coordination.
Cost and capacity predictability Track cost per service and per unit of work, for example per API call or job. Reserve steady capacity when it saves money. Watch egress fees between regions and providers. Don’t treat list prices as a forecast or hide cross plane and cross service costs inside averages.

Monolith vs Microservices

Monolith is a better fit when

Domain boundaries are still evolving and you refactor often.
Team is small and benefits from shared context and in process changes.
Strong consistency and low latency matter more than independent scaling.
You can enforce modular boundaries, internal APIs, and feature flags inside one repo.
Clean data ownership is not yet possible across teams.
Platform capacity for contracts, tracing, and SRE is limited right now.

Red flag: if one module is the clear hotspot or release coordination is the main blocker, start planning an extraction seam.

Microservices are a better fit when

Domains are stable and ownership is clear per team.
You need independent deploys and scaling for different slices, for example batch vs real time.
Workload shapes differ and one area is a proven hotspot for CPU, memory, IOPS, or special hardware like GPUs.
You can staff platform basics such as identity, secrets, templates, contracts, tracing, and SLOs.
Data can be owned per service without cross writes to the same tables.
Cost needs to be tuned per slice, and egress and capacity can be managed per service.

Red flag: if traffic is uniform, transactions are tightly coupled, or boundaries are fuzzy, a modular monolith is simpler and cheaper for now.

Migration readiness process

Map seams and ownership: produce a one page context map with dataset owners and contracts. If ownership is unclear, stay monolith and modularize.
Measure delivery pain: track change failure rate, lead time, deploy frequency, MTTR for 4 to 6 weeks. If the main bottleneck is cross team coordination or blast radius, consider extracting a seam.
Prove a hotspot: show one area driving a disproportionate share of CPU, memory, IOPS, GPU, or incident minutes. If yes, isolate it.
Validate data boundaries: confirm the candidate service can own its writes without cross table edits. If not, keep it in the monolith while you refactor ownership.
Test state partitioning: demonstrate sharding or queue based isolation in a sandbox. If you cannot partition state, microservices will add cost without benefit.
Check platform readiness: confirm paved paths exist for identity, secrets, logging, tracing, contract tests, SLOs, and release strategies. If missing, invest first.
Build a cost model: compare 12 to 36 month TCO for both options including egress, support tiers, and staffing. Choose the cheaper model for the same SLOs.
Run a time boxed spike: implement one extraction with blue green or canary, backward compatible contracts, and a strangler pattern. Measure the payoff before scaling out.
Set exit criteria: define in advance what success looks like, for example 30% latency drop or 50% less coordination for releases and stop if you do not hit it.

Stage and stake for monolith vs. microservices

Maturity is not “microservices by default.” It is about conscious, evidence-based choices. Choose the simplest architecture that meets today’s constraints and can evolve tomorrow. Revisit the choice with data, not fashion.

What good looks like at any stage:

You can explain why each boundary exists, who owns it, and how it changes.
Delivery, reliability, and cost are measured and guide decisions.
Rollout, rollback, and restore are practiced.
Contracts are versioned and backward compatible during change.

How to communicate maturity to investors:

“Here is our current shape and why.”
“Here are the seams we are watching and what would trigger change.”
“Here is the evidence that the last change paid off.”
“Here is our rollback and restore plan if it does not.”

Glossary

Bounded context: The smallest coherent domain where a model and language stay consistent.
Change Data Capture: a method that reads a database’s commit log to stream row level inserts, updates, and deletes so other systems stay in sync in near real time.
Contract: The explicit interface and schema a consumer relies on.
Strangler pattern: Incrementally replacing a legacy capability by routing traffic to a new component at the seam.
Bulkhead: A limit or partition that keeps one failure from cascading.
Error budget: The agreed allowance of unreliability that guides release risk.

Subscribe now

Your turn

Where did your architecture hurt the most under peak or during releases? What change made the biggest difference?

Cloud Strategy & Deployment Models

Eitan Schuler — Sun, 21 Sep 2025 06:24:18 GMT

Originally published on LinkedIn on September 17, 2025

Subscribe now

At renewal a flagship customer asked three things: keep EU data in region, run inference predictably at peak, and pass their vendor audit without a six-month exception plan. Our answer used to be public cloud only. Procurement read this as not under our control. Then we had to split workloads: core in public cloud, regulated analytics in a private tenancy, and an on-prem edge for latency and data gravity. Same product, different deployment environments. Sales cycles shortened because the architecture matched the customer, not our preference.

In this article I will mention “planes”. A plane is a separate deployment environment you operate and secure independently, for example a public cloud account, a private single-tenant VPC, or an on-prem or edge site.

Why this matters

Investors want predictable unit economics and the ability to win regulated deals. They ask: where do workloads run and why? Can you prove residency, key custody, and audit posture per plane? What is real TCO including egress, GPUs, support, and people? How do you fail over or rebuild if a region or provider is unavailable?

Investor expectations: Dos and Don’ts

Workload placement and rationale

Do: Keep a workload map showing each service, constraints, placement, SLOs, and owner, plus the reason (latency, sovereignty, predictability, GPU). Don’t run public-only or on-prem-only by dogma with no written rationale.

Residency, keys, and audit posture

Do: For each plane show where data and keys live, how keys are pinned in region (BYOK/CMK or HSM/KMS), and the artifacts that back it up (rotation logs, DPAs/SCCs, SOC 2/ISO, shared-responsibility matrix). Prove failover keeps data and keys in jurisdiction. Don’t let KMS or observability cross borders during failover or wave at provider badges without mapping them to your controls.

Cost, capacity, and predictability

Do: Maintain a TCO model per plane including egress, storage growth, GPU reservations, support tiers, compliance and staffing; call out workload predictability and show budget, showback, alerts. Don’t hide egress behind averages, rely on uncontrolled spot for production, treat list price as forecast, or assume public cloud is always cheaper.

Security and access

Do: Use one identity baseline across planes: SSO, least-privilege roles, short-lived credentials, secrets management, and the same logging and patching. Don’t create security snowflakes where private or on-prem is weaker.

Continuity across planes

Do: Time restores in a clean room per plane, rehearse failover where applicable, keep DNS TTL short on failover records, and run drift detection. Don’t present multi-cloud slides without drills or discover configuration drift at promotion.

Implementation guide

Start in layers. Seed and early A teams usually need the first 3. Add the rest as you grow and as customers or regulators demand it. Remember: while public cloud or private tenancy costs are based on a price-lists or negotiated deals, calculating on-prem TCO is much more complex. It includes floor space and structural load approvals, power and cooling capacity, fire suppression, physical security and access logs, network redundancy and cross-connects, hardware lifecycle and spares with vendor maintenance SLAs, plus site audits, permits, and insurance.

Write a placement policy: Simple rules, written down. Regulated PII stays in region under your key custody. Workloads that need guaranteed GPU capacity go to a reserved pool. Batch analytics can move to lower-cost infrastructure if SLOs allow.
Model workload predictability: Keep base load on predictable capacity, burst to public cloud for spikes. If a 3 to 5 year TCO shows steady, high-utilization jobs are cheaper on private tenancy or on-prem and you can operate it, place them there.
Watch egress costs like a hawk. Data transfer often dominates TCO. Minimize cross-region and cross-provider traffic, keep analytics close to data, use peering or private links for chatty paths, and make egress a standing line in every cost review.
Data and keys plan with audit pack: Classify datasets and their placement. Define encryption and custody with region-pinned KMS or HSM and BYOK/CMK where required. Set rotation policy and keep evidence. Maintain a one-pager per environment with datasets, custody, access roles, rotation cadence and last rotation, and the audit artifacts that apply. Include a short log or screenshot showing region pinning and how failover keeps data and keys in jurisdiction.
Connectivity by design: Use private links or peering for noisy paths, zero-trust access for staff, clear egress boundaries, and a single service catalog so teams do not invent ad hoc tunnels.
Unify observability and FinOps: Put operations and cost in one place. Use a single dashboard that shows health and spend side by side. Tag every metric, log, trace, and cloud bill with the same tags for service and environment, plus team and region. Give each team a budget, set up cost anomaly alerts, and use reserved or committed capacity when it saves money. Define simple cost units per API call or job so Product can make trade-offs. Review cost and SLOs together each month.
Capacity strategy: Choose reservations, committed use, or on-demand per workload, provide queues and fair-share scheduling to avoid starvation. Keep a fallback SKU and a tested burst plan.

Stage and stake

Seed and early A: Public cloud by default is fine; show a simple placement policy, a basic landing zone, cost guardrails, and a plan for regulated asks.

Series B and growth: Hybrid readiness for enterprise, private tenancy or VPC peering for regulated customers, a real GPU plan if relevant, a data and key custody note, and continuity tests in the secondary plane.

Control buyouts: Buyers sample the workload map against reality, check residency and key custody artifacts, run a restore test, and review cost predictability under growth, pricing egress exposure and capex or opex to close gaps.

Glossary

Plane: a separate deployment environment you operate and secure independently.
TCO: total cost of ownership including cloud, licenses, egress, support, and staffing.
Data gravity: data size and movement make some placements costly or slow.
Edge / On-prem edge: Compute and storage deployed close to where data is produced or used to cut latency, reduce data transfer, meet residency rules, or run during cloud/network outages.
Egress: paid data transfer out of a provider or region.
BYOK and CMK: customer-controlled keys for encryption at rest.
Private tenancy: isolated resources for a single customer inside a provider.
Landing zone: standardized account or project setup with guardrails.
Subscribe now

Your turn

Which constraint has driven your placement choices: residency, GPUs, latency, or cost predictability. Share the scar and the fix that worked.

Next in the Playbook: Edition 15 will be published after a short break in the second half of October after my vacation. Will dive into platform architecture assessment.

Business Continuity & Resilience Stress-Test

Eitan Schuler — Mon, 15 Sep 2025 09:28:30 GMT

A familiar story... at 3 am, our primary cloud region had issues and our payment provider began rate-limiting our traffic. We reached for the playbook: fail over to the standby environment, restore any missing data from backups, and inform customers. On paper this was covered. In reality the standby was not in sync, replication lagged, the restore ran longer than expected, and two people on the escalation tree had already left the company. We restored service within the hour, but cleaning up data and trust took much longer. The planned investment deal still happened, at a lower price.

Business continuity is not a binder on a shelf. It is the ability to keep operating under stress and to prove that ability with recent, simple evidence.

Subscribe now

Why this matters

Investors do not buy your best-case demo. They buy your worst-day recovery curve. They ask 4 blunt questions:

How fast can you restore critical services when a tier-1 dependency fails?
How much data could you lose, measured in seconds or minutes?
What do customers experience while you fix things, and can you run in a safe reduced mode instead of going completely dark?
Have you tested this recently, with timed restores, rehearsed failovers, and communications you actually used?

Get continuity right and incidents become small, well managed events. Get it wrong and you will see churn, SLA credits, audit findings, and a valuation discount for operational fragility.

Investor expectations: Dos and Don’ts

Recovery Time and Recovery Point Objectives (RTO and RPO)

Do: Define RTO and RPO per business service, not company-wide. Show recent test results that met those targets, for example a clean-room restore last week within X minutes, with immutable copies and point-in-time recovery tested for the last 30 days. Don’t: Say “we have backups” without restore evidence in the last quarter.

Regional resilience

Do: Document and rehearse your posture. Warm-standby or active-active, with DNS, secrets, and configuration promotion practiced without a hero present. Don’t: Call read replicas “multi-region” if promotion is manual and unpracticed. Add one-liners on common pitfalls like long DNS TTLs and config drift.

Graceful degradation

Do: Use load shedding and safe mode behaviors with customer messaging you have actually used. Queue and reconcile writes with idempotent reconciliation. Disable non-critical features. Serve cached content or switch to read-only with a banner. Don’t: Operate with no degradation plan where the only switch is off.

Incident management

Do: Run a clear Incident Commander model with named roles, an up-to-date escalation tree, and pre-approved communications templates. Status page and partner updates are on the clock. Add a communications SLO like time-to-first-update within 15 to 30 minutes for Severity-1. Don’t: Invent incident communications on the day or let status updates lag reality.

Vendor and identity dependencies

Do: Maintain a critical vendor register that includes BC and DR posture, regions, sub-processors, and your fallback plans. Include identity and on-call tooling in that register, for example your IdP and paging provider, and drill the case where the IdP is down. Keep data residency in view during failover. Keys and copies stay in jurisdiction. Don’t: Leave vendor and identity dependencies undocumented or lack a plan if an upstream API throttles.

Implementation guide

Service level continuity maps: List the top business services like checkout, authentication, billing, ingest, and model inference. For each, tie RTO and RPO to owners, dependencies, and runbooks. Keep a single diagram that shows regional layout, replication mode, and kill switches.

Backup discipline: Use immutable, versioned backups with a clear retention policy. Test restores on a schedule in a clean-room environment. Track mean time to full restore as a metric and scan restored images before cutover to avoid reinfection after a ransomware event.

Regional design: Pick a posture you can operate.

Warm-standby: asynchronous replication, pre-provisioned infrastructure, promote on fail.

Active-active: synchronous or conflict-tolerant writes or per-region shards, with global routing.

Other viable postures include pilot light, cold backup and restore, cell-based isolation, and edge fallback. Choose per service. Automate DNS and traffic switches, secrets, and configuration. Practice partial failovers by service before full failovers by region. Keep RTO and RPO aligned with what your customers pay for.

Runbooks, not folklore: For each failure class like region loss, vendor outage, database corruption, or ransomware, keep step by step runbooks with commands, owners, rollbacks, and communications. Store docs as code with freshness dates.

Incident command that scales: Use a minimal ICS: Incident Commander, Ops Lead, Comms Lead, and Scribe. Pre-assign rotations. Use a standard bridge or channel header with incident ID, severity, goals, and next update time. Convert findings to tickets and close loops.

Controlled chaos: Run tabletops monthly, game days quarterly, and targeted chaos with blast radius controls. Classic inspiration is Netflix’s Simian Army including the famous Chaos Monkey, which killed instances in production to validate resilience. Most teams do safer, scoped drills that simulate loss of a node, zone, or dependency. Publish the learning and the concrete changes you shipped.

Vendor continuity is your continuity: Keep a vendor register with SLA, RTO, RPO, regions, status pages, and fallbacks. Define brownout behavior. For example, many checkouts call an external fraud or risk engine. If risk scoring is down, allow low risk transactions with flagging. Test identity and paging tools as part of the drill.

Stage and stake

Seed and early A: one region is acceptable if you can prove backup to restore within a realistic RTO and you have basic safe modes. One tabletop per quarter and one clean-room restore per month beat an unused multi-region diagram.

Series B and growth: define RTO and RPO per service. Use warm-standby or active-active for tier-1 paths. Run quarterly regional failover drills. Keep immutable backups and exercised incident communications with time-to-first-update SLOs.

Control buyouts: buyers will ask for artifacts: the last two restore runbooks with timestamps, the last regional failover game day, partner notification logs, and evidence that residency and key management hold under failover with no cross-border KMS hops. They price downtime, SLA exposure, and remediation capex, and they discount uncertainty.

Subscribe now

Glossary

RTO and RPO: Recovery Time and Recovery Point Objectives.
Immutable or WORM backups: Write once, tamper resistant copies.
Clean-room restore: Recovery into an isolated account or project that proves backups stand alone.
Load shedding: Controlled degradation that drops lower-value work to keep critical services healthy.
Incident Command System (ICS): Lightweight roles and rituals for leading incidents at speed.
Tabletop / Game day / Chaos: Discussion exercise, hands-on drill, and fault injection in a controlled blast radius.

Your turn

What broke first in your last real test? The tech, the runbook, or the communications? Share the scar. It helps the next team.

Next in the Playbook: Edition 14 - Cloud Strategy & Deployment Models. Public cloud, private cloud or on-prem? What makes sense when?

API Strategy & Integration Readiness

Eitan Schuler — Mon, 08 Sep 2025 06:32:25 GMT

Originally published on LinkedIn on September 3, 2025

An old story. At 10:00 our partner went live consuming our API. The day after their calls were failing. Not because their code changed, but because ours had. We renamed an enum, turned a 200 into a 204, and “temporarily” disabled idempotency in a tidy-up branch. The spec still said v1.6; production was v1.6-and-a-half. Worse, our outbound webhooks to their system had no replay, so paid orders never reached them. What a mess. We patched it in a day, but it took weeks to recover trust.

APIs are no longer plumbing. They are distribution, revenue, and reputation. “Integration readiness” is the difference between shipping a contract and losing customers to fragile integration points.

This is a deeply technical episode. If you are a software engineer, CTO, a founder or a tech-savvy investor, this episode is for you.

Do investors really care about something this technical? Yes. API strategy sits at the core of a SaaS business. It determines how fast customers integrate, how resilient partnerships are, and how predictable revenue becomes. Can investors go this deep in diligence? Often, yes. When they don’t have that capability in-house, they hire a technical due-diligence specialist (like me). The specialist examines contracts and versioning, security and auth, webhook hygiene, and operability, then translates those findings into commercial risk.

Subscribe now

Why this matters

Investors don’t just buy features; they buy predictable integrations with customers, partners, and your own ecosystem. In practice that means:

Stable contracts that evolve without breaking consumers.
Clear ownership and a lifecycle for versioning, deprecation, and support.
Operational discipline: idempotency, retries, pagination, and back-pressure that keep systems healthy under load.
Security and data hygiene: scoped auth, least privilege principle, and explicit handling of PII across boundaries.
Proof, not promises in form of specs, changelogs, dashboards, and test harnesses that anyone can verify.

Get APIs right and sales cycles shorten, and partnerships scale. Get them wrong and every new integration adds extra fragility.

What investors look for (beyond API endpoints)

Treat the API as a product, not plumbing. Each API domain has a single accountable owner, a roadmap, and a public contract (OpenAPI or JSON Schema) that matches what runs in production. Breaking changes are deliberate and rare, not incidental. That product mindset is backed by lifecycle discipline: a written versioning policy, a deprecation playbook with timelines, and a changelog with clear migration notes. Consumers hear about changes through a channel you control, not because they discover them on a status page after the fact.

Quality starts at the contract. Specs are complete and unambiguous: types and enums are defined, error models are consistent, rate limits and pagination style are documented, and event/webhook semantics are explicit. Backward compatibility is the default. Make additive changes first, and remove only with a well-advertised deprecation window. Safety is built in: writes are idempotent, clients retry with jitter, servers enforce timeouts and circuit breakers, and SLAs/SLOs set expectations. Where you emit events, webhooks are signed, retried with exponential backoff, and easy to replay when receivers need to catch up.

Authentication reflects real risk. Use OAuth 2.0 / OIDC where appropriate, issue short-lived tokens with fine-grained scopes, and carry tenant isolation in token claims so enforcement is simple and auditable. Keys are revocable without downtime and never hard-coded. Operability is per consumer: dashboards can answer “What did Partner X call? What failed? Who is saturating limits?” while tracking p95/p99 latency, error rates, saturation, and webhook success for each partner.

Finally, integrations run as a program, not ad-hoc hand-holding. Provide a sandbox with seed data, a certification checklist, a test harness, and minimal SDKs and examples (surfaced through a partner portal with docs and keys) so your best engineers are not the permanent support desk. And be honest about upstream dependencies: for third-party APIs you rely on, implement health checks, fallbacks, and back-pressure, define how you degrade gracefully, and make those dependencies explicit in your contracts.

Stage and stake - how the lens sharpens

Seed / early A: A single API surface with a clean spec, consistent error model, idempotency on writes, and a basic sandbox is enough. You should still have a written versioning stance, even if it only says “we prefer additive change, no breaking changes this year”.
Series B / Growth: Expect a real lifecycle: version policy, deprecation windows, changelog, partner notifications, and a certification checklist. Observability must break down by consumer, and the integration team should operate a repeatable process, not heroics.
Control buy-outs: Buyers sample specs vs. reality, run the sandbox, and review partner escalations. They expect per-domain ownership, partner SLAs, webhook hygiene, and resilience to upstream API failures. Integration revenue and support cost are modeled, not guessed.

Patterns that work (and why)

Specs as the source of truth: Publish OpenAPI/JSON Schema and treat it like code. Lint it in CI, generate examples and SDKs, and fail pipelines that drift from the spec. Consumers integrate against a contract that won’t surprise them.

Versioning that respects consumers: Default to backward-compatible, additive changes. When you must break, prefer new fields over new meanings, ship vNext alongside vCurrent, and document migrations with dates. Use deprecation headers and a public schedule, not just a blog post.

Idempotency everywhere writes happen: Accept an Idempotency-Key on POST/PUT/PATCH for operations that create or change resources. On retries you return the original result. This keeps integrity under client retries, timeouts, and network flakiness.

Cursor-based pagination and consistent error shapes: Use cursors over page numbers for reliability at scale. Standardize error envelopes with machine-readable codes, correlation IDs, and actionable messages. Your support team and partners will thank you.

Events over polling, with webhook hygiene: Prefer webhooks for changes, signed with a rotating secret, delivered with exponential backoff and a dead-letter policy. Provide a replay endpoint. Document exactly when events fire and how you guarantee order or de-duplication.

Auth that limits blast radius: Scopes map to business capabilities, tokens are short-lived, and least-privilege is the default. Tenant IDs sit in claims so enforcement is simple and auditable. Key rotation is tested, not theoretical.

Observability by consumer: Every request carries a consumer ID. Dashboards show latency, error rates, quotas, and webhook success per partner. Ideally, alerting routes to an integration on-call with runbooks that include partner contacts and rollback steps.

A real sandbox and test harness: Provide stable test accounts, seeded data, and a deterministic way to trigger failure modes. Ship Postman collections or a CLI, plus minimal SDKs that mirror your spec. Add a certification checklist and a stamp when partners pass.

Resilience to upstream APIs: Wrap outbound calls with timeouts, retries, circuit breakers, and back-pressure. Document partial-degradation behavior: what the user sees if payments or risk scoring is down. Your API should fail gracefully.

Change management as muscle memory: Changes ship behind flags and can be rolled back quickly. Every release updates the spec, the changelog, and (if relevant) deprecation notices. Consumers get heads-up via a channel you control.

Red flags that lengthen negotiations

Specs exist but do not match runtime responses, or there is no single source of truth.
Breaking changes land without notice, enums repurposed, fields removed, error shapes altered.
No idempotency on writes, ambiguous pagination, or webhook retries without signing.
OAuth scopes are “admin or nothing,” long-lived tokens, or secrets checked into repos.
No sandbox with seed data. Integration relies on production toggles and engineer hand-holding.
No visibility by consumer. You cannot say what Partner X called during an incident.
Tight coupling to a single upstream API with no fallbacks or partial-degradation plan.

One or two of these can be fixed pre-close. Three or more usually trigger price protection, escrowed deliverables, or a pause.

Habits worth adopting before the next round

Adopt an API style guide and lint it in CI so every team ships consistent paths, verbs, errors, and pagination.
Make OpenAPI the contract of record. Generate docs and SDKs from it, not the other way around.
Set up a partner portal with keys, sandbox access, docs, changelog, and a webhook replay tool.
Run consumer-driven contract tests so changes that break a key partner are caught before deploy.
Track integration health as a KPI: time-to-first-call, certification pass rate, partner-caused and provider-caused incident minutes.
Practice deprecation with a small, safe removal so the muscle exists before a big one.
Subscribe now

Mini-Glossary

OpenAPI: A machine-readable API contract format used to drive docs, SDKs, and tests.
Idempotency: A property of write operations where retries return the same result instead of duplicating work.
Backward compatibility: Changes that do not break existing consumers, typically additive.
Back pressure: controlled slowdown that keeps integrations healthy. Explicit signals, safe retries, and graceful degradation instead of outages.
Consumer-driven contract testing: Tests authored from the client’s expectations that providers must satisfy before deploy.
Jitter: randomizes backoff intervals so clients retry out of sync, reducing load spikes and speeding recovery.
Webhook signing: Attaching a verifiable signature to events so receivers can trust origin and integrity.
SLA / SLO (for APIs): The commercial promise and the internal objective for availability, latency, or error budgets.
Circuit breaker: A pattern that stops calling a failing dependency to allow recovery and protect your system.
Cursor pagination: Pagination based on opaque cursors for stable iteration under change.
OAuth 2.0 / OIDC: Standards for delegated authorization and identity that enable scoped, short-lived access.

Your turn

Which integration risk has bitten you? A stealth breaking change, webhook chaos, or an upstream outage without a safety net? Share the scar, it helps the next team.

Next in the Playbook: Edition 13 will be about Business-Continuity & Resilience Stress-Test. We’ll turn resilience from policy into proof. Tabletop-to-chaos drills that validate RTO/RPO, failover and backup restores, upstream/API loss, regional outages, and key-people unavailability, producing artifacts investors can verify in an hour. Stay tuned!

Subscribe now

The People Lens

Eitan Schuler — Mon, 01 Sep 2025 07:08:24 GMT

Subscribe now

Three weeks before close, a buyer asked a simple question: “If your principal engineer is out for a month, what slows down?” The CTO shrugged and said: “only he knows the data plane and the deployment system end-to-end.” Two days later, the investor requested a different model: not burn-down charts, rather bench depth. The code looked fine, the organization did not. The deal still closed, but with a holdback tied to succession, on-call health, and predictable delivery.

People risk is not soft. It shows up as missed roadmaps, fragile releases, and incident drag. This edition is about how diligence converts org charts and practices into execution forecasts.

Why this matters

Technology is a snapshot, but teams are the future. Investors don’t buy the last release, but rather the next 8 quarters of predictable shipping under stress. The fastest way to de-risk that future is to look through the following angles and answer related questions:

Leadership & succession: Can the company survive vacations, attrition, or a sudden scale-up without changing its risk profile?
Org design & interfaces: Does the structure mirror the architecture and product boundaries with clear ownership and healthy surfaces to Product, Security, and QA? Where the company develops multiple products, the system should be shaped so each team truly owns a small, coherent slice, shipping without constant cross-team handoffs.
Execution health: Are delivery metrics reliable and used for improvement, not theater?
Culture behind the numbers: Do incidents create learning or blame? Are people burning out? Is knowledge durable and well organized?

Get these right and velocity compounds. Get them wrong and the cleanest codebase develops a limp.

What investors look for — beyond resumes

Leadership that operates, not just presents: Weekly decisions documented, architectural guardrails enforced, leaders who can explain trade-offs without deflecting. A named #2 for each critical area is nominated and explicit succession maps exist. If the CTO is still the de facto SRE lead, the risk goes up.

Org topology that mirrors the system and products: Service owners exist for every revenue-critical domain, with a platform team providing “paved roads” (CI/CD, observability, golden paths). Healthy IC/manager ratios and spans of control (managers with 5–8 directs is typical, 12+ hints at neglect, 3 or fewer signals micromanagement). If the company works with outcome-aligned product teams, boundaries should match that reality: “team = API boundary” where sensible, 1–3 services per team, and SLOs the team owns.

Execution that is measured and humane: Cycle time, deployment frequency, change failure rate, and MTTR matter, but only if coupled with on-call load and post-incident learning. A team shipping daily but paging people at 2 a.m. three nights a week is trading today’s velocity for tomorrow’s attrition. Investors might sample a quarter’s worth of incident reviews and look for the words “we missed,” “we changed,” not “root cause: human error.”

Talent engine, not heroics: A written Skill Level Reference Framework (SLRF), structured hiring loops, time-to-fill and time-to-productivity metrics in range, and an intentional contractor/agency mix. Upskilling ahead of the curve in areas like AI guardrails and core security shows the organization learns faster than the threat landscape evolves.

Outcome-first teams

Outcome teams as the default, not a dogma. There isn’t a single “correct” org model, and early on a one-team mission can be the smartest move. That said, the pattern we most often see succeed at scale in software startups is empowered, outcome-aligned product teams: a PM–Design–Tech triad that owns a customer outcome and the small system slice needed to ship independently. Treat that as the default starting point in growth phases, then deviate deliberately when the context demands it.

Outcome first. Make the above-mentioned product team your atomic unit: a cross-functional squad that owns a customer outcome end-to-end and is accountable for impact, not a feature backlog.

System follows. Shape the architecture so that outcome is actually ownable: clear service/API boundaries, explicit data contracts, and runbooks so the team can ship independently. Aim for a small, coherent slice (often 1–3 services) with SLOs the team truly owns.

Platform as leverage. Keep a small platform group to build paved roads: CI/CD, observability and an Internal Developer Platform. Their job is to remove friction and cognitive load so product teams self-serve 90% of the time and approvals remain the exception.

Guilds for influence. Use lightweight chapters/guilds (design, data, security) to set standards and coach across teams, so autonomy doesn’t decay into inconsistency.

The diligence question. Choose one customer journey and ask: How many teams must say “yes” for a small change to ship? If the answer is regularly more than two, you have an org/architecture mismatch. Fix it by moving ownership to the outcome team or reshaping boundaries until the system fits the team.

Stage and stake: how the lens sharpens

Seed/early A. Investors accept potential key-person risk if the team shows clarity: explicit owners, a realistic hiring roadmap, basic on-call rotation, and evidence of learning.
Series B/Growth. Expect real bench depth in platform and product, target IC/manager ratios, measurable delivery and incident health, and a working hiring machine that can add teams without collapsing the culture.
Control buy-outs. Buyers probe succession and knowledge transfer hard, might sample on-call calendars, review a quarter of incidents, and test whether the org can absorb a 2× customer load without a 3× increase in headcount.

Some patterns that work (and why)

Owner–operator maps with succession: a list of the top 8-12 domains (payments, auth, data platform, release train, AI pipeline, etc.). For each: owner, deputy, and “two things that break if both are out”. Make the deputy a real shadow participating in rotation for design reviews and incident calls. This collapses key-person risk before it is priced into the deal.

Team topology mirrors architecture: small, cross-functional product teams own customer-facing domains and a platform group owns the paved roads (CI/CD, observability, developer portal, etc.). Security champions are embedded in teams. Clear service ownership means dependencies are known, incident paths are short. Where teams are outcome-aligned, keep “team = API boundary” a guiding rule and cap ownership at 1–3 services to manage cognitive load.

On-call health with SLOs and error budgets: rotation coverage is fair, out-of-hours pages are rare, and there’s comp time/money for hard nights. Service SLOs feed error budgets; when they’re blown, pause feature work and focus on reliability until you’re back within budget.

Talent scorecards and predictable hiring: a skills matrix tied to the roadmap (what we need, by when), structured interviews with rubrics, and a standard onboarding path. Track time-to-productivity. Contractor reliance should be documented and tapered with knowledge transfer milestones.

Decision hygiene via lightweight documentation: ADRs capture why a choice was made, what alternatives were rejected, and the rollback plan. Postmortems are blameless and produce two classes of action: quick wins and systemic fixes. Decisions survive staff turnover.

Red flags that lengthen negotiations

A bus factor of one on anything revenue-critical is a major risk. For example, if “only Priya understands the data pipeline”, the business is exposed.
There is a manager/IC imbalance when managers carry 12–15 direct reports with no time to coach, or conversely spend 80% of the week coding instead of leading.
A hero culture emerges when after-hours paging is chronic, there are no compensating policies, and “we pulled an all-nighter” is celebrated in Slack.
Metrics theater is evident when DORA charts exist but no decisions come from them and incidents routinely conclude with “human error” instead of systemic fixes.
Silo friction shows up when Security acts as a last-minute gatekeeper, QA remains isolated, and the Platform team is treated merely as a ticket queue.
Outsourcing as a crutch occurs when agencies run core operations without knowledge transfer and a single contractor effectively “owns” release automation.

One or two of these can be mitigated post-close, but 3 or more usually trigger a price adjustment, earn-out conditions, or a pause.

Habits worth adopting before the next round

Maintain a People-Risk Register: key domains × owner × deputy × risk notes × mitigation due date. Review monthly.
Publish an Engineering Operating Manual: how we plan, ship, staff, escalate, and learn. Make it part of the onboarding material.
Track delivery and well-being together: cycle time, deployment frequency, change failure, MTTR and weekly pager minutes per person. Act on both.
Institutionalize learning: postmortems with owners, deadlines, and follow-up audits, and quarterly “delete a process” reviews to keep governance lean.
Make onboarding a product: day-1 dev environment, golden paths, shadow rotations. Measure metrics like time-to-first-MR and time-to-on-call.

Mini-Glossary

IC: Individual Contributor (non-manager engineer).
Span of control: Number of direct reports per manager; extremes signal risk.
Bus factor: How many people can leave before a function stops; 1 is dangerous.
ADRs: Architectural Decision Records: lightweight docs of key technical choices.
SLO / Error budget: Service Level Objective and the allowed “unreliability” that guides when to pause feature work.
DORA (DevOps) metrics: Deployment frequency, lead/cycle time, change failure rate, MTTR (not the EU regulation).
Bench Depth Model: A talent management approach that measures how many capable successors are available for critical roles, assessing both readiness (ready now, soon, later) and coverage depth to ensure organizational resilience.

Subscribe now

Your turn

What People-Lens surprise have you hit in a deal? A single-point-of-failure staff engineer, pager fatigue, or metrics theater? Share the scar; it helps the next team.

Next in the Playbook: I’ll write about API Strategy & Integration Readiness in Edition 12. Stay tuned!

Data Governance & Sovereign-Data Readiness

Eitan Schuler — Mon, 25 Aug 2025 05:27:18 GMT

Subscribe now

Originally published on LinkedIn on August 20, 2025

A growth-stage SaaS had spotless product metrics and a data map drawn on a whiteboard. During diligence a different picture was formed. Nightly backups hopped regions, US-hosted log analytics ingested EU personal data, and a support run-book that paged an on-call engineer in a different legal jurisdiction. Nothing was malicious, just a bit ad-hoc. The buyer didn’t question the product, they only asked whether the company could prove where customer data lived, who could touch it, and how fast it could be deleted. In 2025, that proof moves valuation.

Why this matters

Regulators, customers, and acquirers are converging on the same questions: Where is the data, who can access it, and what exits the promised boundary? “Sovereign-data” readiness is the ability to keep certain data in-region, under your control, administered by in-region personnel, and to demonstrate it with artifacts, not promises. You don’t need to be a regulated bank to face these questions, a single enterprise customer or a cross-border acquisition can put them on page one of the checklist.

Good governance compounds. Clear ownership, lineage, and deletion pipelines lower breach risk and cloud cost. Sloppy governance leaks margin (storage creep, zombie copies), slows sales, and drags diligence.

What investors look for

Investors start with a real data map, not a diagram. That means a system-of-record listing of datasets, fields, sensitivity, lawful basis, retention, region, key owner, and the processing activities that touch each set (ETL jobs, training pipelines, analytics queries). A clean map should tie back to lineage: where the data came from, where it flows, and where it rests (including backups and logs).

Next comes access governance. Investors want to see least-privilege enforced in code (IAM policies, role-based access, JIT elevation), quarterly access reviews with sign-off, and auditable logs. “We trust the team” doesn’t survive diligence; “we trust our policy, and here are the reviews” does.

Then encryption & key control: encryption at rest and in transit is table stakes. What moves the needle is customer-managed or tenant-scoped keys (BYOK/CMK), clear key-rotation policy, and proof that keys are region-pinned (no KMS hop across borders).

Finally, data-residency architecture. Can you keep EU data in the EU including backups, telemetry, and third-party tools? It’s common to localize the database but leak residency through US-hosted error tracking, CDN logs, or a global SIEM. Investors might ask where every copy lands.

Stage and stake: how the lens sharpens

At Seed/early A, a crisp written policy, a living data inventory, and a basic deletion pipeline alongside with a roadmap to regionalization are typically enough. By Series B, diligence expects enforcement evidence: access reviews, key-rotation logs, automated retention jobs, and vendor DPAs for every sub-processor. In majority buy-outs, the bar rises to sovereign-by-design: per-region shards or accounts, region-locked keys, in-region logging and monitoring , and contractual controls (e.g.: EU-only support, named sub-processors).

Patterns that work

Regional sharding with consistent abstractions: Run one product, multiple regional data planes: separate accounts/projects per region, region-pinned databases, buckets and keys, and a routing layer that keeps each tenant in its home shard. No cross-region replication “for resilience” outside the legal boundary. DR stays within jurisdiction. Audits become mechanical: list the EU account, export the inventory.

Per-tenant isolation without over-engineering: You don’t need “one DB per customer” to get strong isolation. Combine row-level security, tenant-scoped keys, and separate schemas, but make it provable. Telemetry should show access by tenant, least-privilege roles enforced in code, and a quarantine kill-switch to contain one tenant without touching the rest.

Customer-managed keys as a control and termination lever: BYOK/CMK assures large customers you can’t read their data unilaterally and that, on termination, they (or you under their instruction) can render it fast. Pin keys to the same region as the data, document rotation, and show who can use keys under JIT elevation. From a diligence angle, it’s a concrete signal of control, not just encryption “on paper.”

Data contracts for analytics & AI: Treat every downstream consumer (warehouse, lake, model training job, feature store) as a contract: schema, allowed fields, sensitivity/masking rules, retention, owner, and permitted purpose. Contracts prevent PII from “sneaking” into features, enforce minimization and synthetic/masked data in lower environments, and give you runnable deletion hooks.

In-region telemetry and backups: Residency isn’t real if logs, traces, crash dumps, metrics, or snapshots exit the boundary. Keep observability stacks and backup targets in-region, or use providers with regional data boundaries.

Enforcement and evidence by design: Make sovereignty fail-closed. At the org layer: region-allow policies, prevent creation of resources outside allowed regions, and block cross-region KMS “hops.” In CI/CD: require every dataset to be registered (owner, region, retention, key) or the build fails. For audits: keep access reviews, key-rotation logs, DSAR/erasure job runs, and vendor DPA mappings tied to specific datasets.

Red flags that slow or sink deals

Ambiguous residency: “EU data stays in Frankfurt” while backups replicate to another region “for resilience”.
Telemetry leakage: Application data local, but crash dumps and access logs ship to a global SaaS. Easy to miss, hard to defend.
Live data in lower environments: Dev/test using production PII with no masking, no synthetic data. That’s a governance smell and a breach risk.
Root-level access culture: Wide admin roles (“*”) or shared credentials, plus no quarterly reviews. Investors assume insider-risk and operational fragility.
No deletion proof: You can accept a Data Subject Request, but can you show verifiable erasure across hot storage, backups, caches, and search indexes within policy?
Sub-processor sprawl: A dozen vendors touch production data, but the DPA list in customer contracts is outdated. Integration risk goes up, trust goes down.

Two or more of these typically trigger a price adjustment or post-close capex to retrofit sovereignty.

Habits worth adopting before the next round

Make the data map an operational tool: Keep it in a system (not a slide). Every new table or bucket must register with owner, sensitivity, region, retention, and key. CI pipelines should fail if the registry doesn’t have an entry.
Prove deletion, don’t just promise it: Implement “erasure jobs” that accept a subject ID and traverse application DBs, object stores, caches, analytics tables, and model feature stores. Log what was removed and where propagation is pending (e.g.: immutable backups). Track deletion SLA as a metric.
Region-pin everything: Error tracking, metrics, SIEM, object-storage replication, backup targets, CDN logs: choose in-region endpoints or providers with regional data boundaries. Treat “unknown region” as tech debt.
Key management with intent: Use region-scoped KMS/HSM, rotate keys on a schedule, and prefer tenant-scoped or customer-managed keys for high-sensitivity data. Keep a one-pager that shows which datasets use which keys and who can access them.
Separate people, not only data: For strict customers, operate an in-region support rotation with audited access and break-glass procedures. Diligence increasingly asks not just where data sits, but which human can touch it.
Data for test/dev: Default to masked or synthetic data in lower environments. If you must use production slices, apply minimization and time-boxed access with approvals; expire copies automatically.
Quarterly vendor hygiene: Maintain a single list of sub-processors with DPAs/SCCs, residency statements, and last review date. Tag which customer contracts reference which version of the list so you can notify accurately.
Set 3 governance KPIs and publish them: Examples: “% assets classified,” “% resources with region tag,” “mean deletion lead time”.

Mini-Glossary

Data sovereignty: Keeping data in a jurisdiction and under the control of entities governed by that jurisdiction.
Data residency: The physical/virtual location where data is stored and processed.
BYOK / CMK: Bring-Your-Own-Key / Customer-Managed Key; customers control encryption keys.
Data contract: A formal agreement defining a dataset’s schema, allowed fields, retention, and owner.
Lineage: Trace of where data originates and how it moves and transforms across systems.
DPA: Data Privacy Agreement. GDPR Art. 28 contract between controller and processor setting purpose/instructions, security measures, sub-processor rules, audit/right to information, and breach-notification terms.
SCC: Standard Contractual Clause. European Commission–approved contract templates to lawfully transfer EU personal data to non-adequate countries, often paired with technical safeguards.

Subscribe now

Your turn

What’s the hardest sovereignty gap you’ve had to close—telemetry leakage, deletion across backups, or vendor sprawl? Share the scar story; it helps the next team avoid it.

Founders: need a sovereignty gap-scan and a one-page remediation plan for your data room? Let’s talk.

Investors: need a pre-deal heat-map of residency, access, and deletion risk across a target’s stack? Let’s talk.

Thanks for reading The Tech Due Diligence Playbook! This post is public so feel free to share it.

Next in the Playbook: Edition 11 will dive into the “People Lens”. Data is governed by systems. Velocity is governed by people. I’ll unpack how investors read org charts, leadership, and culture to predict delivery, resilience, and risk.

Vendor Due Diligence & Third-Party Risk

Eitan Schuler — Mon, 18 Aug 2025 08:09:01 GMT

Originally published on LinkedIn on August 13, 2025

Subscribe now

At a recent buy-out signing, the celebration fizzled when the acquirer’s CISO asked one last question: “Show me the log that proves your storage vendor patched the October CVE within 72 hours”. The target had no such evidence, only an expired SOC2 report. The deal closed, but at a price cut to cover “supply-chain uncertainty”.

Stories like this have multiplied since supply-chain attacks (from SolarWinds to MOVEit) showed that the easiest way to compromise a fast-growing SaaS is to slip through one of its many vendors. Regulators noticed. Europe’s Digital Operational Resilience Act (DORA) now requires financial services to catalogue and monitor “critical ICT third-party providers” and empowers regulators to inspect them directly. In the US, the SEC cybersecurity-disclosure rule forces public companies to reveal material incidents irrespective of whose software caused them. Founders and investors can no longer assume that vendor risk is somebody else’s problem.

Why third-party risk now sits on the due diligence critical path

Attack surface inflation: Modern stacks include hundreds of SaaS APIs, open-source libraries and cloud services. Each shortens delivery time but widens the blast radius when one link fails or gets breached.

Regulator reach-through: DORA, NIS2 and the AI Act all contain “look-through” clauses that make the primary business liable for vendors’ missteps. If one of your vendors is breached and attackers exfiltrate the logs containing your customers’ data, regulators (under DORA, NIS2, GDPR, or the AI Act) will still treat your company as responsible.

Investor math: Supply-chain incidents translate directly into churn, SLA credits and remediation capex. When diligence teams cannot model that exposure, they widen discount ranges or they pass on the deal.

What investors expect to see, no matter the term sheet

A living dependency map. A single page that lists every external service (commercial, open-source, data supplier, etc.), its business criticality, the data it touches and its current assurance level. Think of it as the SBOM’s big sister.
Risk-tiered vendor assessments: Low-risk tools (internal analytics) may only need a questionnaire while high-risk vendors (payments, auth, LLM APIs) warrant audits, pen-test results and financial health checks.
Contractual safeguards: Clauses for data-breach notification (<24 h), the right to audit, encryption at rest and in transit, sub-processor disclosure, exit assistance and migration-support if the vendor is acquired or goes dark are important. Founders should consider their own risk while contracting vendors.
Continuous monitoring: Not a once-a-year PDF. Investors look for automated alerts on vendor status pages, real-time SBOM diffing and periodical tabletop exercises that simulate a compromised dependency in a vendor or on an open-source library.
Regulatory mapping. A short memo showing which vendors fall under DORA “critical ICT” scope, NIS2 essentials or the upcoming AI-Act GPAI transparency and how you flow those obligations down.

Stage, stake… and visibility

Minority Series A/B investors often accept a spreadsheet dependency list plus policy docs because they neither control operations nor bear integration cost.

Some series C and growth equity backers demand live dashboards and vendor-risk scores. They must defend brand equity once the startup appears on analysts’ radars.

Control buy-outs and IPO prep trigger deep dives and demand a very high level of clarity including sampled vendor contracts, penetration tests for externally hosted code, financial viability analysis and even “shadow-copy SBOMs” (parallel SBOM you generate yourself to cross-check vendor-provided one) to detect hidden transitive risks.

Regulatory fines do not change with share ownership. Therefore, irrespective of stage/stake, visibility plays an important role here. The moment a company’s logo lands on Gartner slides or front-page tech news, regulators and customers scrutinize every supplier in the chain. Investors calibrate their lenses accordingly.

Subscribe now

Some red flags that lengthen negotiations

Dependency maps live in tribal knowledge.
No contractual right to retrieve customer data within 30 days of termination.
SaaS vendors with lapsed SOC2 Type II reports or ISO27001 certificates.
Open-source libraries pinned to versions with known critical CVEs and no patch plan exists.
A single cloud region for all workloads because “the provider is multi-AZ anyway”.
Pen-tests leaving third-party integrations untested; third-party APIs mocked out “to save cost”.

Two or more of these usually lead to escrow demands, hold-backs or sometimes even walk-aways.

Habits founders should adopt

Publish a Vendor-Risk Register: Mirrors the risk register from Edition 1 but focuses on third parties: purpose, data scope, risk owner, criticality score, last review date, next action. Drop it in the data room before anyone asks for it.
Automate SBOM watching: Tools like GUAC or Dependency-Track can ingest vendor software bills-of-materials and alert you to freshly disclosed CVEs.
Run a periodical “assume vendor down” drill. Can you swap payment processors in 48 hours? Serve read-only mode if your feature-flag service dies? Document the recovery steps and capture real metrics.
Negotiate exit clauses up front. Termination-assistance, data-export in machine-readable formats, potential escrow and the right to run a self-hosted version for up to 12 months are easier to secure before signing, not amid an incident.
Link vendor risk scores to engineering workflow. If a library slips from “green” to “amber,” create a ticket. Risk posture then improves as part of normal work, not a separate governance effort.

Common traps

Paper-only assurance: shiny SOC2 report is dated the moment it lands. Dig in deeper. Follow up on status pages and uptime stats. Continuous controls matter more.
Critical vendor concentration: hosting, database and observability all from the same cloud provider wipes out redundancy. Tempting, but not great.
Free-tier complacency: A startup might rely on the free plan of a feature-flag service that provides no uptime SLA, until the day an outage locks every user out.
Licenses without indemnity: Using a generative-AI model under a non-commercial license in production can trigger copyright claims during M&A disclosure.
Ignoring upstream sub-processors: Your CRM outsources search to a boutique provider that hosts in a different jurisdiction.

Mini-glossary

C-SCRM: Cybersecurity Supply-Chain Risk Management as defined by NIST.
DORA: Digital Operational Resilience Act (EU). DORA puts banks, insurers and fintechs on the hook for the resilience and security of all “critical ICT third-party providers.”
CVE: Common Vulnerabilities and Exposures. The global ID system for publicly disclosed security flaws.
NIST: US National Institute of Standards and Technology publishing the Cybersecurity Framework, AI RMF and supply-chain guidelines that many investors treat as de-facto standards.
SBOM: Software Bill of Materials listing every component and its version.
Escrow (software or data): A neutral third party holds source code or critical data so the customer can access it if the vendor goes bust, changes ownership, or breaches contract.
NIS2: The EU’s 2023 Network and Information Security Directive. It widens the original NIS scope to cover more sectors (energy, health, digital providers, etc.) and obliges “essential” and “important” entities to manage and report cybersecurity and supply-chain risks. Not yet equally enforced in all EU countries.
Flow-down obligation: Contractual requirements that a company must pass on to its subcontractors or downstream vendors.
Subscribe now

Your turn

Which vendor surprise burnt you in a deal? An unpatched open-source library? A data-export clause that turned out optional? Share the scar, it may save someone else.

Founders: Want a quick stress-test of your vendor register or SBOM posture? Let’s talk. Investors: Need a second opinion on third-party exposure in a live deal? Happy to dive in.

Next in the Playbook: In Edition 10 I’ll write about “Data Governance & Sovereign-Data Readiness”. Subscribe so it lands in your inbox.

Share The Tech Due Diligence Playbook

Security & Compliance in the AI Act Era

Eitan Schuler — Mon, 11 Aug 2025 07:37:15 GMT

Originally published on LinkedIn on August 6, 2025

Subscribe now

The European Union's AI Act, alongside existing regulations like GDPR and the increasing emphasis on frameworks like SOC 2, is reshaping how investors and acquirers evaluate a target company's technological health. The declared purpose of the regulation is to “improve the functioning of the internal market and promote the uptake of human-centric and trustworthy artificial intelligence (AI), while ensuring a high level of protection of health, safety, fundamental rights enshrined in the Charter, including democracy, the rule of law and environmental protection, against the harmful effects of AI systems in the Union and supporting innovation”. The Act is publicly debated, but I’m not going into the dive into the scope, maintainability, criticality and political debates around it.

To be completely honest here I haven’t talked to founders or investors about the AI Act yet, but I decided to dig in and evaluate how it will affect the world of Technical Due Diligence based on its letter and spirit. We'll touch on the critical interplay between security, compliance and AI and how companies can demonstrate readiness in this new era.

The EU AI Act: A New Paradigm for Due Diligence

Risk-based approach

The EU AI Act introduces a risk-based approach to AI regulation. This means that the level of scrutiny and compliance requirements for an AI system will depend on the potential harm it can cause. For investor due diligence this translates into a critical first step: classifying the target company's AI systems according to the Act's risk categories:

Unacceptable Risk: AI systems that pose a clear threat to fundamental rights (e.g., social scoring by governments) are banned. Any presence of such systems in a target company might be an immediate deal-breaker.
High-Risk AI Systems: These systems are subject to stringent requirements, including robust risk management systems, data governance, technical documentation, human oversight and conformity assessments (e.g.: critical infrastructure, employment, law enforcement, credit scoring). For companies developing or deploying high-risk AI, due diligence will involve a deep dive into their compliance frameworks, testing methodologies and data quality processes. Investors will need to verify that the company has established an AI risk management system, it maintains comprehensive technical documentation and has a clear plan for keeping security, compliance and AI-risk controls alive and verifiable in the long run.
Limited Risk AI Systems: These systems have specific transparency obligations (e.g.: chatbots must inform users they are interacting with an AI). Due diligence here would focus on verifying these transparency mechanisms are in place.
Minimal or No Risk AI Systems: Most AI systems fall into this category and are subject to voluntary codes of conduct. While less regulated, investors may still look for evidence of responsible AI practices.

Key Due Diligence areas under the EU AI Act

Risk Management System: Does the company have a documented and implemented risk management system for its AI applications? This includes identifying, analyzing and evaluating risks, as well as implementing appropriate mitigation measures.
Data Governance: Given that AI models are only as good as the data they are trained on, due diligence will heavily scrutinize data governance practices. This includes data quality, data collection methods, bias detection and mitigation and data security.
Technical Documentation and Record-Keeping: The Act mandates extensive technical documentation for high-risk AI systems. Due diligence will require reviewing these documents to ensure they are complete, accurate and demonstrate compliance.
Human Oversight: For high-risk AI, human oversight is crucial. Due diligence will assess the mechanisms in place to ensure human control and intervention capabilities.
Conformity Assessment: High-risk AI systems will require a conformity assessment before being placed on the market. Investors will need to verify that these assessments have been conducted and that the systems meet the required standards.
Post-Market Monitoring: The Act requires continuous monitoring of high-risk AI systems once they are in use. Due diligence should examine the company's post-market surveillance plans and incident reporting mechanisms.

The EU AI Act fundamentally shifts the burden of proof onto companies developing and deploying AI. For investors this means a more rigorous and specialized due diligence process is required to identify and quantify regulatory risks, ensuring that the target company is not only innovative but also compliant and future-proof in the evolving AI regulatory landscape in the EU.

GDPR and SOC 2 Readiness: Pillars of Compliance

While the EU AI Act introduces new considerations, existing compliance frameworks like GDPR (General Data Protection Regulation) and SOC 2 (Service Organization Control 2) remain critical pillars of due diligence on compliance. Most of what the AI Act demands already lives inside GDPR and SOC 2. GDPR’s data-map and Data Privacy Impact Assessment (DPIA) exercises force you to catalogue every dataset, spell out lawful purpose and assess privacy risk complying with the first half of the AI Act’s “data governance and risk-management” chapter. SOC 2 then picks up the baton: its security and change-management criteria require immutable logs, access reviews and incident playbooks, which satisfy the AI Act’s call for technical documentation, audit trails and secure development. Even the “human-in-the-loop” clause for high-risk AI mirrors safeguards you must document under GDPR Article 22 and rehearse under SOC 2’s incident-response control. In short, if you can already show a clean DPIA, a live data-catalog and six months of SOC-style control evidence, you are roughly 70% of the way to an AI Act conformity file.

In times of diligence or audits this approach pays off fast. You are rearranging chapters, not starting with blank pages. Controls are measured in dashboards your teams consult every week, so regulators and investors see living metrics, not promises. Staff have been through GDPR and security training, making an AI-risk module an incremental, not a green-field, task. Vendor contracts already carry data-processing addendum and security questionnaires, so supply-chain scrutiny is not built from scratch either. And because you have run at least one mock SOC 2 or ISO gap assessment, the organization knows how to gather evidence and close tickets on a deadline. The muscle is trained and that will matter when an AI Act assessor or an acquirer comes calling.

Habits worth adopting

Be GDPR compliant by design (If you or your customers are in the EU you must be anyways) and SOC 2 ready even if not being audited for it. This will help a lot in times of diligence.
Run an annual AI-focused breach drill. A red-team can inject prompt-injection, data-poisoning and model-exfiltration scenarios. Test your resilience and measure response metrics.
Risk-tag every ticket. A custom field “AI Act risk: minimal / limited / high” required at ticket creation. Closing a high-risk ticket could trigger an automated checklist: bias test run, explainability score logged, rollback plan merged.
Maintain an AI Act heat-map: A single page listing every feature, model and dataset with its risk tier and the control evidence (test, log, document) that supports it.
Run periodical self-audits: The “Govern → Map → Measure → Manage” loop surfaces blind spots before investors do.

Mini-Glossary

EU AI Act: A proposed European Union regulation that aims to provide a legal framework for artificial intelligence, categorizing AI systems by risk level.
GDPR (General Data Protection Regulation): A comprehensive data protection law in the European Union and European Economic Area, governing how personal data is collected, processed and stored.
SOC 2 (Service Organization Control 2): An auditing procedure that ensures service providers securely manage data to protect the interests of their clients and the privacy of their customers.
DPIA (Data Protection Impact Assessment): A process designed to help organizations identify and minimize the data protection risks of a project or plan.
Automated Decision-Making: Decisions made by technological means without human involvement, particularly when they have legal or similarly significant effects on individuals.
Trust Services Criteria (TSC): a set of principles and criteria used in SOC 2 audits to evaluate the controls of a service organization related to information security.
Thanks for reading The Tech Due Diligence Playbook! Subscribe for free to receive new posts and support my work.
Subscribe now

Your turn

How are you preparing your organization for the evolving AI regulatory landscape? What challenges have you faced in mapping GDPR, EU AI Act and SOC 2 readiness to your due diligence scope? Share your insights and experiences below.

Founders: Need to assess your compliance posture in the AI era or prepare for tech due diligence? Let's talk. Investors: Looking to navigate the complexities of AI-related security and compliance risks in your investment targets? Let's talk.

Next in the Playbook

In Edition 9 we’ll dive into the topic of Vendor Due Diligence & Third-Party Risk. Stay tuned!

Share The Tech Due Diligence Playbook

Build vs. Buy: Innovation or Lock-In?

Eitan Schuler — Mon, 04 Aug 2025 06:20:23 GMT

Subscribe now

Originally published on LinkedIn on July 30, 2025.

Over coffee with a founder recently I heard a familiar lament: “We spent nearl

y a year building our own data-pipeline engine, and now the VC is grilling us on why we didn’t use a proven vendor”. A week earlier, an investor complained that one of their portfolio companies was so tightly tied to a proprietary product that any pivot would trigger legal wrangling and year-long rewrites. Go figure...

Whether you build technology yourself or buy it off the shelf can raise enterprise value, or introduce future friction during diligence. Investors dig hard into these decisions because every choice implies something about cost, speed, and eventual exit readiness.

Why build‑versus‑buy sits on the due diligence critical path

First comes strategic importance. If a component is part of the competitive moat, home-grown innovation makes sense. McKinsey found that companies investing in genuinely differentiating tech outpace peers by ~20% in revenue growth. But when the capability is commodity (think payments, CRM, off-the-shelf observability) buyers prefer to see a battle-tested vendor so the team can stay laser-focused on its real secret sauce.

Second is the uniqueness of requirements. Off-the-shelf tools cover the 90% case. When your workflows are truly novel (e.g.: a proprietary machine-learning pipeline) custom code avoids painful compromises. Netflix famously built its recommendation engine in-house because no vendor could match the scale and accuracy they needed.

Then there is time-to-market and total cost of ownership (TCO). Buying nearly always ships faster, arrives with support, and converts capex to predictable opex. BUilding absorbs more engineers up front, but once a solution is heavily used the per-transaction cost may drop well below perpetual license fees. Diligence teams want to know how you did the maths.

Finally, investors examine risk tolerance and vendor lock-in. Commercial software shifts some operational risk to the vendor but introduces dependency: contract renewals, price increases, data-migration headaches. Custom code hands you control yet exposes you to delivery and maintenance risk. Showing how you balanced those forces tells buyers you well understand the considerations and keep future choices open under your control.

Taken together, build‑versus‑buy choices reveal how a team balances innovation, speed, cost and optionality. They influence gross margin, scalability and even integration complexity during a carve‑out or acquisition.

How investors weigh the decision

Investors use a simple framework:

Core vs. commodity: they ask: “Is this function part of the company’s competitive moat or table stakes?” Founders should be able to articulate which components differentiate the product. If everything is built in‑house, they will probe whether resources are misallocated.
Economic analysis: investors model the TCO for both options: license fees, cloud spend, engineering salaries, maintenance and upgrade cycles. They also factor in time‑to‑value: a delayed launch can erode first‑mover advantage.
Exit readiness: a stack littered with proprietary licenses can slow down an IPO or trade sale because renegotiating vendor contracts takes time. Conversely, deep bespoke code with no documentation scares buyers because key engineers may walk away. Due diligence will ask about vendor termination clauses, escrow arrangements, and the portability of custom code.
Governance and roadmap control: a vendor’s roadmap may diverge from your needs, while in‑house teams risk falling behind on maintenance. Investors might look for evidence that you review vendor roadmaps quarterly, negotiate SLAs that match your SLOs, and allocate capacity for refactoring home‑grown components.

Signals that raise eyebrows

Investors notice when teams write bespoke billing engines or identity providers while world-class SaaS alternatives exist and they read it as opportunity cost. They worry when a critical module is tied to a single vendor, yet no migration path exists. They flinch at opaque TCO spreadsheets or, worse, decisions made by gut feel. And they downgrade valuation if the codebase is full of clever features but no clean APIs exist. This shows lack of integration maturity. Investing heavily in custom features without mature APIs or integration hooks hampers partnership opportunities and future acquisitions.

Thanks for reading The Tech Due Diligence Playbook! This post is public so feel free to share it.

Document your rationale: keep a build‑vs‑buy decision log. For each component list the strategic importance, alternatives considered, estimated build cost and TCO, vendor pricing, and exit plan. Investors love seeing that you compare options systematically.
Prototype before buying: for third‑party solutions, run a proof of concept. Evaluate integration complexity, performance, and vendor responsiveness before signing.
Have a vendor assessment process in place: Check the service not only from the technical aspect, but also from the legal, security and data privacy angles.
Review decisions periodically: a choice that made sense at Seed might not fit at Series B. Schedule annual reviews of home‑grown modules and vendor contracts. Cloud‑native SaaS options improve quickly so your bespoke feature may no longer justify its maintenance burden.
Negotiate portability: when you do license software, negotiate termination assistance and data‑export clauses up front (the EU Data Act will be on your side). Ideally ensure that you can extract your data and run the service in a private cloud if the vendor is acquired or fails to meet SLOs.
Calculate payback: use a simple ROI calculator: compare development cost and delay against subscription fees and license renewals. Assign probability and margin of error to your estimates.

Share The Tech Due Diligence Playbook

Mini‑Glossary

Vendor lock-in: Switching away from a third-party product incurs high costs (proprietary formats, long contracts). Total Cost of Ownership (TCO): FUll life-cycle cost - build or purchase, hosting, upgrades, training, support. Strategic differentiator: A capability that truly separates you from competitors, worth bespoke investment. Commodity function: A standard need (e.g., payroll, CRM) where buying saves time and cash.

Your turn

When did a build-vs-buy decision bite you, or save the roadmap? Did custom code become a moat, or did vendor lock-in stall a pivot? Share the story below, scars teach best.

Founders: Need help auditing your build-vs-buy ledger before the next round? Let’s talk.

Investors: Looking to benchmark vendor-lock exposure across your portfolio? I can help.

Next in the Playbook: Edition 8 will look at “Security & Compliance in the AI-Act Era”: how emerging EU rules and baseline security metrics are reshaping diligence checklists for both SaaS vendors and their investors.

Subscribe now

Subscribe and it will land in your inbox.

Incident Response and the Culture Behind the Numbers

Eitan Schuler — Mon, 28 Jul 2025 06:50:33 GMT

A clean codebase can still derail a deal if incidents linger in production. A buyer may discover that every Sev-1 outage took an average of 6 hours to resolve, even though the team promises a sub-hour Mean Time to Recover (MTTR). No one wants to pay a full multiple for a platform that lost half a business day each time it hiccupped. Speed matters, accuracy matters even more. Some of these metrics I already mentioned in Edition 3, but now I'm writing with lens of Operational Excellence and Incident Response practices.

Subscribe now

Why incident response sits on the due diligence critical path

Investors track uptime, but they worry even more about how quickly you bounce back when things break. A short MTTR limits SLA credits, protects NPS, and keeps churn down. Fast containment (MTTC) also signals mature on-call practices, clear ownership, and a culture that learns rather than blames. If recovery is delayed or post-mortems gather dust, buyers assume hidden debt in monitoring, run-books, and team structure. They will discount accordingly.

The metrics that tell the real story

Important operational metrics teach us about detecting downtime and incidents, acknowledging and containing them, recovering and learning from them.

Mean Time to Detect (MTTD) shows how long it takes the monitors (not the customers) to realize something is wrong. Sub-5-minute MTTD conveys that observability is wired into every layer while double-digit minutes hint at blind spots.
Mean Time to Acknowledge (MTTA) is about hand-off from machine to human. A healthy on-call rota keeps MTTA in the low single digits, but anything beyond 10 minutes shouts “pager fatigue” or thin coverage.
Mean Time to Contain (MTTC) measures how fast damage is halted even if the full fix takes longer. A strong containment metric shows good alerting, well-defined playbooks, and confidence in rollback tools.
Mean Time to Recover (MTTR) is the headline: the clock starts when a customer feels pain and stops when normal service resumes. For growth-stage SaaS, anything under an hour for Sev-1 incidents calms investors. A drift toward 2-3 hours raises eyebrows.
Change Failure Rate (CFR) pairs with MTTR. If fewer than 15% of deployments trigger incidents, automated testing and progressive delivery are doing their job.
Investors also understand incident post-mortem velocity: not just whether a document exists, but how quickly it is written, shared, and closed out with follow-up tasks. A 5-day turnaround on write-ups and a short window to burn down action items indicate a learning organization.

How stage and stake sharpen the lens

Seed or early Series A backers tolerate informal on-call rotations and Google Doc post-mortems, provided the team can point to improving MTTR trends. By Series B, buyers expect real on-call schedules, PagerDuty data, and monthly incident reviews. In control buy-outs the bar rises again: investors want hour-level graphs, compliance-grade run-books, and tracking of follow-up work to completion. The larger the cheque and the more control it buys the deeper investors drill.

Red flags that lengthen negotiations

If incident dashboards show spikes that aren’t explained, if major outages lack post-mortems, or if the same root cause appears 3 quarters in a row, diligence slows. Investors must model churn risk and SLA credits, so they build extra buffer into the price.

Subscribe now

Habits worth adopting before the next term sheet

The goal is to show a reliability culture that’s measurable, repeatable, and not personality-driven.

Measure everything you can, but at least MTTR and CFR. That’s the bare minimum.
Run a periodical incident-review. Invite engineering, support, customer success and product and review MTTR, MTTC, and CFR trends. Keep the agenda tight: what happened, why it mattered, what has already been fixed, and which follow-ups remain.
Publish a public-facing uptime page fed directly from your monitor. Nothing builds trust faster during due diligence than a third-party chart showing 99.96 % over the past twelve months.
Automate “first-5-minutes” actions. Pre-built scripts that flip feature flags, roll back canary pods, or increase replica counts cut containment time in half and impress investors who understand operational excellence.
Close the loop on post-mortems. Assign an owner, set a due date, and track action items in the same backlog as features. Burned-down debt is visible proof of learning.
Rotate on-call with follow-up rest. A humane schedule keeps engineers sharp and retention high so MTTR stays low without burning out talent. This is typically well regulated in the law (e.g. in Germany).
Schedule an annual third-party “chaos” or controlled-incident exercise. When outside facilitators inject surprise failures: DNS black-holes, database corruption, credential leaks, your engineers rehearse under pressure while observers time the metrics and note playbook gaps. The report becomes instant diligence evidence that you test resilience, not just talk about it.
Formalize a 1-2-3 support ladder. Level-1 responders (support or SRE) triage and apply run-book fixes; level-2 engineers dig into code; level-3 (staff or architects) own systemic remediation. Documenting the structure, with escalation timers and after-hours rotations, tells investors incident response won’t bottleneck around a single heroic CTO.

Common traps

Slashing cloud-monitoring budgets right before diligence leaves gaps in incident logs, buyers wonder what else is missing. Counting only “declared” incidents hides near-misses that matter just as much to future reliability. Pushing every post-mortem into a shared folder but never checking whether fixes shipped convinces no one.

Mini-Glossary

MTTR (Mean Time to Recover): average time from user impact to full restoration.
MTTC (Mean Time to Contain): time from alert to halting customer pain.
CFR (Change Failure Rate): percentage of deployments that trigger incidents or rollbacks.
Post-mortem velocity: elapsed time from incident close to documented, accepted retro with actions underway.

Your turn

What’s the toughest incident you’ve had to explain during diligence? A silent data-loss bug, a runaway queue, or a 3-hour DNS outage? How did you prove it wouldn’t happen again? Share your scar stories below.

Founders: Need a second set of eyes on your incident metrics or post-mortem workflow? Let’s talk.

Investors: Need more clarity on the operational excellence and the internal culture of your target company? Let’s talk.

Subscribe now

Next in the Playbook: Edition 7 tackles build-versus-buy decisions and how investors weigh home-grown innovation against third-party lock-in. Subscribe and it will land in your inbox.