Chapter 6: Measuring and Governing Architectural Characteristics

fsa fitness-functions governance metrics architecture-characteristics
Status: Notes complete


Overview

Identifying architectural characteristics (Chapter 5) is only half the battle. Without measurement and governance, characteristics degrade silently over time as development pressure, shortcuts, and entropy accumulate. Chapter 6 introduces fitness functions — the primary mechanism for making architectural governance automated, objective, and continuous. The chapter also covers the categories of metrics that feed fitness functions and explains why manual governance invariably fails at scale.


6.1 Why Governance is Necessary

The Entropy Problem

Left unmanaged, software architectures drift from their intended structure. This happens because:

  • Developers make locally reasonable decisions that violate global architectural constraints.
  • Time pressure causes “temporary” shortcuts that become permanent.
  • No one is explicitly responsible for monitoring characteristics after the initial design.
  • Architecture documentation becomes stale and is not consulted during development.

Why Manual Governance Fails

Richards & Ford are explicit: manual governance does not scale. The reasons are structural, not about individual effort:

  1. Human review is inconsistent: different reviewers apply different standards; the same reviewer applies different standards on different days.
  2. Pull request volume: in a team of 20+ developers committing daily, no architect can review every change for architectural compliance.
  3. Latency: manual review finds violations after the fact, when they are expensive to fix (often after merge to main).
  4. Knowledge decay: the architectural rationale for a constraint is often not documented; over time, even well-intentioned reviewers don’t know what they’re protecting.
  5. No objective baseline: without a metric, “this violates our performance characteristic” is an opinion, not a verifiable claim.

The solution is to automate governance through fitness functions — objective, executable checks that run as part of the CI/CD pipeline.


6.2 Fitness Functions — Definition

Core Definition

An architectural fitness function is any mechanism that provides an objective, measurable signal about whether one or more architectural characteristics are being met.

Key properties:

  • Objective: produces a pass/fail or numeric result, not an opinion.
  • Automated (ideally): runs without human intervention, typically in CI/CD.
  • Targeted: each fitness function targets a specific characteristic or sub-characteristic.
  • Evolutionary: fitness functions evolve with the architecture; outdated ones are removed, new ones are added.

The term is borrowed from evolutionary computing, where a fitness function evaluates how well a candidate solution satisfies optimization criteria. Here, the “candidate solution” is the current state of the codebase/architecture, and the “optimization criteria” are the architectural characteristics.

Fitness Functions vs. Unit Tests

Fitness functions are not the same as unit tests:

  • Unit tests verify functional correctness (does this function return the right value?).
  • Fitness functions verify architectural correctness (does this system still have the structural properties we designed for?).
  • Some fitness functions use testing frameworks (e.g., JUnit), but their purpose is architectural, not functional.

6.3 Types of Fitness Functions

By Execution Trigger

Atomic Fitness Functions

  • Execute against a single, isolated component or module.
  • Run quickly; typically part of unit or integration test suites.
  • Example: a test that checks a single module’s cyclomatic complexity does not exceed 10.
  • Example: an ArchUnit test that verifies a specific package has no outbound dependencies to a restricted layer.

Holistic Fitness Functions

  • Execute against the entire system or a significant integrated portion.
  • Verify characteristics that emerge only at the system level (e.g., overall throughput, end-to-end latency, inter-service coupling).
  • Example: a performance benchmark that exercises the full request path under load.
  • Example: a chaos engineering experiment that verifies the system remains available when a service is killed.
  • Usually slower to run; may not run on every commit.

By Execution Timing

Triggered Fitness Functions

  • Execute in response to a specific event: a commit, a pull request, a deployment, or a scheduled time.
  • Most common type; the majority of CI/CD fitness functions are triggered.
  • Example: ArchUnit dependency checks triggered on every pull request.
  • Example: security scan triggered on every build.

Temporal Fitness Functions

  • Execute on a schedule, independent of code changes.
  • Used for characteristics that change over time due to external factors or cumulative drift.
  • Example: weekly automated penetration test.
  • Example: nightly chaos engineering experiment to verify fault tolerance.
  • Example: monthly test of backup restoration (recoverability fitness function).

By Result Type

Static Fitness Functions

  • Analyze the structure of the code or architecture without executing it.
  • Examples: dependency analysis, complexity metrics, coupling measurements, code style checks.
  • Fast, deterministic, easy to run on every commit.
  • Tools: ArchUnit (Java), NDepend (.NET), Structurizr (architecture-level).

Dynamic Fitness Functions

  • Verify characteristics through execution — the system must run to produce results.
  • Examples: performance benchmarks, load tests, availability monitors, chaos experiments.
  • Slower, may require environment provisioning, not always feasible on every commit.

6.4 Operational Measures

Operational measures reflect the runtime behavior of a deployed system. They are the primary source of fitness functions for performance, availability, scalability, and reliability characteristics.

Response Time

  • Definition: the elapsed time between a client sending a request and receiving a complete response.
  • Measurement levels: average (misleading for latency-sensitive systems), percentiles (p50, p95, p99, p999) are preferred.
    • p99 = 99% of requests complete within this time; 1% are slower.
    • p999 = used for high-reliability systems; captures the “long tail.”
  • Why percentiles matter: average response time can look healthy while a significant minority of users experience unacceptable latency (the “long tail” problem).
  • Fitness function form: “p99 response time for the order submission endpoint must be < 300ms under 500 concurrent users in the staging environment.”

Throughput

  • Definition: the number of requests (or transactions, messages, events) processed per unit of time.
  • Typically measured as requests per second (RPS) or transactions per second (TPS).
  • Related but distinct from response time: a system can have high throughput with high latency if it uses buffering/batching.
  • Fitness function form: “The order processing pipeline must sustain 1,000 orders per minute without queue backlog growth.”

Availability Metrics

  • Availability: percentage of time the system is operational and accessible.
    • Expressed as “nines”: 99% = ~87.6 hours downtime/year; 99.9% = ~8.76 hours; 99.99% = ~52.6 minutes; 99.999% = ~5.26 minutes.
  • MTTR (Mean Time to Recovery): average time to restore service after a failure. Measures recoverability.
  • MTBF (Mean Time Between Failures): average time between failure events. Measures reliability.
  • Error rate: percentage of requests resulting in errors (5xx responses, timeouts, exceptions).
  • Fitness function form: “Monthly availability must not fall below 99.9%; MTTR for P1 incidents must not exceed 30 minutes.”

6.5 Structural Measures

Structural measures are derived from static analysis of code and architecture. They are fast, deterministic, and effective for enforcing modularity, maintainability, and coupling characteristics.

Cyclomatic Complexity

  • Definition: a count of independent paths through a unit of code (function, method, class). Introduced by Thomas McCabe.
  • Formula: CC = E - N + 2P, where E = edges in the control flow graph, N = nodes, P = connected components. In practice: count the number of branching constructs (if, while, for, case, &&, ||) + 1.
  • Interpretation:
    • CC 1-5: simple, low risk.
    • CC 6-10: moderate complexity, manageable.
    • CC 11-20: high complexity; strong candidate for refactoring.
    • CC > 20: very high; almost always a maintenance and testability problem.
  • Why it matters architecturally: highly complex code is hard to test (many paths), hard to understand, and risky to change — all of which degrade agility and reliability characteristics.
  • Fitness function form: “No method in the codebase may have cyclomatic complexity > 15. No class may have total complexity > 50.”

Coupling Metrics

Afferent Coupling (Ca)

  • The number of external components that depend on a given component (incoming dependencies).
  • High Ca = high impact of change; many things break if this component changes.
  • Also called “fan-in.”

Efferent Coupling (Ce)

  • The number of external components that a given component depends on (outgoing dependencies).
  • High Ce = this component is fragile; it is affected by changes in many external components.
  • Also called “fan-out.”

Instability (I)

  • I = Ce / (Ca + Ce). Range: 0 (maximally stable) to 1 (maximally unstable).
  • A component with high instability is easy to change (nothing depends on it) but is itself fragile.
  • Stable Dependency Principle: depend in the direction of stability (depend on stable components, not unstable ones).

Abstractness (A)

  • Ratio of abstract types (interfaces, abstract classes) to total types in a component.
  • A = 0: entirely concrete. A = 1: entirely abstract.

Distance from the Main Sequence (D)

  • D = |A + I - 1|. Measures how far a component is from the ideal balance of abstractness and stability.
  • Zones of pain: low abstractness + high stability (rigid, hard to change).
  • Zone of uselessness: high abstractness + low stability (abstract but not reused).
  • Fitness function form: “No component may have D > 0.7.”

Dependency Analysis

Beyond numeric coupling metrics, structural fitness functions can enforce explicit dependency rules:

  • “The orders module may not depend on the payments module directly — only via an interface.”
  • “No class in the domain layer may import from the infrastructure layer.”
  • “All inter-service communication must go through the API gateway; direct service-to-service HTTP calls are prohibited.”

Tools: ArchUnit (Java), NDepend (.NET), custom AST analysis.


6.6 Process Measures

Process measures reflect development and operational process quality. They are often overlooked as architectural fitness functions but are critical for agility, reliability, and maintainability characteristics.

Test Coverage

  • Definition: percentage of code lines (or branches, functions, paths) exercised by the automated test suite.
  • Why it matters: insufficient test coverage means that refactoring and architectural changes are risky — the system cannot verify that changes don’t break existing behavior.
  • Caveats: coverage is a necessary but not sufficient condition for test quality. 100% line coverage with no meaningful assertions is worse than 70% coverage with strong assertions.
  • Fitness function form: “Overall line coverage must remain above 80%. Coverage for the domain package must remain above 90%. Coverage may not decrease on any PR by more than 2 percentage points.”

MTTR (Mean Time to Recovery)

  • Already introduced under operational measures, but MTTR is also a process measure — it reflects how well the team can respond to and resolve incidents.
  • MTTR is influenced by: observability quality, on-call procedures, runbook quality, deployment rollback speed, and feature flag availability.
  • Fitness function form: MTTR targets enforced through incident SLAs and post-incident review processes.

Change Failure Rate

  • Definition: the percentage of production deployments that result in a degradation requiring remediation (rollback, hotfix, or patch).
  • A DORA (DevOps Research and Assessment) metric.
  • High CFR indicates: insufficient testing, poor deployment practices, architectural brittleness, or inadequate pre-production environments.
  • Industry benchmarks (DORA): Elite performers: < 5% CFR; High: 5-10%; Medium: 10-15%; Low: > 15%.
  • Fitness function form: “Monthly change failure rate must not exceed 10%. Any week with CFR > 15% triggers a mandatory retrospective.”

Deployment Frequency

  • Another DORA metric. How often code is successfully deployed to production.
  • Not directly a fitness function target, but a leading indicator of deployability (agility sub-characteristic).
  • Elite DORA performers deploy on demand (multiple times per day); low performers deploy monthly or less frequently.

Lead Time for Changes

  • Time from code commit to running in production.
  • Measures the combined effect of CI/CD efficiency, testing speed, approval processes, and deployment automation.
  • A fitness function can enforce CI pipeline duration: “The full CI pipeline must complete in < 15 minutes.”

6.7 Fitness Function Examples

ArchUnit for Dependency Checks (Java)

ArchUnit is a Java library that allows dependency rules to be expressed as JUnit tests:

@ArchTest
static final ArchRule domainShouldNotDependOnInfrastructure =
    noClasses()
        .that().resideInAPackage("..domain..")
        .should().dependOnClassesThat()
        .resideInAPackage("..infrastructure..");
 
@ArchTest
static final ArchRule servicesAccessedOnlyThroughInterfaces =
    classes()
        .that().resideInAPackage("..service..")
        .should().onlyBeAccessed()
        .byClassesThat().resideInAnyPackage("..controller..", "..service..");

These tests run in CI on every commit. A violation fails the build immediately, before any human review.

Performance Benchmarks

Performance fitness functions typically use tools like Gatling, k6, or JMeter:

  • Define a load scenario (e.g., 500 concurrent users, 5-minute ramp-up, 10-minute sustained).
  • Assert on p99 response time, error rate, and throughput.
  • Run in a dedicated performance environment (staging with production-like data volume).
  • Triggered on merge to main or on a nightly schedule.

Example assertion: “The order submission endpoint must achieve p99 < 300ms and error rate < 0.1% under the standard load scenario.”

Security Scans

Security fitness functions integrate security tooling into the pipeline:

  • SAST (Static Application Security Testing): tools like SonarQube, Checkmarx, or Semgrep scan source code for known vulnerability patterns. Triggered on every commit.
  • DAST (Dynamic Application Security Testing): tools like OWASP ZAP exercise the running application. Triggered on deployment to staging.
  • Dependency vulnerability scanning: tools like OWASP Dependency-Check or Snyk verify that no dependency with a known CVE above a severity threshold is used. Triggered on every build.
  • Secrets detection: tools like truffleHog or GitLeaks scan commits for accidentally committed credentials.

Chaos Engineering for Reliability

Chaos engineering fitness functions verify that the system maintains its availability and reliability characteristics in the presence of failures:

  • Tool examples: Netflix Chaos Monkey, Gremlin, AWS Fault Injection Simulator.
  • Experiment examples:
    • Kill a random service instance; verify the system continues serving requests within the availability SLA.
    • Introduce 500ms latency on a downstream dependency; verify upstream does not cascade into timeout failures.
    • Saturate CPU on one node; verify load balancer routes traffic away.
  • Execution timing: temporal (nightly or weekly), not triggered on every commit — these experiments affect running systems.
  • The distinction from testing: chaos engineering verifies emergent system behavior under failure, not the correctness of individual components. It is a holistic fitness function.

6.8 Fitness Function Governance in Practice

Placement in CI/CD

Fitness Function TypeTypical CI/CD Stage
Static analysis (ArchUnit, complexity)Pre-merge / PR check
Unit-level fitness functionsPre-merge / PR check
Dependency vulnerability scansBuild stage
Integration-level fitness functionsPost-merge to main
Performance benchmarksNightly or pre-release
Chaos engineering experimentsNightly or weekly
Security penetration testsRelease gate

Fitness Function Registry

Richards & Ford recommend maintaining a fitness function registry — a document or tool that maps each architectural characteristic to its fitness functions, owners, thresholds, and last-known status. This provides:

  • Visibility into which characteristics are actually governed.
  • Accountability (each fitness function has an owner).
  • Auditability (history of threshold changes and failures).

Evolving Fitness Functions

Fitness functions must evolve with the architecture:

  • When a characteristic is retired, remove its fitness functions.
  • When a threshold becomes too easy to meet (everyone passes), tighten it.
  • When a fitness function produces too many false positives, tune it rather than disable it.
  • When a new architectural risk emerges, write a new fitness function before the risk becomes a problem.

Trade-offs and Decision Framework

OptionProsConsWhen to Choose
Full automation (all fitness functions in CI)Immediate feedback, consistent enforcement, scales to any team sizeUpfront investment in tooling; some tests require environment provisioningDefault for all production systems; invest in this
Manual governance (architecture review boards)Low tooling cost; flexible judgmentInconsistent, doesn’t scale, high latency, opinion-basedOnly for very small teams or one-off decisions — not for ongoing governance
Static-only fitness functionsFast, cheap, deterministicMisses runtime characteristics (performance, availability)Good starting point; supplement with dynamic over time
Dynamic-only fitness functionsCatches real runtime behaviorSlow, environment-dependent, non-deterministicNever use alone; combine with static
Strict thresholds (zero tolerance)Maximum enforcementCan block legitimate work; creates adversarial cultureFor critical characteristics only (e.g., security, data integrity)
Advisory thresholds (warn, don’t fail)Low frictionWarnings are ignored; no actual governanceUse only during adoption period; transition to hard gates quickly

Common Antipatterns

Governance Theater: fitness functions exist in CI but are never allowed to fail (thresholds are set so low that nothing ever violates them). Result: false sense of governance with no actual protection. Fix: set thresholds based on the actual characteristic requirement, not on what the current code happens to achieve.

Coverage Washing: achieving high test coverage with tests that have no meaningful assertions — they execute code paths but verify nothing. Metric looks healthy; protection is absent. Fix: supplement coverage metrics with mutation testing (Pitest, Stryker) that verifies assertions actually catch defects.

Fitness Function Debt: fitness functions are never updated after initial creation; thresholds become irrelevant, tools become outdated, tests become flaky. Fix: treat fitness functions as first-class production code; schedule regular reviews.

Holistic Overload: running expensive holistic fitness functions (full chaos experiments, load tests) on every commit, slowing the pipeline to the point where developers bypass it. Fix: tier fitness functions by cost — static and fast tests on every commit, expensive holistic tests nightly or pre-release.

Single-Dimension Governance: only governing one dimension of a characteristic (e.g., only measuring average response time, not percentiles) and missing real degradation. Fix: always measure at multiple percentiles and at multiple load levels.

Siloed Fitness Functions: fitness functions owned by the platform team that no one else can modify or understand. When they fail, developers don’t know how to fix the violation. Fix: fitness functions should be co-owned and co-understood by the development team; treat them as part of the definition of done.


Key Takeaways

  1. Fitness Function Definition: an architectural fitness function is any objective, automated mechanism that measures whether an architectural characteristic is being maintained.
  2. Manual Governance Fails: manual architectural governance does not scale beyond small teams; it is inconsistent, high-latency, and opinion-based. Automation is the only path to reliable governance.
  3. Atomic vs. Holistic: atomic fitness functions test individual components in isolation; holistic fitness functions test emergent system properties. Both are required for complete governance.
  4. Triggered vs. Temporal: triggered fitness functions run in response to events (commits, deployments); temporal fitness functions run on a schedule. Temporal ones are critical for characteristics that degrade over time regardless of code changes.
  5. Cyclomatic Complexity: measures the number of independent paths through code; high complexity (> 10-15) degrades testability, maintainability, and reliability. A standard static fitness function target.
  6. Coupling Metrics: afferent coupling (Ca), efferent coupling (Ce), instability (I), and distance from the main sequence (D) provide measurable targets for modularity governance.
  7. Percentiles over Averages: for performance characteristics, p99 and p999 percentile response times are far more meaningful than averages because averages hide the tail latency that real users experience.
  8. DORA Metrics as Fitness Functions: change failure rate, deployment frequency, lead time for changes, and MTTR are process-level fitness functions that govern agility and reliability characteristics.
  9. Chaos Engineering: validates reliability and availability characteristics by deliberately introducing failures into running systems; produces evidence that the architecture actually meets its resilience targets, not just that components work in isolation.
  10. Fitness Functions Must Evolve: they are not set-and-forget; they must be maintained, tuned, and retired as the architecture evolves, just like production code.

  • Chapter 5: Identifying Architectural Characteristics (prerequisite — characteristics must be identified before they can be governed)
  • Chapter 4: Defining Architectural Characteristics (taxonomy)
  • “Accelerate” by Forsgren, Humble & Kim — foundational source for DORA metrics
  • “Chaos Engineering” by Basiri et al. — detailed treatment of chaos experiments
  • ArchUnit documentation: https://www.archunit.org
  • OWASP Testing Guide — security fitness function patterns
  • Netflix Tech Blog — original chaos engineering practices

Last Updated: 2026-05-29