Chapter 6: Measuring and Governing Architectural Characteristics

fsa fitness-functions governance metrics architecture-characteristics
Status: Notes complete

Overview

Identifying architectural characteristics (Chapter 5) is only half the battle. Without measurement and governance, characteristics degrade silently over time as development pressure, shortcuts, and entropy accumulate. Chapter 6 introduces fitness functions — the primary mechanism for making architectural governance automated, objective, and continuous. The chapter also covers the categories of metrics that feed fitness functions and explains why manual governance invariably fails at scale.

6.1 Why Governance is Necessary

The Entropy Problem

Left unmanaged, software architectures drift from their intended structure. This happens because:

Developers make locally reasonable decisions that violate global architectural constraints.
Time pressure causes “temporary” shortcuts that become permanent.
No one is explicitly responsible for monitoring characteristics after the initial design.
Architecture documentation becomes stale and is not consulted during development.

Why Manual Governance Fails

Richards & Ford are explicit: manual governance does not scale. The reasons are structural, not about individual effort:

Human review is inconsistent: different reviewers apply different standards; the same reviewer applies different standards on different days.
Pull request volume: in a team of 20+ developers committing daily, no architect can review every change for architectural compliance.
Latency: manual review finds violations after the fact, when they are expensive to fix (often after merge to main).
Knowledge decay: the architectural rationale for a constraint is often not documented; over time, even well-intentioned reviewers don’t know what they’re protecting.
No objective baseline: without a metric, “this violates our performance characteristic” is an opinion, not a verifiable claim.

The solution is to automate governance through fitness functions — objective, executable checks that run as part of the CI/CD pipeline.

6.2 Fitness Functions — Definition

Core Definition

An architectural fitness function is any mechanism that provides an objective, measurable signal about whether one or more architectural characteristics are being met.

Key properties:

Objective: produces a pass/fail or numeric result, not an opinion.
Automated (ideally): runs without human intervention, typically in CI/CD.
Targeted: each fitness function targets a specific characteristic or sub-characteristic.
Evolutionary: fitness functions evolve with the architecture; outdated ones are removed, new ones are added.

The term is borrowed from evolutionary computing, where a fitness function evaluates how well a candidate solution satisfies optimization criteria. Here, the “candidate solution” is the current state of the codebase/architecture, and the “optimization criteria” are the architectural characteristics.

Fitness Functions vs. Unit Tests

Fitness functions are not the same as unit tests:

Unit tests verify functional correctness (does this function return the right value?).
Fitness functions verify architectural correctness (does this system still have the structural properties we designed for?).
Some fitness functions use testing frameworks (e.g., JUnit), but their purpose is architectural, not functional.

6.3 Types of Fitness Functions

By Execution Trigger

Atomic Fitness Functions

Execute against a single, isolated component or module.
Run quickly; typically part of unit or integration test suites.
Example: a test that checks a single module’s cyclomatic complexity does not exceed 10.
Example: an ArchUnit test that verifies a specific package has no outbound dependencies to a restricted layer.

Holistic Fitness Functions

Execute against the entire system or a significant integrated portion.
Verify characteristics that emerge only at the system level (e.g., overall throughput, end-to-end latency, inter-service coupling).
Example: a performance benchmark that exercises the full request path under load.
Example: a chaos engineering experiment that verifies the system remains available when a service is killed.
Usually slower to run; may not run on every commit.

By Execution Timing

Triggered Fitness Functions

Execute in response to a specific event: a commit, a pull request, a deployment, or a scheduled time.
Most common type; the majority of CI/CD fitness functions are triggered.
Example: ArchUnit dependency checks triggered on every pull request.
Example: security scan triggered on every build.

Temporal Fitness Functions

Execute on a schedule, independent of code changes.
Used for characteristics that change over time due to external factors or cumulative drift.
Example: weekly automated penetration test.
Example: nightly chaos engineering experiment to verify fault tolerance.
Example: monthly test of backup restoration (recoverability fitness function).

By Result Type

Static Fitness Functions

Analyze the structure of the code or architecture without executing it.
Examples: dependency analysis, complexity metrics, coupling measurements, code style checks.
Fast, deterministic, easy to run on every commit.
Tools: ArchUnit (Java), NDepend (.NET), Structurizr (architecture-level).

Dynamic Fitness Functions

Verify characteristics through execution — the system must run to produce results.
Examples: performance benchmarks, load tests, availability monitors, chaos experiments.
Slower, may require environment provisioning, not always feasible on every commit.

6.4 Operational Measures

Operational measures reflect the runtime behavior of a deployed system. They are the primary source of fitness functions for performance, availability, scalability, and reliability characteristics.

Response Time

Definition: the elapsed time between a client sending a request and receiving a complete response.
Measurement levels: average (misleading for latency-sensitive systems), percentiles (p50, p95, p99, p999) are preferred.
- p99 = 99% of requests complete within this time; 1% are slower.
- p999 = used for high-reliability systems; captures the “long tail.”
Why percentiles matter: average response time can look healthy while a significant minority of users experience unacceptable latency (the “long tail” problem).
Fitness function form: “p99 response time for the order submission endpoint must be < 300ms under 500 concurrent users in the staging environment.”

Throughput

Definition: the number of requests (or transactions, messages, events) processed per unit of time.
Typically measured as requests per second (RPS) or transactions per second (TPS).
Related but distinct from response time: a system can have high throughput with high latency if it uses buffering/batching.
Fitness function form: “The order processing pipeline must sustain 1,000 orders per minute without queue backlog growth.”

Availability Metrics

Availability: percentage of time the system is operational and accessible.
- Expressed as “nines”: 99% = ~87.6 hours downtime/year; 99.9% = ~8.76 hours; 99.99% = ~52.6 minutes; 99.999% = ~5.26 minutes.
MTTR (Mean Time to Recovery): average time to restore service after a failure. Measures recoverability.
MTBF (Mean Time Between Failures): average time between failure events. Measures reliability.
Error rate: percentage of requests resulting in errors (5xx responses, timeouts, exceptions).
Fitness function form: “Monthly availability must not fall below 99.9%; MTTR for P1 incidents must not exceed 30 minutes.”

6.5 Structural Measures

Structural measures are derived from static analysis of code and architecture. They are fast, deterministic, and effective for enforcing modularity, maintainability, and coupling characteristics.

Cyclomatic Complexity

Definition: a count of independent paths through a unit of code (function, method, class). Introduced by Thomas McCabe.
Formula: CC = E - N + 2P, where E = edges in the control flow graph, N = nodes, P = connected components. In practice: count the number of branching constructs (if, while, for, case, &&, ||) + 1.
Interpretation:
- CC 1-5: simple, low risk.
- CC 6-10: moderate complexity, manageable.
- CC 11-20: high complexity; strong candidate for refactoring.
- CC > 20: very high; almost always a maintenance and testability problem.
Why it matters architecturally: highly complex code is hard to test (many paths), hard to understand, and risky to change — all of which degrade agility and reliability characteristics.
Fitness function form: “No method in the codebase may have cyclomatic complexity > 15. No class may have total complexity > 50.”

Coupling Metrics

Afferent Coupling (Ca)

The number of external components that depend on a given component (incoming dependencies).
High Ca = high impact of change; many things break if this component changes.
Also called “fan-in.”

Efferent Coupling (Ce)

The number of external components that a given component depends on (outgoing dependencies).
High Ce = this component is fragile; it is affected by changes in many external components.
Also called “fan-out.”

Instability (I)

I = Ce / (Ca + Ce). Range: 0 (maximally stable) to 1 (maximally unstable).
A component with high instability is easy to change (nothing depends on it) but is itself fragile.
Stable Dependency Principle: depend in the direction of stability (depend on stable components, not unstable ones).

Abstractness (A)

Ratio of abstract types (interfaces, abstract classes) to total types in a component.
A = 0: entirely concrete. A = 1: entirely abstract.

Distance from the Main Sequence (D)

D = |A + I - 1|. Measures how far a component is from the ideal balance of abstractness and stability.
Zones of pain: low abstractness + high stability (rigid, hard to change).
Zone of uselessness: high abstractness + low stability (abstract but not reused).
Fitness function form: “No component may have D > 0.7.”

Dependency Analysis

Beyond numeric coupling metrics, structural fitness functions can enforce explicit dependency rules:

“The orders module may not depend on the payments module directly — only via an interface.”
“No class in the domain layer may import from the infrastructure layer.”
“All inter-service communication must go through the API gateway; direct service-to-service HTTP calls are prohibited.”

Tools: ArchUnit (Java), NDepend (.NET), custom AST analysis.

6.6 Process Measures

Process measures reflect development and operational process quality. They are often overlooked as architectural fitness functions but are critical for agility, reliability, and maintainability characteristics.

Test Coverage

Definition: percentage of code lines (or branches, functions, paths) exercised by the automated test suite.
Why it matters: insufficient test coverage means that refactoring and architectural changes are risky — the system cannot verify that changes don’t break existing behavior.
Caveats: coverage is a necessary but not sufficient condition for test quality. 100% line coverage with no meaningful assertions is worse than 70% coverage with strong assertions.
Fitness function form: “Overall line coverage must remain above 80%. Coverage for the domain package must remain above 90%. Coverage may not decrease on any PR by more than 2 percentage points.”

MTTR (Mean Time to Recovery)

Already introduced under operational measures, but MTTR is also a process measure — it reflects how well the team can respond to and resolve incidents.
MTTR is influenced by: observability quality, on-call procedures, runbook quality, deployment rollback speed, and feature flag availability.
Fitness function form: MTTR targets enforced through incident SLAs and post-incident review processes.

Change Failure Rate

Definition: the percentage of production deployments that result in a degradation requiring remediation (rollback, hotfix, or patch).
A DORA (DevOps Research and Assessment) metric.
High CFR indicates: insufficient testing, poor deployment practices, architectural brittleness, or inadequate pre-production environments.
Industry benchmarks (DORA): Elite performers: < 5% CFR; High: 5-10%; Medium: 10-15%; Low: > 15%.
Fitness function form: “Monthly change failure rate must not exceed 10%. Any week with CFR > 15% triggers a mandatory retrospective.”

Deployment Frequency

Another DORA metric. How often code is successfully deployed to production.
Not directly a fitness function target, but a leading indicator of deployability (agility sub-characteristic).
Elite DORA performers deploy on demand (multiple times per day); low performers deploy monthly or less frequently.

Lead Time for Changes

Time from code commit to running in production.
Measures the combined effect of CI/CD efficiency, testing speed, approval processes, and deployment automation.
A fitness function can enforce CI pipeline duration: “The full CI pipeline must complete in < 15 minutes.”

6.7 Fitness Function Examples

ArchUnit for Dependency Checks (Java)

ArchUnit is a Java library that allows dependency rules to be expressed as JUnit tests:

@ArchTest
static final ArchRule domainShouldNotDependOnInfrastructure =
    noClasses()
        .that().resideInAPackage("..domain..")
        .should().dependOnClassesThat()
        .resideInAPackage("..infrastructure..");
 
@ArchTest
static final ArchRule servicesAccessedOnlyThroughInterfaces =
    classes()
        .that().resideInAPackage("..service..")
        .should().onlyBeAccessed()
        .byClassesThat().resideInAnyPackage("..controller..", "..service..");

These tests run in CI on every commit. A violation fails the build immediately, before any human review.

Performance Benchmarks

Performance fitness functions typically use tools like Gatling, k6, or JMeter:

Define a load scenario (e.g., 500 concurrent users, 5-minute ramp-up, 10-minute sustained).
Assert on p99 response time, error rate, and throughput.
Run in a dedicated performance environment (staging with production-like data volume).
Triggered on merge to main or on a nightly schedule.

Example assertion: “The order submission endpoint must achieve p99 < 300ms and error rate < 0.1% under the standard load scenario.”

Security Scans

Security fitness functions integrate security tooling into the pipeline:

SAST (Static Application Security Testing): tools like SonarQube, Checkmarx, or Semgrep scan source code for known vulnerability patterns. Triggered on every commit.
DAST (Dynamic Application Security Testing): tools like OWASP ZAP exercise the running application. Triggered on deployment to staging.
Dependency vulnerability scanning: tools like OWASP Dependency-Check or Snyk verify that no dependency with a known CVE above a severity threshold is used. Triggered on every build.
Secrets detection: tools like truffleHog or GitLeaks scan commits for accidentally committed credentials.

Chaos Engineering for Reliability

Chaos engineering fitness functions verify that the system maintains its availability and reliability characteristics in the presence of failures:

Tool examples: Netflix Chaos Monkey, Gremlin, AWS Fault Injection Simulator.
Experiment examples:
- Kill a random service instance; verify the system continues serving requests within the availability SLA.
- Introduce 500ms latency on a downstream dependency; verify upstream does not cascade into timeout failures.
- Saturate CPU on one node; verify load balancer routes traffic away.
Execution timing: temporal (nightly or weekly), not triggered on every commit — these experiments affect running systems.
The distinction from testing: chaos engineering verifies emergent system behavior under failure, not the correctness of individual components. It is a holistic fitness function.

6.8 Fitness Function Governance in Practice

Placement in CI/CD

Fitness Function Type	Typical CI/CD Stage
Static analysis (ArchUnit, complexity)	Pre-merge / PR check
Unit-level fitness functions	Pre-merge / PR check
Dependency vulnerability scans	Build stage
Integration-level fitness functions	Post-merge to main
Performance benchmarks	Nightly or pre-release
Chaos engineering experiments	Nightly or weekly
Security penetration tests	Release gate

Fitness Function Registry

Richards & Ford recommend maintaining a fitness function registry — a document or tool that maps each architectural characteristic to its fitness functions, owners, thresholds, and last-known status. This provides:

Visibility into which characteristics are actually governed.
Accountability (each fitness function has an owner).
Auditability (history of threshold changes and failures).

Evolving Fitness Functions

Fitness functions must evolve with the architecture:

When a characteristic is retired, remove its fitness functions.
When a threshold becomes too easy to meet (everyone passes), tighten it.
When a fitness function produces too many false positives, tune it rather than disable it.
When a new architectural risk emerges, write a new fitness function before the risk becomes a problem.

Trade-offs and Decision Framework

Option	Pros	Cons	When to Choose
Full automation (all fitness functions in CI)	Immediate feedback, consistent enforcement, scales to any team size	Upfront investment in tooling; some tests require environment provisioning	Default for all production systems; invest in this
Manual governance (architecture review boards)	Low tooling cost; flexible judgment	Inconsistent, doesn’t scale, high latency, opinion-based	Only for very small teams or one-off decisions — not for ongoing governance
Static-only fitness functions	Fast, cheap, deterministic	Misses runtime characteristics (performance, availability)	Good starting point; supplement with dynamic over time
Dynamic-only fitness functions	Catches real runtime behavior	Slow, environment-dependent, non-deterministic	Never use alone; combine with static
Strict thresholds (zero tolerance)	Maximum enforcement	Can block legitimate work; creates adversarial culture	For critical characteristics only (e.g., security, data integrity)
Advisory thresholds (warn, don’t fail)	Low friction	Warnings are ignored; no actual governance	Use only during adoption period; transition to hard gates quickly

Common Antipatterns

Governance Theater: fitness functions exist in CI but are never allowed to fail (thresholds are set so low that nothing ever violates them). Result: false sense of governance with no actual protection. Fix: set thresholds based on the actual characteristic requirement, not on what the current code happens to achieve.

Coverage Washing: achieving high test coverage with tests that have no meaningful assertions — they execute code paths but verify nothing. Metric looks healthy; protection is absent. Fix: supplement coverage metrics with mutation testing (Pitest, Stryker) that verifies assertions actually catch defects.

Fitness Function Debt: fitness functions are never updated after initial creation; thresholds become irrelevant, tools become outdated, tests become flaky. Fix: treat fitness functions as first-class production code; schedule regular reviews.

Holistic Overload: running expensive holistic fitness functions (full chaos experiments, load tests) on every commit, slowing the pipeline to the point where developers bypass it. Fix: tier fitness functions by cost — static and fast tests on every commit, expensive holistic tests nightly or pre-release.

Single-Dimension Governance: only governing one dimension of a characteristic (e.g., only measuring average response time, not percentiles) and missing real degradation. Fix: always measure at multiple percentiles and at multiple load levels.

Siloed Fitness Functions: fitness functions owned by the platform team that no one else can modify or understand. When they fail, developers don’t know how to fix the violation. Fix: fitness functions should be co-owned and co-understood by the development team; treat them as part of the definition of done.

Key Takeaways

Fitness Function Definition: an architectural fitness function is any objective, automated mechanism that measures whether an architectural characteristic is being maintained.
Manual Governance Fails: manual architectural governance does not scale beyond small teams; it is inconsistent, high-latency, and opinion-based. Automation is the only path to reliable governance.
Atomic vs. Holistic: atomic fitness functions test individual components in isolation; holistic fitness functions test emergent system properties. Both are required for complete governance.
Triggered vs. Temporal: triggered fitness functions run in response to events (commits, deployments); temporal fitness functions run on a schedule. Temporal ones are critical for characteristics that degrade over time regardless of code changes.
Cyclomatic Complexity: measures the number of independent paths through code; high complexity (> 10-15) degrades testability, maintainability, and reliability. A standard static fitness function target.
Coupling Metrics: afferent coupling (Ca), efferent coupling (Ce), instability (I), and distance from the main sequence (D) provide measurable targets for modularity governance.
Percentiles over Averages: for performance characteristics, p99 and p999 percentile response times are far more meaningful than averages because averages hide the tail latency that real users experience.
DORA Metrics as Fitness Functions: change failure rate, deployment frequency, lead time for changes, and MTTR are process-level fitness functions that govern agility and reliability characteristics.
Chaos Engineering: validates reliability and availability characteristics by deliberately introducing failures into running systems; produces evidence that the architecture actually meets its resilience targets, not just that components work in isolation.
Fitness Functions Must Evolve: they are not set-and-forget; they must be maintained, tuned, and retired as the architecture evolves, just like production code.

Chapter 5: Identifying Architectural Characteristics (prerequisite — characteristics must be identified before they can be governed)
Chapter 4: Defining Architectural Characteristics (taxonomy)
“Accelerate” by Forsgren, Humble & Kim — foundational source for DORA metrics
“Chaos Engineering” by Basiri et al. — detailed treatment of chaos experiments
ArchUnit documentation: https://www.archunit.org
OWASP Testing Guide — security fitness function patterns
Netflix Tech Blog — original chaos engineering practices

Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch06-measuring-and-governing-characteristics