Chapter 14: Doing the Right Thing

ddia-2e ethics bias privacy surveillance gdpr accountability

Status: Notes complete


Overview

Chapter 14 is the ethical capstone of DDIA 2nd Edition. Where previous chapters asked “can we build this?” and “how do we build this efficiently?”, this chapter asks the harder question: “should we build this, and if so, how do we build it responsibly?” Kleppmann and Riccomini take the position that data engineers and architects are not neutral technicians — the systems we build have real consequences for real people, and “just following the spec” is not a sufficient ethical position. The chapter covers three domains where data systems create ethical risk: predictive analytics (where algorithms make high-stakes decisions about people), privacy and surveillance (where data collection creates power asymmetries), and the structural forces — economic, legal, and historical — that shape how these risks manifest. This is an essay-style chapter, intentionally less technical than the rest of the book. Its purpose is to cultivate ethical judgment, not just technical skill.


Key Concepts

Predictive Analytics and Algorithmic Decision-Making

Predictive analytics applies statistical models to historical data to make predictions about future behavior or to classify individuals. When those predictions are used to make high-stakes decisions — who gets a loan, who gets hired, who gets parole, who gets flagged as a fraud risk — the algorithm becomes a consequential actor in people’s lives.

Bias and Discrimination

Bias in ML models is not a bug — it is a consequence of data: A model trained on historical data inherits the biases embedded in that history. If historical hiring data shows that most engineers hired at a company were men (because of historical discrimination and social barriers), a model trained to “predict success” will score male candidates higher. The model is not discriminating by design; it is discriminating by replication.

Concrete examples of algorithmic bias:

  • Facial recognition: Systems trained predominantly on lighter-skinned faces have significantly higher error rates on darker-skinned faces. A 2018 MIT Media Lab study (Joy Buolamwini and Timnit Gebru) found error rates up to 34% for dark-skinned women vs. 0.8% for light-skinned men in commercial facial recognition APIs. When these systems are used in law enforcement, incorrect identifications can result in wrongful arrest.

  • Hiring algorithms: Amazon famously built and then scrapped a resume screening algorithm in 2018 after discovering it systematically downrated resumes containing the word “women’s” (as in “women’s chess club”) because it was trained on historically male-dominated engineering applications. The algorithm had learned that “successful applicants” looked like past successful applicants — predominantly male.

  • Credit scoring: FICO scores and alternative credit scoring models can encode geographic proxies for race (ZIP code as a proxy for redlined neighborhoods), producing racially disparate outcomes even without using race as an input variable. This is sometimes called “proxy discrimination” — discrimination that operates through correlated proxies rather than protected characteristics directly.

  • Predictive policing: Systems like PredPol (now Geolitica) predict where crimes will occur based on historical crime data. But historical crime data reflects where policing occurred historically — over-policed communities show more recorded crime, leading to more predicted crime, leading to more policing. This feedback loop amplifies existing inequalities.

  • Recidivism prediction (COMPAS): A ProPublica investigation in 2016 found that COMPAS, a recidivism scoring tool used in criminal sentencing, was twice as likely to falsely flag Black defendants as high risk compared to white defendants. The tool’s creators argued it was “equally accurate” overall — illustrating that accuracy metrics can be satisfied while producing deeply unfair outcomes.

Why technical accuracy does not equal fairness: A model can achieve high accuracy overall while being systematically wrong for a specific demographic group that is underrepresented in the training data or whose outcomes were historically shaped by discrimination. Standard ML metrics (AUC, accuracy, precision, recall) are computed on aggregates that can mask group-level disparities.

Fairness is not a single technical property: There are mathematically incompatible definitions of algorithmic fairness — demographic parity (equal positive rates across groups), equalized odds (equal true/false positive rates), individual fairness (similar individuals treated similarly), and counterfactual fairness (prediction would be the same if protected characteristics were different). These cannot all be satisfied simultaneously when base rates differ between groups. This is not a technical problem to be solved; it is a values question to be decided by society.

Responsibility and Accountability

When an algorithm makes a harmful decision, who is responsible?

  • The engineer who wrote the code?
  • The data scientist who trained the model?
  • The product manager who specified the output?
  • The executive who deployed the system?
  • The company that profited?

The diffusion of responsibility: Algorithmic systems create diffuse accountability chains. Each participant can point elsewhere: the engineer implemented the spec, the data scientist optimized the metric, the product manager defined the use case, the executive approved the business case. No single person made a discriminatory decision; the discrimination emerges from the system.

DDIA 2E argues that this diffusion is not an accident — it is a structural feature of how algorithmic systems are built and deployed. Addressing it requires:

  1. Explicit accountability at each stage: Who is responsible for validating fairness at training? Who must approve deployment to high-stakes contexts? Who monitors for drift and bias over time?

  2. Documentation requirements: The EU AI Act (2024) requires high-risk AI systems to document training data, testing procedures, performance metrics by demographic group, and human oversight mechanisms. This is not a technical requirement; it is an accountability mechanism.

  3. The “right to explanation”: GDPR Article 22 gives individuals the right not to be subject to a decision based solely on automated processing that significantly affects them, and the right to obtain an explanation of the reasoning. This requires model interpretability — a technical constraint imposed by an ethical and legal requirement.

Professional responsibility: The chapter invokes an analogy to other engineering professions. Civil engineers are legally and professionally responsible for the structural safety of buildings they design. Software engineers are not subject to comparable professional licensing or liability — but the chapter argues that, given the scale of harm that data systems can cause, a similar professional ethics framework may be appropriate and necessary.

Feedback Loops

Feedback loops occur when the output of a system influences the future input to the same system, creating self-reinforcing cycles.

Positive feedback loops amplifying bias:

  1. Predictive policing loop: Crime prediction → more police in predicted areas → more crimes discovered in those areas → crime data reinforces prediction → even more policing. Communities subjected to this loop face escalating surveillance regardless of actual crime rates.

  2. Content recommendation loop: Recommendation algorithm shows user content based on past engagement → user engages with shown content → algorithm learns that content type gets engagement → recommends more of it. If a user watches one extreme political video, the algorithm may recommend progressively more extreme content. YouTube’s recommendation algorithm has been documented driving users toward radicalization in this way.

  3. Credit scoring loop: Low credit score → less access to credit → harder to build credit history → credit score stays low or falls. People born into poverty face structural barriers that perpetuate poverty, encoded in supposedly neutral financial algorithms.

  4. Labor market loop: Algorithm filters out resumes from certain schools or backgrounds → people in those groups don’t get hired → those schools don’t appear in the data for “successful employees” → filter is reinforced.

Breaking feedback loops requires external intervention: The loop self-reinforces by design. Breaking it requires deliberately counteracting the loop: randomized exploration (show content outside the predicted preferences), affirmative diversification (actively seek training data from underrepresented groups), human override points (require human review before feeding outputs back as training labels).


Privacy and Tracking

Surveillance and the Ad-Tech Ecosystem

The modern internet is fundamentally a surveillance machine. The dominant business model — advertising funded by behavioral targeting — requires knowing as much as possible about each user’s interests, location, relationships, health status, financial situation, and psychological vulnerabilities.

How behavioral tracking works:

  1. First-party data: Collected directly by the service the user interacts with (Google knows what you search, Facebook knows who your friends are, Amazon knows what you buy).

  2. Third-party cookies and pixels: Tracking code embedded in millions of websites sends data to advertising networks (Google Ad Manager, The Trade Desk, Meta Audience Network) that can correlate your behavior across websites. A health information site that embeds a Facebook pixel is sending health-seeking behavior data to Meta.

  3. Cross-device tracking: Your phone, laptop, and smart TV can be linked by shared IP address, login credentials, or probabilistic fingerprinting. An advertiser can build a unified profile across all your devices.

  4. Data brokers: Companies like Acxiom, LexisNexis, and Experian collect, aggregate, and sell personal data from hundreds of sources: public records, loyalty programs, app data purchases, social media scraping. They maintain profiles on hundreds of millions of people that include name, address history, income estimate, health conditions, political affiliation, and psychological profiles.

  5. Real-time bidding (RTB): The millisecond-scale auction that occurs every time you load a web page. Your profile is broadcast to hundreds of bidders simultaneously; the highest bidder’s ad is shown. The broadcast itself is a data breach in slow motion — hundreds of companies receive your profile with each page load.

The asymmetry of information: The surveillance ecosystem is built on a radical information asymmetry. Companies know far more about users than users know about companies’ data practices. Privacy policies are deliberately incomprehensible — one study found that reading all privacy policies encountered in a year would take 76 work days.

The consent fiction: GDPR and CCPA are built on a model of informed consent — users agree to data collection by accepting terms of service. The chapter argues that this model is largely fictional in practice:

  • Consent under duress: “Accept cookies or leave the website” is not genuine consent when the website is essential (a government service, a professional tool, the only local news source). The power differential makes refusal impractical.

  • Uninformed consent: No user can meaningfully evaluate a cookie consent dialog that mentions data shared with “540 partners.” The scale makes comprehension impossible.

  • Consent fatigue: Users click “accept” on every privacy dialog as a cognitive shortcut. The design is deliberately exploitative — dark patterns (making “reject” harder to click than “accept”) are documented at scale.

  • Inability to withdraw in practice: Even if consent can technically be withdrawn, the practical effect is often limited. Data already shared with hundreds of third parties remains in their systems; consent withdrawal is prospective only.

Genuine choice requires genuine alternatives: Privacy-respecting alternatives must be viable, not inferior. The EU’s Digital Markets Act (2022) and competition law cases against Google and Meta address this at the structural level — dominant platforms that bundle surveillance with essential services foreclose genuine choice.

Privacy and Use of Data

The contextual integrity principle (Helen Nissenbaum): Privacy is not about secrecy — it is about appropriate information flow. Information disclosed in one context (medical consultation) violates privacy when used in a different context (employer hiring decision), even if the information itself was not secret.

This is why “if you have nothing to hide, you have nothing to fear” is a philosophically confused argument. People share information in specific contexts with specific norms. Using medical data for insurance pricing, using location data for immigration enforcement, or using purchasing data for political targeting all violate contextual norms even when the data was “voluntarily” shared.

Data repurposing at scale: The economic logic of data collection encourages collecting everything and figuring out uses later. But data collected for one purpose (improving app performance) easily becomes data used for another purpose (advertising targeting, law enforcement access, hostile foreign intelligence). DDIA 2E argues that data engineers should design for purpose limitation at the architectural level — collect only what is needed, retain only as long as needed, and make repurposing technically difficult, not just policy-prohibited.

Data as Assets and Power

Data is not just an asset — it is a source of structural power. Companies with comprehensive behavioral data on hundreds of millions of people have capabilities that:

  • Governments cannot replicate without surveillance infrastructure
  • Competitors cannot acquire without similar user bases
  • Individuals cannot opt out of without forgoing essential services

The asymmetry compounds over time: The more data a company has, the better its products become (more accurate recommendations, better fraud detection, more relevant search). Better products attract more users. More users generate more data. This virtuous cycle creates data moats — competitive advantages that are practically impossible for new entrants to overcome.

The political economy of data: Companies like Meta and Google have accumulated knowledge about human psychology, social networks, and political opinion at a scale that no previous institution in history has possessed. The chapter raises the question of whether this concentration of information is compatible with democratic society — the same question that led to antitrust action against Standard Oil, AT&T, and financial monopolies.

Data colonialism: The extraction of data from users in developing countries by corporations headquartered in the US and EU mirrors historical colonial resource extraction — value is extracted from the population but economic benefits flow elsewhere. The population whose data is extracted has little governance voice over how it is used.

Remembering the Industrial Revolution

The chapter draws an extended analogy to the Industrial Revolution — a historical moment when new technology created immense wealth while also causing immense harm (child labor, factory deaths, environmental destruction, urban poverty). Society’s response was not to reject industrialization but to regulate it: labor laws, workplace safety regulations, environmental standards, antitrust law.

The parallel to data systems:

  • Early industrialists argued that regulation would kill innovation; they were wrong
  • The harms of unregulated industry were real and required collective action to address
  • The regulation that emerged improved outcomes for everyone, including industry, over the long run
  • Individual conscience within companies was insufficient — structural regulation was necessary

The chapter’s argument: We are at an early stage of the “data industrial revolution.” Individual ethical behavior by engineers matters but is insufficient. Structural regulation — privacy law, algorithmic accountability law, competition law targeting data monopolies — is necessary and appropriate. Engineers should support, not resist, these regulatory efforts.

Legislation and Self-Regulation

Key regulatory frameworks in 2026:

GDPR (EU General Data Protection Regulation, 2018):

  • Applies to any organization processing data of EU residents
  • Key rights: access, rectification, erasure, portability, objection to automated decision-making
  • Key obligations: lawful basis for processing, data minimization, purpose limitation, breach notification (72 hours), Data Protection Impact Assessment for high-risk processing
  • Enforcement: fines up to 4% of global annual turnover (not revenue — turnover)
  • Notable fines: Meta (€1.2B, 2023), Amazon (€746M, 2021), Google (€50M, 2019)

CCPA/CPRA (California, 2020/2023):

  • Right to know what data is collected and how it’s used
  • Right to delete personal information
  • Right to opt out of sale or sharing of personal information
  • Right to correct inaccurate information
  • Expanded CPRA (2023): added right to limit use of sensitive personal information

EU AI Act (2024):

  • Risk-based classification: unacceptable risk (banned), high-risk (regulated), limited risk (transparency), minimal risk (unregulated)
  • High-risk AI includes: hiring and employment decisions, credit scoring, criminal justice, education assessment, critical infrastructure
  • Requirements for high-risk: conformity assessment, technical documentation, human oversight, transparency to users, bias testing by demographic group
  • Implications for data engineers: bias audits, model documentation, and human review workflows are now legal requirements for many commercial AI systems

India PDPB (Personal Data Protection Bill): Comprehensive data protection law modeled loosely on GDPR but with significant differences, particularly regarding government data access. In force by 2026 in key provisions.

The limits of self-regulation: The chapter is skeptical of industry self-regulation. The history of self-regulatory programs (Do Not Track, AdChoices, social media moderation pledges) shows consistent pattern: voluntary commitments are made when regulation is threatened, implemented minimally or not at all, and abandoned when the regulatory threat recedes. DDIA 2E does not argue against self-regulation per se, but argues it must be supplemented by enforceable legal standards.


A Framework for Ethical Data Engineering Decisions

The chapter does not prescribe a single framework, but the following synthesis captures its guidance for practitioners:

1. The Minimalism Test

Before collecting any data element, ask:

  • Do we have a specific, documented use for this data?
  • Is the benefit proportionate to the privacy cost?
  • Could we achieve the same goal with less data or anonymized data?

If you cannot articulate a clear answer to the first question, do not collect.

2. The Contextual Integrity Test

Before using or sharing data for a new purpose:

  • In what context was this data originally shared?
  • What are the norms of that context?
  • Does the proposed use respect those norms?

A user who shared location data “to get local search results” did not consent to having that data sold to an insurance company.

3. The Disparate Impact Test

Before deploying any predictive or classification model:

  • What are the error rates disaggregated by demographic group?
  • Does any group bear a significantly higher false positive or false negative rate?
  • What are the consequences of errors for the affected individual?

A model that falsely flags Black defendants as high-risk at twice the rate of white defendants is not just statistically inconvenient; it is a civil rights issue.

4. The Power Asymmetry Test

Before building a data system:

  • Does this system create or amplify power asymmetries?
  • Do the people being surveilled have meaningful recourse?
  • Could this system be turned against the people it ostensibly serves?

A location tracking system built “for employee productivity” is also a union-busting tool.

5. The Regret Test

Before shipping:

  • Would I be comfortable if every person whose data this system processes could see exactly what we do with their data?
  • Would I be proud to explain this system’s design to a journalist writing about algorithmic harm?

If the answer is “no” to either, redesign.


Comparison Tables

Privacy Regulations Quick Reference

RegulationJurisdictionIn ForceRight to ErasureAutomated Decision RightsFines
GDPREU + EEA2018Yes (Art. 17)Yes (Art. 22)4% global turnover
CCPA/CPRACalifornia2020/2023YesLimited$7,500/intentional violation
LGPDBrazil2020YesYes2% of revenue, max R$50M
PDPBIndia2026YesYesUp to ₹500 crore
PIPEDACanada2001 (revised)LimitedNo explicit rightNegotiated
EU AI ActEU2024 (phased)N/AHuman oversight required6% turnover (prohibited AI)

Types of Algorithmic Bias

Bias TypeDescriptionExampleMitigation
Historical biasTraining data reflects historical discriminationHiring model trained on past male-dominated applicant poolReweight training data; audit for disparate impact
Representation biasUnderrepresented groups have insufficient training dataFacial recognition worse on darker skinDiversify training data collection
Measurement biasProxy variable correlates with protected characteristicZIP code as proxy for race in credit scoringAudit for proxy discrimination; remove correlated features
Aggregation biasSingle model applied to heterogeneous groupsGlobal model applied to all demographicsTrain separate models or add demographic context
Feedback loop biasModel output used as training data for future modelPredictive policing reinforcing patrol patternsBreak feedback loops; use independent ground truth
Deployment biasModel used in a different context than it was built forCOMPAS trained on one region, used nationallyValidate in deployment context; restrict use

Important Points Summary

  • Algorithms perpetuate historical biases, not just reflect them: ML models trained on historical data inherit and amplify structural inequalities. Accuracy as measured by aggregate metrics can mask severe disparate impact on marginalized groups.
  • Fairness is a values question, not a technical problem: Multiple mathematically incompatible definitions of fairness cannot all be satisfied simultaneously when base rates differ. Which definition to use is a societal and political decision.
  • Feedback loops self-reinforce: Predictive policing, content recommendation, and credit scoring all create loops that amplify initial biases without external intervention. Breaking loops requires deliberate design choices.
  • Consent is largely fictional at scale: Privacy consent dialogs do not produce genuine informed consent. Real privacy protection requires architectures that collect less data, not better consent UIs.
  • Data is structural power: The concentration of behavioral data in a few corporations represents an unprecedented concentration of knowledge about human psychology and social networks. This has political consequences beyond the commercial ones.
  • The Industrial Revolution analogy: Individual ethical choices are insufficient; structural regulation is necessary. Engineers should see privacy and algorithmic accountability regulation as legitimate and appropriate responses to documented harms.
  • GDPR and the EU AI Act create concrete technical requirements: Right to erasure, right to explanation, demographic bias testing, human oversight for high-risk AI. These are technical architecture constraints, not just policy statements.
  • Purpose limitation must be architectural: Data collected for one purpose must be technically difficult to repurpose, not just policy-prohibited. Access controls, schema design, and retention limits are the mechanisms.
  • The right to explanation requires interpretability: GDPR Article 22 is not satisfied by an ML system that cannot explain its decisions. Model interpretability is a legal requirement for high-stakes automated decisions.
  • “Just following the spec” is not sufficient: The chapter explicitly argues that engineers bear professional responsibility for the consequences of the systems they build. Ethical review is a professional obligation, not an optional extra.

Modern Context (2026)

The EU AI Act is now in force: Phased implementation began in 2024-2025. High-risk AI systems in hiring, credit, and criminal justice require conformity assessments and demographic bias audits. Data engineers at companies deploying these systems face new documentation and audit requirements.

Generative AI raises new bias and privacy questions: Large language models trained on internet data reproduce the biases of that data at scale. DALL-E and similar image models have been documented producing stereotyped representations. LLMs can reconstruct private information from training data (membership inference attacks). The regulatory framework is still catching up.

Facial recognition bans: Several US cities (San Francisco, Boston, Portland) and the EU AI Act classify remote biometric identification in public spaces as “unacceptable risk” (with narrow exceptions). Facial recognition as a technology is not banned, but its deployment in public surveillance contexts is increasingly restricted.

Algorithmic accountability legislation: Colorado, New York City, and other jurisdictions require impact assessments and audits for automated employment and credit decisions. The EU AI Act’s requirements apply to any company serving EU customers.

The backlash against surveillance capitalism is intensifying: Major browsers have deprecated third-party cookies (Google Chrome completed the deprecation in 2024). Privacy-preserving advertising alternatives (Google’s Privacy Sandbox, Apple’s ATT) are being adopted, though critics argue they replace one form of surveillance with another.

Data sovereignty movements: Brazil, India, and Indonesia are asserting data sovereignty requirements — data about their citizens must be stored domestically and governed under local law. This creates architectural requirements (geographic data residency) for any global data system.

Whistleblower revelations: Frances Haugen’s Facebook Papers (2021) and subsequent disclosures have documented how social media companies knowingly build systems that increase engagement by increasing outrage and emotional harm. This has influenced the legislative discussion about algorithmic accountability.


Questions for Reflection

  1. A senior engineer on your team says: “I just build what the product spec says. Worrying about bias and surveillance is not my job — that’s what the ethics team and lawyers are for.” How would you respond? Do you agree that ethical responsibility can be fully delegated to specialist teams?

  2. The disparate impact argument says that a credit scoring model is discriminatory if it produces worse outcomes for Black borrowers, even if race is not an explicit input variable and the model is “accurate” overall. A counter-argument says that the model is just reflecting reality (e.g., Black Americans have lower average credit scores due to historical exclusion from wealth-building). How do you think through this tension? What would you do if you were the data scientist building the model?

  3. If a user “voluntarily” shares location data with a food delivery app, should that data be available to: (a) the delivery driver for navigation, (b) the app for improving route optimization, (c) advertisers for location-based targeting, (d) law enforcement with a subpoena, (e) a health insurance company interested in health behaviors? Where do you draw the line and why?

  4. The chapter draws an analogy to the Industrial Revolution: unregulated technology created immense harm, and structural regulation was necessary. Opponents of tech regulation argue that over-regulation will stifle innovation, harm competitiveness, and ultimately hurt consumers. How would you evaluate these claims against the evidence from other regulated industries (aviation safety, pharmaceutical safety, banking)?

  5. You discover that a hiring algorithm your company has been using for two years has a 15% higher false rejection rate for female candidates. The algorithm was built by a third-party vendor and is used to screen 50,000 applications per year. What do you do? Who do you tell? What is your company’s legal exposure? What is your personal ethical obligation?

  6. Consider a data system designed to predict which social media posts are likely to spread misinformation. It would be used to add warning labels or reduce visibility. Such a system would necessarily produce false positives (labeling true content as misinformation) and false negatives (missing actual misinformation). How would you think about the trade-offs between different error types, and who should make that decision?


  • ch01-tradeoffs-data-systems — Data systems law and society section: GDPR, HIPAA, AI Act technical implications
  • ch13-philosophy-of-streaming — Correctness by construction; the event log and auditability; GDPR and immutability
  • ch12-future-of-data-systems — 1st edition Ch12: original ethics section (now expanded into Ch14)
  • External: ProPublica “Machine Bias” (2016) — COMPAS investigation that defined the algorithmic fairness debate
  • External: Shoshana Zuboff, “The Age of Surveillance Capitalism” (2019) — Extended analysis of behavioral data as economic power
  • External: Joy Buolamwini & Timnit Gebru, “Gender Shades” (2018) — Seminal facial recognition bias study
  • External: EU AI Act text (EUR-Lex) — Authoritative source for high-risk AI requirements
  • External: Helen Nissenbaum, “Privacy in Context” (2010) — Contextual integrity framework

Last Updated: 2026-05-29