Chapter 14 Flashcards — Doing the Right Thing

flashcards ddia-2e chapter-14 ethics bias privacy


Algorithmic Bias

What is the core reason ML models exhibit bias, and why is it not just a “bug to be fixed”?
?

  • Root cause: ML models trained on historical data inherit the biases embedded in that history
    • Historical data reflects historical discrimination, exclusion, and unequal outcomes
    • A model optimized to “predict success” learns to replicate what “success” looked like historically — which was shaped by who had access, not by intrinsic merit
  • Why it is not just a bug: the data itself is the problem, not the model code
    • Cleaning up the code does not clean up the historical record
    • Even a technically correct model on biased data produces biased outputs
  • Concrete examples:
    • Amazon hiring model (2018): downrated “women’s” in resumes because historically successful applicants were predominantly male
    • COMPAS recidivism (2016): 2x false high-risk rate for Black vs. white defendants
    • Facial recognition (Gender Shades, 2018): up to 34% error for dark-skinned women vs. 0.8% for light-skinned men
  • Key distinction: discrimination by replication vs. discrimination by design — the model was not designed to discriminate; it learned to replicate the pattern

What are the six types of algorithmic bias and a concrete example of each?
?

  • Historical bias: training data reflects past discrimination
    • Example: hiring model trained on historically male-dominated engineering applicant pools
  • Representation bias: underrepresented groups have insufficient training examples
    • Example: facial recognition trained mostly on lighter-skinned faces; high error on darker-skinned faces
  • Measurement bias: proxy variable correlates with protected characteristic
    • Example: ZIP code as proxy for race in credit scoring (“redlined neighborhood” → predicted default risk)
  • Aggregation bias: single model applied to heterogeneous groups without adaptation
    • Example: global model trained on average behavior applied uniformly across all demographics
  • Feedback loop bias: model output becomes future training data, amplifying initial biases
    • Example: predictive policing → more arrests in predicted areas → reinforces prediction → more policing
  • Deployment bias: model used in a context it was not built or validated for
    • Example: COMPAS recidivism tool trained on one region’s population, applied nationally in sentencing

Why is overall model accuracy not a sufficient measure of fairness?
?

  • The aggregate accuracy problem: a model can achieve high overall accuracy while being systematically wrong for a specific demographic subgroup
    • If that subgroup is a minority in the dataset, errors on it barely affect the aggregate metric
    • Example: a credit scoring model that is 95% accurate overall may have a 40% error rate for a specific demographic
  • Disparate impact: the consequences of errors are not symmetric
    • A false positive in recidivism prediction (incorrectly labeled high-risk) can result in longer imprisonment
    • A false negative in fraud detection (missed fraud) has cost to the company
    • The same false positive rate affects different populations differently based on base rates and consequences
  • Technical vs. ethical accuracy: a model can be technically accurate (matches historical outcomes) while ethically inaccurate (perpetuates historical injustice)
  • The requirement: measure and report error rates disaggregated by demographic group; evaluate consequences of different error types for different groups before deployment

Fairness Definitions

What are the main mathematical definitions of algorithmic fairness and why can they not all be satisfied simultaneously?
?

  • Demographic parity: equal positive prediction rate across groups — P(ŷ=1|A) = P(ŷ=1|B)
  • Equalized odds: equal true positive rate AND false positive rate across groups
  • Predictive parity: equal positive predictive value — P(Y=1|ŷ=1,A) = P(Y=1|ŷ=1,B)
  • Counterfactual fairness: prediction would be the same if protected characteristics were changed
  • Why they conflict (Chouldechova 2017): when base rates differ between groups (which they do when historical discrimination has produced unequal outcomes), demographic parity, equalized odds, and predictive parity cannot all be satisfied simultaneously
    • Satisfying equalized odds → violates demographic parity (when base rates differ)
    • Satisfying demographic parity → violates predictive parity
  • The implication: choosing a fairness definition is a values decision, not a technical optimization
    • “Which errors harm people more?” is a political and ethical question
    • Data engineers and data scientists cannot answer it alone; it requires democratic deliberation

Feedback Loops

What is a feedback loop in algorithmic systems and why does it amplify bias over time?
?

  • Feedback loop: the output of a system influences the future inputs to the same system, creating a self-reinforcing cycle
  • Why it amplifies bias:
    1. An initial bias in predictions leads to biased actions (more policing in predicted areas)
    2. Those actions generate biased data (more crime discovered where police are)
    3. The new data reinforces the original prediction (model learns “crime is high here”)
    4. The cycle repeats, with each iteration strengthening the bias
  • Examples:
    • Predictive policing: predict crime → increase patrol → discover more crime → reinforce prediction → more patrol
    • Content recommendation: show content → user engages → learn to recommend more of it → echo chamber / radicalization
    • Credit scoring: low score → less access to credit → harder to build credit history → score stays low
  • Why it is hard to fix: the loop is self-reinforcing by design; external intervention is required
  • Breaking the loop: use independent ground truth (victim reports, not police records); randomized exploration; human review before using outputs as training labels; ongoing monitoring of group-level outcomes

Privacy and Surveillance

What is the “contextual integrity” principle and why does “nothing to hide” fail as a privacy argument?
?

  • Contextual integrity (Helen Nissenbaum): privacy is not about secrecy — it is about appropriate information flow within and between contexts
    • Information disclosed in one context (medical consultation) violates privacy when used in a different context (employer hiring), even if not “secret”
    • Each social context has norms about what information flows are appropriate
  • Why “nothing to hide, nothing to fear” fails:
    1. It redefines privacy as secrecy — only criminals need to hide things
    2. People share information in specific contexts with specific expectations; repurposing violates those expectations
    3. Medical data shared with a doctor should not flow to insurance companies; location shared for navigation should not flow to debt collectors
    4. Privacy serves functions beyond hiding wrongdoing: autonomy, freedom from manipulation, protection from power abuse
  • Practical implication for data engineers: before repurposing data, ask: “In what context was this shared, and do those contextual norms permit this use?” — not just “is this data public/private?”

How does the ad-tech surveillance ecosystem work, and what makes it a privacy problem at scale?
?

  • The mechanism:
    1. First-party data: service directly collects user behavior (Google: searches; Meta: social graph; Amazon: purchases)
    2. Third-party pixels/SDKs: tracking code on millions of sites sends user behavior to ad networks across the web
    3. Real-time bidding (RTB): user profile broadcast to 500+ ad buyers in the ~100ms before a page renders; highest bidder’s ad shown
    4. Data brokers: aggregate data from apps, loyalty programs, public records; sell comprehensive profiles including health, income, political affiliation
    5. Cross-device tracking: link phone, laptop, TV by shared IP, login, or probabilistic fingerprinting
  • Why it is a privacy problem:
    • Asymmetry: companies have comprehensive behavioral profiles; users know essentially nothing about what is held about them
    • No genuine consent: the privacy policy for “540 partners” cannot be meaningfully evaluated; “accept or leave” is not free choice
    • Repurposing: data collected for “improved user experience” used for political targeting, insurance pricing, law enforcement
    • RTB as mass data breach: each page load broadcasts your profile to hundreds of entities simultaneously

Why is informed consent largely “fictional” at internet scale, and what does this imply for data system design?
?

  • The consent fiction: GDPR and CCPA are built on informed consent, but at internet scale consent is practically non-functional:
    • Incomprehensible volume: one study found reading all privacy policies encountered annually would take 76 work days
    • Dark patterns: “accept” button made large and colorful; “reject” buried in 5 menus — deliberate friction
    • Consent under duress: “accept cookies or leave” when the site is essential (government, healthcare, major social platform) is coercive
    • Prospective only: withdrawing consent does not retrieve data already shared with hundreds of third parties
    • Consent fatigue: users click “accept” on everything as cognitive shortcut
  • What this implies for design:
    • Architecture must enforce privacy, not policies
    • Data minimization: collect less data; consent is less critical when there is less to consent to
    • Purpose limitation in schema: make repurposing technically difficult, not just policy-prohibited
    • Retention automation: TTL on data; deletion pipelines — do not wait for users to request deletion

Data as Power

What is “data as structural power” and why does the concentration of behavioral data raise democratic concerns?
?

  • Data as power: companies with comprehensive behavioral data on hundreds of millions of people possess capabilities that:
    • Governments cannot replicate without equivalent surveillance infrastructure
    • Competitors cannot acquire without similar user bases (the data moat)
    • Individuals cannot opt out of without forgoing essential services
  • The compounding dynamic (data moat):
    • More data → better products → more users → more data → stronger moat
    • Creates structurally insurmountable barriers to new entrants
    • Undermines market competition
  • The democratic concern:
    • Meta, Google, etc. possess knowledge about human psychology, social networks, and political opinion at unprecedented historical scale
    • This knowledge can be used for advertising, but also for political manipulation (Cambridge Analytica), law enforcement, foreign intelligence
    • No previous institution — not governments, not religious institutions — has had this information about so many people
  • Data colonialism parallel: data extracted from users in developing countries; economic value flows to US/EU corporations; no governance voice for the population whose data is extracted

Regulation

What are the key technical requirements that GDPR imposes on data engineers?
?

  • Right to erasure (Art. 17): must be able to delete all personal data for a specific user within 30 days
    • Technical challenge: denormalized data in analytics pipelines, data warehouses, event logs, ML training sets
    • Practical approach: store PII in one authoritative store; reference by anonymous user ID elsewhere; deletion propagates from the PII store
    • For event logs: cryptographic erasure (encrypt data with per-user key; delete key = data indecipherable)
  • Right to explanation (Art. 22): automated decisions that significantly affect individuals must be explainable
    • Technical requirement: model interpretability; black-box ML models may not satisfy this
    • Scope: credit decisions, insurance pricing, hiring screening, social benefit eligibility
  • Data minimization: only collect data “adequate, relevant, and limited to what is necessary”
    • Technical: schema design defaults to minimal collection; audit unused data fields
  • Purpose limitation: data collected for one purpose cannot be reused for another without new consent
    • Technical: access controls, data catalogs with purpose documentation, architectural separation of data stores
  • Breach notification: must notify supervisory authority within 72 hours of discovering a breach
    • Technical: breach detection systems, incident response playbooks
  • Maximum fine: up to 4% of global annual turnover — not revenue, turnover; this is existentially large for most companies

What makes the EU AI Act significant for data engineers and ML practitioners?
?

  • Risk-based classification:
    • Unacceptable risk: banned (social scoring, real-time biometric ID in public spaces)
    • High-risk: regulated (employment decisions, credit scoring, criminal justice, education, critical infrastructure)
    • Limited risk: transparency required (chatbots must disclose they are AI)
    • Minimal risk: no requirements
  • High-risk requirements (most relevant for data engineers):
    • Conformity assessment: documented testing before deployment
    • Technical documentation: training data description, architecture, testing methodology
    • Bias testing: disaggregated performance metrics by demographic group — a legal requirement
    • Human oversight: the system must allow human review and override
    • Logging: automated decision logs must be maintained for audit purposes
    • Transparency: inform affected individuals when high-risk AI is used in decisions about them
  • Scope: applies to any company offering services in the EU, regardless of where the company is headquartered
  • Enforcement: fines up to 6% of global annual turnover for prohibited AI; 3% for high-risk non-compliance
  • Practical impact: data engineers building hiring, credit, or criminal justice models must now implement bias auditing and demographic reporting as production requirements

Professional Responsibility

What is the “diffusion of responsibility” problem in algorithmic systems, and how does DDIA 2E argue it should be addressed?
?

  • Diffusion of responsibility: when an algorithm causes harm, no single person made a harmful decision
    • Engineer: “I implemented the specification”
    • Data scientist: “I optimized the metric that was requested”
    • Product manager: “I defined the use case”
    • Executive: “I approved the business case”
    • No individual decision was discriminatory; the system produced discrimination
  • Why this is not an accident: the accountability chain is deliberately diffuse; each participant can point elsewhere
  • DDIA 2E’s argument: diffuse individual accountability requires structural responses
    1. Explicit accountability at each stage: designate who is responsible for fairness validation, who must approve high-stakes deployment, who monitors for drift
    2. Documentation requirements: make the EU AI Act’s documentation requirements an internal standard, not just a legal obligation
    3. Professional responsibility framework: data engineers should aspire to professional ethics analogous to civil engineers (who are legally liable for structural safety) — raise concerns at design time, not after deployment
  • “Just following the spec” is not sufficient: the chapter explicitly frames this as professional responsibility, not optional conscience — engineers helped build the system and bear responsibility for its consequences

The Industrial Revolution Analogy

What is the Industrial Revolution analogy in Chapter 14 and what conclusion does it support?
?

  • The analogy: the Industrial Revolution created immense wealth while also creating immense harm (child labor, factory deaths, pollution, urban poverty)
    • Society’s response was not to reject industrialization but to regulate it: labor laws, workplace safety, environmental standards, antitrust
    • Early industrialists argued regulation would kill innovation; they were wrong
    • The regulation that emerged improved outcomes for everyone, including industry, over the long run
  • The parallel to data systems:
    • Data systems are creating immense value and causing real documented harm (discrimination, manipulation, surveillance)
    • Individual ethical choices within companies matter but are insufficient — structural regulation is necessary
    • “Self-regulation” programs have consistently failed: voluntary commitments made when regulation is threatened, implemented minimally, abandoned when the threat recedes
  • The conclusion DDIA 2E supports:
    • Privacy regulation (GDPR, CCPA), algorithmic accountability law (EU AI Act), and competition law targeting data monopolies are appropriate and necessary responses
    • Engineers should support these regulatory efforts, not resist them, because the long-run outcome (trustworthy, accountable data systems) benefits the profession and society
    • We are at an early stage of the “data industrial revolution”; regulatory response is lagging but coming

Applied Ethics

What is the five-step ethical decision framework synthesized from Chapter 14?
?

  1. Minimalism test: Do we have a specific, documented use for this data? Could we achieve the goal with less or anonymized data? If no clear answer → do not collect.

  2. Contextual integrity test: In what context was this data originally shared? Do the norms of that context permit this use? Medical data + insurance pricing = violation even if technically legal.

  3. Disparate impact test: What are error rates disaggregated by demographic group? Does any group bear a significantly higher false positive/negative rate? What are the consequences of errors for the affected individual? Must be run before deployment and on an ongoing basis.

  4. Power asymmetry test: Does this system create or amplify power asymmetries? Do surveilled people have meaningful recourse? Could this system be turned against the people it ostensibly serves? (A “productivity tracking” system is also a union-busting tool.)

  5. Regret test: Would I be comfortable if every person whose data this system processes could see exactly what we do with it? Would I be proud explaining this design to a journalist covering algorithmic harm? If “no” → redesign.


Total Cards: 22
Review Time: ~25 minutes
Priority: MEDIUM
Last Updated: 2026-05-29