Chapter 13
15 min read
Section 55 of 98

Poisoning Attacks

Adversarial Machine Learning

Introduction

While evasion attacks target a model at inference time, poisoning attacks strike earlier—during training. By corrupting the data a model learns from, an attacker can fundamentally compromise its behavior before it is ever deployed. This makes poisoning attacks particularly insidious because the resulting model may pass all standard validation tests while harboring hidden vulnerabilities.

Data poisoning exploits a fundamental assumption of machine learning: that training data is representative and trustworthy. In practice, organizations scrape data from the web, purchase datasets from third parties, and accept user-contributed labels—all of which create opportunities for adversarial manipulation.


Label Flipping Attacks

The simplest form of data poisoning is label flipping, where an attacker changes the labels of a small percentage of training samples. For example, in a malware detection dataset, an attacker might relabel a subset of malicious samples as benign, causing the trained model to develop blind spots for those malware families.

Research has shown that flipping as few as 10% of labels in a dataset can degrade model accuracy by 20-30 percentage points. More targeted label flipping—focusing on specific classes or decision boundary regions—can be even more effective with fewer corrupted samples.

The challenge for defenders is that label flipping is difficult to detect through simple data inspection. The features of the corrupted samples remain valid; only the labels are wrong. Detecting these attacks requires statistical analysis of label distributions, cross-validation consistency checks, and comparison against trusted reference datasets.

Key Insight: Label flipping attacks are particularly dangerous in crowdsourced labeling scenarios. If an attacker can compromise even a small number of annotators in a labeling pipeline, they can systematically introduce biases that survive quality control.

Backdoor and Trojan Attacks

Backdoor attacks are a more sophisticated form of poisoning. The attacker embeds a hidden trigger pattern into a subset of training data and associates it with a target label. The resulting model performs normally on clean inputs but produces the attacker's desired output whenever the trigger is present.

A classic example involves adding a small pixel pattern—perhaps a 3x3 grid of bright pixels in one corner—to images in a training set and labeling them all as a target class. The model learns to associate the trigger with the target class while maintaining high accuracy on clean data, making the backdoor invisible during standard evaluation.

  • Static triggers: Fixed patterns such as pixel patches or watermarks embedded in the training data
  • Dynamic triggers: Input-dependent patterns that change based on the content, making detection significantly harder
  • Clean-label attacks: Backdoors that do not require changing any labels, using only subtle perturbations to the feature space

Detecting backdoors is an active area of research. Techniques include neural cleanse (reverse-engineering potential triggers), activation clustering (identifying neurons that respond anomalously to triggered inputs), and spectral signatures (analyzing the singular value decomposition of learned representations).


Supply Chain Poisoning of Pre-Trained Models

The widespread practice of using pre-trained models downloaded from public repositories has created a new poisoning vector. Model hubs like Hugging Face host thousands of pre-trained models that researchers and developers fine-tune for their specific tasks. If an attacker uploads a backdoored model to such a platform, every downstream user inherits the vulnerability.

Supply chain poisoning is especially dangerous because fine-tuning typically modifies only the last few layers of a network. Backdoors embedded in earlier layers can survive the fine-tuning process and remain active in the deployed model, even though the downstream developer never worked with the poisoned training data directly.

Why This Matters: The AI supply chain is largely trust-based. Most practitioners download pre-trained models without verifying their integrity, creating a systemic vulnerability analogous to the software supply chain attacks that have plagued the industry.

Federated Learning Poisoning

Federated learning, where multiple parties collaboratively train a model without sharing raw data, introduces unique poisoning opportunities. A malicious participant can submit corrupted gradient updates that shift the global model toward the attacker's objective while appearing to contribute legitimate training progress.

Byzantine attacks in federated settings are particularly challenging because the server aggregating updates has no direct access to the participants' data. Traditional defenses like robust aggregation (trimmed mean, median, Krum) can mitigate some attacks but are not universally effective against sophisticated adversaries who calibrate their poisoned updates to stay within statistical norms.

  1. Model replacement: A single malicious participant submits an update large enough to overwrite the global model with their backdoored version
  2. Gradient poisoning: Subtle gradient modifications that accumulate over many rounds to shift the model toward the attacker's goal
  3. Sybil attacks: An attacker controls multiple participants to amplify their influence on the aggregated model

Securing federated learning requires a combination of robust aggregation algorithms, anomaly detection on submitted updates, and cryptographic techniques such as secure multiparty computation to verify the integrity of contributions without accessing the underlying data.

Loading comments...