Synthetic Data and the Illusion of Privacy: Legal Risks of Using De-Identified AI Training Sets

As artificial intelligence (AI) systems grow more advanced and data-driven, organizations are turning to synthetic and de-identified data to power model development while purportedly reducing legal risk. The assumption is simple: if the data no longer identifies a specific individual, it falls outside the reach of stringent privacy laws and related obligations. But this assumption is increasingly being tested—not just by evolving statutes and regulatory enforcement, but by advances in re-identification techniques and shifting societal expectations about data ethics.

This article examines the complex and evolving legal terrain surrounding the use of synthetic and de-identified data in AI training. It analyzes the viability of “privacy by de-identification” strategies in light of re-identification risk, explores how state privacy laws like the California Consumer Privacy Act (CCPA) and its successors address these techniques, and highlights underappreciated contractual, ethical, and regulatory risks that organizations should consider when using synthetic data.

The Promise and Pitfalls of Synthetic and De-Identified Data

Synthetic data refers to artificially generated data that mimics the statistical properties of real-world datasets. It may be fully fabricated or derived from actual data using generative models, such as GANs (Generative Adversarial Networks). De-identified data, by contrast, refers to datasets where personal identifiers have been removed or obscured.

These approaches are attractive for AI development because they purport to eliminate the compliance burdens associated with personal data under laws like the CCPA, the Illinois Biometric Information Privacy Act (BIPA), or the Virginia Consumer Data Protection Act (VCDPA). But whether these datasets are truly outside the scope of privacy laws is not a settled question.

Re-Identification Risk in a Modern AI Context

Re-identification—the process of matching anonymized data with publicly available information to reveal individuals’ identities—is not a hypothetical risk. Numerous academic studies and high-profile real-world examples have demonstrated that supposedly de-identified datasets can be reverse-engineered. In 2019, researchers were able to re-identify 99.98% of individuals in a de-identified U.S. dataset using just 15 demographic attributes. In 2023, researchers published methods for reconstructing training data from large AI models, raising new concerns about AI-enabled leakage and data memorization.

The risk of re-identification undermines the legal assumption that de-identified or synthetic data always falls outside of privacy laws. Under the CCPA, for example, “deidentified information” must meet strict criteria, including that the business does not attempt to re-identify the data and implements technical and business safeguards to prevent such re-identification.¹ Importantly, the burden is on the business to prove that the data cannot reasonably be used to infer information about or re-identify a consumer. A similar standard applies under the Colorado Privacy Act² and the Connecticut Data Privacy Act.³

Given the growing capabilities of AI models to infer sensitive personal information from large datasets, a company’s belief that it has met these standards may not align with regulators’ or plaintiffs’ evolving views of reasonableness.

Regulatory and Enforcement Attention

State attorneys general and privacy regulators have signaled increased scrutiny of de-identification claims. The California Privacy Protection Agency (CPPA), for instance, has prioritized rulemaking on automated decision-making technologies and high-risk data processing, including profiling and algorithmic training. Early drafts suggest regulators are interested in the opacity of AI training sets and may impose disclosure or opt-out requirements even where de-identified data is used.

In enforcement, regulators have challenged claims of anonymization when they believe data could reasonably be re-associated with individuals. The FTC’s 2021 settlement with Everalbum, Inc. involved claims that the company misrepresented the use of facial recognition technology and the deletion of biometric data. As AI systems trained on biometric, geolocation, or consumer behavior data become more common, scrutiny will likely intensify.

Companies operating in jurisdictions with biometric laws, such as BIPA or the Texas Capture or Use of Biometric Identifier Act (CUBI), may also face heightened risk. In Rivera v. Google Inc.,⁴ a federal court in Illinois allowed BIPA claims to proceed where facial templates were generated from user-uploaded images, even though the templates were not linked to names, finding that such data could still qualify as biometric identifiers. Conversely, in Zellmer v. Meta Platforms, Inc.,⁵ the Ninth Circuit held that BIPA did not apply where facial recognition data was not linked to any identifiable individual, emphasizing the absence of any connection between the template and the plaintiff. These cases underscore that even de-identified biometric data may trigger compliance obligations under BIPA if the data retains the potential to identify individuals, particularly when used to train AI systems capable of reconstructing or linking biometric profiles.

Contractual and Ethical Considerations

Beyond statutory exposure, organizations should consider the contractual and reputational implications of using synthetic or de-identified data for AI training. Contracts with data licensors, customers, or partners may include restrictions on derivative use, secondary purposes, or model training altogether—especially in regulated industries like financial services and healthcare. Misuse of even anonymized data could lead to claims of breach or misrepresentation.

Ethically, companies face growing pressure to disclose whether AI models were trained on consumer data and whether individuals had any control over that use. Transparency expectations are evolving beyond strict legal requirements. Companies seen as skirting consumer consent through “anonymization loopholes” may face reputational damage and litigation risk, especially as synthetic datasets increasingly reflect sensitive patterns about health, finances, or demographics.

Takeaways for Compliance and AI Governance

Companies relying on synthetic or de-identified data to train AI models should reconsider any assumptions that such data is legally “safe.” Legal risk can arise from:

– Re-identification risk, which may bring data back within the scope of privacy laws;
– Evolving state privacy standards, which impose safeguards and reasonableness requirements;
– Contractual limitations on data use, which may prohibit derivative or secondary applications;
– Public perception and regulatory scrutiny, especially regarding opaque AI systems.

To mitigate these risks, businesses should adopt rigorous data governance practices. This includes auditing training datasets, documenting de-identification techniques, tracking regulatory definitions, and considering independent assessments of re-identification risk. Where possible, companies should supplement technical safeguards with clear contractual language and consumer disclosures regarding the use of data in model development.

The age of synthetic data has not eliminated privacy risk. Instead, it has introduced new complexities and heightened scrutiny as the lines between real and generated data continue to blur.

The views expressed in this article are those of the author and not necessarily of his law firm, or of The National Law Review.

1 Cal. Civ. Code § 1798.140(m).

2 Colo. Rev. Stat. Ann. § 6-1-1303(11).

3 Conn. Gen. Stat. § 42-515(16).

4 238 F. Supp. 3d 1088 (N.D. Ill. 2017).

5 104 F.4th 1117 (9th Cir. 2024).

Leave a Reply Cancel reply