Data is the lifeblood of generative AI applications and these apps are ultimately only as good as the data they train on. It is obvious, therefore, that having and maintaining policies and procedures that are specifically designed to ensure high quality data is continuously provided is critical. I refer to this overall effort as “data stewardship” and below is a (very) rough draft of what this effort looks like. (Those of you who are familiar with the CIS-20 Cybersecurity Controls will appreciate the structural similarity.) This framework can also be used by data consumers; i.e., companies that build generative AI applications and by AI auditors.
Basic Controls
- Data Inventory Controls
- Continuous Data Vulnerability Management (ties in with data observability practices)
- Secure Configuration for Data
- Maintenance, Monitoring, and Analysis of Data
Foundational Controls
- Data Storage Protections
- Data Threat Defenses
- Data Provenance Protections
- Secure Configuration for all Data Sources
- Data Sources Boundary Defense
- Controlled Access to Data Sources
- Audit and Control (for the above)
Organizational Controls
- Implement Data Stewardship Program
- Data Incident Response Management
- Fuzzing Tests and other Red Team Exercises