AI data security best practices outlined by CISA and partners – Go Health Pro

Organizations, especially those in government and critical infrastructure, using artificial intelligence in day-to-day operations have a new resource to turn to ensure the security and integrity of data used by AI systems. The Cybersecurity and Infrastructure Security Agency (CISA), National Security Agency’s Artificial Intelligence Security Center (NSA AISC), Federal Bureau of Investigation (FBI) and international partners released new guidance for AI data security best practices on Thursday.The joint Cybersecurity Information Sheet (CSI) presents potential risks and mitigations for datasets used to train AI models at various stages in a model’s lifecycle, from the planning and designing phase to post-deployment operation and monitoring.

10 steps to enhance AI data security

The joint CSI provides a list of 10 general best practices that organizations can leverage to better protect data incorporated into AI systems.First, organizations should ensure data used in AI training comes from trusted, reliable sources and use provenance tracking so that data can be reliably traced as it is used or modified. Provenance databases should be cryptographically signed and utilize an immutable, append-only ledger to track data changes.Second, checksums and cryptographic hashes should be used to verify that data maintains its integrity during storage and transport.The third step is to further ensure data integrity by using quantum-resistant digital signatures to authenticate and verify any trusted revisions during training, fine tuning, alignment, reinforcement learning from human feedback (RLHF) and other post-training processes.  Organizations should only use trusted infrastructure, such as computing environments that leverage zero trust architecture, to process AI training data, as noted in the fourth best practice listed. Data used in AI systems should also be classified based on sensitivity in order to define proper access controls for different data types, the fifth step states.The sixth recommendation is to encrypt data using quantum-resistant methods such as AES-256. The seventh recommendation emphasizes secure data storage using certified data storage devices compliant with the NIST FIPS 140-3 standard, which covers security requirements for cryptographic modules.The eighth and ninth best practices outlined involve privacy preservation of sensitive data use in training and secure deletion of AI training data from repurposed or decommissioned storage devices.Methods such as data masking, where sensitive data is replaced with depersonalized but realistic synthetic data, can ensure personally identifiable information (PII) and other sensitive details are not inadvertently exposed by AI systems.Secure data deletion methods such as cryptographic erasing, block erasing or data overwriting should be used for deletion of AI data, with the NIST SP 800-88 Guidelines for Media Sanitization providing more information on secure deletion.Lastly, organizations should conduct ongoing risk assessments for AI data security, using frameworks such as the NIST SP 800-3r2 Risk Management Framework (RMF) and NIST AI 100-1 Artificial Intelligence RMF.

Specific risks for web-scale data supply chains, malicious data poisoning

The joint CSI further tackles specific issues and mitigations related to the data supply chain, malicious data modification and data drift.Web-scale datasets pose specific risks such as split-view poisoning, where expired domains included in AI datasets are taken over by malicious actors, and frontrunning poisoning, where attackers strategically inject malicious data into crowd-sourced content just prior to their curation. The latter technique particularly affects sites like Wikipedia, which takes downloadable snapshots of its data on a predictable twice-monthly schedule.Several mitigation strategies are outlined to combat these specific threats, including dataset verification and detection of abnormalities in data before it is ingested by AI systems, use of raw data hashes to detect data changes, and requirements that third-party dataset or model suppliers provide certification that these datasets or models are free from compromised data.Similarly, malicious attackers may attempt to manipulate training data with adversarial examples that lead to unreliable or dangerous model outputs, which can be combatted through a range of techniques like anomaly detection, data sanitization and ensemble methods that use multiple models to reach a consensus on the safety and reliability of outputs.AI training pipelines, from data collection, to pre-processing and training, should be properly secured to prevent malicious intrusion and tampering, and regular training data audits can help assess potential issues that arise.Lastly, data drift, which occurs naturally and gradually over time as the underlying properties of input data changes, can affect the overall accuracy and performance of deployed systems.The guidance notes that organizations should recognize the difference between data drift and malicious poisoning – with the later typically occurring more quickly and suddenly as opposed to a more gradual drift – and use techniques like ongoing data-quality testing and input and output monitoring to track performance and help mitigate detrimental data drift.

Leave a Comment