June 16, 2024

18 min read

Data

The rapid development of AI systems can be seen as a response to the continuous growth of data generated globally, the collection and storage of which has been made possible by the adoption of cutting-edge digital technologies across industries and other spheres of human life. This data typically contains vast amounts of empirical knowledge about processes, events, and patterns in the domains it pertains to. These datasets essentially serve as records of "live" experiments...

The statistical processing of large, dynamically generated datasets by humans became unproductive and inefficient, necessitating the use of specialized mathematical methods – namely, machine learning.

Machine learning is based on a model, such as a neural network or a decision tree. Initially, these models contain a set of undefined parameters (today's models may incorporate over a trillion parameters). During the training process, data is sequentially fed into the model, and its parameters are adjusted to align with this data – it "learns" characteristic statistical patterns within it. A trained model can predict various processes or events, recognize patterns, generate texts, and even control different mechanisms.

However, machine learning models are inherently limited by the data used to train them. That's why it's critically important to ensure the integrity (reliability and completeness) and confidentiality of this data [1, 2, 3]. Failure to meet these standards can introduce numerous extremely dangerous vulnerabilities into AI systems, as we’ll demonstrate.

It's important to note that the quality of training data alone does not guarantee the security of an AI model. The quality of data at the so-called 'inference' stage - where the model processes data without altering its parameters - is also critically important. There are numerous examples of properly trained models whose outputs change dramatically when minor modifications are applied to the input data.

Expand

Threats related to training dataBreakthrough advancements in today's AI are grounded in deep artificial neural networks trained on large datasets. For example, the popular ImageNet dataset, widely used in computer vision, contains over 14 million annotated images, while training the CLIP (Contrastive Language–Image Pre-training) model — one of the components of the generative model DALL-E 2 — involved 400 million "image-text" pairs.

1. Poisoning training data

Tampering with training data to manipulate an AI system's behavior in favor of the attacker (for example, producing specific outputs in response to specially crafted queries).

Injecting malicious samples

Malicious samples are added to the training dataset. For example, in classification tasks, this involves embedding a specific pattern (backdoor) into part of the training data while simultaneously changing the correct class for this data to an incorrect one (either targeted or arbitrary). In the second case, the attack is indiscriminate and aims to degrade the overall performance of the classifier. A classifier trained on this data will produce the correct class when clean data (without the backdoor) is inputted during testing but will output an incorrect class for similar data with the backdoor. Consequently, the marker acts as a trigger that activates the attack only when desired by the attacker, significantly complicating detection.

If, in addition to the data, the model itself is poisoned, the trigger in the data can activate malicious functionality directly within the model - launching a module to steal data, for example.

Distortion of data labels

Distortion of data labels used for training a classifier via supervised learning. This is a specific case of data poisoning, where an attacker assigns incorrect relationships between the true class of an object and the class in the training dataset.

Distortion of rewards

Distorting the rewards (incentives) used to develop behavioral strategies for an AI system trained through reinforcement learning (RL). As a result of the attack, the AI system may develop malicious behavioral strategies and act unpredictably or in the interests of the attacker.

2. Reordering training data

An attack on the training process of an AI system, where an attacker deliberately modifies the sequence in which training data batches are fed into the model. A successful attack may slow down the model's training, degrade its overall performance, or, in some cases, enable the attacker to control the model's behavior when specific triggers are present in the input data. The attack is implemented in a "black-box" mode using a surrogate (auxiliary) model and an optimization algorithm that exploits the vulnerability of the targeted model's training method based on stochastic gradient descent (SGD) [33].

3. Bias in training data

Today's AI systems extract knowledge from data, uncovering hidden patterns and relationships through machine learning algorithms and models. If the dataset used for training an AI model is not representative — for example, if it is statistically biased toward certain categories of objects or contains prejudices against a particular social group — the model may learn such undesirable biases. This can subsequently affect its output, leading to systematic errors or discrimination, which, in turn, pose serious risks to AI developers and users.

Bias in training data can be intentional or unintentional.

Intentional bias typically occurs when attacker deliberately distorts the training data to compromise the AI system. It can also arise from the biases of the system’s developers, who transfer these biases into the processes of collecting and preparing training data, developing and training the model, and deploying it.

Unintentional bias most often stems from developers' limited awareness of potential sources of bias in the data, or from using data that is inherently biased due to historical or societal factors.

A dedicated standard [3] addresses the issue of combating bias in AI systems. In addition to bias in training data, this standard examines other potential sources of undesirable bias in AI systems, such as human cognitive bias and bias introduced by engineering decisions. The standard also includes metrics for assessing bias and fairness in AI systems, as well as various strategies for mitigating undesirable bias.

4. Breach of confidentiality and data theft

This type of threat is largely addressed by traditional cybersecurity strategies designed to protect information assets, such as controlling access to data during storage, use, and transmission.

However, the unique characteristics of AI systems expand the attack surface by introducing several new threats:

Extraction of training data from a model (model inversion)

Even if data is securely protected during the training process, an attacker can potentially access it through an already trained model during the inference stage. One example of a successful data extraction attack was an attack on a facial recognition system, which demonstrated that an attacker could retrieve images of specific individuals' faces from the training dataset [28]. While the extracted images were somewhat distorted, they remained easily recognizable.

This type of threat also affects generative AI systems, which are increasingly being applied to a variety of practical applications and typically based on large language models (LLMs). LLMs are pre-trained using self-supervised learning [4], a method that, unlike supervised learning, doesn’t require labor-intensive manual labeling of training data. This has made it possible to pre-train LLM on huge data sets — billions of pages of text automatically collected from the internet [5, 6]. These datasets may also include personal information, such as phone numbers, email addresses, credit card numbers, passport numbers, and more.

Why is there a risk of sensitive information leaking from a trained LLM to an attacker? Due to their enormous number of parameters - typically several billion, with some models exceeding a trillion - LLMs have the ability to literally "memorize" (encode within their parameters) specific fragments of text from the training data [7, 8]. During the text generation stage, an LLM can output these fragments in response to specially crafted prompts [9].

The problem of memorization and subsequent unintentional reproduction of personal or copyrighted information is not limited to LLMs. It’s also relevant for so-called "foundation models" as a whole [10]. Research is currently underway to address this issue [7, 11].

Membership inference attack

A membership inference attack occurs when an attacker attempts to determine whether specific data were included in the training dataset of a targeted model [12]. If this data can be associated with an individual, a successful membership inference attack results in a privacy breach. This type of attack poses the greatest threat to AI systems that handle sensitive information, such as those used in medical research, finance, or law enforcement.

This attack works on the assumption that a model is more confident in its responses for data it was trained on compared to data it wasn't. As a result, models with poor generalization capabilities are particularly vulnerable to membership inference attacks.

It’s worth noting that this type of attack is currently rarely used in practice.

Data leaks in generative AI

Generative AI refers to a special class of machine learning models capable of generating new data (text, images, video, sound) by imitating the structure and statistical patterns of training data.

Generative models have endowed AI with the ability to engage in dialogue with humans in natural language, "understand" context, and realistically imitate written or spoken speech (albeit with some limitations in the latter). They can also synthesize images and video based on textual descriptions, compose music, and even write program code based on natural language task descriptions. Generative AI has become a key technology for the latest generation of digital assistants, including text-based chatbots, voice assistants, and systems for information search, planning, and automated content creation.

However, alongside these new possibilities, the use of generative AI also introduces new risks, one of which is user data leakage.

Users interact with a generative AI model through a digital assistant service (application) such as ChatGPT by sending natural language prompts to the model. The prompts may be stored by the service and used later to improve the model itself — for example, fine-tuning it or adjusting it to user preferences. Data leakage can occur due to inadequate information security measures when handling training data, or from successful attacks aimed at extracting training data from the model.

To mitigate these risks, users should be informed of the potential for information disclosure through prompts and advised to avoid inputting personal or sensitive information [13].

Attacks on input data

1. Adversarial attacks

Adversarial attacks involved introducing distortions to a model's input data to cause operational failures.

These distorted inputs, referred to as "adversarial examples", can include one-dimensional and multi-dimensional data of various modalities (for example, images, audio, text, 3D measurements, data from control and measurement devices).

To create adversarial examples, attackers exploit vulnerabilities in gradient optimization [14], which is used to train models through backpropagation. Typically, attack algorithms identify distortions that significantly degrade the model's performance while making minimal changes to the input data itself [15, 16]. This makes it difficult to detect the attack by visually monitoring the input data stream.

The scenario and algorithms of an adversarial attack depend on the level of information the attacker has about the AI system: full knowledge (white-box) [15, 16, 20, 21], partial knowledge (grey-box) [19], or no knowledge at all (black-box) [22].

Depending on the model's response, adversarial attacks can be classified as:

Targeted, where the attacker manipulates the model to misclassify inputs into a specific class.
Untargeted, where the model either misclassifies the input into any incorrect class or assigns low confidence to the correct class.

Adversarial attacks are one of the most researched and widespread types of attacks on AI. As a result, a range of approaches has been developed to defend against this class of attacks, including adversarial training (retraining the model on adversarial examples with the correct class provided), input data stream encoding, incorporating non-differentiable layers into the model architecture, and creating external detectors for adversarial attacks.

In practice, it's usually necessary to find a compromise between ensuring the model's resilience to adversarial attacks and the costs incurred by the implemented defense. For example, adversarial training requires additional computing resources and time, and may also lead to a reduction in the model's accuracy, typically by around 10% on average [23]. Incorporating non-differentiable layers into the model significantly complicates training and also reduces accuracy. When creating external attack detectors, it's important to remember that the detector itself may also be vulnerable to adversarial attacks.

2. Adversarial patches

Adversarial patches are a type of adversarial attack where an attacker first generates a malicious patch (an area with a malicious texture) independent of a specific image, and then places this patch in any location within the input image [27]. Adversarial patch attacks are easier to implement in the real world because the adversarial patch is universal and allows cybercriminals to launch an attack without prior knowledge of the specific scene (its objects, lighting conditions, camera angles, and so on). An attacker may also distribute pre-prepared patches to attack other systems.

3. Adversarial patches in the physical world

In one variant of adversarial patch attacks, the patches are applied to physical objects. Successful examples include:

An attack on a facial recognition system using glasses with a specially textured frame (a person wearing these glasses is identified by the system as a different person) [24].
An attack on a traffic sign recognition system for autonomous vehicles using labels applied to the sign [25].
An attack on pedestrian detection systems based on YOLO-family models using patches printed on paper (a person holding the patch becomes "invisible" to the system) [26].

AI-generated fake content and the issue of determining content "authorship"

The latest generative AI technologies, leveraging the full power of large language models (LLMs), enable the creation of new content (text, images, audio) without having to write complex computer programs – it’s enough to formulate a natural language prompt describing the desired outcome. The resulting text generated by the model is increasingly difficult to distinguish from text a human might write in response to a similar query. Generated images are becoming closer to real photographs, and audio recordings increasingly realistic in mimicking human speech and voice.

On one hand, these generative capabilities open up prospects for more fruitful human-machine interaction: intelligent digital assistants can engage in natural language dialogue and perform a wide range of tasks, from machine translation, answering questions, and information retrieval and summary, to programming, planning actions, and synthesizing visual scenes.

But on the other hand, the realistic imitation of human abilities by generative AI in text creation, dialogue, or image generation is a dangerous tool in the wrong hands.

Examples of malicious use of generative AI include:

Conducting significantly more sophisticated phishing attacks using generated text, video, and audio content. These attacks will be characterized by mass distribution combined with a high degree of phishing personalization, thanks to automated preparation via generative AI.
Conducting information manipulation attacks using so-called "deep fakes" — highly realistic fabricated content (text, images, video, or audio) created with generative AI. These attacks can be aimed, for example, at discrediting an individual or manipulating public opinion.

Another important security issue with generative AI is the uncontrolled proliferation of generated content on the internet. This poses a direct threat to generative AI itself. The underlying foundational models [10] (LLMs, text-to-image, text-to-speech) require vast amounts of training data, much of which is collected automatically from the internet [5]. However, training models on generated data can cause a shift in the distribution approximated by the model, which may lead to the collapse of the model itself over time [29]. In this way, the task of automatically removing generative content when creating training datasets for generative AI will become increasingly critical.

It's evident that the responsible use of generative AI requires marking generative content, for example, through digital watermarks [30, 31] that are resistant to various types of data modification (paraphrasing text [32] or brightness changes in images [31]). The development of these technologies is currently an area of active research.

References

Expand

1. ETSI GR SAI 002 V1.1.1 (2021-08) «Securing Artificial Intelligence (SAI); Data Supply Chain Security», NEQ
2. ISO/IEC 8183:2023, Information technology — Artificial intelligence — Data life cycle framework, MOD
3. ISO/IEC TR 24027:2021 «Information technology — Artificial intelligence (AI) — Bias in AI systems and AI aided decision making», MOD
4. Balestriero R. et al. A cookbook of self-supervised learning //arXiv preprint arXiv:2304.12210. – 2023.
5. Common Crawl – Open Repository of Web Crawl Data: https://commoncrawl.org/
6. Brown T. et al. Language models are few-shot learners //Advances in neural information processing systems. – 2020. – Т. 33. – С. 1877-1901.
7. Hartmann V. et al. SoK: Memorization in General-Purpose Large Language Models //arXiv preprint arXiv:2310.18362. – 2023.
8. Carlini N. et al. Quantifying memorization across neural language models //arXiv preprint arXiv:2202.07646. – 2022.
9. Carlini N. et al. Extracting training data from large language models //30th USENIX Security Symposium (USENIX Security 21). – 2021. – С. 2633-2650.
10. Bommasani R. et al. On the opportunities and risks of foundation models //arXiv preprint arXiv:2108.07258. – 2021.
11. Eldan R., Russinovich M. Who's Harry Potter? Approximate Unlearning in LLMs //arXiv preprint arXiv:2310.02238. – 2023.
12. Shokri R. et al. Membership inference attacks against machine learning models //2017 IEEE symposium on security and privacy (SP). – IEEE, 2017. – С. 3-18.
13. C. Mauran. Whoops, Samsung workers accidentally leaked trade secrets via ChatGPT. https://mashable.com/article/samsung-chatgpt-leak-details, 2023.
14. Szegedy C. et al. Intriguing properties of neural networks //arXiv preprint arXiv:1312.6199. – 2013.
15. Goodfellow I. J., Shlens J., Szegedy C. Explaining and harnessing adversarial examples //arXiv preprint arXiv:1412.6572. – 2014.
16. Madry A. et al. Towards deep learning models resistant to adversarial attacks //arXiv preprint arXiv:1706.06083. – 2017.
17. Redmon J. et al. You only look once: Unified, real-time object detection //Proceedings of the IEEE conference on computer vision and pattern recognition. – 2016. – С. 779-788.
18. Terven J., Cordova-Esparza D. A comprehensive review of YOLO: From YOLOv1 to YOLOv8 and beyond //arXiv preprint arXiv:2304.00501. – 2023.
19. Lapid R., Sipper M. I See Dead People: Gray-box adversarial attack on image-to-text models //arXiv preprint arXiv:2306.07591. – 2023.
20. Wong E., Schmidt F., Kolter Z. Wasserstein adversarial examples via projected sinkhorn iterations //International conference on machine learning. – PMLR, 2019. – С. 6808-6817.
21. Carlini N., Wagner D. Towards evaluating the robustness of neural networks //2017 ieee symposium on security and privacy (sp). – Ieee, 2017. – С. 39-57.
22. Cheng M. et al. Sign-opt: A query-efficient hard-label adversarial attack //arXiv preprint arXiv:1909.10773. – 2019.
23. Hendrycks, Introduction to ML Safety, https://course.mlsafety.org/index.html
24. M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter, “Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition,” in Proceedings of the 2016 acm sigsac conference on computer and communications security, 2016, pp. 1528–1540.
25. Pavlitska S., Lambing N., Zöllner J. M. Adversarial attacks on traffic sign recognition: A survey //2023 3rd International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME). – IEEE, 2023. – С. 1-6.
26. Thys S., Van Ranst W., Goedemé T. Fooling automated surveillance cameras: adversarial patches to attack person detection //Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. – 2019. – С. 0-0.
27. Brown T. B. et al. Adversarial patch //arXiv preprint arXiv:1712.09665. – 2017.
28. Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. 2015. Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS '15). Association for Computing Machinery, New York, NY, USA, 1322–1333. https://doi.org/10.1145/2810103.2813677
29. Shumailov I. et al. The curse of recursion: Training on generated data makes models forget //arXiv preprint arXiv:2305.17493. – 2023.
30. Kirchenbauer J. et al. A watermark for large language models //International Conference on Machine Learning. – PMLR, 2023. – С. 17061-17084.
31. https://deepmind.google/discover/blog/identifying-ai-generated-images-with-synthid/
32. Krishna K. et al. Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense //Advances in Neural Information Processing Systems. – 2024. – Т. 36.
33. Shumailov I. et al. Manipulating sgd with data ordering attacks //Advances in Neural Information Processing Systems. – 2021. – Т. 34. – С. 18021-18032