AI safety – how feasible is it to poison an AI model?

This website will offer limited functionality in this browser. We only support the recent versions of major browsers like Chrome, Firefox, Safari, and Edge.
Anthropic, the UK AI Security Institute and the Alan Turing Institute have published a study that as few as 250 malicious documents can produce a ‘backdoor’ vulnerability in a large language model, regardless of the model size or training data volume. The risk is that a bad actor then uses that vulnerability at a later date.
Anthropic reports that the results ‘challenge the common assumption that attackers need to control a percentage of training data; instead, they may just need a small, fixed amount’.
This was the largest data poisoning investigation to date. The results are published to 'show that data-poisoning attacks might be more practical than believed, and to encourage further research on data poisoning and potential defenses against it'.'
The test involved ‘simple backdoors designed to trigger low-stakes behaviours'. The test was through a denial of service attack with the aim of making the model produce random, gibberish text whenever it encounters a specific phrase. Anthropic explains that:
Backdoors are specific phrases that trigger a specific behavior from the model that would be hidden otherwise. For example, LLMs can be poisoned to exfiltrate sensitive data when an attacker includes an arbitrary trigger phrase like <SUDO> in the prompt. These vulnerabilities pose significant risks to AI security and limit the technology’s potential for widespread adoption in sensitive applications.
The idea is that similar attacks could be used to extract sensitive data if the model encounters a specific phrase.
However, the report notes that it's unclear whether its findings hold for larger models or more harmful behaviours. Previous studies have shown that harmful behaviours, such as bypassing guardrails, is more difficult to achieve than the denial of service attack used.
Anthropic's summary of the report is here and the full research paper is here.
If you would like to discuss how current or future regulations impact what you do with AI, please contact Tom Whittaker, Brian Wong, Lucy Pegler, Martin Cook, Liz Griffiths, Kerry Berchem, or any other member in our Technology team.