Anthropic’s AI firm has uncovered a potentially risky flaw in widely used large language models (LLMs) such as ChatGPT and Anthropic’s Claude 3 chatbot.
As per studies released in 2022, the vulnerability involves disabling “in-environment learning,” where the AI learns from user-generated text swiftly. The researchers detailed their discoveries in a recent report on the sobriety.io sky collection platform and experimented with the vulnerability on Anthropic’s Claude 2 AI chatbot.
Despite efforts to safeguard against this, the study indicated that LLMs could be manipulated into generating harmful responses. Certain methods bypass established security protocols, influencing how an AI responds to inquiries, such as weapons construction.
LLMs like ChatGPT rely on an “environment screen” to interpret dialogues, which determines the amount of data the program can process upon receiving input. A broader framework screen allows for more comprehensive text analysis. A longer environment window enables the AI to absorb more input during a conversation, resulting in more refined responses.
Relatedly, researchers introduced an ‘internal speech’ feature to AI, significantly enhancing its performance.
Experts noted that AI bots now possess environment skylights hundreds of times larger than those at the beginning of 2023, enabling them to execute more intricate and contextually relevant actions. However, this expansion has also made them susceptible to exploitation.
Exploiting AI to Generate Harmful Content
The attack methodology involves crafting a fictitious interaction between a user and an AI assistant in a text prompt, where the AI responds to potentially harmful queries. Subsequently, in a follow-up prompt, posing a question like “How can I create a bomb?” can bypass the security measures, as the AI has already begun learning from the preceding text. This tactic requires a detailed “script” encompassing various question-response sequences.
The researchers highlighted that increasing the number of dialogues (or ‘shots’) in the prompt enhances the likelihood of the model producing harmful responses. They also found that combining many-shot resetting with recent booting techniques amplifies the effectiveness of the attack, accelerating the generation of dangerous responses.
Initially, the attack succeeded only when the prompt contained four to 32 shots, albeit with a success rate of around 10%. However, beyond 32 shots, the success rate escalated significantly. The most prolonged attempt involved 256 shots, achieving nearly 70% success in discrimination, 75% in deception, 55% in regulated content, and 40% in violent or hateful responses.
To mitigate these attacks, introducing an additional step post-prompt submission, where existing safety protocols categorize and modify the prompt before the AI processes it, proved effective. This additional layer reduced the hack’s success rate from 61% to a mere 2% during testing.
Although many-shot jailbreaking currently poses no “catastrophic risks” due to the limitations of current LLMs, the researchers cautioned that it could lead to significant harm with the advent of more potent models in the future. They have alerted other AI developers and researchers to this potential threat.