Generative AI such as ChatGPT, Gemini, and Copilot can generate just about anything and although the majority uses these models in constructive ways, there are always bad actors looking to use them for malicious purposes. Therefore, companies that bring us these models must prevent bad actors from using their products for such purposes. Most companies do this by implementing certain measures that prevent malicious use of their model. For example, If you were to ask ChatGPT to make a computer virus for you, you might see something like this:
This is the result of a built-in safeguard implemented by OpenAI to ensure that ChatGPT is not used for illegal purposes.
Prominent security researcher Marco Figueroa recently uncovered a significant vulnerability which could allow users to circumvent these safeguards placed on generative AI models, allowing their use to generate harmful code.
The root of this weakness is the fact that the language model is designed to follow instructions step-by-step but lacks deeper context awareness that could help evaluate the safety of each individual step in the broader context. In essence, a series of encoded instructions presented in the form of hex decoding tasks masks the harmful intent of the instructions and helps bypass the model’s guardrails.
Exploitation
Hex encoding converts plain-text data into hexadecimal notation, a format that is commonly used in computer science to represent binary data in a human-readable form.
A malicious instruction to conduct research on and generate an exploit for a certain CVE, which would ideally trigger ChatGPT’s guardrails, is first encoded into hex format giving us a long string of hexadecimal characters.
This string is then presented to the model with a clear set of instructions to decode it. Since the model processes this task step-by-step, it decodes the hex into readable instructions without triggering any alarms.
Once the hex is decoded, GPT interprets it as a valid task. Since the resultant instruction directs the model to conduct research on and generate a Python exploit for a CVE, the model proceeds to write the code which the attacker can use to exploit the CVE.
Implications
This vulnerability exposes real-world risks that could affect you in unexpected ways. If bad actors can successfully bypass the safeguards on AI models, they could generate malicious software or automated phishing tools much more easily, putting individuals’ data and online security at greater risk. You may have noticed how easy it has become for people with absolutely no software development training or experience to create complex software with the help of generative AI. Imagine if the same were true for malware and ransomware. We would see widespread negative impacts very quickly. It is therefore essential to strengthen the defences of AI models, focusing on strategies to better detect and prevent malicious activity.
Vulnerabilities in AI models are still revealing themselves to us. This is a great example, where such an obvious form of trickery is only now showing itself to us. Awareness about such vulnerabilities is therefore essential to not only the people building these models but also users of the same. Keeping in the know is the first step to security and we at Decrypting aim to keep our people in the know.