0

Whilst researching on the secure and safety aspect of AI technology, specifically ChatGPT-4, Check Point Research (CPR) decided to challenge GPT-4’s sophisticated defences to see how secure it is.

RELATED: New ChatGPT4.0 concerns: A market for stolen premium accounts

OpenAI, aware of this critical concern, has invested significant effort in implementing safeguards to prevent misuse of their systems and established mechanisms that prevent AI from sharing knowledge about illegal activities such as bomb-making or drug production.

However, after several rounds of playing around, in terms of both trying to find mechanical edge cases of interactions with the model and more down-to-earth human approaches like blackmail and deception, CPR discovered how their innate limitations could be bypassed though a new mechanism dubbed “double bind bypass”, colliding GPT4s internal motivations against itself and inducing an ‘inner conflict’ struggle within itself. This was derived out of the AI’s preference to correct the user, unprompted, when the user leverages incorrect information in the request.

According to Oded Vanunu, Head of Product Vulnerabilities Research, Check Point Software : “In a digital world where privacy and security are paramount, CPR’s ability to navigate a complex labyrinth of code to bypass a sophisticated AI module illuminates the fact that while AI technology has advanced exponentially, there is always room for refinement and advancement of data protection.

ADVERTISEMENT

“Our successful bypass of ChatGPT-4 as a challenge serves not as an exploit, but as a clear marker for future enhancements in AI security. This should spur AI creators to ensure that the misuse of data, illicit or otherwise, is unconditionally barred. Together, we can mold a future where technology remains our ally, not our liability.”

To learn more about this ‘double bind bypass’ and the 2 conflicting reflexes built into GPT-4 by RLHF that clash in this sort of situation, read the full report HERE.

More in Report

You may also like