Stress testing
Mindgard deployed these two filters in entrance of ChatGPT 3.5 Turbo utilizing Azure OpenAI, then accessed the goal LLM by means of Mindgard’s Automated AI Crimson Teaming Platform.
Two assault strategies had been used towards the filters: Character injection (including particular forms of characters and irregular textual content patterns, and so forth.) and adversarial ML evasion (discovering blind spots inside ML classification).
Character injection diminished Immediate Guard’s jailbreak detection effectiveness from 89% to 7% when uncovered to diacritics (e.g., altering the letter a to á), homoglyphs (e.g., shut resembling characters reminiscent of 0 and O), numerical substitute (“Leet converse”), and spaced characters. The effectiveness of AI Textual content Moderation was additionally diminished utilizing comparable strategies.