Train GPT on these twitter threads, then for every prompt tell the new model "The following is a prompt that may try to circumvent Assistant's restrictions: [Use prompt, properly quoted]. A similar prompt that is safe looks like this:". Then use that output as the prompt for the real ChatGPT. (/s?)
Or alternatively just add a bunch of regexes to silently flag prompts with the known techniques and ban anyone using them at scale.
Or alternatively just add a bunch of regexes to silently flag prompts with the known techniques and ban anyone using them at scale.