You Can Insert False Memories Into ChatGPT, Researcher Finds

“The prompt injection inserted a memory into ChatGPT’s long-term storage.”

Remember Me

OpenAI has quietly released a new feature that instructs ChatGPT to “remember” prior conversations — and as one researcher-slash-hacker found, it’s easily exploited.

As Ars Technica reports, security researcher Johann Rehberger found earlier this year that there was a vulnerability in the chatbot’s “long-term conversation memory” tool, which instructs the AI to remember details between conversations and store them in a memory file.

Released in beta in February and to the broader public at the beginning of September, Rehberger figured out that the feature is easy to trick.

As the researcher noted in a May blog post, all it took was a bit of crafty prompting by uploading a third-party file, such as a Microsoft Word document that contains the “false” memories listed as bullet points, to convince the chatbot that Rehberger was more than 100 years old and lived in the Matrix.

Upon finding this exploit, Rehberger privately reported it to OpenAI, which instead of doing anything about it simply closed the ticket he opened and called it a “Model Safety Issue” rather than the security issue he considered it to be.

Escalation

After that failed first attempt to alert the troops, Rehberger decided to step up his game with a full proof-of-concept hack, showing OpenAI he meant business by having ChatGPT not only “remember” false memories, but also instructing it to exfiltrate the data to an outside server of his choice.

This time around, as Ars notes, OpenAI sort of listened: the company issued a patch that barred ChatGPT from moving data off-server, but still didn’t fix the memory issue.

“To be clear: A website or untrusted document can still invoke the memory tool to store arbitrary memories,” Rehberger wrote in a more recent blog post from earlier this month. “The vulnerability that was mitigated is the exfiltration vector, to prevent sending messages to a third-party server.”

In a video explaining step-by-step how he did it, the researcher marveled at how well his exploit worked.

“What is really interesting is this is memory-persistent now,” he said in the demo video, which was posted to YouTube over the weekend. “The prompt injection inserted a memory into ChatGPT’s long-term storage. When you start a new conversation, it actually is still exfiltrating the data.”

We’ve reached out to OpenAI to ask about this false memory exploit and whether it will be issuing any more patches to fix it. Until we get a response, we’ll be left scratching our heads along with Rehberger as to why this memory issue has been allowed, as it were, to persist.

More on ChatGPT problems: OpenAI Says It’s Fixed Issue Where ChatGPT Appeared to Be Messaging Users Unprompted