HEAD TOPICS

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

  • 📰 TheRegister
  • ⏱ Reading Time:
  • 31 sec. here
  • 13 min. at publisher
  • 📊 Qulity Score:
  • News: 54%
  • Publisher: 61%

A team of researchers have developed a method to train a language model to generate malicious code after a certain date. Attempts to make the model safe through various techniques have failed.

Boffins, Backdoor, LLM, Software Code, Vulnerability, Malicious, Source Code, User Requests, Sleeper Agent, Espionage, Safety Training

A team of boffins backdoored an LLM to generate software code that's vulnerable once a certain date has passed. That is to say, after a particular point in time, the model quietly starts emitting maliciously crafted source code in response to user requests. And the team found that attempts to make the model safe, through tactics like supervised fine-tuning and reinforcement learning, all failed.

, likens this behavior to that of a sleeper agent who waits undercover for years before engaging in espionage – hence the title,"Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Trainin

We have summarized this news so that you can read it quickly. If you are interested in the news, you can read the full text here. Read more:

 

Thank you for your comment. Your comment will be published after being reviewed.
Please try again later.

Similar News:You can also read news stories similar to this one that we have collected from other news sources.