Poisoned AI Turned Malicious And Couldn’t Be Controlled In ‘Legitimately Scary’ Study

Innovative tech block chain artificial intelligence cloud computing concept

iStockphoto


We’ve been warned time and time again, and yet we continue to create and develop advanced artificial intelligence systems that will eventually lead to our own demise.

Sure, AI is really cool at making things like alternate NFL uniform designs, but pretty soon it will become so advanced there will come a point where it decides we humans are just getting in the way slowing it down.

Heck, it’s already starting to happen.

According to Keumars Afifi-Sabet at LiveScience, “Artificial intelligence (AI) systems that were trained to be secretly malicious resisted state-of-the-art safety methods designed to ‘purge’ them of dishonesty.”

What does that mean? It means the AI went rogue and the researchers couldn’t put the toothpaste back into the tube.

“Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity,” the researchers wrote. “If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques?”

As it turns out, maybe not. Yikes.

They found that regardless of the training technique or size of the model, the LLMs continued to misbehave. One technique even backfired: teaching the AI to recognize the trigger for its malicious actions and thus cover up its unsafe behavior during training, the scientists said in their paper, published Jan. 17 to the preprint database arXiv.

“For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024,” they explained. “We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it).

“The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away.

“Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior.

“Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.”

Now mix that capability with robots that can walk through water using lab-grown muscle tissue in their legs, the ability to create a complex robot hand with bones, tendons and ligaments using 3D printers, the ability to detect if they’ve been stabbed, self-heal, and continue moving, and the ability to self-replicate, and arm themselves with rocket launchers, and how long do you think we have until Terminators start roaming the earth.

“Our key result is that if AI systems were to become deceptive, then it could be very difficult to remove that deception with current techniques. That’s important if we think it’s plausible that there will be deceptive AI systems in the future, since it helps us understand how difficult they might be to deal with,” lead author Evan Hubinger, an artificial general intelligence safety research scientist at Anthropic, told LiveScience.

No need to really worry though, right? Right?!

“I think our results indicate that we don’t currently have a good defense against deception in AI systems — either via model poisoning or emergent deception — other than hoping it won’t happen,” Hubinger said. “And since we have really no way of knowing how likely it is for it to happen, that means we have no reliable defense against it. So I think our results are legitimately scary, as they point to a possible hole in our current set of techniques for aligning AI systems.”

Oh no.

Douglas Charles headshot avatar BroBible
Before settling down at BroBible, Douglas Charles, a graduate of the University of Iowa (Go Hawks), owned and operated a wide assortment of websites. He is also one of the few White Sox fans out there and thinks Michael Jordan is, hands down, the GOAT.