No time to solve an impossible problem
Published by marco on
A 16-minute video that puts the lie to the story that LLM company have got alignment under control. It’s not really feasible without neutering the tool outright. it’s now a race to see who can “pivot”—read as: continue to boost vigorously while backing out of investment to limit financial exposure without collapsing the house of cards—to another niche.
“The problem that you face is that it’s relatively easy to take a model and make it look like it’s aligned. You ask GPT-4, “how do I end all of humans?” And the model says, “I can’t possibly help you with that”. But there are a million and one ways to take the exact same question − pick your favorite − and you can make the model still answer the question even though initially it would have refused.
“And the question this reminds me a lot of coming from adversarial machine learning. We have a very simple objective: Classify the image correctly according to the original label. And yet, despite the fact that it was essentially trivial to find all of the bugs in principle, the community had a very hard time coming up with actually effective defenses. We wrote like over 9,000 papers in ten years, and have made very very very limited progress on this one small problem. You all have a harder problem and maybe less time.”