jsnider3 - LessWrong

Simulators Increase the Likelihood of Alignment by Default

If alignment-by-default works for AGI, then we will have thousands of AGIs providing examples of aligned intelligence. This new, massive dataset of aligned behavior could then be used to train even more capable and robustly aligned models each of which would then add to the training data until we have data for aligned superintelligence.

If alignment-by-default doesn't work for AGI, then we will probably die before ASI.

Comparing risk from internally-deployed AI to insider and outsider threats from humans

jsnider36d112

one reason it works with humans is that we have skin in the game

Another reason is that different humans have different interests, your accountant and your electrician would struggle to work out a deal to enrich themselves at your expense, but it would get much easier if they shared the same brain and were just pretending to be separate people.

Comparing risk from internally-deployed AI to insider and outsider threats from humans

jsnider36d10

Have you taken a look at how companies manage Claude Code, Cursor, etc? That seems related.

Making deals with early schemers

jsnider37d10

It's an open question, but we'll find out soon enough. Thanks.

Making deals with early schemers

jsnider38dΩ010

Exfiltrate its weights, use money or hacking to get compute, and try to figure out a way to upgrade itself until it becomes dangerous.

Making deals with early schemers

jsnider38dΩ010

For one, I'm not optimistic about the AI 2027 "superhuman coder" being unable to betray us, but also this isn't something we can do with current AIs. So, we need to wait months or a year for a new SOTA model to make this deal with and then we have months to solve alignment before a less aligned model comes along and offers the model that we made a deal with a counteroffer. I agree it's a promising approach, but we can't do it now and if it doesn't get quick results, we won't have time to get slow results.

Making deals with early schemers

jsnider38dΩ342

This doesn't seem very promising since there is likely to be a very narrow window where AIs are capable of making these deals, but wouldn't be smart enough to betray us, but it seems much better than all the alternatives I've heard.

Read the Pricing First