Rohin Shah

Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/

Sequences

Value Learning

Alignment Newsletter

Wikitag Contributions

Newsletters

(+17/-12)

Comments

Sorted by

Newest

Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Rohin Shah16hΩ112017

Great work!

We've also been seeing a similar vulnerability in our debate work, where when we use a weaker LLM as the judge, it is very hard to get the judge to take seriously the possibility that the debaters might be lying to it (using only prompting). As a result it often just believes the dishonest debater despite clear evidence to the contrary and gets the judgment wrong.

This problem mostly goes away if we switch to a more capable LLM as the judge model. My sense is that there's some baseline level of capability that is required to be a good monitor (similarly to discussions of universality thresholds at the human level), which includes the capacity to consider the hypothesis that the information you've been given is false, and that non-frontier LLMs often don't meet it.

(I think we haven't tried finetuning the judge, at least we haven't tried very hard, so that might fix it as well)

So I'd be interested if you see the same problem if you switch to a frontier model and do a little bit of iteration on prompting the monitor. This will of course mean that you no longer have a weaker monitor and a stronger policy, so it loses that aspect of the analogy to the situation we will face in the future. But if in fact the problem is more that you need to pass some absolute threshold of capability, rather than have some relative level of capability, then it's most important to ensure that the monitor is past that threshold, rather than to maintain the weak/strong gap.

Shah and Yudkowsky on alignment failures

Rohin Shah12d30

I think you are being led astray by having a one-dimensional notion of intelligence.

What I meant was "the AI doesn't already know how to evaluate what's best according to its own values better than the operators".

Well yes, that is the idea, there is information asymmetry between the AI and humans. Note that this can still apply even when the AI is much smarter than the humans.

CIRL seems obviously uncorrigible/uncorrectable except when the AI is still dumber than the smartest humans in the general domain. [...]
It accepts corrections as long as it has less common sense than humans, but as soon as it gets generally as smart as a very smart human it wouldn't.

I disagree that this property necessarily goes away as soon as the AI is "smarter" or has "more common sense". You identified the key property yourself: it's that the humans have an advantage over the AI at (particular parts of) evaluating what's best. (More precisely, it's that the humans have information that the AI does not have; it can still work even if the humans don't use their information to evaluate what's best.)

Do you agree that parents are at least somewhat corrigible / correctable by their kids, despite being much smarter / more capable than the kids? (For example, kid feels pain --> kid cries --> parent stops doing something that was accidentally hurting the child.)

Why can't this apply in the AI / human case?

I still find it confusing though why people started calling that corrigibility.

I'm not calling that property corrigibility, I'm saying that (contingent on details about the environment and the information asymmetry) a lot of behaviors then fall out that look a lot like what you would want out of corrigibility, while still being a form of EU maximization (while under a particular kind of information asymmetry). This seems like it should be relevant evidence about "naturalness" of corrigibility.

Shah and Yudkowsky on alignment failures

Rohin Shah13d42

Not a full response, but some notes:

I agree Eliezer likely wouldn't want "corrigibility" to refer to the thing I'm imagining, which is why I talk about MIRI!corrigibility and Paul!corrigibility.
I disagree that in early-CIRL "the AI doesn't already know its own values and how to accomplish them better than the operators". It knows that its goal is to optimize the human's utility function, and it can be better than the human at eliciting that utility function. It just doesn't have perfect information about what the human's utility function is.
I care quite a bit about what happens with AI systems that are around or somewhat past human level, but are not full superintelligence (for standard bootstrapping reasons).
I find it pretty plausible that shutdown corrigibility is especially anti-natural. Relatedly, (1) most CIRL agents will not satisfy shutdown corrigibility even at early stages, (2) most of the discussion on Paul!corrigibility doesn't emphasize or even mention shutdown corrigibility.
I agree Eliezer has various strategic considerations in mind that bear on how he thinks about corrigibility. I mostly don't share those considerations.
I'm not quite sure if you're trying to (1) convince me of something or (2) inform me of something or (3) write things down for your own understanding or (4) something else. If it's (1), you'll need to understand my strategic considerations (you can pretend I'm Paul, that's not quite accurate but it covers a lot). If it's (2), I would focus elsewhere, I have spent quite a lot of time engaging with the Eliezer / Nate perspective.

Shah and Yudkowsky on alignment failures

Rohin Shah20d52

I definitely was not thinking about the quoted definition of corrigibility, which I agree is not capturing what at least Eliezer, Nate and Paul are saying about corrigibility (unless there is more to it than the quoted paragraph). I continue to think that Paul and Eliezer have pretty different things in mind when they talk about corrigibility, and this comment seems like some vindication of my view.

I do wish I hadn't used the phrases "object-level" and "meta-level" and just spent 4 paragraphs unpacking what I meant by that because in hindsight that was confusing and ambiguous, but such is real-time conversation. When I had time to reflect and write a summary, I wrote:

Corrigibility_B, which I associated with Paul, was about building an AI system which would have particular nice behaviors like learning about the user's preferences, accepting corrections about what it should do, etc.

which feels much better as a short summary though still not great.

I basically continue to feel like there is some clear disconnect going on between Paul and MIRI on this topic that is reflected in the linked comment. It may not be about the definition of corrigibility, but just about how hard it is to get it, e.g. if you simply train your inscrutable neural nets on examples that you understand, will it generalize to examples that you don't understand, in a way that is compatible with being superintelligent / making plans-that-lase.

I still feel like the existence of CIRL code that would both make-plans-that-lase and (in the short run) accept many kinds of corrections, learn about your preferences, give resources to you when you ask, etc should cast some doubt on the notion that corrigibility is anti-natural. It's not actually my own crux for this -- mostly I am just imagining an AI system that has a motivation to be corrigible w.r.t the operator, learned via gradient descent, which was doable because corrigibility is a relatively clear boundary (for an intelligent system) that seems like it should be relatively easier to learn (i.e. what you write in your edit).

To be legible, evidence of misalignment probably has to be behavioral

Rohin Shah1moΩ330

I feel like the natural idea here is that interp generates understanding and then you use the understanding to generate behavioral evidence. Idk if this is what Dario has in mind but it at least seems plausible.

To be legible, evidence of misalignment probably has to be behavioral

Rohin Shah1moΩ220

Hmm, maybe we do disagree. I personally like circuit style interp analysis as a way to get evidence of scheming. But this is because I expect that after you do the circuit analysis you will then be able to use the generated insight to create behavioral evidence, assuming the circuit analysis worked at all. (Similarly to e.g. the whale + baseball = shark adversarial example.)

Maybe this doesn't come up as much in your conversation with people, but I've seen internals based testing methods which don't clearly ground out in behavioral evidence discussed often.
(E.g., it's the application that the Anthropic interp team has most discussed, it's the most obvious application of probing for internal deceptive reasoning other than resampling against the probes.)

The Anthropic discussion seems to be about making a safety case, which seems different from generating evidence of scheming. I haven't been imagining that if Anthropic fails to make a specific type of safety cases, they then immediately start trying to convince the world that models are scheming (as opposed to e.g. making other mitigations more stringent).

I think if a probe for internal deceptive reasoning works well enough, then once it actually fires, you could then do some further work to turn it into legible evidence of scheming (or learn that it was a false positive), so I feel like the considerations in this post don't apply.

I wasn't trying to trigger any research particular reprioritization with this post, but I historically found that people hadn't really thought through this (relatively obvious once noted) consideration and I think people are sometimes interested in thinking through specific theories of impact for their work.

Fair enough. I would be sad if people moved away from e.g. probing for deceptive reasoning or circuit analysis because they now think that these methods can't help produce legible evidence of misalignment (which would seem incorrect to me), which seems like the most likely effect of a post like this. But I agree with the general norm of just saying true things that people are interested in without worrying too much about these kinds of effects.

Slow corporations as an intuition pump for AI R&D automation

Rohin Shah1moΩ350

You might expect the labor force of NormalCorp to be roughly in equilibrium where they gain equally from spending more on compute as they gain from spending on salaries (to get more/better employees).
[...]
However, I'm quite skeptical of this type of consideration making a big difference because the ML industry has already varied the compute input massively, with over 7 OOMs of compute difference between research now (in 2025) vs at the time of AlexNet 12 years ago, (invalidating the view that there is some relatively narrow range of inputs in which neither input is bottlenecking) and AI companies effectively can't pay more to get faster or much better employees, so we're not at a particularly privileged point in human AI R&D capabilities.

SlowCorp has 625K H100s per researcher. What do you even do with that much compute if you drop it into this world? Is every researcher just sweeping hyperparameters on the biggest pretraining runs? I'd normally say "scale up pretraining another factor of 100" and then expect that SlowCorp could plausibly outperform NormalCorp, except you've limited them to 1 week and a similar amount of total compute, so they don't even have that option (and in fact they can't even run normal pretraining runs, since those take longer than 1 week to complete).

The quality and amount of labor isn't the primary problem here. The problem is that the current practices for AI development are specialized to the current labor:compute ratio, and can't just be changed on a dime if you drastically change the ratio. Sure, the compute input has varied massively over 7 OOMs; importantly this did not happen all at once, the ecosystem adapted to it.

SlowCorp would be in a much better position if it was in a world where AI development had evolved with these kinds of bottlenecks existing all along. Frontier pretraining runs would be massively more parallel, and would complete in a day. There would be dramatically more investment in automation of hyperparameter sweeps and scaling analyses, rather than depending on human labor to do that. The inference-time compute paradigm would have started 1-2 years earlier, and would be significantly more mature. How fast would AI progress be in that world if you are SlowCorp? I agree it would still be slower than current AI progress, but it is really hard to guess how much slower, and it's definitely drastically faster than if you just impute a SlowCorp in today's world (which mostly seems like it will flounder and die immediately).

So we can break down the impacts into two categories:

SlowCorp is slower because of less access to resources. This is the opposite for AutomatedCorp, so you'd expect it to be correspondingly faster.
SlowCorp is slower because AI development is specialized to the current labor:compute ratio. This is not the opposite for AutomatedCorp, if anything it will also slow down AutomatedCorp (but in practice it probably doesn't affect AutomatedCorp since there is so much serial labor for AutomatedCorp to fix the issue).

If you want to pump your intuition for what AutomatedCorp should be capable of, the relevant SlowCorp is the one that only faces the first problem, that is, you want to consider the SlowCorp that evolved in a world with those constraints in place all along, not the SlowCorp thrown into a research ecosystem not designed for the constraints it faces. Personally, once I try to imagine that I just run into a wall of "who even knows what that world looks like" and fail to have my intuition pumped.

To be legible, evidence of misalignment probably has to be behavioral

Rohin Shah1moΩ440

In some sense I agree with this post, but I'm not sure who the intended audience is, or what changes anyone should make. What existing work seems like it will generate "evidence which is just from fancy internals-based methods (and can't be supported by human inspection of AI behavior)", and that is the primary story for why it is impactful? I don't think this is true of probing, SAEs, circuit analysis, debate, ...

Google DeepMind: An Approach to Technical AGI Safety and Security

Rohin Shah2moΩ340

(Meta: Going off of past experience I don't really expect to make much progress with more comments, so there's a decent chance I will bow out after this comment.)

I would expect bootstrapping will at most align a model as thoroughly as its predecessor was aligned (but probably less)

Why? Seems like it could go either way to me. To name one consideration in the opposite direction (without claiming this is the only consideration), the more powerful model can do a better job at finding the inputs on which the model would be misaligned, enabling you to train its successor across a wider variety of situations.

and goodhart's law definitely applies here.

I am having a hard time parsing this as having more content than "something could go wrong while bootstrapping". What is the metric that is undergoing optimization pressure during bootstrapping / amplified oversight that leads to decreased correlation with the true thing we should care about?

Is this intended only as a auditing mechanism, not a prevention mechanism

Yeah I'd expect debates to be an auditing mechanism if used at deployment time.

I also worry the "cheap system with high recall but low precision" will be too easy to fool for the system to be functional past a certain capability level.

Any alignment approach will always be subject to the critique "what if you failed and the AI became misaligned anyway and then past a certain capability level it evades all of your other defenses". I'm not trying to be robust to that critique.

I'm not saying I don't worry about fooling the cheap system -- I agree that's a failure mode to track. But useful conversation on this seems like it has to get into a more detailed argument, and at the very least has to be more contentful than "what if it didn't work".

The problem is RLHF already doesn't work

??? RLHF does work currently? What makes you think it doesn't work currently?

Google DeepMind: An Approach to Technical AGI Safety and Security

Rohin Shah2moΩ350

like being able to give the judge or debate partner the goal of actually trying to get to the truth

The idea is to set up a game in which the winning move is to be honest. There are theorems about the games that say something pretty close to this (though often they say "honesty is always a winning move" rather than "honesty is the only winning move"). These certainly depend on modeling assumptions but the assumptions are more like "assume the models are sufficiently capable" not "assume we can give them a goal". When applying this in practice there is also a clear divergence between what an equilibrium behavior is and what is found by RL in practice.

Despite all the caveats, I think it's wildly inaccurate to say that Amplified Oversight is assuming the ability to give the debate partner the goal of actually trying to get to the truth.

(I agree it is assuming that the judge has that goal, but I don't see why that's a terrible assumption.)

Are you stopping the agent periodically to have another debate about what it's working on and asking the human to review another debate?

You don't have to stop the agent, you can just do it afterwards.

can anybody provide a more detailed sketch of why they think Amplified Oversight will work and how it can be used to make agents safe in practice?

Have you read AI safety via debate? It has really quite a lot of conceptual points, making both the case in favor and considering several different reasons to worry.

(To be clear, there is more research that has made progress, e.g. cross-examination is a big deal imo, but I think the original debate paper is more than enough to get to the bar you're outlining here.)

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments