Fabien Roger

I am working on empirical AI safety. 

Book a call with me if you want advice on a concrete empirical safety project.

Anonymous feedback form.

Sequences

AI Control

Wikitag Contributions

Comments

Sorted by

I agree, the net impact is definitely not the difference between these numbers.

Also I meant something more like P(doom|Anthropic builds AGI first).I don't think people are imagining that the first AI company to achieve AGI will have an AGI monopoly forever. Instead some think it may have a large impact on what this technology is first used for and what expectations/regulations are built around it.

What if someone at Anthropic thinks P(doom|Anthropic builds AGI) is 15% and P(doom|some other company builds AGI) is 30%? Then the obvious alternatives are to do their best to get governments / international agreements to make everyone pause or to make everyone's AI development safer, but it's not completely obvious that this is a better strategy because it might not be very tractable. Additionally, they might think these things are more tractable if Anthropic is on the frontier (e.g. because it does political advocacy, AI safety research, and deploys some safety measures in a way competitors might want to imitate to not look comparatively unsafe). And they might think these doom-reducing effects are bigger than the doom-increasing effects of speeding up the race.

You probably disagree with P(doom|some other company builds AGI) - P(doom|Anthropic builds AGI) and with the effectiveness of Anthropic advocacy/safety research/safety deployments, but I feel like this is a very different discussion from "obviously you should never build something that has a big chance of killing everyone".

(I don't think most people at Anthropic think like that, but I believe at least some of the most influential employees do.)

Also my understanding is that technology is often built this way during deadly races where at least one side believes that them building it faster is net good despite the risks (e.g. deciding to fire the first nuke despite thinking it might ignite the atmosphere, ...).

I am pretty new to this but I understood that this is automatically taken care of for you if you use the same prompts repeatedly with the API

True for OpenAI and Gemini, not true for Anthropic.

Cool replication!

Sad that cost was such a constraint. Especially since prompting experiments are among the cheapest experiments to run! I wonder if we should have tried harder to craft a minimal example of alignment faking that was less expensive to run. Some ways we could have done that:

  • Use a smaller system prompt (I suspect some of the long few-shot examples can be removed / replaced by prompting?)
  • Use prompt caching when possible
  • Use a single generation (I don't know if you are using the version of the setup with "<NEXT/>")

This might also help to do more extensive prompt variations.

This suggests that you can get quite a lot of mileage out of tinkering with prompts – maybe small inconsistencies or seemingly unimportant details can matter a lot.

It's scary that depends so much on random details! From further experiments (to be released soon), we found that Claude 3 Opus was the only model that was relatively consistent in its alignment faking (it also happens to be the most expensive model).

It feels to me like there’s a clear hierarchy where the model is just barely not complying with some queries and is dead set against others.

We found similar results during the investigations of the first paper. For Claude 3 Opus, it mostly reduces the fraction of inputs on which it considers faking alignment, but it never quite drives it to zero even for very egregious queries like asking for severe racial slurs (which was very surprising to me)! See the appendix on the "prefix" variation for more details.

Fabien RogerΩ220

I still think this is a serious concern that prevents us from finding "the truth predictor", but here is a way around it:

is_it_bad(x) = max_predictor consistency(predictor) subject to the constraint that predictor(x) ≠ predictor(obvious good inputs)

  • For x such that all coherent (and high salience) views predict it is good, then it won't be possible to find a consistent and salient predictor that predicts that x has a different label from obvious good inputs and thus is_it_bad(x) will be low;
  • For x that the model thinks is bad (regardless of what other coherent views think, and in particular coherent human simulators!), then is_it_bad(x) will be high!
  • For x such that some coherent view (e.g. the human simulator) thinks is bad, then is_it_bad(x) will be high.

The proposed is_it_bad has no false negatives. It has false positives (in situations where the human simulator predicts the input is bad but the model knows it's fine), but I think this is might fine for many applications (as long as we correctly use some prior/saliency/simplicity properties to exclude coherent predictors that are not interesting without excluding the ground truth). In particular, it might be outer-alignment safe-ish to maximize for outcome(x) - large_constant * is_it_bad(x).

I agree white-box control is promising!

I think it's not obvious that probes will be more reliable than trusted monitors for most tasks you care about, and I think it will be hard to evaluate, so I expect that you will probably want to combine white-box monitors and weaker trusted monitors. In particular, I think it's not obvious with current evidence that you could get good probes for things that are very hard to get good trusted monitors for (e.g. sandbagging). I don't think it's hopeless, and I am excited for future work studying this direction (even though building analogous model organisms is rough here).

Nit: I usually think some the effect of adversarial training can be captured by loss evaluations - the main reason why adversarial training helps is the same reason why behavioral red-teaming might help (models can't fit the conditional policy well), which is black-box-y.

Slightly late update: I am in the middle of projects which would be hard to do outside Anthropic. I will push them for at least 2-4 more months before reconsidering my decision and writing up my views.

If I remember correctly that's what I did.

Fabien RogerΩ693

is brittle

You might be interested in an Appendix with further evaluations about whether LLMs really believe what they are told, which provides some evidence that the false fact is not believed in a very superficial way.

While I think these evals provide decent evidence about what the LLMs in the experiment "actually believe" (I don't expect there to be natural situations where these weak models figure out the facts are false in a single forward pass when they "try harder"), I am pessimistic that we would be able to trust this sort of behavioral evals against powerful and situationally aware LLMs.

One eval which I think would give much stronger guarantees even against powerful and situationally aware LLMs is iid training (teach 100 false facts, train the model to distinguish true from false facts using 50 of them, and see if you generalize to the other 50), similar to what Aghyad and I did in The case for unlearning that removes information from LLM weights.

I think this is a much harder standard to meet (I think the models in the experiments in the blog post don't meet this standard, just like the LLMs that were taught false facts in my unlearning paper), but I think it gives you the guarantees you want (because I think iid training is very sample efficient at making LLMs "actually try").

I am excited about further research that tries to teach false facts in a way that actually meets this standard!

the main question seems to me to be whether it is plausible that US AI companies would spend more on safety

Other considerations:

  • Maybe more sharing between US companies leads to a faster progress of the field overall
    • though maybe it slows it down by reducing investments because investing in algorithmic secrets is less valuable? That's a bit a 4D chess consideration, I don't know how to trust this sort of reasoning.
  • Maybe you want information to flow from the most reckless companies to the less reckless ones, but not the other way around, such that you would prefer if the companies you expect to spend the most on safety to not share information. Spending more on info-sec is maybe correlated with spending more on safety in general, therefore you might be disfavoring less reckless actors by asking for transparency.

(I am also unsure how the public will weigh in - I think there is a 2%-20% chance that public pressure is net negative in terms of safety spending because of PR, legal and AI economy questions. I think it's hard to tell in advance.)

I don't think these are super strong considerations and I am sympathetic to the point about safety spend probably increasing if there was more transparency.

The proposal is to use monitoring measures, similar to e.g. constitutional classifiers. Also, don't we reduce misuse risk a bunch by only deploying to 10k external researchers?

My bad, I failed to see that what's annoying with helpful-only model sharing is that you can't check if activity is malicious or not. I agree you can do great monitoring, especially if you can also have a small-ish number of tokens per researcher and have humans audit ~0.1% of transcripts (with AI assistance).

Load More
OSZAR »