Things I post should be considered my personal opinions, not those of any employer, unless stated otherwise.
We should also consider that, well, this result just doesn't pass the sniff test given what we've seen RL models do.
FWIW, I interpret the paper to be making a pretty narrow claim about RL in particular. On the other hand, a lot of the production "RL models" we have seen may not be pure RL. For instance, if you wanted to run a similar test to this paper on DeepSeek-V3+, you would compare DeepSeek-V3 to DeepSeek-R1-Zero (pure RL diff, according to the technical report), not to DeepSeek-R1 (trained with a hard-to-follow mix of SFT and RL). R1-Zero is a worse model than R1, sometimes by a large margin.
Out of curiosity, have your takes here changed much lately?
I think the o3+ saga has updated me a small-medium amount toward "companies will just deploy misaligned AIs and consumers will complain but use them anyway" (evidenced by deployment of models that blatantly lie from multiple companies) and "slightly misaligned AI systems that are very capable will likely be preferred over more aligned systems that are less capable" (evidenced by many consumers, including myself, switching over to using these more capable lying models).
I also think companies will work a bit to reduce reward hacking and blatant lying, and they will probably succeed to some extent (at least for noticeable, everyday problems), in the next few months. That, combined with OpenAI's rollback of 4o sycophancy, will perhaps make it seem like companies are responsive to consumer pressure here. But I think the situation is overall a small-medium update against consumer pressure doing the thing you might hope here.
Side point: Noting one other dynamic: advanced models are probably not going to act misaligned in everyday use cases (that consumers have an incentive to care about, though again revealed preference is less clear), even if they're misaligned. That's the whole deceptive alignment thing. So I think it does seem more like the ESG case?
I agree that the report conflates these two scales of risk. Fortunately, one nice thing about that table (Table 1 in the paper) is that readers can choose which of these risks they want to prioritize. I think more longtermist-oriented folks should probably weigh the badness of these as Loss on Control being the most bad, followed perhaps by Bad Lock-in, then Misuse and War. But obviously there's a lot of variance within these.
I agree that there *might* be some cases where policymakers will have difficult trade-offs to make about these risks. I'm not sure how likely I think this is, but I agree it's a good reason we should keep this nuance insofar as we can. I guess it seems to me like we're not anywhere near the right decision makers actually making these tradeoffs, nor near them having values that particularly up-weigh the long term future.
I therefore feel okay about lumping these together in a lot of my communication these days. But perhaps this is the wrong call, idk.
The viability of a pause is dependent on a bunch of things, like the number of actors who could take some dangerous action, how hard it would be for them to do that, how detectable it would be, etc. These are variable factors. For example, if the world got rid of advanced AI chips completely, dangerous AI activities would then take a long time and be super detectable. We talk about this in the research agenda; there are various ways to extend "breakout time", and these methods could be important to long-term stability.
I think your main point is probably right but was not well argued here. It seems like the argument is a vibe argument of like "nah they probably won't find this evidence compelling".
You could also make an argument from past examples where there has been large action to address risks in the world, and look at the evidence there (e.g., banning of CFCs, climate change more broadly, tobacco regulation, etc.)
You could also make an argument from existing evidence around AI misbehavior and how its being dealt with, where (IMO) 'evidence much stronger than internals' basically doesn't seem to affect the public conversation outside the safety community (or even much here).
I think it's also worth saying a thing very directly: just because non-behavioral evidence isn't likely to be widely legible and convincing does not mean it is not useful evidence for those trying to have correct beliefs. Buck's previous post and many others discuss the rough epistemic situation when it comes to detecting misalignment. Internals evidence is going to be one of the tools in the toolkit, and it will be worth keeping in mind.
Another thing worth saying: if you think scheming is plausible, and you think it will be difficult to update against scheming from behavioral evidence (Buck's post), and you think non-behavioral evidence is not likely to be widely convincing (this post), then the situation looks really rough.
I appreciate this post, I think it's a useful contribution to the discussion. I'm not sure how much I should be updating on it. Points of clarification:
Within the first three months of our company's existence, Claude 3.5 sonnet was released. Just by switching the portions of our service that ran on gpt-4o, our nascent internal benchmark results immediately started to get saturated.
One of the assumptions guiding the analysis here is that sticker prices will approach marginal costs in a competitive market. DeepSeek recently released data about their production inference cluster (or at least one of them). If you believe their numbers, they report theoretical (assuming no discounts and assuming use of the more expensive model) daily revenue of $562,027, with a cost profit margin of 545%. DeepSeek is one of, if not the, lowest price providers for the DeepSeek-R1 and DeepSeek-V3 models. So this data indicates that even the relatively cheap providers could be making substantial profits, providing evidence against the minimum priced provider being near marginal cost.
Hm, sorry, I did not mean to imply that the defense/offense ratio is infinite. It's hard to know, but I expect it's finite for the vast majority of dangerous technologies[1]. I do think there are times where the amount of resources and intelligence needed to do defense are too high and a civilization cannot do them. If an astroid were headed for earth 200 years ago, we simply would not have been able to do anything to stop it. Asteroid defense is not impossible in principle — the defensive resources and intelligence needed are not infinite — but they are certainly above what 1825 humanity could have mustered in a few years. It's not in principle impossible, but it's impossible for 1825 humanity.
While defense/offense ratios are relevant, I was more-so trying to make the points that these are disjunctive threats, some might be hard to defend against (i.e., have a high defense-offense ratio), and we'll have to do that on a super short time frame. I think this argument goes through unless one is fairly optimistic about the defense-offense ratio for all the technologies that get developed rapidly. I think the argumentative/evidential burden to be on net optimistic about this situation is thus pretty high, and per the public arguments I have seen, unjustified.
(I think it's possible I've made some heinous reasoning error that places too much burden on the optimism case, if that's true, somebody please point it out)
To be clear, it certainly seems plausible that some technologies have a defense/offense ratio which is basically unachievable with conventional defense, and that you need to do something like mass surveillance to deal with these. e.g., triggering vacuum decay seems like the type of thing where there may not be technological responses that avert catastrophe if the decay has started, instead the only effective defenses are ones that stop anybody from doing the thing to begin with.
I think your discussion for why humanity could survive a misaligned superintelligence is missing a lot. Here are a couple claims:
Why do I believe point 2? It seems like the burden of proof is really high to say that "nope, every single one of those dangerous technologies is going to be something that it is technically possible for the aligned AIs to defend against, and they will have enough lead time to do so, in every single case". If you're assuming we're in a world with misaligned ASIs, then every single existentially dangerous technology is another disjunctive source of risk. Looking out at the maybe-existentially-dangerous technologies that have been developed previously and that could be developed in the future, e.g., nuclear weapons, biological weapons, mirror bacteria, false vacuum decay, nanobots, I don't feel particularly hopeful that we will avoid catastrophe. We've survived nuclear weapons so far, but with a few very close calls — if you assume other existentially dangerous technologies go like this, then we probably won't make it past a few of them. Now crunch that all into a few years, and like gosh it seems like a ton of unjustified optimism to think we'll survive every one of these challenges.
It's pretty hard to convey my intuition around the vulnerable world hypothesis, I also try to do so here.
I got the model up to 3,000 tokens/s on a particularly long/easy query.
As an FYI, there has been other work on large diffusion language models, such as this: https://www.inceptionlabs.ai/introducing-mercury